论文部分内容阅读
由于传统聚类分析中文本相似度计算方法不适用于短文本,本文选用基于句子成分的相似度计算方法来计算微博文本之间的相似度。首先对文本进行句子划分,再通过句法分析获取微博的句子成分,选择构成句子成分的词语为特征词。利用知网计算两个微博文本之间相同成分词语的语义相似度,将语义相似度值按句子成分种类加权相加得到微博文本之间的相似度值。据此,构建文本相似矩阵,进行聚类分析,找到微博热点主题。最后,用实验证明本文方法的可行性。
Because the method of text similarity calculation in traditional cluster analysis is not suitable for short texts, this paper chooses the similarity calculation method based on sentence elements to calculate the similarity between Weibo texts. First of all, the text is divided into sentences, and then through the syntax analysis to obtain the composition of the sentence of Weibo, select the words that make up the sentence composition as feature words. The semantic similarity of the same constituent words between two Weibo texts is calculated by using the known network, and the similarity values between the two Weibo texts are obtained by weighting the semantic similarity values according to the types of the sentence components. Based on this, we construct a text similarity matrix and perform cluster analysis to find the hot topics of Weibo. Finally, the feasibility of this method is proved by experiments.