K-means实现中文短文本聚类--688IT编程网

K-means实现中⽂短⽂本聚类

⼀、具体流程

1.读⼊⽂本，并进⾏分词

2.对分词后的⽂本进⾏去除停⽤词

3.使⽤TF-IDF进⾏求出权重

4.通过K-means进⾏聚类

（由于笔者⽔平较低，只能⽤⾃⼰好理解的⽅法写，所以看起来很⿇烦，见谅）

⼆、读⼊⽂本并分词

1.读⼊⽂本

（1）⽂本来源于搜狗新闻语料库（链接：）

（2）读⼊⽂本（代码如下）

def read_from_file(file_name):

with open(file_name) as fp:

words = fp.read()

return words

words = (read_from_file("D:\\PyCharm Community Edition 2018.2.4\\python\\day20181127\\sougou_all\\互联⽹\\1.txt"))

words1 = (read_from_file("D:\\PyCharm Community Edition 2018.2.4\\python\\day20181127\\sougou_all\\互联⽹\\2.txt"))

words2 = (read_from_file("D:\\PyCharm Community Edition 2018.2.4\\python\\day20181127\\sougou_all\\互联⽹\\3.txt"))

words3 = (read_from_file("D:\\PyCharm Community Edition 2018.2.4\\python\\day20181127\\sougou_all\\互联⽹\\4.txt"))

listall = [words,words1,words2,words3]

2.进⾏分词

（1）安装jieba库：分词需要安装jieba库，在Pycharm⾥的setting⾥的project.interpreter⾥点击右上⽅的加号，在搜索框中输⼊jieba点击应⽤就可以了。

（2）进⾏分词：（代码如下）

def cut_words(words):

result = jieba.cut(words)

words = []

for r in result:

words.append(r)

return words

三、去除停⽤词

1.下载停⽤词库（建议⽤哈⼯⼤停⽤词表）将其转化为txt存放。

2.读⼊停⽤词⽂本。

def stop_words(stop_word_file):

with open(stop_word_file) as fp:

words = fp.read()

# print(words)

result = words.split('\n')

return result

stopwords = stop_words("D:\\PyCharm Community Edition 2018.2.4\\python\\day20181127\\")

3.将我们读⼊的分过词的⽂本与停⽤词⽂本，进⾏对⽐，去除与停⽤词重合的部分，得到⼀个已经经过初步预处理的数据。

def del_stop_words(words,stop_words_set):

new_words = []

for k in words:

if k not in stop_words_set:

new_words.append(k)

return new_words

list0 = []

for a in listall:

#去除停⽤词

list1 = del_stop_words(cut_words(words),stopwords)

list2 = del_stop_words(cut_words(words1),stopwords)

list3 = del_stop_words(cut_words(words2),stopwords)

list4 = del_stop_words(cut_words(words3),stopwords)

4.由于上⾯的for循环处理了很多个⽂本，所以得到的结果会有⼀定的重复，我们通过把它的类型转化为set来得到去重结果，但这个结果是⼀个⽆序的去重后的结果，于是再将其转化成list的时候需要⽤sort来实现排序，然后我们就可以得到⼀个预处理完毕的⽂本。

listall1 = list(set(list0))

listall1.sort(key = list0.index)

四、TF-IDF

1.由于我们通过分词后的结果得到的是很多词语其中有些词语是⽆⾜轻重的，有些词语却是在⽂本中处于⼀个关键词的地位，它们的权重不⼀样，所以我们通过实现TF-IDF的步骤来得到关键词的权重。

2.⾸先计算词频数代码⼤致如下

def frequency(strlist,listall1):

array = []

for word in listall1:

array.unt(word))

return array

#统计每⼀元组的词数

arrayall = [frequency(list1,listall1),frequency(list2,listall1),frequency(list3,listall1),frequency(list4,listall1)]

#将词数以DataFrame形式输出

df = pandas.DataFrame(data=arrayall, columns=listall1)

3.计算TF（3，4，5都是可以省略的，sklearn中有专门的TF-IDF可以直接调⽤计算，这是笔者根据TF-IDF原理⾃⼰写的，⽅便⾃⼰理解，如果有错误和可以改进的地⽅望指出，谢谢）

def tf(arrayall):

#求每⼀个元组的数据总和

a = []

for j in arrayall:

b = 0

for i in j:

b = b + i

a.append(b)

#求tf

c = 0

Tf = []

tf1 = []

for i in arrayall:

for k in a:

for j in i:

c = j/k

tf1.append(c)

Tf.append(tf1)

tf1 = []

break

return Tf

Tf = pandas.DataFrame(data=tf(arrayall),columns=listall1)

TF = tf(arrayall)

4.计算IDF

def idf(df):

#blog.csdn/kkkkkiko/article/details/80845859

#求出包含单词的⽂档数

x = []

y = []

a = (df != 0).astype(int).sum(axis=0)

for i in a.values:

x.append(math.log(4/(i+1)))

y.append(x)

return y

Idf = pandas.DataFrame(data=idf(df),columns=listall1)#以dataframe的形式输出⽅便查看

IDF = idf(df)

5.计算TF-IDF

def tf_idf(TF,IDF):

a = []

b = []

c = []

x = []

for i in TF:

d = []

for k in i:

d.append(float(k))

c.append(d)

for j in IDF:

b = [float(d) for d in j]

h = 0

for i in c:

while h < len(b):

x.append(b[h]*i[h])

a.append(x)

x = []

return a

Tf_idf = pandas.DataFrame(data=tf_idf(TF,IDF),columns=listall1)

print(tf_idf(TF,IDF))

print(Tf_idf)

（此处笔者出现了溢出的现象，望⼤神指点！）

五、K-means聚类

1.创建随机质点

def randCent(dataSet, k):

n = shape(dataSet)[1]

centroids = mat(zeros((k,n)))#⽤mat函数转换为矩阵之后可以才进⾏⼀些线性代数的操作

for j in range(n):#在每个维度的边界内，创建簇⼼。

minJ = min(dataSet[:,j])

rangeJ = float(max(dataSet[:,j]) - minJ)

centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1))

return centroids

2.计算距离

def distEclud(vecA, vecB):

return math.sqrt(sum(power(vecA - vecB, 2)))

3.⾃⾏确定簇⼼数量并选取初始点，点与点之间的距离计算采⽤欧⼏⾥得距离算法来实现。

# dataSet样本点,k 簇的个数

# disMeas距离量度，默认为欧⼏⾥得距离

# createCent，初始点的选取

def K_means(dataSet,k,distMeas = distEclud,createCent = randCent):

python中文文档m = shape(dataSet)[0]#样本数

clusterAssment = mat(zeros((m,2)))#m*2的矩阵

centroids = createCent(dataSet,k)#初始化k个中⼼

clusterChanged = True

while clusterChanged:#当聚类不再变化

clusterChanged = False

for i in range(m):

minDist = math.inf;minIndex = -1

for j in range(k):#到最近的质⼼

distJI = distMeas(centroid[j,:],dataSet[i,:])

if distJI < minDist:

minDist = distJI;minIndex = j

if clusterAssment[i,0] !=minIndex:clusterChanged = True

#第⼀列为所属质⼼，第⼆列为距离

clusterAssment[i,:] = minIndex,minDist**2

print(centroids)

#更改质⼼位置

for cent in range(k):

ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]

centroids[cent,:] = mean(ptsInClust,axis=0)

return centroids,clusterAssment

六、总结

1.在TF-IDF处出现了溢出的问题，但是解析⽂本并不多，可能是⽅法有问题，需要继续改正。

2.后⾯的K-means的代码上还有⼀定的疑惑，需要解决。

3.对中⽂短⽂本聚类上的掌握还不熟练，需要多加练习。

希望各位⼤神批评指正！

688IT编程网

K-means实现中文短文本聚类

发表评论

推荐文章

Linux怎么直接执行PHP脚本文件

php文件写入或追加数据

php中实现文件上传的函数

php文件上传类程序代码

413 request entity too large 解决方法 -回复

热门文章

php中用来导入其他文件的语句

php获取文件后缀名的方法

创建php文件方法

国家电网公司电子商务平台常见问题

【2018-2019】别克英朗说明书-实用word文档 (12页)

诺基亚E71常见问题以及解决方法

HXD3型电力机车故障应急处理

卫星电视中星9号解密方法及节目参数,长期可用

硬盘U盘等启动奶瓶beini详细步骤教程

BT3使用教程

破解网通铁通电信封路由器的几种方法

手把手教你WPA2加密无线网络

教你如何破解搜索到的无线网络

Get清风OD入门系列图文详细教程、破解做辅助起步

java rar破解原理

同余方程在密码学中的应用与破解

无限网络解码

winrar破解方法

macOS终端中的文件加密和解密技巧

rar加密原理

最新文章

php中实现文件上传的函数

413 request entity too large 解决方法 -回复

php实现编辑和保存文件的方法

php 配置文件的用法 -回复

突破php网站上传文件大小限制

php(实现url重写)

标签列表