K-means实现中⽂短⽂本聚类
⼀、具体流程
1.读⼊⽂本,并进⾏分词
2.对分词后的⽂本进⾏去除停⽤词
3.使⽤TF-IDF进⾏求出权重
4.通过K-means进⾏聚类
(由于笔者⽔平较低,只能⽤⾃⼰好理解的⽅法写,所以看起来很⿇烦,见谅)
⼆、读⼊⽂本并分词
1.读⼊⽂本
(1)⽂本来源于搜狗新闻语料库(链接:)
(2)读⼊⽂本(代码如下)
def read_from_file(file_name):
with open(file_name) as fp:
words = fp.read()
return words
words = (read_from_file("D:\\PyCharm Community Edition 2018.2.4\\python\\day20181127\\sougou_all\\互联⽹\\1.txt"))
words1 = (read_from_file("D:\\PyCharm Community Edition 2018.2.4\\python\\day20181127\\sougou_all\\互联⽹\\2.txt"))
words2 = (read_from_file("D:\\PyCharm Community Edition 2018.2.4\\python\\day20181127\\sougou_all\\互联⽹\\3.txt"))
words3 = (read_from_file("D:\\PyCharm Community Edition 2018.2.4\\python\\day20181127\\sougou_all\\互联⽹\\4.txt"))
listall = [words,words1,words2,words3]
2.进⾏分词
(1)安装jieba库:分词需要安装jieba库,在Pycharm⾥的setting⾥的project.interpreter⾥点击右上⽅的加号,在搜索框中输⼊jieba点击应⽤就可以了。
(2)进⾏分词:(代码如下)
def cut_words(words):
result = jieba.cut(words)
words = []
for r in result:
words.append(r)
return words
三、去除停⽤词
1.下载停⽤词库(建议⽤哈⼯⼤停⽤词表)将其转化为txt存放。
2.读⼊停⽤词⽂本。
def stop_words(stop_word_file):
with open(stop_word_file) as fp:
words = fp.read()
# print(words)
result = words.split('\n')
return result
stopwords = stop_words("D:\\PyCharm Community Edition 2018.2.4\\python\\day20181127\\")
3.将我们读⼊的分过词的⽂本与停⽤词⽂本,进⾏对⽐,去除与停⽤词重合的部分,得到⼀个已经经过初步预处理的数据。
def del_stop_words(words,stop_words_set):
new_words = []
for k in words:
if k not in stop_words_set:
new_words.append(k)
return new_words
list0 = []
for a in listall:
#去除停⽤词
list1 = del_stop_words(cut_words(words),stopwords)
list2 = del_stop_words(cut_words(words1),stopwords)
list3 = del_stop_words(cut_words(words2),stopwords)
list4 = del_stop_words(cut_words(words3),stopwords)
4.由于上⾯的for循环处理了很多个⽂本,所以得到的结果会有⼀定的重复,我们通过把它的类型转化为set来得到去重结果,但这个结果是⼀个⽆序的去重后的结果,于是再将其转化成list的时候需要⽤sort来实现排序,然后我们就可以得到⼀个预处理完毕的⽂本。
listall1 = list(set(list0))
listall1.sort(key = list0.index)
四、TF-IDF
1.由于我们通过分词后的结果得到的是很多词语其中有些词语是⽆⾜轻重的,有些词语却是在⽂本中处于⼀个关键词的地位,它们的权重不⼀样,所以我们通过实现TF-IDF的步骤来得到关键词的权重。
2.⾸先计算词频数代码⼤致如下
def frequency(strlist,listall1):
array = []
for word in listall1:
array.unt(word))
return array
#统计每⼀元组的词数
arrayall = [frequency(list1,listall1),frequency(list2,listall1),frequency(list3,listall1),frequency(list4,listall1)]
#将词数以DataFrame形式输出
df = pandas.DataFrame(data=arrayall, columns=listall1)
3.计算TF(3,4,5都是可以省略的,sklearn中有专门的TF-IDF可以直接调⽤计算,这是笔者根据TF-IDF原理⾃⼰写的,⽅便⾃⼰理解,如果有错误和可以改进的地⽅望指出,谢谢)
def tf(arrayall):
#求每⼀个元组的数据总和
a = []
for j in arrayall:
b = 0
for i in j:
b = b + i
a.append(b)
#求tf
c = 0
Tf = []
tf1 = []
for i in arrayall:
for k in a:
for j in i:
c = j/k
tf1.append(c)
Tf.append(tf1)
tf1 = []
break
return Tf
Tf = pandas.DataFrame(data=tf(arrayall),columns=listall1)
TF = tf(arrayall)
4.计算IDF
def idf(df):
#blog.csdn/kkkkkiko/article/details/80845859
#求出包含单词的⽂档数
x = []
y = []
a = (df != 0).astype(int).sum(axis=0)
for i in a.values:
x.append(math.log(4/(i+1)))
y.append(x)
return y
Idf = pandas.DataFrame(data=idf(df),columns=listall1)#以dataframe的形式输出⽅便查看
IDF = idf(df)
5.计算TF-IDF
def tf_idf(TF,IDF):
a = []
b = []
c = []
x = []
for i in TF:
d = []
for k in i:
d.append(float(k))
c.append(d)
for j in IDF:
b = [float(d) for d in j]
h = 0
for i in c:
while h < len(b):
x.append(b[h]*i[h])
a.append(x)
x = []
return a
Tf_idf = pandas.DataFrame(data=tf_idf(TF,IDF),columns=listall1)
print(tf_idf(TF,IDF))
print(Tf_idf)
(此处笔者出现了溢出的现象,望⼤神指点!)
五、K-means聚类
1.创建随机质点
def randCent(dataSet, k):
n = shape(dataSet)[1]
centroids = mat(zeros((k,n)))#⽤mat函数转换为矩阵之后可以才进⾏⼀些线性代数的操作
for j in range(n):#在每个维度的边界内,创建簇⼼。
minJ = min(dataSet[:,j])
rangeJ = float(max(dataSet[:,j]) - minJ)
centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1))
return centroids
2.计算距离
def distEclud(vecA, vecB):
return math.sqrt(sum(power(vecA - vecB, 2)))
3.⾃⾏确定簇⼼数量并选取初始点,点与点之间的距离计算采⽤欧⼏⾥得距离算法来实现。
# dataSet样本点,k 簇的个数
# disMeas距离量度,默认为欧⼏⾥得距离
# createCent,初始点的选取
def K_means(dataSet,k,distMeas = distEclud,createCent = randCent):
python中文文档m = shape(dataSet)[0]#样本数
clusterAssment = mat(zeros((m,2)))#m*2的矩阵
centroids = createCent(dataSet,k)#初始化k个中⼼
clusterChanged = True
while clusterChanged:#当聚类不再变化
clusterChanged = False
for i in range(m):
minDist = math.inf;minIndex = -1
for j in range(k):#到最近的质⼼
distJI = distMeas(centroid[j,:],dataSet[i,:])
if distJI < minDist:
minDist = distJI;minIndex = j
if clusterAssment[i,0] !=minIndex:clusterChanged = True
#第⼀列为所属质⼼,第⼆列为距离
clusterAssment[i,:] = minIndex,minDist**2
print(centroids)
#更改质⼼位置
for cent in range(k):
ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]
centroids[cent,:] = mean(ptsInClust,axis=0)
return centroids,clusterAssment
六、总结
1.在TF-IDF处出现了溢出的问题,但是解析⽂本并不多,可能是⽅法有问题,需要继续改正。
2.后⾯的K-means的代码上还有⼀定的疑惑,需要解决。
3.对中⽂短⽂本聚类上的掌握还不熟练,需要多加练习。
希望各位⼤神批评指正!