基于中医药知识图谱智能问答(一)--688IT编程网

基于中医药知识图谱智能问答（⼀）

鸣谢：该项⽬基于刘焕勇⽼师、IrvingBei这两位的代码启发下，才有了我这么⼀个辣鸡项⽬。期间我的学业导师，给了我很多指导帮助。站在前⼈的肩膀上，我们可以看得更远

摘要：知识图谱与⾃然语⾔的处理技术的结合使⽤愈发⼴泛，已经成为各⼤搜索引擎公司所重视的领域之⼀。尽管⽬前科技创新和普及中医药知识⼯作的稳步推进，但对于中医药领域中复杂的中药信息数据如何可视化分析与检索仍然是⼀个难以解决的问题。为此本研究⽴⾜中药领域，以垂直型中医药⽹站的本草纲⽬开源数据为数据来源，搭建了⼀个包含9类规模为7k的知识实体，7种关系的中药知识图谱。并在该知识图谱的基础上实现了中药知识⾃动问答和辅助开药⽅的功能。该系统的实现对于提升中医药知识在⼤众中的普及、为中医药临床实践、科研及教学提供决策⽀持上都有着重要意义和参考价值。

什么是知识图谱？

知识图谱是⼀种描述真实世界客观存在的实体、概念及它们之间关联关系的语义⽹络。它充分⾤⽤了可视化的技术，不仅能够对知识资源和载体进⾏描述，同时还可以对知识以及知识之间的联系进⾏分析和描绘。在⼤数据存储技术⽀持下，⼤规模的知识图谱与数据挖掘、机器学习、信息分析等技术相结合，可以实现利⽤图形将复杂的知识领域绘制并展现出来。Google 早在 2012 年就发布了“知识图谱

”，增强了其搜索结果的智能性，将互联⽹的信息表达成了更接近⼈类认知世界的形式，这标志着⼤规模知识在互联⽹语义搜索中的成功应⽤，知识图谱提供了⼀种更好的组织、管理和理解互联⽹海量信息的能⼒，它与⼤数据和深度学习⼀起，成为推动⼈⼯智能发展的核⼼驱动⼒[1]。知识图谱分为通⽤知识图谱和⾏业知识图谱，典型的通⽤知识图谱包括：⾯向语⾔的WordNet，⼤规模开放的知识图谱Yago、DBPedia和Freebase等[2-7]；典型的⾏业知识图谱包括：描述⼈物亲属关系的Kinships、医疗领域图谱UMLS及中国中医科学院中医药信息所研制的中医药学语⾔系统（Traditional Chinese Medicine Language System，TCMLS）。

其中，TCMLS是以中医药学科体系为核⼼，遵循中医药语⾔学特点，借鉴语义⽹络的理念，建⽴的⼀个中医药学语⾔集成系统。它共收录约10万个概念、30万个术语及127万条语义关系[8]。近年来，已经有部分学者以TCMLS为⾻架，开发了⼀些中医知识图谱的智能应⽤。项⽬流程

本研究以垂直⽹站中药⽹和A+医学⽹站为主要数据来源，使⽤爬⾍脚本爬取并对⽹站数据进⾏结构化处理，再利⽤Neo4j图数据库构建了中药知识图谱，并利⽤基于规则匹配算法、关键词匹配以及对问句进⾏分类等关键步骤实现了中药知识问答和辅助开⽅。为了提⾼⽤户体验与系统可视化程度，本研究⼜利⽤web.py框架设计搭建前端问答界⾯。

总体设计

基于中医药知识图计分为中医药知识图谱的构建和智能问答的搭建

中医药知识图谱的构建

- 数据获取与预处理

- 基于Neo4j构建中医药知识图

数据获取与预处理

对于通⽤知识图谱搭建⽽⾔，最主要数据来源于互联⽹⽹页上的开源数据。本研究采⽤⾃底向上的模式，先从⽹页中识别出知识实体，再将知识实体归出合适的数据模式。对于通⽤知识图谱搭建⽽⾔，最主要数据来源于互联⽹⽹页上的开源数据。本研究采⽤⾃底向上的模式，先从⽹页中识别出知识实体，再将知识实体归纳出合适的数据模式。且这两个⽹站的数据以结构化数为主，基于这⼀特点，我们在爬取数据的时候，很容易通过⽹站结构化的信息抽取出相关实体和属性概念。对⽹页的结构化数据进⾏xpath解析，赋予相应的标签，得到的数据以excel表形式存储，然后将这些表格数据导⼊Neo4j数据库。

import requests

from lxml import html

import pandas

from openpyxl import Workbook

import re

class zhongyao():

def__init__(self):

<_all =dict()

self.url="www.a-hospital/w/%E6%9C%AC%E8%8D%89%E7%BA%B2%E7%9B%AE"

self.headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.3 6"}

def get_parse_html(self,url):#将⽹页源代码转换成xpath对象的函数

(url,headers=self.headers)

html_text =

parse_html = HTML(html_text)#将⽹页源代码转换成xpath对象

return parse_html

def get_url(self):

(self.url,headers=self.headers)

en = )

text_title=en.xpath('//h3/span[@class="mw-headline"]/text()')

text_url1 = en.xpath('//*[@id="bodyContent"]/p[5]/a/@href')

<_nate(text_url1)

text_url2 = en.xpath('//*[@id="bodyContent"]/p[6]/a/@href')

text_url3 = en.xpath('//*[@id="bodyContent"]/p[7]/a/@href')

# _nate(text_url3)

text_url4 = en.xpath('//*[@id="bodyContent"]/p[8]/a/@href')

#_nate(text_url4)

text_url5 = en.xpath('//*[@id="bodyContent"]/p[9]/a/@href')

#_nate(text_url5)

text_url6 = en.xpath('//*[@id="bodyContent"]/p[10]/a/@href')

# _nate(text_url6)

text_url7 = en.xpath('//*[@id="bodyContent"]/p[11]/a/@href')

# _nate(text_url7)

text_url8 = en.xpath('//*[@id="bodyContent"]/p[12]/a/@href')

# _nate(text_url8)

text_url9 = en.xpath('//*[@id="bodyContent"]/p[13]/a/@href')

# _nate(text_url9)

text_url10 = en.xpath('//*[@id="bodyContent"]/p[14]/a/@href') #_nate(text_url10)

text_url11 = en.xpath('//*[@id="bodyContent"]/p[15]/a/@href') # _nate(text_url11)

text_url12 = en.xpath('//*[@id="bodyContent"]/p[16]/a/@href') # _nate(text_url12)

text_url13 = en.xpath('//*[@id="bodyContent"]/p[17]/a/@href') # _nate(text_url13)

text_url14 = en.xpath('//*[@id="bodyContent"]/p[18]/a/@href') # _nate(text_url14)

text_url15 = en.xpath('//*[@id="bodyContent"]/p[19]/a/@href') # _nate(text_url15)

def get_text_alias(self,parse_html_text):

try:

alias_data=""

text_alias = parse_html_text.xpath('//*[@id="bodyContent"]/p[1]/text()') text_alias =''.join(text_alias)

alias_data = text_alias.split('」')[1]

except:

return""

return alias_data

def get_text_smell(self, parse_html_text):

try:

smell_data =""

text_smell = parse_html_text.xpath('//*[@id="bodyContent"]/p[2]/text()') text_smell =''.join(text_smell)

smell_data = text_smell.split('」')[1]

except:

return""

error parse new

return smell_data

def get_text_cure(self, parse_html_text):

try:

new_cure =""

text_cure1 = parse_html_text.xpath('//*[@id="bodyContent"]/p[3]/text()') text_cure2 = parse_html_text.xpath('string(//*[@id="bodyContent"]/p[4])') text_cure3 = parse_html_text.xpath('string(//*[@id="bodyContent"]/p[5])') text_cure4 = parse_html_text.xpath('string(//*[@id="bodyContent"]/p[6])') text_cure1 =''.join(text_cure1)

text_cure2 =''.join(text_cure2)

text_cure3 =''.join(text_cure3)

text_cure4 =''.join(text_cure4)

new_cure = text_cure1 + text_cure2 + text_cure3 + text_cure4

new_cure = new_cure.split('」')[1]

except:

return""

return new_cure

# def save(self,row):

# for i in row:

# with open('⼟部.xlsx', "") as f:

# f.write(i)

def get_nate(self,text_url):

count=0

rows=[]

for link in text_url:#对所有帖⼦的站内链接进⾏遍历拼接完整的帖⼦链接

t_url="www.a-hospital"+link#拼接得到帖⼦的url

parse_html__parse_html(t_url)

text_name = parse_html_text.xpath('//*[@id="firstHeading"]/text()')

text_name =''.join(text_name)

text_name = text_name.split('/')[1]# 取‘/’右边的

text__text_alias(parse_html_text)

text__text_smell(parse_html_text)

text__text_cure(parse_html_text)

# try:

# text_alias = parse_html_text.xpath('//*[@id="bodyContent"]/p[1]/text()')

# text_alias = ''.join(text_alias)

# text_alias = text_alias.split('」')[1]

# except:

# return text_alias

# try:

# text_smell = parse_html_text.xpath('//*[@id="bodyContent"]/p[2]/text()')

# text_smell = ''.join(text_smell)

# text_smell = text_smell.split('」')[1]

# except:

# text_smell=""

# try:

# text_cure1 = parse_html_text.xpath('//*[@id="bodyContent"]/p[3]/text()')

# text_cure2 = parse_html_text.xpath('string(//*[@id="bodyContent"]/p[4])') # text_cure3 = parse_html_text.xpath('string(//*[@id="bodyContent"]/p[5])') # text_cure4 = parse_html_text.xpath('string(//*[@id="bodyContent"]/p[6])') # text_cure1=''.join(text_cure1)

# text_cure2=''.join(text_cure2)

# text_cure3=''.join(text_cure3)

# text_cure4=''.join(text_cure4)

# new_cure = text_cure1 + text_cure2 + text_cure3 + text_cure4

# new_cure = new_cure.split('」')[1]

# except Exception:

# return new_cure #对于某些没有主治的异常处理

wb = Workbook()

ws = wb.active

ws['A1']='name'

ws['B1']='alias'

ws['C1']='smell'

ws['D1']='cure'

row=[text_name,text_alias,text_smell,text_cure]

rows.append(row)

#self.save(row)

# print(row)

# ws.append(row)

for new_row in rows:

ws.append(new_row)

count+=1

print(count)

wb.save('草部.xlsx')

# frame = pandas.DataFrame(columns=['name','alias', 'smell','cure'])

# frame['name'] = text_name

# frame['alias'] = text_alias

# frame['smell'] = text_smell

# frame['cure'] = new_cure

# _excel('./data/草部.xlsx')

# text_cure3= parse_html_text.xpath('string(//*[@id="bodyContent"])')

# a="""「主治」"""

# b="""参考"""

# new_cure = re.search('^1.*?b$',text_cure3,re.S)

#print(row)

zhongyao = zhongyao()

<_url()

获取A+医学百科的数据

from lxml import html

from openpyxl import Workbook

import re

import requests

class zhongyao():

def__init__(self):

<_all =dict()

self.url="www.zhzyw/zyts/pfmf/Index.html"

self.basic_url="www.zhzyw/"

self.headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.3 6"}

def get_parse_html(self,url):#将⽹页源代码转换成xpath对象的函数

(url,headers=self.headers)

html_text =

parse_html = HTML(html_text)#将⽹页源代码转换成xpath对象

return parse_html

def get_one(self,url):#拼接并得到xpath解析的⽹页

text_url1one = self.basic_url +''.join(url)

en = _parse_html(text_url1one)

return en

def get_url(self):

en = _parse_html(self.url)

text_url1 = en.xpath('//*[@id="title"]/ul/li[1]/a/@href')

text_url2 = en.xpath('//*[@id="title"]/ul/li[2]/a/@href')

text_url3 = en.xpath('//*[@id="title"]/ul/li[3]/a/@href')

text_url4 = en.xpath('//*[@id="title"]/ul/li[4]/a/@href')

text_url5 = en.xpath('//*[@id="title"]/ul/li[5]/a/@href')

text_url6 = en.xpath('//*[@id="title"]/ul/li[6]/a/@href')

text_url7 = en.xpath('//*[@id="title"]/ul/li[7]/a/@href')

enone = _one(text_url7)

text_urltwo = enone.xpath('//*[@id="left"]/div[4]/ul/li/a/@href')# 获取某⼀科下⾯所有秘⽅

<_nate(text_urltwo)

def get_text_part(self,parse_html_text):

try:

text_alias = parse_html_text.xpath('//*[@id="wzdh"]/a[5]/text()')

text_alias =''.join(text_alias)

except:

return"没到"

return text_alias

def get_text_drug(self, parse_html_text):

try:

text_smell = parse_html_text.xpath('//*[@id="left"]/h1/text()')

text_smell =''.join(text_smell)

except:

return"不到"

return text_smell

def get_text_cure(self, parse_html_text):

try:

text_cure1 = parse_html_text.xpath('//*[@id="left"]/div[2]/text()')

text_cure1 =''.join(text_cure1)

except:

return"不到"

return text_cure1

def get_nate(self,text_url):

count =0

rows =[]

for link in text_url:#对所有帖⼦的站内链接进⾏遍历拼接完整的帖⼦链接

t_url=self.basic_url+link#拼接得到帖⼦的url

parse_html__parse_html(t_url)

text_name = parse_html_text.xpath('//*[@id="left"]/div[3]/ul/li/a/@href')

for t_link in text_name:

all_url=self.basic_url+t_link

all_html__parse_html(all_url)

text__text_part(all_html_text)

text__text_drug(all_html_text)

text__text_cure(all_html_text)

#print(text_part+"\n"+text_drug+"\n"+text_prescript)

wb = Workbook()

ws = wb.active

ws['A1']='drug'

ws['B1']='prescript'

ws['C1']='part'

row=[text_drug,text_prescript,text_part]

rows.append(row)

for new_row in rows:

ws.append(new_row)

count+=1

print(count)

wb.save(text_part+'.xlsx')

zhongyao = zhongyao()

<_url()

获取中医药⽹数据

由于代码是去年写的，⽹站代码修改有些出⼊，再加上当时写代码⽔平也不强，有能⼒的好兄弟可以重构这部分代码。(现在再看这些代码，内⼼就在想：wc，这代码谁写的)

688IT编程网

基于中医药知识图谱智能问答(一)

发表评论

推荐文章

Tubular spring slip-joint and jar

二、SpringBoot中maven中dependencies所有的jar包都报红,

springboot打Jar包和War包

SpringBoot项目没有把依赖的jar包一起打包的问题解决

重构springboot老项目之-剔除pom中无用的jar引用

热门文章

springmvc的文件保存方法详解

springboot中生成文件路径的问题及解决方法

Spring配置——import标签

SpringBoot如何实现分离资源文件并打包

Docker在容器中运行springboot的jar包,挂载外部yml配置文件

springboot打包插件去除jar包瘦身

JAVA运行springbootjar包设置classpath

在IDEA中将SpringBoot项目打包成jar包的方法

springboot打jar包发布的方法

关于springboot启动所需所有jar包详解

在pom包中添加spring-boot-starter-test包引用

Spring动态加载bean后调用实现方法解析

application.properties多环境配置文件、jar包外部配置文件、配置项加...

Spring5——Spring开发web项目及拆分Spring配置文件

SSH所用JAR包详解

如何在IDEA中快速解决Jar冲突详解

springbootmavenresource资源文件打包配置

java如何扫描指定包下类(包括jar中的java类)

Java实现文件的上传下载(含源代码和jar包)

springboot打包依赖包和配置文件分离

最新文章

springboot打Jar包和War包

SpringBoot项目没有把依赖的jar包一起打包的问题解决

重构springboot老项目之-剔除pom中无用的jar引用

Linux编辑启动、停止与重启springbootjar包脚本实例

使用easyexcel读取Excel文件时报错(避坑)

mockmultipartfile resources文件读取 -回复

标签列表