pythonscrapy爬取知乎问题和收藏夹下所有答案的内容和图⽚
上⽂介绍了爬取知乎问题信息的整个过程,这⾥介绍下爬取问题下所有答案的内容和图⽚,⼤致过程相同,部分核⼼代码不同.
爬取⼀个问题的所有内容流程⼤致如下:
replace函数4个参数⼀个问题url
请求url,获取问题下的答案个数(我不需要,因为之前获取问题信息的时候保存了问题的回答个数)
通过答案的接⼝去获取答案(如果⼀次获取5个答案,总计100个答案,需要计算的出访问20次答案接⼝)[答案的接⼝地址如下图所⽰]
答案接⼝返回的内容保存到mysql
提取内容中的图⽚地址,保存到本地
mysql面试题 知乎爬取代码:
从mysql库中查到question的id, 然后直接访问 答案接⼝ 去获取数据.
answer_template="www.zhihu/api/v4/questions/%s/answers?include=data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,anno comment,content,editable_content,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,relevant_info,question,excer
a[*].author.follower_count,badge[?(type=best_answerer)].topics&limit=5&offset=%s&sort_by=default"
linux文件怎么保存def check_login(self, response):
#从mysql中读取question的信息,来进⾏爬取
db = t("localhost", "root", "", "crawl", charset='utf8' )
cursor = db.cursor()
selectsql="select questionid,answer_num from  zhihu_question where id in ( 251,138,93,233,96,293,47,24,288,151,120,311,214,33) ;"
try:
results = cursor.fetchall()
for row in results:
questionid = row[0]
answer_num = row[1]
fornum = answer_num/5 #计算需要访问答案接⼝的次数
print("questionid : "+ str(questionid)+"  answer_Num: "+str(answer_num))
for i in range(fornum+1):
answer_url = self.answer_template % (str(questionid), str(i*5))
yield scrapy.Request(answer_url,callback=self.parse_answer, headers=self.headers)
except Exception as e:
深度32位linux系统下载print(e)
db.close()
解析responsejquery选择器中的dom
parser_anser解析接⼝⾥的内容,这⾥就⽐较⽅便了, 因为是json格式的
代码如下:
def parse_answer(self,response):
#测试时把返回结果写到本地, 然后写pythonmain⽅法测试,测试⽅法都在test_code⽬录下
#temfn= str(random.randint(0,100))
#f = open("/var/www/html/scrapy/answer/"+temfn,'wb')
英文description什么意思#f.write(response.body)
#f.write("------")
#f.close()
res=json.)
#print (res)
data=res['data']
# ⼀次返回多个(默认5个)答案, 需要遍历
for od in data:
#print(od)
item = AnswerItem()
item['answer_id']=str(od['id'])  #  answer id
item['question_id']=str(od['question']['id'])
item['question_title']=od['question']['title']
item['author_url_token']=od['author']['url_token']
item['author_name']=od['author']['name']
item['voteup_count']=str(od['voteup_count'])
item['comment_count']=str(od["comment_count"])
item['content']=od['content']
yield item
testh = etree.HTML(od['content'])
itemimg = MyImageItem()
itemimg['question_answer_id'] = str(od['question']['id'])+"/"+str(od['id'])
itemimg['image_urls']=testh.xpath("//img/@data-original")
yield itemimg
成果展⽰
爬取了4w+个答案和12G图⽚(个⼈服务器只有12G空间了~)
爬取收藏夹下的答案内容和图⽚:
爬取收藏夹下的回答的流程和爬取问题下回答基本流程⼀样,区别在于:
1. 问题的start_urls为多个,收藏夹是⼀个⼀个爬取
2. 问题页⾯上到了内容接⼝,返回json.⽅便. 收藏夹页⾯没有到接⼝(我没有到),我是访问每页,然后解析的html.构造每页的起始地址:
解析html核⼼代码: