python爬虫学习(三):使用re库爬取淘宝商品,并把结果写进txt文件--688IT编程网

python爬⾍学习（三）：使⽤re库爬取淘宝商品，并把结果写进

txt⽂件

第⼆个例⼦是使⽤requests库+re库爬取淘宝搜索商品页⾯的商品信息

（1）分析⽹页源码

打开淘宝，输⼊关键字“python”，然后搜索，显⽰如下搜索结果

然后翻页，先跳到第⼆页，url变为：

再跳到第三页，url变为：

经过对⽐发现，翻页后，变化的关键字是s，每次翻页，s便以44的倍数增长（可以数⼀下每页显⽰的商品数量，刚好是44）

所以可以根据关键字“s=”，来设置爬取的深度（爬取多少页）

右键查看源码：

分析商品名称和商品价格分别由哪个关键字控制：

商品名称可能的关键字是“title”和“raw_title”，进⼀步多看⼏个商品的名称，发现选取“raw_title”⽐较合适；商品价格⾃然就是“view_price”(通过⽐对淘宝商品展⽰页⾯)；

所以商品名称和商品价格分别是以 "raw_title":"名称" 和 "view_price":"价格"，这样的键/值对的形式展⽰的。

（2）分析如何实现

与上⼀个例⼦爬取“最好⼤学排名”不同，淘宝商品信息不像之前的⼤学信息是以HTML格式嵌⼊的，这⾥的商品信息并未以HTML标签的形式处理数据，⽽是直接以脚本语⾔放进来的，所以不需要⽤BeautifulSoup来解析，直接⽤正则表达式提取关键字信息即可

（3）提取信息

写个demo，看看是如何⼀步步解析信息的

# coding:utf-8

import requests

import re

goods = '⽔杯'

url = 's.taobao/search?q=' + goods

r = (url=url, timeout=10)

html = r.text

tlist = re.findall(r'\"raw_title\"\:\".*?\"', html) # 正则提取商品名称

plist = re.findall(r'\"view_price\"\:\"[\d\.]*\"', html) # 正则提⽰商品价格

print(tlist)

print(plist)

print(type(plist)) # 正则表达式提取出的商品名称和商品价格都是以列表形式存储数据的

去掉列表中的键，只留下值，也就是去掉每组数据的“raw_title”和“view_price”

print('第⼀个商品的键值对信息：', tlist[0]) # 查看第⼀个商品的键值对信息

a = tlist[0].split(':')[1] # 使⽤split()⽅法以":"为切割点，将商品的键值分开，提取值，即商品名称

print('第⼀个商品的名称', a)

print(type(a)) # 查看a的类型

b = eval(a) # 使⽤eval()函数，去掉字符串的引号

print('把商品名称去掉引号后', b) # 查看去掉引号后的效果

print(type(b)) # 查看b的类型

利⽤for循环，把每个商品的名称和价格组成⼀个列表，然后把这写列表再追加到⼀个⼤列表中：

goodlist = []

for i in range(len(tlist)):

title = eval(tlist[i].split(':')[1]) # eval()函数简单说就是⽤于去掉字符串的引号

price = eval(plist[i].split(':')[1])

goodlist.append([title, price]) # 把每个商品的名称和价格组成⼀个⼩列表，然后把所有商品组成的列表追加到⼀个⼤列表中print(goodlist)

完整代码：

# coding: utf-8

import requests

import re

# def getHTMLText(url):

# try:

# r = (url, timeout=30)

# r.raise_for_status()

# r.encoding = r.apparent_encoding

# except:

# return ""

# def parsePage(ilt, html):

# try:

python正则表达式爬虫# plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"', html)

# tlt = re.findall(r'\"raw_title\"\:\".*?\"', html)

# for i in range(len(plt)):

# price = eval(plt[i].split(':')[1])

# title = eval(tlt[i].split(':')[1])

# ilt.append([price, title])

# except:

# print()

# def printGoodsList(ilt):

# tplt = "{:4}\t{:8}\t{:16}"

# print(tplt.format("序号", "价格", "商品名称"))

# count = 0

# for t in ilt:

# count = count + 1

# print(tplt.format(count, t[0], t[1]))

# def main():

# goods = '⾼达'

# depth = 3

# start_url = 's.taobao/search?q=' + goods

# infoList = []

# for i in range(depth):

# try:

# url = start_url + '&s=' + str(44 * i)

# html = getHTMLText(url)

# parsePage(infoList, html)

# except:

# continue

# printGoodsList(infoList)

# main()

def get_html(url):

"""获取源码html"""

try:

r = (url=url, timeout=10)

except:

print("获取失败")

def get_data(html, goodlist):

"""使⽤re库解析商品名称和价格

tlist：商品名称列表

plist：商品价格列表"""

tlist = re.findall(r'\"raw_title\"\:\".*?\"', html)

plist = re.findall(r'\"view_price\"\:\"[\d\.]*\"', html)

for i in range(len(tlist)):

title = eval(tlist[i].split(':')[1]) # eval()函数简单说就是⽤于去掉字符串的引号

price = eval(plist[i].split(':')[1])

goodlist.append([title, price])

def write_data(list, num):

# with open('E:/Crawler/', 'a') as data:

# print(list, file=data)

for i in range(num): # num控制把爬取到的商品写进多少到⽂本中

u = list[i]

with open('E:/Crawler/', 'a') as data:

print(u, file=data)

def main():

goods = '⽔杯'

depth = 3 # 定义爬取深度，即翻页处理

start_url = 's.taobao/search?q=' + goods

infoList = []

for i in range(depth):

try:

url = start_url + '&s=' + str(44 * i) # 因为淘宝显⽰每页44个商品，第⼀页i=0,⼀次递增 html = get_html(url)

get_data(html, infoList)

except:

continue

write_data(infoList, len(infoList))

if__name__ == '__main__':

main()

688IT编程网

python爬虫学习(三):使用re库爬取淘宝商品,并把结果写进txt文件

发表评论

推荐文章

「2022」打算跳槽涨薪,必问面试题及答案——VUE3篇

前端开发面试笔试题目

函数式组件和类组件的区别

移动应用开发专家面试问题及答案

vue 场景面试题目

热门文章

ReactHook中useState异步回调获取不到最新值及解决方案

react useeffect面试题

react fiber常见的面试题

reactnative高级面试题

react高阶面试题

react 数组包含字符的写法

react-virtuoso使用手册

antd的message高级用法

react调用amis组件

react-sticky实例

移动穿戴设备软件工程师面试题及答案

英语面试题库

初中级前端面试题

aftership前端面试题(二)

高级前端面试问题及答案解析

西藏久远银海公司面试题(一)

AIESEC绝密面试题

Redux面试题汇总及答案

react框架高级面试题

react-native 面试题

最新文章

「2022」打算跳槽涨薪,必问面试题及答案——VUE3篇

前端开发面试笔试题目

移动应用开发专家面试问题及答案

vue 场景面试题目

reactnative 组件更新的方法

react render() 方法

标签列表