python网页文本爬虫--688IT编程网

Python爬虫

1， python爬虫介绍：、

网络爬虫（又被称为网页蜘蛛，网络机器人），是一种按照一定的规则，自动的抓取万维网信息的程序或者脚本。各大搜索引擎都用爬虫缓存各种url，提供搜索服务。高级爬虫技术难度是很高的，要考虑很多，比如连接优化，代理服务器，大数据量下爬取优化，站点爬取规则设计，但是基础爬虫重点只是实现信息抓取保存和处理，爬取规则通常很简单。

以小说网站爬取为例，首先需要掌握python基础，比如urllib使用，python进行字符串操作，复杂一点使用正则表达式。还有就是基本的程序逻辑。具备这三点就能开始爬小说。

爬虫代码示例

首先贴上完整代码：

import urllib

import urllib2

import os

import time

import sys

def getHtml(url):

page = urllib.urlopen(url)

html = ad()

return html.decode('gbk').encode('utf-8') + '\r\n'

def interstr(src, begin, end):

index1 = src.find(begin)

if index1 is -1:

return None

index1 += len(begin)

tmp = src[index1:]python爬虫开发

index2 = tmp.find(end)

if index2 is -1:

return None

dst = tmp[:index2]

return dst

def getTitle(html):

title = interstr(html, 'title = " ', '";</script>')

if title is None:

return None

return title

def getNextPage(html):

pageNum = interstr(html, 'next_page = "', '.html";')

bookID = interstr(html, 'bookid = "', '";')

if pageNum is None or bookID is None:

return None

nextPage = (url + bookID + pageNum)

return nextPage

def getContent(html):

data = interstr(html, '<div id="content">', '</div>')

if data is None:

return None

data = place('<script>read();</script>', '')

data = place(' ', '\n')

data = place('<br /><br />', '')

return data + '\n'

def forstr(src, begin, end):

tmpSrc = src

strList = []

while True:

indexBegin = tmpSrc.find(begin)

if indexBegin is -1:

break

indexBegin += len(begin)

tmp = tmpSrc[indexBegin:]

indexEnd = tmp.find(end)

if indexEnd is -1:

break

tmpString = tmp[:indexEnd]

strList.append(tmpString)

tmpSrc = tmp

return strList

if __name__ == '__main__':

book = sys.argv[1]

url = sys.argv[2]

nextPage = url + (book)

html = getHtml(nextPage)

listSrc = interstr(html, '<id="list">', '</div>')

listSrc = interstr(listSrc, '<dl>', '</dl>')

print listSrc

myList = forstr(listSrc, '<a href="', '">')

print myList

2，代码详解

def getHtml(url):

page = urllib.urlopen(url)

html = ad()

return html.decode('gbk').encode('utf-8') + '\r\n'

函数功能是下载html页面，其中url参数是目标网址，使用urllib库，通常小说网站都是gbk编码，读取小说章节网址，下载页面，下载的数据即页面源码，源码中的汉字通常是gbk编码，需要转码UTF-8。

def interstr(src, begin, end):

index1 = src.find(begin)

if index1 is -1:

return None

index1 += len(begin)

tmp = src[index1:]

index2 = tmp.find(end)

if index2 is -1:

return None

dst = tmp[:index2]

return dst

函数功能是截取字符串，在src中截取begin和end字符串中间的数据。操作很初级，效率比较低，但是思路很简单，功能都自己实现。

def getTitle(html):

title = interstr(html, 'title = " ', '";</ ')

if title is None:

return None

return title

函数功能，是截取章节名，功能也很简单，但是不同的网站的章节名截取规则不同。

def getNextPage(html):

pageNum = interstr(html, 'next_page = "', '.html";')

bookID = interstr(html, 'book = "', '";')

if pageNum is None or bookID is None:

return None

nextPage = (url + bookID + pageNum)

return nextPage

获取下一章节的url，通常网站页面会有本章节内容，和上下章节的地址，将这些地址爬出来，可以在循环中自动获取下一章节内容。

def getContent(html):

data = interstr(html, '<div id="cnt">', '</div>')

if data is None:

return None

data = place('<script>read ();</script>', '')

data = place(' ', '\n')

data = place('<br /><br />', '')

return data + '\n'

解析下载源码中的小说文本内容，不同站点规则不同，匹配方案简单，只用字符串操作，可

以考虑正则表达式，但是，其实正则看起来还没这个直观。

分析之后的数据就可以追加写入文本，然后改成txt放到手机上看。

Python有一个安卓版的运行解释器，所以可以手机直接下载小说到本机。

这种爬虫，可以考虑更高级的方案，使用http长连接，下载速度会提高很多。

同时使用多线程，但是同步的时候很麻烦。

这个脚本用了很久，只要小说网站做得好，爬取的内容一般不会有错误，重复

688IT编程网

python网页文本爬虫

发表评论

推荐文章

自由基迁移英语

化学中间隙的名词解释

crp名词解释

rni的名词解释

Regulation of cancer cell metabolism-NATURE

热门文章

细胞生物学之笔记--第6章

免疫学综述

活性氧对健康的影响与防御

活性氧的代谢与调控研究

活性氧在生物学中的作用机制研究

手性农药选择性生物活性与毒性效应研究进展

分子生物学笔记完全版

MULTIFUNCTIONAL STAR-SHAPED PREPOLYMERS, THEIR PR

self immolative polymers

各学科国际重要学术期刊JCR分区情况统计

高分子材料专家牛人

Reactive coextrusion of functionalized polymers an

Reactive coextrusion of functionalized polymers

Photo-reactive benzocyclobutenones and polymers th

期刊名全称和缩写对照P-Z

FUNCTIONAL POLYMERS WITH CARBON-LINKED FUNCTIONAL

Reactive polymers having pendant flexible side cha

高分子材料专家

电气工程常用专业英语词汇表

Electrical design specification0310

最新文章

crp名词解释

rni的名词解释

Regulation of cancer cell metabolism-NATURE

重金属污染对植物体内超氧化物歧化酶的影响

光动力英语

Drebrin参与树突棘发育及认知功能形成的研究进展

标签列表

688IT编程网

python网页文本爬虫

发表评论

推荐文章

自由基迁移 英语

化学中间隙的名词解释

crp名词解释

rni的名词解释

Regulation of cancer cell metabolism-NATURE

热门文章

细胞生物学之笔记--第6章

免疫学综述

活性氧对健康的影响与防御

活性氧的代谢与调控研究

活性氧在生物学中的作用机制研究

手性农药选择性生物活性与毒性效应研究进展

分子生物学笔记完全版

MULTIFUNCTIONAL STAR-SHAPED PREPOLYMERS, THEIR PR

self immolative polymers

各学科国际重要学术期刊JCR分区情况统计

高分子材料专家牛人

Reactive coextrusion of functionalized polymers an

Reactive coextrusion of functionalized polymers

Photo-reactive benzocyclobutenones and polymers th

期刊名全称和缩写对照P-Z

FUNCTIONAL POLYMERS WITH CARBON-LINKED FUNCTIONAL

Reactive polymers having pendant flexible side cha

高分子材料专家

电气工程常用专业英语词汇表

Electrical design specification0310

最新文章

crp名词解释

rni的名词解释

Regulation of cancer cell metabolism-NATURE

重金属污染对植物体内超氧化物歧化酶的影响

光动力英语

Drebrin参与树突棘发育及认知功能形成的研究进展

标签列表

自由基迁移英语