关于requests库中文编码问题--688IT编程网

关于requests库中⽂编码问题

转⾃：

Python reqeusts在作为代理爬⾍节点抓取不同字符集⽹站时遇到的⼀些问题总结. 简单说就是中⽂乱码的问题. 如果单纯的抓取微博，，电商，那么字符集charset很容易就

确认，你甚⾄可以单⽅⾯把encoding给固定住。但作为舆情数据来说，他每天要抓取⼏⼗万个不同⽹站的敏感数据，所以这就需要我们更好确认字符集编码，避免中⽂的乱码情况.

我们⾸先看这个例⼦. 你会发现⼀些有意思的事情.

In [9]: r = ('/en/latest/')

In [10]: r.encoding

Out[10]: 'ISO-8859-1'

In [11]: )

Out[11]: unicode

In [12]: t)

Out[12]: str

In [13]: r.apparent_encoding

Out[13]: 'utf-8'

In [14]: chardet.t)

Out[14]: {'confidence': 0.99, 'encoding': 'utf-8'}

第⼀个问题是，为什么会有ISO-8859-1这样的字符集编码？

iso-8859是什么？他⼜被叫做Latin-1或“西欧语⾔” . 对于我来说，这属于requests的⼀个bug，在requests库的github⾥可以看到不只是中国⼈提交了这个issue. 但官⽅的回复说

是按照http rfc设计的。

下⾯通过查看requests源代码，看这问题是如何造成的 !

requests会从服务器返回的响应头的 Content-Type 去获取字符集编码，如果content-type有charset字段那么requests才能正确识别编码，否则就使⽤默认的 ISO-8859-1. ⼀般那

些不规范的页⾯往往有这样的问题.

In [52]: r.headers

Out[52]: {'content-length': '16907', 'via': 'BJ-H-NX-116(EXPIRED), http/1.1 BJ-UNI-1-JCS-116 ( [cHs f ])', 'ser': '3.81', 'content-encoding': 'gzip', 'age': '23', 'expires ⽂件: requests.utils.py

def get_encoding_from_headers(headers):

"""通过headers头部的dict中获取编码格式"""

content_type = ('content-type')

if not content_type:

return None

content_type, params = cgi.parse_header(content_type)

if'charset'in params:

return params['charset'].strip("'\"")

if'text'in content_type:

return'ISO-8859-1'

第⼆个问题，那么如何获取正确的编码？

requests的返回结果对象⾥有个apparent_encoding函数, apparent_encoding通过调⽤chardet.detect()来识别⽂本编码. 但是需要注意的是，这有些消耗计算资源.

⾄于为⽑，可以看看chardet的源码实现.

@property

def apparent_encoding(self):

unicode keyboard download"""使⽤chardet来计算编码"""

return chardet.t)['encoding']

第三个问题，requests的text() 跟 content() 有什么区别？

requests在获取⽹络资源后，我们可以通过两种模式查看内容。⼀个是r.text，另⼀个是r.content，那他们之间有什么区别呢？

分析requests的源代码发现，r.text返回的是处理过的Unicode型的数据，⽽使⽤r.content返回的是bytes型的原始数据。也就是说，r.content相对于r.text来说节省了计算资

源，r.content是把内容bytes返回. ⽽r.text是decode成Unicode. 如果headers没有charset字符集的化,text()会调⽤chardet来计算字符集，这⼜是消耗cpu的事情.

通过看requests代码来分析text() content()的区别.

⽂件: dels.py

@property

def apparent_encoding(self):

"""The apparent encoding, provided by the chardet library"""

return chardet.t)['encoding']

@property

def content(self):

"""Content of the response, in bytes."""

if self._content is False:

# Read the contents.

try:

if self._content_consumed:

raise RuntimeError(

'The content for this response was already consumed')

if self.status_code == 0:

self._content = None

else:

self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes() except AttributeError:

self._content = None

self._content_consumed = True

# don't need to release the connection; that's been handled by urllib3

# since we exhausted the data.

return self._content

@property

def text(self):

"""Content of the response, in unicode.

ding is None, encoding will be guessed using

``chardet``.

The encoding of the response content is determined based solely on HTTP

headers, following RFC 2616 to the letter. If you can take advantage of

non-HTTP knowledge to make a better guess at the encoding, you should

set ``r.encoding`` appropriately before accessing this property.

"""

# Try charset from content-type

content = None

encoding = ding

if t:

return str('')

#当为空的时候会使⽤chardet来猜测编码.

ding is None:

encoding = self.apparent_encoding

# Decode unicode from given encoding.

try:

content = t, encoding, errors='replace')

except (LookupError, TypeError):

# A LookupError is raised if the encoding was not found which could

# indicate a misspelling or similar mistake.

# A TypeError can be raised if encoding is None

# So we try blindly encoding.

content = t, errors='replace')

解决⽅案：

对于requests中⽂乱码解决⽅法有这么⼏种.

1.由于content是HTTP相应的原始字节串，可以根据headers头部的charset把content decode为unicode，前提别是ISO-8859-1编码. In [96]: r.encoding

Out[96]: 'gbk'

In [98]: t.ding)[200:300]

="keywords" content="Python数据分析与挖掘实战,，机械⼯业出版社,9787111521235,，在线购买，折扣，打折"/>

另外有⼀种特别粗暴⽅式，就是直接根据chardet的结果来encode成utf-8格式.

In [22]: r = ('item.jd/1012551875.html')

In [23]: t

KeyboardInterrupt

In [23]: r.apparent_encoding

Out[23]: 'GB2312'

In [24]: r.encoding

Out[24]: 'gbk'

In [25]: r.content.ding).encode('utf-8')

---------------------------------------------------------------------------

UnicodeDecodeError Traceback (most recent call last)

<ipython-input-25-918324cdc053> in <module>()

----> t.decode(r.apparent_encoding).encode('utf-8')

UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 49882-49883: illegal multibyte sequence

In [27]: r.content.decode(r.apparent_encoding,'replace').encode('utf-8')

如果在确定使⽤text，并已经得知该站的字符集编码时，可以使⽤ r.encoding = ‘xxx’ 模式，当你指定编码后，requests在text时会根据你设定的字符集编码进⾏转换. >>> import requests

>>> r = ('')

>>> r.text

>>> r.encoding

'gbk'

>>> r.encoding = 'utf-8'

2.使⽤正则从html中的meta中获取

根据我抓⼏⼗万的⽹站的经验，⼤多数⽹站还是很规范的，如果headers头部没有charset，那么就从html的meta中抽取.

In [78]: s

Out[78]: ' <meta http-equiv="Content-Type" content="text/html; charset=gbk"'

In [79]: b = repile("<meta.*content=.*charset=(?P<charset>[^;\s]+)", flags=re.I)

In [80]: b.search(s).group(1)

Out[80]: 'gbk"'

python requests的utils.py⾥已经有个完善的从html中获取meta charset的函数. 说⽩了还是⼀对的正则表达式.

In [32]: _encodings_from_t)

Out[32]: ['gbk']

⽂件: utils.py

def get_encodings_from_content(content):

charset_re = repile(r'<meta.*?charset=["\']*(.+?)["\'>]', flags=re.I)

pragma_re = repile(r'<meta.*?content=["\']*;?charset=(.+?)["\'>]', flags=re.I)

xml_re = repile(r'^<\?xml.*?encoding=["\']*(.+?)["\'>]')

return (charset_re.findall(content) +

pragma_re.findall(content) +

xml_re.findall(content))

最后，针对requests中⽂乱码的问题总结:

统⼀编码，要不都成utf-8, 要不就⽤unicode做中间码 !

国内的站点⼀般是utf-8、gbk、gb2312 , 当requests的encoding是这些字符集编码后，是可以直接decode成unicode.

但当你判断出encoding是 ISO-8859-1 时，可以结合re正则和chardet判断出他的真实编码. 可以把这逻辑封装补丁引⼊进来.

import requests

def monkey_patch():

prop = t

def content(self):

_content = prop.fget(self)

ding == 'ISO-8859-1':

encodings = _encodings_from_content(_content)

if encodings:

else:

_content = _content.ding, 'replace').encode('utf8', 'replace')

self._content = _content

return _content

monkey_patch()

Python3.x解决了这编码问题，如果你还是python2.6 2.7，那么还需要⽤上⾯的⽅法解决中⽂乱码的问题. END.

688IT编程网

关于requests库中文编码问题

发表评论

推荐文章

文献翻译-非传统的加工工艺

电机常用英语词汇翻译

照明系统翻译对照

纺织品英语翻译-外贸翻译术语

科技英语翻译专业词汇-电气

热门文章

胶黏剂行业胶水专业术语英文翻译

英语翻译-优秀经理的素质

药品英文说明书的语言特点与翻译

英语六级翻译新题型答案

医学常用中英文名称翻译(标准)

电力翻译

电力专业英语阅读与翻译

无功规划及其在功率管理中的运行毕业论文外文翻译

橡胶英语翻译

生化名词解释中英对译

浅析setup如何通过ref获取子组件实例中的DOM结构数据方法及获取子组件实...

谷粒商城商品服务API(八)

vue下的checkbox控制(全选,反选,及统计选中个数)

vue3 嵌套路由调用子页面方法

vue实现表单单独移除一个字段验证

Vue2实现树形菜单(多级菜单)功能模块

vue源码nextTick使用及原理解析

vue项目的文件的运行过程

el-tab_vue3_调用子方法_补充说明

vue3+ts封装echart组件多次调用,数据被覆盖_解释说明

最新文章

文献翻译-非传统的加工工艺

照明系统翻译对照

大学所有课程名称翻译大全

英语摘要翻译13篇

电力专业英语翻译(第二版)

太阳能术语中英翻译

标签列表

688IT编程网

关于requests库中文编码问题

发表评论

推荐文章

文献翻译-非传统的加工工艺

电机常用英语词汇翻译

照明系统翻译对照

纺织品英语翻译-外贸翻译术语

科技英语翻译专业词汇-电气

热门文章

胶黏剂行业胶水专业术语英文翻译

英语翻译-优秀经理的素质

药品英文说明书的语言特点与翻译

英语六级翻译新题型答案

医学常用中英文名称翻译(标准)

电力翻译

电力专业英语阅读与翻译

无功规划及其在功率管理中的运行毕业论文外文翻译

橡胶英语翻译

生化名词解释中英对译

浅析setup如何通过ref获取子组件实例中的DOM结构数据方法及获取子组件实...

谷粒商城商品服务API(八)

vue下的checkbox控制(全选,反选,及统计选中个数)

vue3 嵌套路由 调用子页面方法

vue实现表单单独移除一个字段验证

Vue2实现树形菜单(多级菜单)功能模块

vue源码nextTick使用及原理解析

vue项目的文件的运行过程

el-tab_vue3_调用子方法_补充说明

vue3+ts封装echart组件多次调用,数据被覆盖_解释说明

最新文章

文献翻译-非传统的加工工艺

照明系统翻译对照

大学所有课程名称翻译大全

英语摘要翻译13篇

电力专业英语翻译(第二版)

太阳能术语中英翻译

标签列表

vue3 嵌套路由调用子页面方法