Python: Why `requests` incorrectly decodes text instead of UTF-8
Sometimes requests incorrectly decodes text in response.text
- it uses ISO-8859-1
(Latin-1
) instead of UTF-8
event if there is <meta charset="uft-8">
or <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
in HTML.
Problem is it doesn't uses <meta>
but header Content-Type
which usually has value text/html; charset=UTF-8
(HTML5)
but sometimes server sends only text/html
and default encoding for this value is ISO-8859-1
(HTML4)
You can see it in response.headers['content-type']
and response.encoding
Luckly it also uses module chardet to detect charset in text and it assigns it to response.apparent_encoding
so you can do:
print(response.content.decode(response.apparent_encoding))
or
response.encoding = response.apparent_encoding
print(response.text)
Example code
import requests
import lxml.html
import chardet
r = requests.get('https://travel.rakuten.co.jp/')
print('content-type:', r.headers['content-type'])
print('encoding:', r.encoding)
print('apparent:', r.apparent_encoding)
print('chardet :', chardet.detect(r.content) )
# wrong result
html = r.text
tree = lxml.html.fromstring(html)
result = tree.xpath('//*[@id="rt-nav-box"]/li[1]/a')[0]
print('1:', result.text)
# correct result
detected_encoding = chardet.detect(r.content)['encoding']
html = r.content.decode(detected_encoding)
tree = lxml.html.fromstring(html)
result = tree.xpath('//*[@id="rt-nav-box"]/li[1]/a')[0]
print('2:', result.text)
# correct result
html = r.content.decode(r.apparent_encoding)
tree = lxml.html.fromstring(html)
result = tree.xpath('//*[@id="rt-nav-box"]/li[1]/a')[0]
print('3:', result.text)
# correct result
r.encoding = r.apparent_encoding
html = r.text
tree = lxml.html.fromstring(html)
result = tree.xpath('//*[@id="rt-nav-box"]/li[1]/a')[0]
print('4:', result.text)
Result:
encoding: ISO-8859-1
apparent: utf-8
chardet : {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
1: å½å
æ
è¡
2: 国内旅行
3: 国内旅行
4: 国内旅行
If you like it
Buy a Coffee
