Search on blog:

Python: Why `requests` incorrectly decodes text instead of UTF-8

Sometimes requests incorrectly decodes text in response.text - it uses ISO-8859-1 (Latin-1) instead of UTF-8 event if there is <meta charset="uft-8"> or <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> in HTML.

Problem is it doesn't uses <meta> but header Content-Type which usually has value text/html; charset=UTF-8 (HTML5) but sometimes server sends only text/html and default encoding for this value is ISO-8859-1 (HTML4)

You can see it in response.headers['content-type'] and response.encoding

Luckly it also uses module chardet to detect charset in text and it assigns it to response.apparent_encoding so you can do:

print(response.content.decode(response.apparent_encoding))

response.encoding = response.apparent_encoding

print(response.text)

Example code

import requests
import lxml.html
import chardet

r = requests.get('https://travel.rakuten.co.jp/')

print('content-type:', r.headers['content-type'])
print('encoding:', r.encoding)
print('apparent:', r.apparent_encoding)
print('chardet :', chardet.detect(r.content) )

# wrong result
html = r.text
tree = lxml.html.fromstring(html)
result = tree.xpath('//*[@id="rt-nav-box"]/li[1]/a')[0]
print('1:', result.text)

# correct result
detected_encoding = chardet.detect(r.content)['encoding']
html = r.content.decode(detected_encoding)
tree = lxml.html.fromstring(html)
result = tree.xpath('//*[@id="rt-nav-box"]/li[1]/a')[0]
print('2:', result.text)

# correct result
html = r.content.decode(r.apparent_encoding)
tree = lxml.html.fromstring(html)
result = tree.xpath('//*[@id="rt-nav-box"]/li[1]/a')[0]
print('3:', result.text)

# correct result
r.encoding = r.apparent_encoding
html = r.text
tree = lxml.html.fromstring(html)
result = tree.xpath('//*[@id="rt-nav-box"]/li[1]/a')[0]
print('4:', result.text)

Result:

encoding: ISO-8859-1
apparent: utf-8
chardet : {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
1: å½åæè¡
2: 国内旅行
3: 国内旅行
4: 国内旅行

If you like it

Buy a Coffee