Python 中beautifulsoup乱码(实际上是requests返回结果乱码)
for url in urls:
resp = rq.get(url)
# print(resp.content)
bs = bs4.BeautifulSoup(resp.text, "html.parser")
h1 = bs.findAll("h1")
pattern = re.compile("^2019年(.+)招生计划$")
pattern.match(h1[0].text)
print(h1[0].text) # .encode("utf8") string.decode("utf8")
# res = bs.findAll(is_entry_class)
res = bs.select("div.entry table")
if res is not None:
i = i+1
print(i)
for child in res[0].tbody.children:
row = []
for son in child.children:
row.append(son.text)
print("\t".join(row))
print()
调试发现 resp
返回结果采用ISO-8859-1 编码,而实际网站中头部中字符集为utf8
<html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>2019年...</title>
调试方法:
直接修改返回结果的编码
for url in urls:
resp = rq.get(url)
# print(resp.content)
resp.encoding = "utf8"
...
飞火➹流萤
发布了17 篇原创文章 · 获赞 3 · 访问量 1499
私信
关注