我正在尝试使用BeautifulSoup抓取一个网站,并编写了以下代码:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://gematsu.com/tag/media-create-sales")
soup = BeautifulSoup(page.text, 'html.parser')
try:
content = soup.find('div', id='main')
print (content)
except:
print ("Exception")
但是,即使div在网站上具有正确的ID,也会返回NoneType.我做错了什么吗?
我在页面上看到ID为main的div:
当我打印汤时,我也找到了div main:
解决方法:
BeautifulSoup’s documentation中对此进行了简要介绍
Beautiful Soup presents the same interface to a number of different parsers, but each parser is different. Different parsers will create different parse trees from the same document. The biggest differences are between the HTML parsers and the XML parsers
[ … ]
Here’s the same document parsed with Python’s built-in HTML parser:
BeautifulSoup("<a></p>", "html.parser")
Like html5lib, this parser ignores the closing
</p>
tag. Unlike html5lib, this parser makes no attempt to create a well-formed HTML document by adding a tag. Unlike lxml, it doesn’t even bother to add an tag.
您遇到的问题可能是由于HTML.parser无法正确处理的HTML格式错误.当BeautifulSoup解析HTML时,这导致id =“ main”被剥离.通过将解析器更改为html5lib或lxml,BeautifulSoup处理格式错误的HTML的方式与处理html.parser的方式不同