目录
Web scraping is when we write a program that pretends to be a web browser and retrieves pages, then examines the data in those pages looking for patterns.
Web抓取是指我们编写一个程序,假装是一个Web浏览器,然后检索页面,然后检查这些页面中的数据寻找模式。
1 应用正则表达式解析HTML
▲示例:从网页内容中提取HTML:
网页内容:
<h1>The First Page</h1>
<p>
If you like, you can switch to the
<a href="http://www.dr-chuck.com/page2.htm">
Second Page</a>.
</p>
提取"http://www.dr-chuck.com/page2.htm"的正则表达式:
href="http[s]?://.+?"
其中,[s]?表示有0个或1个s,即http://或https://,后面接一个或多个任意字符,+?为non-greedy模式 ,到“截止
▲应用urllib加正则表达式的代码
import urllib.request, urllib.parse, urllib.error
import re
import ssl
# Ignore SSL certificate errors ssl库允许该程序访问严格执行HTTPS的网站
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ')###用户输入需要抓取的网页
html = urllib.request.urlopen(url, context=ctx).read()###一次性读入全部网页内容,为bytes对象
links = re.findall(b'href="(http[s]?://.*?)"', html)###查找并提取HTML,返回列表,b''表示bytes对象
for link in links:
print(link.decode())###逐个输出,decode解析
输出内容:
Enter - https://docs.python.org
https://docs.python.org/3/index.html
https://www.python.org/
https://docs.python.org/3.8/
https://docs.python.org/3.7/
https://docs.python.org/3.5/
https://docs.python.org/2.7/
https://www.python.org/doc/versions/
https://www.python.org/dev/peps/
https://wiki.python.org/moin/BeginnersGuide
https://wiki.python.org/moin/PythonBooks
https://www.python.org/doc/av/
https://www.python.org/
https://www.python.org/psf/donations/
http://sphinx.pocoo.org/
▲当HTML格式良好且可预测时,正则表达式工作得非常好。但是,由于有很多损坏的HTML页面,只使用正则表达式的解决方案可能会错过一些有效链接,或者最终得到坏数据。
2 应用 BeautifulSoup解析HTML
▲安装BeautifulSoup
下载后命令行运行:
###安装或更新pip
py -m pip install --upgrade pip setuptools wheel
pip install desktop/beautifulsoup4-4.10.0-py3-none-any.whl
####我下载在桌面上,所以是desktop/,根据自己下载的路径更改
▲解析HTML
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ')###输入网页
html = urllib.request.urlopen(url, context=ctx).read()###urllib打开并读取内容
soup = BeautifulSoup(html, 'html.parser')###传输到BeautifulSoup解析
# Retrieve all of the anchor tags抓取anchor元素,具体见https://www.w3school.com.cn/tags/index.asp
tags = soup('a')
for tag in tags:
print(tag.get('href', None))
▲ 还能抓取其他内容
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ')
html = urlopen(url, context=ctx).read()
print(html.decode())
soup = BeautifulSoup(html, "html.parser")
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
# Look at the parts of a tag
print('TAG:', tag)
print('URL:', tag.get('href', None))
print('Contents:', tag.contents[0])
print('Attrs:', tag.attrs)
运行结果:
Enter - http://www.dr-chuck.com/page1.htm
TAG: <a href="http://www.dr-chuck.com/page2.htm">
Second Page</a>
URL: http://www.dr-chuck.com/page2.htm
Content: ['\nSecond Page']
Attrs: [('href', 'http://www.dr-chuck.com/page2.htm')]