Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库,它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.
import requests
from bs4 import BeautifulSoup
link="http://category.tudou.com/category/c_96_r_2019_p_1.html"
headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0'}
req=requests.get(link,headers=headers)
#得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出:
soup=BeautifulSoup(req.text,"lxml")
print(soup.prettify())
#从文档中找到所有<link>标签的链接:
for lk in soup.find_all('link'):
print(lk.get('href'))
Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .
下表列出了主要的解析器,以及它们的优缺点:
解析器 | 使用方法 | 优势 | 劣势 |
---|---|---|---|
Python标准库 | BeautifulSoup(markup, "html.parser") |
|
|
lxml HTML 解析器 | BeautifulSoup(markup, "lxml") |
|
|
lxml XML 解析器 |
BeautifulSoup(markup, ["lxml", "xml"]) BeautifulSoup(markup, "xml") |
|
|
html5lib | BeautifulSoup(markup, "html5lib") |
|
|
遍历文档树:
一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点.Beautiful Soup提供了许多操作和遍历子节点的。
print(soup.head)
<head><meta content="IE=Edge" http-equiv="X-UA-Compatible"/><link href="//static.youku.com" rel="dns-prefetch"/><link href="//static.ykimg.com" rel="dns-prefetch"/><link href="//r1.ykimg.com" rel="dns-prefetch"/><link href="//r2.ykimg.com" rel="dns-prefetch"/><link href="//r3.ykimg.com" rel="dns-prefetch"/><link href="//r4.ykimg.com" rel="dns-prefetch"/><link href="//g1.ykimg.com" rel="dns-prefetch"/><link href="//g2.ykimg.com" rel="dns-prefetch"/><link href="//g3.ykimg.com" rel="dns-prefetch"/><link href="//g4.ykimg.com" rel="dns-prefetch"/><link href="//p.l.youku.com" rel="dns-prefetch"/><link href="//urchin.lstat.youku.com" rel="dns-prefetch"/><link href="//html.atm.youku.com" rel="dns-prefetch"/><meta content="text/html; charset=utf-8" http-equiv="content-type"/><meta content="zh-cn" http-equiv="content-language"/><title>剧综影漫_土豆视频</title><meta content="视频,视频分享,视频搜索,视频播放,土豆视频" name="keywords"/><meta content="土豆-中国第一视频网站,提供视频播放,视频发布,视频搜索 - 视频服务平台,提供视频播放,视频发布,视频搜索,视频分享 - 土豆视频" name="description"/><meta content="a2h28" name="data-spm"/><link href="/favicon.ico" rel="shortcut icon"/><link href="//static.youku.com/yk/lib/css/tudou.8f50c0ed37.css" rel="stylesheet"/><link href="//static.youku.com/yk/newtudou/css/pc/category/category.340b6db21c.css" rel="stylesheet"/><script>var Local={"domain":{"default":"www.youku.com","test":"test.youku.com","subscribe":"ding.youku.com","uc":"i.youku.com","video":"v.youku.com","rz":"rz.youku.com","userlive":"userlive.youku.com","esign":"hetong.youku.com","listpage":"list.youku.com","xinterest":"x.youku.com","ypartner":"yp.youku.com","interact":"hudong.pl.youku.com","creation":"mp.tudou.com","uctg":"uctg.youku.com","playlists":"playlists.youku.com","static":"static.youku.com","passport":"account.youku.com","static_ext":"static.ykimg.com","static_ext_js":"js.ykimg.com","static_ext_css":"css.ykimg.com"},"service":{"push":"push.youku.com","interact":"hudong.pl.youku.com"},"debug":false};</script><script>var require = {"baseUrl": "//static.youku.com/newtudou/js/"};</script><script>if(require){require.paths={"main.category": "//static.youku.com/yk/newtudou/js/pc/category/main.category.f22a91da07"};}</script><script data-main="main.category" src="//static.youku.com/yk/lib/js/base.tudou.464a1349ea.js"></script></head>
print(soup.title)
<title>剧综影漫_土豆视频</title>
print(soup.title.text)
剧综影漫_土豆视频
这是个获取tag的小窍门,可以在文档树的tag中多次调用这个方法.下面的代码可以获取<body>标签中的第一个<b>标签:
print(soup.body.b)
<b class="line-after"></b>
print(soup.body.b['class'])
['line-after']