官方文档地址:http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
Beautiful Soup 相比其他的html解析有个非常重要的优势。html会被拆解为对象处理。全篇转化为字典和数组。
相比正则解析的爬虫,省略了学习正则的高成本。
相比xpath爬虫的解析,同样节约学习时间成本。虽然xpath已经简单点了。(爬虫框架Scrapy就是使用xpath)
安装
linux下可以执行
- apt-get install python-bs4
也可以用python的安装包工具来安装
- easy_install beautifulsoup4
- pip install beautifulsoup4
使用简介
下面说一下BeautifulSoup 的使用。
解析html需要提取数据。其实主要有几点
1:获取指定tag的内容。
- <p>hello, watsy</p><br><p>hello, beautiful soup.</p>
2:获取指定tag下的属性。
- <a href="http://blog.csdn.net/watsy">watsy‘s blog</a>
3:如何获取,就需要用到查找方法。
使用示例采用官方
- html_doc = """
- <html><head><title>The Dormouse‘s story</title></head>
- <body>
- <p class="title"><b>The Dormouse‘s story</b></p>
- <p class="story">Once upon a time there were three little sisters; and their names were
- <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
- <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
- <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
- and they lived at the bottom of a well.</p>
- <p class="story">...</p>
- """
格式化输出。
- from bs4 import BeautifulSoup
- soup = BeautifulSoup(html_doc)
- print(soup.prettify())
- # <html>
- # <head>
- # <title>
- # The Dormouse‘s story
- # </title>
- # </head>
- # <body>
- # <p class="title">
- # <b>
- # The Dormouse‘s story
- # </b>
- # </p>
- # <p class="story">
- # Once upon a time there were three little sisters; and their names were
- # <a class="sister" href="http://example.com/elsie" id="link1">
- # Elsie
- # </a>
- # ,
- # <a class="sister" href="http://example.com/lacie" id="link2">
- # Lacie
- # </a>
- # and
- # <a class="sister" href="http://example.com/tillie" id="link2">
- # Tillie
- # </a>
- # ; and they lived at the bottom of a well.
- # </p>
- # <p class="story">
- # ...
- # </p>
- # </body>
- # </html>
获取指定tag的内容
- soup.title
- # <title>The Dormouse‘s story</title>
- soup.title.name
- # u‘title‘
- soup.title.string
- # u‘The Dormouse‘s story‘
- soup.title.parent.name
- # u‘head‘
- soup.p
- # <p class="title"><b>The Dormouse‘s story</b></p>
- soup.a
- # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
上面示例给出了4个方面
1:获取tag
soup.title
2:获取tag名称
soup.title.name
3:获取title tag的内容
soup.title.string
4:获取title的父节点tag的名称
soup.title.parent.name
怎么样,非常对象化的使用吧。
提取tag属性
下面要说一下如何提取href等属性。
- soup.p[‘class‘]
- # u‘title‘
获取属性。方法是
soup.tag[‘属性名称‘]
- <a href="http://blog.csdn.net/watsy">watsy‘s blog</a>
常见的应该是如上的提取联接。
代码是
- soup.a[‘href‘]
相当easy吧。
查找与判断
接下来进入重要部分。全文搜索查找提取.
soup提供find与find_all用来查找。其中find在内部是调用了find_all来实现的。因此只说下find_all
- def find_all(self, name=None, attrs={}, recursive=True, text=None,
- limit=None, **kwargs):
看参数。
第一个是tag的名称,第二个是属性。第3个选择递归,text是判断内容。limit是提取数量限制。**kwargs 就是字典传递了。。
举例使用。
- tag名称
- soup.find_all(‘b‘)
- # [<b>The Dormouse‘s story</b>]
- 正则参数
- import re
- for tag in soup.find_all(re.compile("^b")):
- print(tag.name)
- # body
- # b
- for tag in soup.find_all(re.compile("t")):
- print(tag.name)
- # html
- # title
- 列表
- soup.find_all(["a", "b"])
- # [<b>The Dormouse‘s story</b>,
- # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
- # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
- # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
- 函数调用
- def has_class_but_no_id(tag):
- return tag.has_attr(‘class‘) and not tag.has_attr(‘id‘)
- soup.find_all(has_class_but_no_id)
- # [<p class="title"><b>The Dormouse‘s story</b></p>,
- # <p class="story">Once upon a time there were...</p>,
- # <p class="story">...</p>]
- tag的名称和属性查找
- soup.find_all("p", "title")
- # [<p class="title"><b>The Dormouse‘s story</b></p>]
- tag过滤
- soup.find_all("a")
- # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
- # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
- # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
- tag属性过滤
- soup.find_all(id="link2")
- # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
- text正则过滤
- import re
- soup.find(text=re.compile("sisters"))
- # u‘Once upon a time there were three little sisters; and their names were\n‘
获取内容和字符串
- title_tag.string
- # u‘The Dormouse‘s story‘
注意在实际使用中应该使用 unicode(title_tag.string)来转换为纯粹的string对象
- for string in soup.strings:
- print(repr(string))
- # u"The Dormouse‘s story"
- # u‘\n\n‘
- # u"The Dormouse‘s story"
- # u‘\n\n‘
- # u‘Once upon a time there were three little sisters; and their names were\n‘
- # u‘Elsie‘
- # u‘,\n‘
- # u‘Lacie‘
- # u‘ and\n‘
- # u‘Tillie‘
- # u‘;\nand they lived at the bottom of a well.‘
- # u‘\n\n‘
- # u‘...‘
- # u‘\n‘
- head_tag = soup.head
- head_tag
- # <head><title>The Dormouse‘s story</title></head>
- head_tag.contents
- [<title>The Dormouse‘s story</title>]
- title_tag = head_tag.contents[0]
- title_tag
- # <title>The Dormouse‘s story</title>
- title_tag.contents
- # [u‘The Dormouse‘s story‘]
总结
- soup = BeatifulSoup(data)
- soup.title
- soup.p.[‘title‘]
- divs = soup.find_all(‘div‘, content=‘tpc_content‘)
- divs[0].contents[0].string