beautifulSoup基本用法及find选择器

2022-11-01 19:06:36

　　总结来源于官方文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#find-all

示例代码段

html_doc = """
<html>
    <head><title>The Dormouse's story <!--Hey, buddy. Want to buy a used parser?-->
    <a><!--Hey, buddy. Want to buy a used parser?--></a></title>
    </head>
<body>
    <p class="title">
        <b>The Dormouse's story</b>
        <a><!--Hey, buddy. Want to buy a used parser?--></a>
    </p>
    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1 link4">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
    </p>
    <p class="story">...</p>
"""

　　1、快速操作：

soup.title  == soup.find('title')

# <title>The Dormouse's story</title>

soup.title.name

# u'title'

soup.title.string  == soup.title.text  == soup.title.get_text()

# u'The Dormouse's story'

soup.title.parent.name

# u'head'

soup.p   == soup.find('p')  # . 点属性，只能获取当前标签下的第一个标签

# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']

# u'title'

soup.a  == soup.find('a')

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find_all(['a','b'])  # 查找所有的a标签和b标签
soup.find_all(id=["link1","link2"])  # 查找所有id=link1 和id=link2的标签

soup.find(id="link3")

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

　　2、Beautiful Soup对象有四种类型：

　　　　1、BeautifulSoup

　　　　2、tag：标签

　　　　3、NavigableString : 标签中的文本，可包含注释内容

　　　　4、Comment ：标签中的注释，纯注释，没有正文内容

　　标签属性的操做跟字典是一样一样的

　　html多值属性(xml不适合)：

　　　　意思为一个属性名称，它是多值的，即包含多个属性值，即使属性中只有一个值也返回值为list，

　　　　如：class,rel , rev , accept-charset , headers , accesskey

　　　　其它属性为单值属性，即使属性值中有多个空格隔开的值，也是反回一个字符串

soup.a['class']  #['sister']

id_soup = BeautifulSoup('<p id="my id"></p>')

id_soup.p['id']  #'my id'

　　3、html中tag内容输出：　

　　　　string:输出单一子标签文本内容或注释内容（选其一，标签中包含两种内容则输出为None）

　　　　strings: 返回所有子孙标签的文本内容的生成器（不包含注释）

　　　　stripped_strings:返回所有子孙标签的文本内容的生成器（不包含注释,并且在去掉了strings中的空行和空格）

　　　　text:只输出文本内容，可同时输出多个子标签内容

　　　　get_text():只输出文本内容，可同时输出多个子标签内容

　　string:

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"

soup = BeautifulSoup(markup, 'html.parser')

comm = soup.b.string

print(comm)  # Hey, buddy. Want to buy a used parser?

print(type(comm))  #<class 'bs4.element.Comment'>

　　strings:

head_tag = soup.body

for s in head_tag.strings:

    print(repr(s))

结果：

'\n'

"The Dormouse's story"

'\n'

'Once upon a time there were three little sisters; and their names were\n        '

'Elsie'

',\n        '

'Lacie'

' and\n        '

'Tillie'

';\n        and they lived at the bottom of a well.\n    '

'\n'

'...'

'\n'

　　stripped_strings:

head_tag = soup.body

for s in head_tag.stripped_strings:

    print(repr(s))

结果：

"The Dormouse's story"

'Once upon a time there were three little sisters; and their names were'

'Elsie'

','

'Lacie'

'and'

'Tillie'

';\n        and they lived at the bottom of a well.'

'...'

　　text:

soup = BeautifulSoup(html_doc, 'html.parser')

head_tag = soup.body

print(head_tag.text)

结果：

The Dormouse's story

Once upon a time there were three little sisters; and their names were

        Elsie,

        Lacie and

        Tillie;

        and they lived at the bottom of a well.

...

soup = BeautifulSoup(html_doc, 'html.parser')

head_tag = soup.body

print(repr(head_tag.text))

结果：

"\nThe Dormouse's story\nOnce upon a time there were three little sisters; and their names were\n        Elsie,\n        Lacie and\n        Tillie;\n        and they lived at the bottom of a well.\n    \n...\n"

　　4、返回子节点列表：

　　　　.contents: 以列表的方式返回节点下的直接子节点

　　　　.children:以生成器的方式反回节点下的直接子节点

soup = BeautifulSoup(html_doc, 'html.parser')

head_tag = soup.head

print(head_tag)

print(head_tag.contents)

print(head_tag.contents[])

print(head_tag.contents[].contents)

for ch in head_tag.children:

    print(ch)

结果：

<head><title>The Dormouse's story</title></head>

[<title>The Dormouse's story</title>]

<title>The Dormouse's story</title>

["The Dormouse's story"]

<title>The Dormouse's story</title>

　　5、返回子孙节点的生成器：

　　　　　.descendants: 以列表的方式返回标签下的子孙节点

for ch in head_tag.descendants:

    print(ch)

结果：

<title>The Dormouse's story</title>

The Dormouse's story

　　6、父标签（parent）：如果是bs4对象，不管本来是标签还是文本都可以找到其父标签，但是文本对象不能找到父标签

soup = BeautifulSoup(html_doc, 'html.parser')

tag_title = soup.b  # b标签

print(tag_title.parent)  # b标签的父标签 p

print(type(tag_title.string))  # b标签中的文本的类型,文本中有注释时结果为None <class 'bs4.element.NavigableString'>

print(tag_title.string.parent)  # b标签中文本的父标签 b

print(type(tag_title.text))  # b 标签中的文本类型为str，无bs4属性找到父标签

　　7、递归父标签（parents）：递归得到元素的所有父辈节点

soup = BeautifulSoup(html_doc, 'html.parser')

link = soup.a

for parent in link.parents:

    print(parent.name)

结果：

p
body
html
[document]

　　8、前后节点查询(不是前后标签哦，文本也是节点之一):previous_sibling,next_sibling

　　9、以生成器的方式迭代返回所有兄弟节点

for sib in soup.a.next_siblings:

    print(sib)

    print("---------")

结果：

-------------

,

---------

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

---------

---------

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

---------

;

        and they lived at the bottom of a well.

---------

　　10、搜索文档树

　　　　过滤器：

　　　　　　1、字符串

　　　　　　2、正则表达式

　　　　　　3、列表

　　　　　　4、True

　　　　　　5、方法

html_doc = """<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were</p>

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.

<p class="story">...</p>

</body>

"""

from bs4 import BeautifulSoup

import re

soup = BeautifulSoup(html_doc, 'html.parser')

soup.find_all("a")  # 字符串参数

soup.find_all(re.compile("^b"))  # 正则参数

soup.find_all(re.compile("a"))  # 正则参数

soup.find_all(re.compile("l$"))  # 正则参数

soup.find_all(["a", "b"])  # 标签的列表参数

soup.find_all(True)  # 返回所有标签

def has_class_no_id(tag):

    return tag.has_attr("class") and not tag.has_attr("id")

soup.find_all(has_class_no_id)  # 方法参数

　　11、find选择器：

　　　　语法：

　　　　# find_all( name , attrs , recursive , text , **kwargs )

　　　　#  name :要查找的标签名

　　　　#  attrs: 标签的属性

　　　　#  recursive: 递归

　　　　#  text: 查找文本

　　　　# **kwargs :其它 键值参数

　　特殊情况:
　　　　data-foo="value",因中横杠不识别的原因，只能写成attrs={"data-foo":"value"},

　　　　class="value",因class是关键字，所以要写成class_="value"或attrs={"class":"value"}

from bs4 import BeautifulSoup

import re

html_doc = """

<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

# find_all( name , attrs , recursive , text , **kwargs )

#  name :要查找的标签名（字符串、正则、方法、True）

#  attrs: 标签的属性

#  recursive: 递归

#  text: 查找文本

# **kwargs :其它 键值参数

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.find_all('p', 'title')) # p标签且class="title"

soup.find_all('title')  # 以列表形式返回 所有title标签a

soup.find_all(attrs={"class":"sister"})  # 以列表形式返回 所有class属性==sister的标签

soup.find_all(id='link2')  # 返回所有id属性==link2的标签

soup.find_all(href=re.compile("elsie")) # 返回所有href属性包含elsie的标签

soup.find_all(id=True)  # 返回 所有包含id属性的标签

soup.find_all(id="link1", href=re.compile('elsie'))  #  id=link1且href包含elsie

关于class的搜索

soup = BeautifulSoup(html_doc, 'html.parser')

css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser')

css_soup.find_all("p", class_="body")  # 多值class,指定其中一个即可

css_soup.find_all("p", class_="strikeout")

css_soup.find_all("p", class_="body strikeout")  # 精确匹配

# text 参数可以是字符串，列表、方法、True

soup.find_all("a", text="Elsie")  # text="Elsie"的a标签

　　12、父节点方法：

　　　　find_parents( name , attrs , recursive , text , **kwargs )

　　　　find_parent( name , attrs , recursive , text , **kwargs )

html_doc = """<html>

    <head>

        <title>The Dormouse's story</title>

    </head>

<body>

    <p class="title"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were</p>

    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

    <p>

        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

    </p>

    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

    and they lived at the bottom of a well.

    <p class="story">...</p>

</body>

"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')

a_string = soup.find(text="Lacie")  # 文本为Lacie的节点

type(a_string), a_string  # <class 'bs4.element.NavigableString'> Lacie

a_parent = a_string.find_parent()  # a_string的父节点中的第一个节点

a_parent = a_string.find_parent("p")  # a_string的父节点中的第一个p节点

a_parents = a_string.find_parents()  # a_string的父节点

a_parents = a_string.find_parents("a")  # a_string的父点中所有a节点

　　13、后面的邻居节点：

　　　　find_next_siblings( name , attrs , recursive , text , **kwargs )

　　　　find_next_sibling( name , attrs , recursive , text , **kwargs )

html_doc = """<html><head><title>The Dormouse's story</title></head>

<body>

    <p class="title"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were</p>

    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

    <b href="http://example.com/elsie" class="sister" id="link1">Elsie</b>,

    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

        and they lived at the bottom of a well.

    <p class="story">...</p>

</body>

"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')

first_link = soup.a  # 第一个a标签

a_sibling = first_link.find_next_sibling()  # 后面邻居的第一个

a_sibling = first_link.find_next_sibling("a")  # 后面邻居的第一个a

a_siblings = first_link.find_next_siblings()  # 后面的所有邻居

a_siblings = first_link.find_next_siblings("a")  # 后面邻居的所有a邻居

　　14、前面的邻居节点：

　　　　find_previous_siblings( name , attrs , recursive , text , **kwargs )

　　　　find_previous_sibling( name , attrs , recursive , text , **kwargs )

　　15、后面的节点：

　　　　find_all_next( name , attrs , recursive , text , **kwargs )

　　　　find_next( name , attrs , recursive , text , **kwargs )

html_doc = """<html>

    <head>

        <title>The Dormouse's story</title>

    </head>

<body>

    <p class="title"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were</p>

    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

    <p>

        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

    </p>

    <p>

        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

    </p>

        and they lived at the bottom of a well.

    <p class="story">...</p>

</body>

"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')

a_string = soup.find(text="Lacie")

a_next = a_string.find_next()  # 后面所有子孙标签的第一个

a_next = a_string.find_next('a')  # 后面所有子孙标签的第一个a标签

a_nexts = a_string.find_all_next()  # 后面的所有子孙标签

a_nexts = a_string.find_all_next('a')  # 后面的所有子孙标签中的所有a标签

　　16、前面的节点：

　　　　find_all_previous( name , attrs , recursive , text , **kwargs )

　　　　find_previous( name , attrs , recursive , text , **kwargs )

　　17、解析部分文档：

　　　　如果仅仅因为想要查找文档中的<a>标签而将整片文档进行解析,实在是浪费内存和时间.最快的方法是从一开始就把<a>标签以外的东西都忽略掉. SoupStrainer 类可以定义文档的某段内容,这样搜索文档时就不必先解析整篇文档,只会解析在 SoupStrainer 中定义过的文档. 创建一个 SoupStrainer 对象并作为 parse_only 参数给 BeautifulSoup 的构造方法即可。

　　SoupStrainer 类参数：name , attrs , recursive , text , **kwargs

html_doc = """<html>

    <head>

        <title>The Dormouse's story</title>

    </head>

<body>

    <p class="title"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were

        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

    </p>

        and they lived at the bottom of a well.

    <p class="story">...</p>

</body>

"""

from bs4 import SoupStrainer

a_tags = SoupStrainer('a')  # 所有a标签

id_tags = SoupStrainer(id="link2")  # id=link2的标签

def is_short_string(string):

    return len(string) <   # string长度小于10，返回True

short_string = SoupStrainer(text=is_short_string)  # 符合条件的文本

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser', parse_only=a_tags).prettify()

soup = BeautifulSoup(html_doc, 'html.parser', parse_only=id_tags).prettify()

soup = BeautifulSoup(html_doc, 'html.parser', parse_only=short_string).prettify()

码农公寓

相关文章