python-BeautifulSoup4 find_all()在extract()或decompose()之后表现奇怪

我观察到使用BeautifulSoup4时发现奇怪的行为.
我有以下XML(文件名:fake_product.xml):

<product acronym="ACRO1">
<formats>
    <format id="format1">
    </format>
    <format id="format2">
    </format>
    <format id="format3">
    </format>
    <format id="format4">
    </format>
    <format id="format5">
    </format>
    <format id="format6">
    </format>
</formats>
</product>

此TestCase失败:

import unittest
from bs4 import BeautifulSoup


class Test(unittest.TestCase):

    def setUp(self):
        with open('fake_product.xml') as f:
            self.soup = BeautifulSoup(f, 'xml')

    def test_product_removal(self):
        output = len(self.soup.find_all('format'))
        expected = 6
        self.assertEqual(output, expected)

        format_to_delete = self.soup.find(id='format2')
        format_to_delete.extract()
        #self.soup = BeautifulSoup(self.soup.prettify(), 'xml')
        output = len(self.soup.find_all('format'))
        expected -= 1
        self.assertEqual(output, expected)

原因是find_all()无法再找到所有格式.如果我这样做打印self.soup.prettify()对我来说一切都很好.
如果我取消注释TestCase中的注释行并在extract()之后创建一个新的BeautifulSoup对象,则find_all()似乎可以再次正常工作,并且TestCase成功.

有人可以向我解释这种行为吗?

解决方法:

这是4.4.0中引入的错误,请参见BeautifulSoup 4 project bug tracker

In some situations, it seems calling extract() does not correctly adjust the next_sibling attribute of the previous element. This leaves the extracted element in the descendant generator. When later calling find(...) or find_all(...), the search then terminates at the extracted element, causing results to be missed.

This bug也与此相关,并且包含一个潜在的修复程序:

Lines 265, 267, 274, 277 need != changing to is not

Line 290 needs == changing to is

我可以确认它可以解决您的特定测试.

如果您不满意编辑BeautifulSoup源代码,则解决方法是按照您的方法重建树,或降级到4.3.2,直到可以修复为止.

上一篇:python-BeautifulSoup-处理自动关闭标签的正确方法


下一篇:尝试使用BeautifulSoup在HTML文档中查找特定表