我观察到使用BeautifulSoup4时发现奇怪的行为.
我有以下XML(文件名:fake_product.xml):
<product acronym="ACRO1">
<formats>
<format id="format1">
</format>
<format id="format2">
</format>
<format id="format3">
</format>
<format id="format4">
</format>
<format id="format5">
</format>
<format id="format6">
</format>
</formats>
</product>
此TestCase失败:
import unittest
from bs4 import BeautifulSoup
class Test(unittest.TestCase):
def setUp(self):
with open('fake_product.xml') as f:
self.soup = BeautifulSoup(f, 'xml')
def test_product_removal(self):
output = len(self.soup.find_all('format'))
expected = 6
self.assertEqual(output, expected)
format_to_delete = self.soup.find(id='format2')
format_to_delete.extract()
#self.soup = BeautifulSoup(self.soup.prettify(), 'xml')
output = len(self.soup.find_all('format'))
expected -= 1
self.assertEqual(output, expected)
原因是find_all()无法再找到所有格式.如果我这样做打印self.soup.prettify()对我来说一切都很好.
如果我取消注释TestCase中的注释行并在extract()之后创建一个新的BeautifulSoup对象,则find_all()似乎可以再次正常工作,并且TestCase成功.
有人可以向我解释这种行为吗?
解决方法:
这是4.4.0中引入的错误,请参见BeautifulSoup 4 project bug tracker:
In some situations, it seems calling
extract()
does not correctly adjust thenext_sibling
attribute of the previous element. This leaves the extracted element in the descendant generator. When later callingfind(...)
orfind_all(...)
, the search then terminates at the extracted element, causing results to be missed.
This bug也与此相关,并且包含一个潜在的修复程序:
Lines 265, 267, 274, 277 need
!=
changing tois not
Line 290 needs
==
changing tois
我可以确认它可以解决您的特定测试.
如果您不满意编辑BeautifulSoup源代码,则解决方法是按照您的方法重建树,或降级到4.3.2,直到可以修复为止.