python-美丽汤将标准普尔变成标准普尔； AT&T进入AT&T；？

2021-07-07 13:20:52

我正在使用BeautifulSoup 4(4.3.2)解析一些相当混乱的HTML文档,并且遇到了一个问题,它将把公司名称转换为S& P(标准普尔)或M& S(Marks和Spencer)AT&amp ; T进入S& P;,M& S;和AT& T;.因此,它希望将& [A-Z]模式完成为一个html实体,但实际上由于& P;而不使用html实体查找表.不是html实体.

如何使其不这样做,还是只需要对正则表达式匹配无效的实体并将其变回原样？

>>> import bs4
>>> soup = bs4.BeautifulSoup('AT&T announces new plans')
>>> soup.text
u'AT&T; announces new plans'

>>> import bs4
>>> soup = bs4.BeautifulSoup('AT&TOP announces new plans')
>>> soup.text
u'AT&TOP; announces new plans'

我已经在OSX 10.8.5 Python 2.7.5和Scientifix Linux 6 Python 2.7.5上尝试了以上方法

解决方法:

这似乎是BeautifulSoup4处理未知HTML实体引用的方式中的错误或功能.正如伊格纳西奥(Ignacio)在上述评论中所说,最好对输入进行预处理并替换’&’. HTML实体(‘& amp;’)的符号.

但是,如果由于某些原因您不想这样做-我只能找到解决问题的方法的唯一方法是“猴子修补”代码.该脚本为我工作(在Mac OS X上为Python 2.73)：

import bs4

def my_handle_entityref(self, name):
     character = bs4.dammit.EntitySubstitution.HTML_ENTITY_TO_CHARACTER.get(name)
     if character is not None:
         data = character
     else:
         #the original code mishandles unknown entities (the following commented-out line)
         #data = "&%s;" % name
         data = "&%s" % name
     self.handle_data(data)

bs4.builder._htmlparser.BeautifulSoupHTMLParser.handle_entityref = my_handle_entityref
soup = bs4.BeautifulSoup('AT&T announces new plans')
print soup.text
soup = bs4.BeautifulSoup('AT&TOP announces new plans')
print soup.text

它产生输出：

AT&T announces new plans
AT&TOP announces new plans

您可以在此处查看存在问题的方法：

http://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/builder/_htmlparser.py#L81

和这里的问题线：

http://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/builder/_htmlparser.py#L86

码农公寓

相关文章