我正在尝试使用html实体从下面的字符串中创建一个div元素.由于我的字符串包含html实体,& html实体中的保留字符被转义为& amp;在输出中.因此,html实体显示为纯文本.我怎样才能避免这种情况,以便正确呈现html实体?
s = 'Actress Adamari López And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts™ Website And Resources'
div = etree.Element("div")
div.text = s
lxml.html.tostring(div)
output:
<div>Actress Adamari L&#243;pez And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts&#8482; Website And Resources</div>
解决方法:
您可以在调用tostring()时指定编码:
>>> from lxml.html import fromstring, tostring
>>> s = 'Actress Adamari López And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts™ Website And Resources'
>>> div = fromstring(s)
>>> print tostring(div, encoding='unicode')
<p>Actress Adamari López And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts™ Website And Resources</p>
作为旁注,在处理HTML数据时你是should definitely use lxml.html.tostring()
:
Note that you should use
lxml.html.tostring
and notlxml.tostring
.lxml.tostring(doc)
will return the XML representation of the document,
which is not valid HTML. In particular, things like<script src="..."></script>
will be serialized as<script src="..." />
, which completely confuses browsers.
另见:
> Serialising to Unicode strings