python-lxml,序列化时缺少doctype

In [1]: from lxml import etree

我有一个HTML文档:

In [2]: root = etree.fromstring(u'''<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">\n<HTML></HTML>''', etree.HTMLParser())

正确解析了其doctype:

In [3]: root.getroottree().docinfo.doctype
Out[3]: u'<!DOCTYPE html PUBLIC "-//IETF//DTD HTML//EN">'

但是当序列化它时,我会丢失它:

In [4]: etree.tostring(root.getroottree(), method='html')
Out[4]: '<html></html>'

我应该怎么做才能将该文档类型序列化?

Debian GNU / Linux,Sid. Python 2.6.6. lxml 2.2.8-2.

解决方法:

到目前为止,使它正常工作的唯一方法是使用默认的XML解析器,并向文档添加非空系统URL:

>>> html = etree.parse(StringIO('''<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN" " ">\n<HTML></HTML>'''))
>>> etree.tostring(html, method="xml")
'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN" " ">\n<HTML/>'
>>> etree.tostring(html, method="html")
'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN" " ">\n<HTML></HTML>'

使用HTMLParser进行的相同操作会产生相同的docinfo,但不会获得所需的输出:

>>> html = etree.parse(StringIO('''<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN" " ">\n<HTML></HTML>'''), etree.HTMLParser())
>>> etree.tostring(html, method="html")
'<html></html>'
上一篇:linux下普通用户开机自启动tomcat


下一篇:UIView、UIViewLayout UI_01