python – 迭代xml元素的有效方法

我有这样的xml:

<a>
    <b>hello</b>
    <b>world</b>
</a>
<x>
    <y></y>
</x>
<a>
    <b>first</b>
    <b>second</b>
    <b>third</b>
</a>

我需要遍历所有< a>和< b>标签,但我不知道它们中有多少是在文档中.所以我使用xpath来处理:

from lxml import etree

doc = etree.fromstring(xml)

atags = doc.xpath('//a')
for a in atags:
    btags = a.xpath('b')
    for b in btags:
            print b

它有效,但我有相当大的文件,cProfile告诉我xpath使用起来非常昂贵.

我想知道,也许有更有效的方法来迭代无限数量的xml元素?

解决方法:

XPath应该很快.您可以将XPath调用的数量减少为一个:

doc = etree.fromstring(xml)
btags = doc.xpath('//a/b')
for b in btags:
    print b.text

如果这还不够快,你可以尝试Liza Daly’s fast_iter.这样做的好处是不需要先用etree.fromstring处理整个XML,并且在访问子节点后丢弃父节点.这两件事都有助于降低内存需求.下面是a modified version of fast_iter,它更积极地删除不再需要的其他元素.

def fast_iter(context, func, *args, **kwargs):
    """
    fast_iter is useful if you need to free memory while iterating through a
    very large XML file.

    http://lxml.de/parsing.html#modifying-the-tree
    Based on Liza Daly's fast_iter
    http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    See also http://effbot.org/zone/element-iterparse.htm
    """
    for event, elem in context:
        func(elem, *args, **kwargs)
        # It's safe to call clear() here because no descendants will be
        # accessed
        elem.clear()
        # Also eliminate now-empty references from the root node to elem
        for ancestor in elem.xpath('ancestor-or-self::*'):
            while ancestor.getprevious() is not None:
                del ancestor.getparent()[0]
    del context

def process_element(elt):
    print(elt.text)

context=etree.iterparse(io.BytesIO(xml), events=('end',), tag='b')
fast_iter(context, process_element)

关于解析大型XML文件的Liza Daly’s article也可能对您有用.根据文章,带有fast_iter的lxml可能比cElementTree的iterparse更快. (见表1).

上一篇:python – 无法在Ubuntu 12.04上安装lxml


下一篇:如何使用Python在多行文本中搜索XPath中的内容?