使用python和lxml从表中提取文本

我最近看到另一位用户提出了一个关于从网络表Extracting information from a webpage with python中提取信息的问题.来自ekhumoro的答案在其他用户询问的页面上运行得很好.见下文.

from urllib2 import urlopen
from lxml import etree

url = 'http://www.uscho.com/standings/division-i-men/2011-2012/'

tree = etree.HTML(urlopen(url).read())

for section in tree.xpath('//section[starts-with(@id, "section_")]'):
    print section.xpath('h3[1]/text()')[0]
    for row in section.xpath('table/tbody/tr'):
        cols = row.xpath('td//text()')
        print '  ', cols[0].ljust(25), ' '.join(cols[1:])
    print

我的问题是使用此代码作为解析此页面http://www.uscho.com/rankings/d-i-mens-poll/的指南
.使用以下更改我只能打印h1和h3.

输入

url = 'http://www.uscho.com/rankings/d-i-mens-poll/'
tree = etree.HTML(urlopen(url).read())

for section in tree.xpath('//section[starts-with(@id, "rankings")]'):
    print section.xpath('h1[1]/text()')[0]
    print section.xpath('h3[1]/text()')[0]
    for row in section.xpath('table/tbody/tr'):
        cols = row.xpath('td/b/text()')
        print '  ', cols[0].ljust(25), ' '.join(cols[1:])
    print

产量

USCHO.com Division I Men's Poll
December 12, 2011

表格的结构似乎是一样的,所以我不知道为什么我不能使用类似的代码.我只是一个机械工程师.任何帮助表示赞赏.

解决方法:

lxml很棒,但是如果你不熟悉xpath,我推荐你BeautifulSoup:

from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup

url = 'http://www.uscho.com/rankings/d-i-mens-poll/'
soup = BeautifulSoup(urlopen(url).read())

section = soup.find('section', id='rankings')
h1 = section.find('h1')
print h1.text
h3 = section.find('h3')
print h3.text
print

rows = section.find('table').findAll('tr')[1:-1]
for row in rows:
    columns = [data.text for data in row.findAll('td')[1:]]
    print '{0:20} {1:4} {2:>6} {3:>4}'.format(*columns)

此脚本的输出是:

USCHO.com Division I Men's Poll
December 12, 2011

Minnesota-Duluth     (49) 12-3-3  999
Minnesota                 14-5-1  901
Boston College            12-6-0  875
Ohio State           ( 1) 13-4-1  848
Merrimack                 10-2-2  844
Notre Dame                11-6-3  667
Colorado College           9-5-0  650
Western Michigan           9-4-5  647
Boston University         10-5-1  581
Ferris State              11-6-1  521
Union                      8-3-5  510
Colgate                   11-4-2  495
Cornell                    7-3-1  347
Denver                     7-6-3  329
Michigan State            10-6-2  306
Lake Superior             11-7-2  258
Massachusetts-Lowell      10-5-0  251
North Dakota               9-8-1   88
Yale                       6-5-1   69
Michigan                   9-8-3   62
上一篇:python – lxml classic:获取除嵌套标签之外的文本内容?


下一篇:python – 获取lxml中特定名称的所有节点?