python-使用Xpath / BeautifulSoup在h3 / h2标签之间的HTML

我正在为项目使用Scrapy,并且得到以下html:

<h3><span class="my_class">First title</span></h3>
<ul>
    <li>Text for the first title... li #1</li>
</ul>
<ul>
    <li>Text for the first title... li #2</li>
</ul>
<h3><span class="my_class">Second title</span></h3>
<ul>
    <li>Text for the second title... li #1</li>
</ul>
<ul>
    <li>Text for the second title... li #2</li>
</ul>

现在,当我使用response.xpath(“ .// ul / li / text()”).extract()时,它确实为我提供了[“第一个标题的文字… li#1”,“ Text对于第一个标题… li#2“,”第二个标题的文本… li#1“,”第二个标题的文本… li#2“]]但是,这部分是我想要的.

我想要两个列表,一个用于第一个标题,另一个用于第二个标题.
这样,结果将是:

first_title = ["Text for the first title... li #1", "Text for the first title... li #2"]
second_title = ["Text for the second title... li #1", "Text for the second title... li #2"]

我仍然不知道如何实现这一目标.我目前正在使用Scrapy来获取HTML;将xpath与纯Python结合使用的解决方案对我来说是理想的.但是我以某种方式相信BeautifulSoup将对此类任务有用.

您对如何在Python中执行此操作有任何想法吗?

解决方法:

您可以在Scrapy中使用XPath和CSS选择器.

这是一个示例解决方案(在ipython会话中;我只将第2块中的#1和#2更改为#3和#4,以使其更加明显):

In [1]: import scrapy

In [2]: selector = scrapy.Selector(text="""<h3><span class="my_class">First title</span></h3>
   ...: <ul>
   ...:     <li>Text for the first title... li #1</li>
   ...:     <li>Text for the first title... li #2</li>
   ...: </ul>
   ...: <h3><span class="my_class">Second title</span></h3>
   ...: <ul>
   ...:     <li>Text for the second title... li #3</li>
   ...:     <li>Text for the second title... li #4</li>
   ...: </ul>""")

In [3]: for title_list in selector.css('h3 + ul'):
   ...:         print title_list.xpath('./li/text()').extract()
   ...:     
[u'Text for the first title... li #1', u'Text for the first title... li #2']
[u'Text for the second title... li #3', u'Text for the second title... li #4']

In [4]: for title_list in selector.css('h3 + ul'):
        print title_list.css('li::text').extract()
   ...:     
[u'Text for the first title... li #1', u'Text for the first title... li #2']
[u'Text for the second title... li #3', u'Text for the second title... li #4']

In [5]: 

在OP提出问题后进行编辑:

Every <li> tag is enclosed in its own <ul> (…) Is there any way to extend that to make it look for all the ul tags below the h3 tag?

如果h3和ul都是兄弟姐妹,则选择下一个h3之前的ul的一种方法是计数preceding h3 siblings

考虑以下输入HTML代码段:

<h3><span class="my_class">First title</span></h3>
<ul><li>Text for the first title... li #1</li></ul>
<ul><li>Text for the first title... li #2</li></ul>

<h3><span class="my_class">Second title</span></h3>
<ul><li>Text for the second title... li #3</li></ul>
<ul><li>Text for the second title... li #4</li></ul>

第一< li>线具有1个前置h3兄弟,第3个ul表示同级.该行有2个先前的h3同级.

因此,对于每个h3,您都希望跟随ul兄弟姐妹,这些兄弟姐妹具有您到目前为止已看到的h3的数目.

第一:

following-sibling :: ul [count(preceding-sibling :: h3)= 1]

然后,

following-sibling :: ul [count(preceding-sibling :: h3)= 2]

等等.

这是在枚举h3选择时借助enumerate()起作用的想法(请记住XPath positions start at 1,而不是0):

In [1]: import scrapy

In [2]: selector = scrapy.Selector(text="""
<h3><span class="my_class">First title</span></h3>
<ul><li>Text for the first title... li #1</li></ul>
<ul><li>Text for the first title... li #2</li></ul>

<h3><span class="my_class">Second title</span></h3>
<ul><li>Text for the second title... li #3</li></ul>
<ul><li>Text for the second title... li #4</li></ul>
""")

In [3]: for cnt, title in enumerate(selector.css('h3'), start=1):
   ...:     print title.xpath('following-sibling::ul[count(preceding-sibling::h3)=%d]/li/text()' % cnt).extract()
   ...: 
[u'Text for the first title... li #1', u'Text for the first title... li #2']
[u'Text for the second title... li #3', u'Text for the second title... li #4']
上一篇:python-BeautifulSoup找不到标签


下一篇:python-很多空白beautifulsoup