使用beautufulsoup从div刮取页面内容

我正在尝试从每个div的http://www.indiainfoline.com/top-news刮取标题,摘要,日期和链接.与班级”:“行”.

link = 'http://www.indiainfoline.com/top-news'
redditFile = urllib2.urlopen(link)
redditHtml = redditFile.read()
redditFile.close()
soup = BeautifulSoup(redditHtml, "lxml")
productDivs = soup.findAll('div', attrs={'class' : 'row'})
for div in productDivs:
    result = {}
    try:
        import pdb
        #pdb.set_trace()
        heading = div.find('p', attrs={'class': 'heading fs20e robo_slab mb10'}).get_text()
        title = heading.get_text()
        article_link = "http://www.indiainfoline.com"+heading.find('a')['href']
        summary = div.find('p')

但是没有任何组件可以获取.关于如何解决此问题的任何建议?

解决方法:

看到html源代码中有很多class = row,您需要过滤掉存在实际行数据的节块.在id =“ search-list”下的情况下,所有16个预期行都存在.因此,首先提取节,然后行.由于.select返回数组,因此必须使用[0]提取数据.一旦获得行数据,就需要迭代并提取标题,articl_url,摘要等.

from bs4 import BeautifulSoup
link = 'http://www.indiainfoline.com/top-news'
redditFile = urllib2.urlopen(link)
redditHtml = redditFile.read()
redditFile.close()
soup = BeautifulSoup(redditHtml, "lxml")
section = soup.select('#search-list')
rowdata = section[0].select('.row')

for row in rowdata[1:]:
    heading = row.select('.heading.fs20e.robo_slab.mb10')[0].text
    title = 'http://www.indiainfoline.com'+row.select('a')[0]['href']
    summary = row.select('p')[0].text

输出:

PFC board to consider bonus issue; stock surges by 4%     
http://www.indiainfoline.com/article/news-top-story/pfc-pfc-board-to-consider-bonus-issue-stock-surges-by-4-117080300814_1.html
PFC board to consider bonus issue; stock surges by 4%
...
...
上一篇:Web前端 ——Html基本概念


下一篇:前端02vue+vscode+nodejs 开发环境搭建