Python-网页搜罗-BeautifulSoup

我是BeautifulSoup的新手,正在尝试从以下网站提取数据:
http://www.expatistan.com/cost-of-living/comparison/phoenix/new-york-city

我正在尝试提取每种类别(食物,住房,衣服,交通,个人护理和娱乐)的汇总百分比.因此,对于上面提供的链接,我想提取百分比:48%,129%,63%,43%,42%,42%和72%.

不幸的是,我当前使用BeautifulSoup的Python代码提取了以下百分比:12%,85%,63%,21%,42%和48%.我不知道为什么会这样.在这里的任何帮助将不胜感激!这是我的代码:

import urllib2
from bs4 import BeautifulSoup
url = "http://www.expatistan.com/cost-of-living/comparison/phoenix/new-york-city"
page  = urllib2.urlopen(url)
soup_expatistan = BeautifulSoup(page)
page.close()

expatistan_table = soup_expatistan.find("table",class_="comparison")
expatistan_titles = expatistan_table.find_all("tr",class_="expandable")

for expatistan_title in expatistan_titles:
    published_date = expatistan_title.find("th",class_="percent")
    print(published_date.span.string)

解决方法:

我无法确定确切原因,但似乎与urllib2有关.只是更改为requests,它开始起作用.这是代码:

import requests
from bs4 import BeautifulSoup

url = "http://www.expatistan.com/cost-of-living/comparison/phoenix/new-york-city"
page  = requests.get(url).text
soup_expatistan = BeautifulSoup(page)

expatistan_table = soup_expatistan.find("table", class_="comparison")
expatistan_titles = expatistan_table.find_all("tr", class_="expandable")

for expatistan_title in expatistan_titles:
    published_date = expatistan_title.find("th", class_="percent")
    print(published_date.span.string)

您可以使用pip来安装请求:

$pip install requests

编辑

该问题确实与urllib2有关.似乎www.expatistan.com服务器根据请求中设置的User-Agent响应不同.为了获得与urllib2相同的响应,您必须执行以下操作:

url = "http://www.expatistan.com/cost-of-living/comparison/phoenix/new-york-city"
request = urllib2.Request(url)
opener = urllib2.build_opener()
request.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20130406 Firefox/23.0')
page = opener.open(request).read()
上一篇:新建springboot子模块时,pom文件project根标签会报红


下一篇:使用命令启动springBoot的可执行jar时, 报demo.1.0-SNAPSHOT.jar中没有主清单属性解决方法