使用Python进行网页爬取-选择div,h2和h3类

这是我第一次使用Python和网络抓取.已经环顾四周,仍然无法获得我需要做的事情.

以下是我通过Chrome使用的元素的打印屏幕.

我想做的是,我想从选定的城市名称中获取公寓名称和地址.

import requests
from bs4 import BeautifulSoup

#url = 'http://www.homestead.ca/apartments-for-rent/'                           
rootURL = 'http://www.homestead.ca'
response = requests.get(rootURL)                                                   
html = response.content
soup = BeautifulSoup(html,'lxml')

dropdown_list = soup.select(".primary .child-pages a")

#city_names=[dropdown_list_value.text for dropdown_list_value in dropdown_list]
#print (city_names)

cityLinks=[rootURL + dropdown_list_value['href'] for dropdown_list_value in dropdown_list]

for cityLinks_select in dropdown_list:                                       #Looping each city from the Apartment drop down list
    print ('Selecting city:',cityLinks_select.text)
    cityResponse = requests.get(cityLinks)
    cityHtml = cityResponse.content
    citySoup = BeautifulSoup(cityHtml,'lxml')

    community_list = soup.select(".extended-search .property-container a[h2 h3]")
    get and print the apartment link
    get and print the apartment name
    get and print the address of the apartment

解决方法:

正如我所评论的那样,某些数据是动态创建的,如果我们查看源本身,就会看到:

                        <div class="content">
                                    <div class="title-container">
                                        <h2 class="building-name"><%= building.get('name') %></h2>
                                        <h3 class="address"><%= building.get('address').address %></h3>
                                    </div>

                                    <div class="rent">
                                        <h4 class="sub-title">Rent from</h4>
                                        <% if (building.get('statistics').suites.rates.min !== 'undefined') { %>
                                            <% $min_rate = commaSeparateNumber(parseInt(building.get('statistics').suites.rates.min)); %>
                                            <span class="rent-value">$<%= $min_rate %></span>
                                        <% } %>
                                    </div>

我们可以从源头获得的只是建筑物名称,地址和ph值:

cityLinks = [rootURL + dropdown_list_value['href'] for dropdown_list_value in dropdown_list]

# you need to iterate over the joined urls
for city in cityLinks:  # Looping each city from the Apartment drop down list
    cityResponse = requests.get(city)
    cityHtml = cityResponse.content
    citySoup = BeautifulSoup(cityHtml, 'lxml')
    # all the info we can parse is inside the div class="building-info"
    for div in citySoup.select("div.building-info"):
        print(div.select_one("h1.building-name").text.strip())
        print(div.select_one("h2.location").text.strip())
        print(div.select_one("div.contact-container div.phone").text.strip())

如果我们模仿ajax请求,则可以json格式获取所有数据:

import requests
from bs4 import BeautifulSoup
from pprint import pprint as pp

rootURL = 'http://www.homestead.ca'
response = requests.get(rootURL)
html = response.content
soup = BeautifulSoup(html, 'lxml')

dropdown_list = soup.select(".primary .child-pages a")


cityLinks = (rootURL + dropdown_list_value['href'] for dropdown_list_value in dropdown_list)

# params for our request
params = {"show_promotions": "true",
        "show_custom_fields": "true",
        "client_id": "6",
        "auth_token": "sswpREkUtyeYjeoahA2i",
        "min_bed": "-1",
        "max_bed": "100",
        "min_bath": "0",
        "max_bath": "10",
        "min_rate": "0",
        "max_rate": "4000",
        "keyword": "false",
        "property_types": "low-rise-apartment,mid-rise-apartment,high-rise-apartment,luxury-apartment,townhouse,house,multi-unit-house,single-family-home,duplex,tripex,semi",
        "order": "max_rate ASC, min_rate ASC, min_bed ASC, max_bath ASC",
        "limit": "50",
        "offset": "0",
        "count": "false"}

for city in cityLinks:  # Looping each city from the Apartment drop down list
    with requests.Session() as s:
        r= s.get(city)
        # we need to parse the city_id for out next request to work
        soup = BeautifulSoup(r.content)
        city_id = soup.select_one("div.hidden.search-data")["data-city-id"]
        # update params with the city id
        params["city_id"] = city_id
        js = s.get("http://api.theliftsystem.com/v2/search", params=params).json()
        pp(js)

现在我们得到如下数据:

[{u'address': {u'address': u'325 North Park Street',
               u'city': u'Brantford',
               u'city_id': 332,
               u'country': u'Canada',
               u'country_code': u'CAN',
               u'intersection': u'',
               u'neighbourhood': u'',
               u'postal_code': u'N3R 2X4',
               u'province': u'Ontario',
               u'province_code': u'ON'},
  u'availability_count': 6,
  u'availability_status': 1,
  u'availability_status_label': u'Available Now',
  u'building_header': u'',
  u'client': {u'email': u'bcadieux@homestead.ca',
              u'id': 6,
              u'name': u'Homestead Land Holdings',
              u'phone': u'613-546-3146',
              u'website': u'www.homestead.ca'},
  u'contact': {u'alt_extension': u'',
               u'alt_phone': u'',
               u'email': u'rentals@homestead.ca',
               u'extension': u'',
               u'fax': u'(519) 752-6855',
               u'name': u'',
               u'phone': u'519-752-3596'},
  u'details': {u'features': u'',
               u'location': u'',
               u'overview': u"Located on North Park Street and Memorial Avenue,this quiet building is within walking distance of the following: - Zehrs Plaza, North Park Plaza, Shoppers Drug Mart, Zehrs Grocery Store, Zellers, Pet Store, Party Supply Store, furniture store, variety store, Black's Photography, paint shop and veterinary clinic\xa0  - Restaurants and coffee shops\xa0  - Wayne Gretzky Recreational Arena\xa0  - Medical Clinic,Shoppers Home Health Care Clinic and Pharmacy\xa0  - Catholic Elementary School\xa0  - On bus route ",
               u'suite': u''},
  u'geocode': {u'distance': None,
               u'latitude': u'43.1703624',
               u'longitude': u'-80.2605725'},
  u'id': 309,
  u'matched_beds': [u'0', u'1', u'2'],
  u'matched_suite_names': [u'Bachelor', u'One Bedroom', u'Two Bedroom'],
  u'min_availability_date': u'',
  u'name': u'North Park Tower',
  u'office_hours': u'',
  u'parking': {u'additional': u'', u'indoor': u'', u'outdoor': u''},
  u'permalink': u'http://www.homestead.ca/apartments/325-north-park-street-brantford',
  u'pet_friendly': True,
  u'photo': u'1443018148_2.jpg',
  u'photo_path': u'http://s3.amazonaws.com/lws_lift/homestead/images/gallery/full/1443018148_2.jpg',
  u'promotion': {u'featured': 0},
  u'property_type': u'High-rise-apartment',
  u'statistics': {u'suites': {u'bathrooms': {u'average': 1.0,
                                             u'max': 1.0,
                                             u'min': 1.0},
                              u'bedrooms': {u'average': u'1.0',
                                            u'max': 2,
                                            u'min': 0},
                              u'rates': {u'average': 950.0,
                                         u'max': 1275.0,
                                         u'min': 625.0},
                              u'square_feet': {u'average': 0.0,
                                               u'max': u'0.0',
                                               u'min': u'0.0'}}},
  u'thumbnail_path': u'http://s3.amazonaws.com/lws_lift/homestead/images/gallery/256/1443018148_2.jpg',
  u'website': {u'description': u'', u'title': u'', u'url': u''}},
 {u'address': {u'address': u'661 West Street',
               u'city': u'Brantford',
               u'city_id': 332,
               u'country': u'Canada',
               u'country_code': u'CAN',
               u'intersection': u'',
               u'neighbourhood': u'',
               u'postal_code': u'N3R 6W9',
               u'province': u'Ontario',
               u'province_code': u'ON'},
  u'availability_count': 6,
  u'availability_status': 1,
  u'availability_status_label': u'Available Now',
  u'building_header': u'',
  u'client': {u'email': u'bcadieux@homestead.ca',
              u'id': 6,
              u'name': u'Homestead Land Holdings',
              u'phone': u'613-546-3146',
              u'website': u'www.homestead.ca'},
  u'contact': {u'alt_extension': u'',
               u'alt_phone': u'',
               u'email': u'rentals@homestead.ca',
               u'extension': u'',
               u'fax': u'(519) 751-0379',
               u'name': u'',
               u'phone': u'519-751-3867'},
  u'details': {u'features': u'',
               u'location': u'',
               u'overview': u'Located in the North end of Brantford, Westgate Tower is in an area that resembles a city within a city. There are a variety of banks, grocery stores, drug stores, malls, a wide selection of fast food, fine dining restaurants and an after hours medical centre, within waking distance.',
               u'suite': u''},
  u'geocode': {u'distance': None,
               u'latitude': u'43.1733242',
               u'longitude': u'-80.2482991'},
  u'id': 310,
  u'matched_beds': [u'0', u'1', u'2'],
  u'matched_suite_names': [u'Bachelor', u'One Bedroom', u'Two Bedroom'],
  u'min_availability_date': u'',
  u'name': u'Westgate Apartments',
  u'office_hours': u'',
  u'parking': {u'additional': u'', u'indoor': u'', u'outdoor': u''},
  u'permalink': u'http://www.homestead.ca/apartments/661-west-street-brantford',
  u'pet_friendly': True,
  u'photo': u'1443017488_1.jpg',
  u'photo_path': u'http://s3.amazonaws.com/lws_lift/homestead/images/gallery/full/1443017488_1.jpg',
  u'promotion': {u'featured': 0},
  u'property_type': u'High-rise-apartment',
  u'statistics': {u'suites': {u'bathrooms': {u'average': 1.0,
                                             u'max': 1.0,
                                             u'min': 1.0},
                              u'bedrooms': {u'average': u'1.0',
                                            u'max': 2,
                                            u'min': 0},
                              u'rates': {u'average': 975.0,
                                         u'max': 1300.0,
                                         u'min': 650.0},
                              u'square_feet': {u'average': 0.0,
                                               u'max': u'0.0',
                                               u'min': u'0.0'}}},
  u'thumbnail_path': u'http://s3.amazonaws.com/lws_lift/homestead/images/gallery/256/1443017488_1.jpg',
  u'website': {u'description': u'', u'title': u'', u'url': u''}},
 {u'address': {u'address': u'321 Fairview Drive',
               u'city': u'Brantford',
               u'city_id': 332,
               u'country': u'Canada',
               u'country_code': u'CAN',
               u'intersection': u'',
               u'neighbourhood': u'',
               u'postal_code': u'N3R 2X6',
               u'province': u'Ontario',
               u'province_code': u'ON'},
  u'availability_count': 8,
  u'availability_status': 1,
  u'availability_status_label': u'Available Now',
  u'building_header': u'',
  u'client': {u'email': u'bcadieux@homestead.ca',
              u'id': 6,
              u'name': u'Homestead Land Holdings',
              u'phone': u'613-546-3146',
              u'website': u'www.homestead.ca'},
  u'contact': {u'alt_extension': u'',
               u'alt_phone': u'',
               u'email': u'rentals@homestead.ca',
               u'extension': u'',
               u'fax': u'(519) 752-6855',
               u'name': u'',
               u'phone': u'519-752-3596'},
  u'details': {u'features': u'',
               u'location': u'',
               u'overview': u'Dornia Manor is a quiet, ninety-two unit apartment building located in the North end of Brantford. We offer one, two and three bedroom units and one penthouse suite. The building is located in close proximity to many major services such as banking, shopping, health services, recreational facilities, beauty shops, dry cleaners, schools and churches. There is a bus stop at the front door and highway 403 is within minutes.',
               u'suite': u''},
  u'geocode': {u'distance': None,
               u'latitude': u'43.1706331',
               u'longitude': u'-80.2584034'},
  u'id': 308,
  u'matched_beds': [u'1', u'2', u'3'],
  u'matched_suite_names': [u'One Bedroom', u'Two Bedroom', u'Three Bedroom'],
  u'min_availability_date': u'',
  u'name': u'Dornia Manor',
  u'office_hours': u'',
  u'parking': {u'additional': u'', u'indoor': u'', u'outdoor': u''},
  u'permalink': u'http://www.homestead.ca/apartments/321-fairview-drive-brantford',
  u'pet_friendly': True,
  u'photo': u'1443017947_1.jpg',
  u'photo_path': u'http://s3.amazonaws.com/lws_lift/homestead/images/gallery/full/1443017947_1.jpg',
  u'promotion': {u'featured': 0},
  u'property_type': u'High-rise-apartment',
  u'statistics': {u'suites': {u'bathrooms': {u'average': 1.375,
                                             u'max': 2.0,
                                             u'min': 1.0},
                              u'bedrooms': {u'average': u'2.25',
                                            u'max': 3,
                                            u'min': 1},
                              u'rates': {u'average': 1124.5,
                                         u'max': 1350.0,
                                         u'min': 899.0},
                              u'square_feet': {u'average': 0.0,
                                               u'max': u'0.0',
                                               u'min': u'0.0'}}},
  u'thumbnail_path': u'http://s3.amazonaws.com/lws_lift/homestead/images/gallery/256/1443017947_1.jpg',
  u'website': {u'description': u'', u'title': u'', u'url': u''}}]

这为您提供了网址,卧室以及您想要的几乎所有内容.列表中的每个字典都是一个列表,您只需要使用键进行访问即可提取所需的数据,例如:

 for dct in js:
        add = dct["address"]
        print(add["city"])
        print(add["postal_code"])
        print(add["province"])
        print(dct["permalink"])

会给你:

Brantford
N3R 2X4
Ontario
http://www.homestead.ca/apartments/325-north-park-street-brantford
Brantford
N3R 6W9
Ontario
http://www.homestead.ca/apartments/661-west-street-brantford
Brantford
N3R 2X6
Ontario
http://www.homestead.ca/apartments/321-fairview-drive-brantford

联系信息在dct [“ contact”]下,而统计信息在= dct [“ statistics”]下:

for dct in js:
        contact = dct["contact"]
        print(contact)
        stats = dct["statistics"]
        print(stats["suites"])

这会给你:

{u'alt_phone': u'', u'fax': u'(519) 752-6855', u'name': u'', u'alt_extension': u'', u'phone': u'519-752-3596', u'extension': u'', u'email': u'rentals@homestead.ca'}
{u'rates': {u'max': 1275.0, u'average': 950.0, u'min': 625.0}, u'bedrooms': {u'max': 2, u'average': u'1.0', u'min': 0}, u'bathrooms': {u'max': 1.0, u'average': 1.0, u'min': 1.0}, u'square_feet': {u'max': u'0.0', u'average': 0.0, u'min': u'0.0'}}
{u'alt_phone': u'', u'fax': u'(519) 751-0379', u'name': u'', u'alt_extension': u'', u'phone': u'519-751-3867', u'extension': u'', u'email': u'rentals@homestead.ca'}
{u'rates': {u'max': 1300.0, u'average': 975.0, u'min': 650.0}, u'bedrooms': {u'max': 2, u'average': u'1.0', u'min': 0}, u'bathrooms': {u'max': 1.0, u'average': 1.0, u'min': 1.0}, u'square_feet': {u'max': u'0.0', u'average': 0.0, u'min': u'0.0'}}
{u'alt_phone': u'', u'fax': u'(519) 752-6855', u'name': u'', u'alt_extension': u'', u'phone': u'519-752-3596', u'extension': u'', u'email': u'rentals@homestead.ca'}
{u'rates': {u'max': 1350.0, u'average': 1124.5, u'min': 899.0}, u'bedrooms': {u'max': 3, u'average': u'2.25', u'min': 1}, u'bathrooms': {u'max': 2.0, u'average': 1.375, u'min': 1.0}, u'square_feet': {u'max': u'0.0', u'average': 0.0, u'min': u'0.0'}}

您可以将所有这些放在一起以获得所需的一切.您可以调整参数,如果您使用chrome工具或萤火虫检查请求,则实际上还有更多.

上一篇:python-无法使用BeautifulSoup检索所需XPATH的元素


下一篇:使用Python中的BS4,Selenium收集动态数据并避免重复