Python爬虫实例：爬取豆瓣Top250

2022-11-26 15:52:16

入门第一个爬虫一般都是爬这个，实在是太简单。用了 requests 和 bs4 库。

1、检查网页元素，提取所需要的信息并保存。这个用 bs4 就可以，前面的文章中已经有详细的用法阐述。

2、找到下一个 url 地址。本例中有两种方法，一是通过 url 的规则，本例中通过比较发现，只要更改 url 中的 start 参数值就可以；二是通过下一个页的标签获取下一页的 url。代码中采用了第一种方法。

3、判断退出条件，爬虫不可能无限制循环下去。

在这个最简单的示例中，实现以上三步一个爬虫就完成了。简单到不想做其他说明，直接看代码吧。

"""

爬取豆瓣电影Top250

"""

import os

import re

import time

import requests

from bs4 import BeautifulSoup

def download(url, page):

    print(f"正在爬取：{url}")

    html = requests.get(url).text   # 这里不加text返回<Response [200]>

    soup = BeautifulSoup(html, 'html.parser')

    lis = soup.select("ol li")

    for li in lis:

        index = li.find('em').text

        title = li.find('span', class_='title').text

        rating = li.find('span', class_='rating_num').text

        strInfo = re.search("(?<=<br/>).*?(?=<)", str(li.select_one(".bd p")), re.S | re.M).group().strip()

        infos = strInfo.split('/')

        year = infos[0].strip()

        area = infos[1].strip()

        type = infos[2].strip()

        write_fo_file(index, title, rating, year, area, type)

    page += 25

    if page < 250:

        time.sleep(2)

        download(f"https://movie.douban.com/top250?start={page}&filter=", page)

def write_fo_file(index, title, rating, year, area, type):

    f = open('movie_top250.csv', 'a')

    f.write(f'{index},{title},{rating},{year},{area},{type}\n')

    f.closed

def main():

    if os.path.exists('movie_top250.csv'):

        os.remove('movie_top250.csv')

    url = 'https://movie.douban.com/top250'

    download(url, 0)

    print("爬取完毕。")

if __name__ == '__main__':

    main()

相关博文推荐：

Python爬虫实例：爬取猫眼电影——破解字体反爬

Python爬虫实例：爬取B站《工作细胞》短评——异步加载信息的爬取

码农公寓

相关文章