爬虫项目1 - 豆瓣电影top250

豆瓣电影top250

步骤

  1. 定义爬取函数
import requests
import re
import csv

def parse_html(url,headers,params):
    try:
        res = requests.get(url=url, headers=headers, params=params)
        return res.content.decode('utf-8')
    except requests.RequestException:
        return None
  1. 使用正则匹配来获取所需字段
def get_data(html):
    movie = {
        'rank': '',
        'title': '',
        'score': '',
        'comments': ''
    }

    pattern = re.compile('<li>.*?<em class="">(\d+)</em>.*?<span class="title">(.*?)</span>.*?<span class="rating_num" property="v:average">(.*?)</span>.*?<span class="inq">(.*?)</span>.*?</li>',re.S)
    datalist = re.findall(pattern, html)
    for data in datalist:
        movie['rank'] = data[0]
        movie['title'] = data[1]
        movie['score'] = data[2]
        movie['comments'] = data[3]
        save(movie)
  1. 保存csv到本地。
def save(item):
    with open('./Doubantop250.csv','a', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(item.values())

def run():
    headers = {'User-Agent': "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"}
    for i in range(10):
        url = 'https://movie.douban.com/top250?start={}&filter='.format(i*25)
        html = parse_html(url, headers, {})
        get_data(html)
    print('Done!')

if __name__ == '__main__':
    run()
  1. 完成。入门级爬虫,非常简单。
上一篇:12、爬虫实践1:静态网页数据爬取


下一篇:Python爬虫新手入门教学(一):爬取豆瓣电影排行信息