Python爬虫之scrapy高级(传参,图片,中间件)

文章目录

1 scrapy请求传参

1.1 传参说明

使用场景:如果爬取解析的数据不在同一张页面中(需要深度爬取)
在爬虫文件中,需要回调的函数那里添加一个meta字典对象
使用方法:手动发送请求时,需要传入一个新定义回调函数,同时这个函数需要参数,那么就需要meta来发送,在调用时用meta={}字典形式发送,接收就用response.meta['xxx']获取其中的字典

1.2 具体操作

class BossSpider(scrapy.Spider):
    name = 'boss'
    start_urls = ['https://www.test.com/chaxun/']

    def parse_detail(self,response):
        #获取传送过来的meta对象
        item=response.meta['item']
        job_desc=response.xpath('//div[2]/div/div[1]/ul/li[1]/a')
        job_desc=''.join(job_desc)
        item['job_desc']=job_desc

    def parse(self, response):
        div_list = response.xpath('//div[@class="shici_list_main"]')
        for div in div_list:
            item=BossproItem()
            job_name=div.xpath('.//div/h3/a/text()').extract()
            item['job_name']=job_name
            detail_url='https://www.test.com'+div.xpath('.//div/h3/h4/text()').extract_first()
            #手动请求发送
            #请求传参,根据meta={},可以将meta字典传送给请求对应的回调函数
            yield scrapy.Request(detail_url,callback=self.parse_detail,meta={'item':item})

2 scrapy图片爬取

2.1 ImagesPipeline理解

图片数据爬取之ImagesPipeline
基于scrapy爬取字符串类型的数据和爬取图片类型的数据区别:

  • 字符串:只需要基于xpath进行解析且提交管道进行持久化存储
  • 图片:xpath解析出图片src属性值,单独的对图片地址发起请求获取图片二进制类型的数据
  • ImagesPipeline只需要将imgsrc属性值进行解析,提交到管道,管道就会对图片的src进行发送获取图片的二进制类型数据,且还会帮我们进行持久化存储

2.2 ImagesPipeline使用

注意:有些网站会对图片的src标签修改为src2,只有当图片标签定位到当前窗口才会变为正常标签属性,这样可以避免加载过多图片,减轻服务器压力
使用步骤:

  • 数据解析
  • 将存储图片地址的item提交到判定的管道类
  • 在管道文件中自定制一个基于imagesPipeLine的一个管道类,并重写三个方法:get_media_requestsfile_pathitem_completed
  • 在配置文件settings.py指定刚刚自定义的定制的管道类,指定图片存储目录:IMAGES_STORE='./imgs'

2.2.1 图片爬虫文件

图片爬虫文件,里面包含了主要的解析类文件信息
爬虫文件

import scrapy
from imgsPro.items import ImgsproItem

class ImgSpider(scrapy.Spider):
    name = 'img'
    # allowed_domains = ['www.xxx.com']
    # 模拟站长素材地址
    start_urls = ['http://sc.test.com/tupian/']
    
    # 数据解析

    def parse(self, response):
        div_list = response.xpath('//div[@id="container"]/div')
        for div in div_list:
            src='https:'+div.xpath('./div/a/img/@src2').extract_first()
            print(src)
            item = ImgsproItem()
            item['src']=src
            return item

item文件

import scrapy

class ImgsproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    src=scrapy.Field()

2.2.2 基于ImagesPipeLine的管道类

写一个继承于ImagesPipeLine的管道类,并重写三个方法:get_media_requestsfile_pathitem_completed

from scrapy.pipelines.images import ImagesPipeline
import scrapy
class ImgsPipeline(ImagesPipeline):

    可以根据图片的地址进行图片数据的请求
    def get_media_requests(self, item, info):
        yield scrapy.Request(item['src'])

    指定图片存储名字 具体路径在配置文件中指定的
    def file_path(self,request,response=None,info=None):
        imgName=request.url.split('/')[-1]
        return imgName

    返回下一个即将被执行的管道类
    def item_completed(self, results, item, info):
        return item

2.2.3 settings.py

指定刚新建的管道类,以及图片储存地址

指定管道类
ITEM_PIPELINES = {
   'imgsPro.pipelines.ImgsPipeline': 300,
}

ROBOTSTXT_OBEY = False
日志级别
LOG_LEVEL='ERROR'
图片目录
IMAGES_STORE='./imgs_sucai/'

注意:可能会有图片下不下来问题,比如,代码没有报错,但是图片就是保存不下来,可以试着安下pillow这个包:pip install pillow

3 中间件

3.1 中间件简单介绍

中间件在scrapy工程里的middlewares.py
爬虫中间件(MiddleproSpiderMiddleware):在引擎和爬虫文件之间的中间件

下载中间件(MiddleproDownloaderMiddleware):

  • 位置:在引擎和下载器之间的中间件
  • 作用:批量拦截到整个工程中所有的请求和响应
  • 拦截请求:UA伪装;代理IP
  • 拦截响应:篡改响应数据,响应对象(如果有些是动态加载出来的可以使用selenium抓取响应信息并返回)
  • 只需要重点关注这三个方法:process_request:拦截请求,process_response:拦截响应,process_exception:拦截异常

如图所示:
Python爬虫之scrapy高级(传参,图片,中间件)

3.2 中间件处理请求

请求时的UA伪装
下载中间件文件:
middlewares.py文件

class MiddleproDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.
    agents_list = [
        "Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)",
        "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
        "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
        "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
        "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
        "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
        "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10",
        "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)",
        "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)",
        "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
        "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
        "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2",
        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1",
        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre",
        "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )",
        "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)",
        "Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a",
        "Mozilla/2.02E (Win95; U)",
        "Mozilla/3.01Gold (Win95; I)",
        "Mozilla/4.8 [en] (Windows NT 5.1; U)",
        "Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)",
        "HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
        "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0",
        "Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
        "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
        "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
        "Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
        "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
        "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
        "Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3",
        "Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
        "Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
        "Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1",
        "Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
        "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
        "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
        "Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
        "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
        "Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522  (KHTML, like Gecko) Safari/419.3",
        "Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
        "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
        "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
        "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
        "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
        "Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
        "Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    ]

   

    def process_request(self, request, spider):
        request.headers['User-Agent']=random.choice(self.agents_list)
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

在settings.py文件中开启对下载中间件的支持:

DOWNLOADER_MIDDLEWARES = {
   'middlePro.middlewares.MiddleproDownloaderMiddleware': 543,
}

3.3 中间件处理响应

模拟爬取网易新闻信息

3.3.1 爬虫文件

由于动态响应数据需要使用selenium来获取,因此需要在spider文件中添加浏览器驱动对象selenium用法详解
middleSpider.py

import scrapy
from selenium.webdriver import webdriver

class MiddlespiderSpider(scrapy.Spider):

    实例化一个浏览器对象
    def __init__(self):
        self.bro=webdriver.Chrome(executable_paht='浏览器驱动地址')
    name = 'MiddleSpider'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://news.163.com/']
    models_urls=[]#存储的板块内的详情页
    def parse(self, response):
        pass
    
    爬虫结束后关闭浏览器对象
    def closed(self,spider):
        self.bro.quit()

3.3.2 下载中间件文件

MiddleproDownloaderMiddleware是下载中间件文件
篡改响应主要说的是下载中间件中的process_response方法的修改,其中使用到了selenium语法selenium用法详解

import time
from scrapy.http import HtmlResponse
class MiddleproDownloaderMiddleware:
    def process_request(self, request, spider):       
        return None
   
    参数spider表示爬虫对象,即MiddleSpider
    def process_response(self, request, response, spider):
        #挑选出指定响应对象进行篡改
        #通过url指定request
        #通过request指定response
        此处的参数spider表示爬虫对象,即MiddleSpider
        bro=spider.bro
        if request.url in spider.models_urls:
            bro.get(request.url)
            time.sleep(2)
            page_text=bro.page_source
            #五大板块对应的对象 需要替换response
            new_response=HtmlResponse(url=request.url,body=page_text,encode='utf-8',request=request)
            return new_response
        else:
            return response

    def process_exception(self, request, exception, spider):
       pass

3.3.3 settings.py文件

下载中间件修改后需要在settings.py文件中开启

DOWNLOADER_MIDDLEWARES = {
   'middlePro.middlewares.MiddleproDownloaderMiddleware': 543,
}
上一篇:Hyper-V 2016 配置管理系列(Part3)


下一篇:豆瓣电影海报爬取