文章目录
1 scrapy请求传参
1.1 传参说明
使用场景:如果爬取解析的数据不在同一张页面中(需要深度爬取)
在爬虫文件中,需要回调的函数那里添加一个meta
字典对象
使用方法:手动发送请求时,需要传入一个新定义回调函数,同时这个函数需要参数,那么就需要meta
来发送,在调用时用meta={}
字典形式发送,接收就用response.meta['xxx']
获取其中的字典
1.2 具体操作
class BossSpider(scrapy.Spider):
name = 'boss'
start_urls = ['https://www.test.com/chaxun/']
def parse_detail(self,response):
#获取传送过来的meta对象
item=response.meta['item']
job_desc=response.xpath('//div[2]/div/div[1]/ul/li[1]/a')
job_desc=''.join(job_desc)
item['job_desc']=job_desc
def parse(self, response):
div_list = response.xpath('//div[@class="shici_list_main"]')
for div in div_list:
item=BossproItem()
job_name=div.xpath('.//div/h3/a/text()').extract()
item['job_name']=job_name
detail_url='https://www.test.com'+div.xpath('.//div/h3/h4/text()').extract_first()
#手动请求发送
#请求传参,根据meta={},可以将meta字典传送给请求对应的回调函数
yield scrapy.Request(detail_url,callback=self.parse_detail,meta={'item':item})
2 scrapy图片爬取
2.1 ImagesPipeline理解
图片数据爬取之ImagesPipeline
基于scrapy
爬取字符串类型的数据和爬取图片类型的数据区别:
- 字符串:只需要基于
xpath
进行解析且提交管道进行持久化存储 - 图片:
xpath
解析出图片src
属性值,单独的对图片地址发起请求获取图片二进制类型的数据 -
ImagesPipeline
只需要将img
的src
属性值进行解析,提交到管道,管道就会对图片的src
进行发送获取图片的二进制类型数据,且还会帮我们进行持久化存储
2.2 ImagesPipeline使用
注意
:有些网站会对图片的src
标签修改为src2
,只有当图片标签定位到当前窗口才会变为正常标签属性,这样可以避免加载过多图片,减轻服务器压力
使用步骤:
- 数据解析
- 将存储图片地址的
item
提交到判定的管道类 - 在管道文件中自定制一个基于
imagesPipeLine
的一个管道类,并重写三个方法:get_media_requests
,file_path
,item_completed
- 在配置文件
settings.py
指定刚刚自定义的定制的管道类,指定图片存储目录:IMAGES_STORE='./imgs'
2.2.1 图片爬虫文件
图片爬虫文件,里面包含了主要的解析类文件信息
爬虫文件
import scrapy
from imgsPro.items import ImgsproItem
class ImgSpider(scrapy.Spider):
name = 'img'
# allowed_domains = ['www.xxx.com']
# 模拟站长素材地址
start_urls = ['http://sc.test.com/tupian/']
# 数据解析
def parse(self, response):
div_list = response.xpath('//div[@id="container"]/div')
for div in div_list:
src='https:'+div.xpath('./div/a/img/@src2').extract_first()
print(src)
item = ImgsproItem()
item['src']=src
return item
item
文件
import scrapy
class ImgsproItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
src=scrapy.Field()
2.2.2 基于ImagesPipeLine的管道类
写一个继承于ImagesPipeLine
的管道类,并重写三个方法:get_media_requests
,file_path
,item_completed
from scrapy.pipelines.images import ImagesPipeline
import scrapy
class ImgsPipeline(ImagesPipeline):
可以根据图片的地址进行图片数据的请求
def get_media_requests(self, item, info):
yield scrapy.Request(item['src'])
指定图片存储名字 具体路径在配置文件中指定的
def file_path(self,request,response=None,info=None):
imgName=request.url.split('/')[-1]
return imgName
返回下一个即将被执行的管道类
def item_completed(self, results, item, info):
return item
2.2.3 settings.py
指定刚新建的管道类,以及图片储存地址
指定管道类
ITEM_PIPELINES = {
'imgsPro.pipelines.ImgsPipeline': 300,
}
ROBOTSTXT_OBEY = False
日志级别
LOG_LEVEL='ERROR'
图片目录
IMAGES_STORE='./imgs_sucai/'
注意
:可能会有图片下不下来问题,比如,代码没有报错,但是图片就是保存不下来,可以试着安下pillow
这个包:pip install pillow
3 中间件
3.1 中间件简单介绍
中间件在scrapy
工程里的middlewares.py
中
爬虫中间件(MiddleproSpiderMiddleware
):在引擎和爬虫文件之间的中间件
下载中间件(MiddleproDownloaderMiddleware
):
- 位置:在引擎和下载器之间的中间件
- 作用:批量拦截到整个工程中所有的请求和响应
- 拦截请求:
UA
伪装;代理IP
- 拦截响应:篡改响应数据,响应对象(如果有些是动态加载出来的可以使用
selenium
抓取响应信息并返回) - 只需要重点关注这三个方法:process_request:拦截请求,process_response:拦截响应,process_exception:拦截异常
如图所示:
3.2 中间件处理请求
请求时的UA
伪装
下载中间件文件:
middlewares.py文件
class MiddleproDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
agents_list = [
"Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
"Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
"Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)",
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre",
"Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )",
"Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)",
"Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a",
"Mozilla/2.02E (Win95; U)",
"Mozilla/3.01Gold (Win95; I)",
"Mozilla/4.8 [en] (Windows NT 5.1; U)",
"Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)",
"HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0",
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3",
"Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1",
"Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522 (KHTML, like Gecko) Safari/419.3",
"Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
"Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
]
def process_request(self, request, spider):
request.headers['User-Agent']=random.choice(self.agents_list)
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
在settings.py文件中开启对下载中间件的支持:
DOWNLOADER_MIDDLEWARES = {
'middlePro.middlewares.MiddleproDownloaderMiddleware': 543,
}
3.3 中间件处理响应
模拟爬取网易新闻信息
3.3.1 爬虫文件
由于动态响应数据需要使用selenium
来获取,因此需要在spider文件中添加浏览器驱动对象selenium用法详解
middleSpider.py
import scrapy
from selenium.webdriver import webdriver
class MiddlespiderSpider(scrapy.Spider):
实例化一个浏览器对象
def __init__(self):
self.bro=webdriver.Chrome(executable_paht='浏览器驱动地址')
name = 'MiddleSpider'
# allowed_domains = ['www.xxx.com']
start_urls = ['https://news.163.com/']
models_urls=[]#存储的板块内的详情页
def parse(self, response):
pass
爬虫结束后关闭浏览器对象
def closed(self,spider):
self.bro.quit()
3.3.2 下载中间件文件
MiddleproDownloaderMiddleware
是下载中间件文件
篡改响应主要说的是下载中间件中的process_response
方法的修改,其中使用到了selenium
语法selenium用法详解
import time
from scrapy.http import HtmlResponse
class MiddleproDownloaderMiddleware:
def process_request(self, request, spider):
return None
参数spider表示爬虫对象,即MiddleSpider
def process_response(self, request, response, spider):
#挑选出指定响应对象进行篡改
#通过url指定request
#通过request指定response
此处的参数spider表示爬虫对象,即MiddleSpider
bro=spider.bro
if request.url in spider.models_urls:
bro.get(request.url)
time.sleep(2)
page_text=bro.page_source
#五大板块对应的对象 需要替换response
new_response=HtmlResponse(url=request.url,body=page_text,encode='utf-8',request=request)
return new_response
else:
return response
def process_exception(self, request, exception, spider):
pass
3.3.3 settings.py文件
下载中间件修改后需要在settings.py
文件中开启
DOWNLOADER_MIDDLEWARES = {
'middlePro.middlewares.MiddleproDownloaderMiddleware': 543,
}