- 请求传参
使用场景:如果解析的数据不在同一个页面中(深度爬取)。
举个例子:假如我们首先爬取了首页数据,然后再解析详情页数据,如何操作?
1 # 解析首页的岗位名称 2 def parse(self, response): 3 li_list = response.xpath('//*[@id="main"]/div/div[3]/ul/li') 4 for li in li_list: 5 # 实例化item对象 6 item = BossproItem() 7 8 detail_page_url = 'https://www.zhipin.com' + li.xpath('./div/div[1]/div[1]/div/@href').extract_first() 9 job_name = li.xpath('.//span[@class="job-name"]//text()').extract() 10 11 # 分页操作 12 if self.page_num <= 5: 13 new_url = format(self.url%self.page_num) 14 self.page_num += 1 15 # 对详情页发请求获取详情页页面源码数据 16 yield scrapy.Request(detail_page_url, callback=self.parse_detail) 17 18 item['job_name'] = job_name
之前的代码在16行,callback参数是parse(),但是因为这个时候详情数据在第二个页面打开,如果回调给parse(),解析的还是首页数据,所以此处callback应该是解析详情页数据的函数parse_detail()。
此外,在parse()中解析到的数据传给了item,包括详情页url,所以如果要对详情页发起请求我们需要获取到详情页的url,也就是parse_detail()要获取到item,所以我们需要在第16行将item作为参数传递过去,即16行改为:
yield scrapy.Request(detail_page_url, callback=self.parse_detail, meta={'item': item})
再补充一个parse_detail(),获取一个item然后在进行数据解析即可。
# 回调函数接受item # 解析详情页岗位描述 def parse_detail(self,response): item = response.meta['item'] job_desc = response.xpath('//*[@id="main"]/div[3]/div/div[2]/div[2]/div[1]//text()').extract() job_desc = ''.join(job_desc) # print(job_desc) item['job_desc'] = job_desc yield item
- 图片爬取
1、scrapy中图片爬取需要专门用到一个管道类 --- ImagesPipeline,因为:
(1)字符串:只需要基于xpath进行解析且提交给管道进行持久化存储
(2)图片:xpath解析出图片src的属性值。单独的对图片地址发起请求获取图片二进制类型的数据。
2、ImagesPipeline:
只需要将img的src的属性值进行解析,提交给管道,管道就会对图片src进行请求发送获取图片的二进制类型的数据,且会帮我们进行持久化存储
3、使用流程
(1)数据解析(只获取到图片的地址就可)
(2)将存储图片地址的item提交到制定的管道类
(3)在管道文件中自定制一个基于ImagesPipeLine的一个管道类
class imgsPipeLine(ImagesPipeline): # 根据图片地址进行图片数据的爬取 def get_media_requests(self, item, info): yield scrapy.Request(item['src']) # 指定图片的存储路径 def file_path(self, request, response=None, info=None, *, item=None): imgName = request.url.split('/')[-1] return imgName def item_completed(self, results, item, info): return item
(4)在配置文件中进行配置:指定图片存储的目录:IMAGES_STORE = './imgsName'以及指定开启的管道(自定制的管道类)
- 中间件(分为爬虫中间件(不常用)和下载器中间件(重点介绍))
下载中间件:位于引擎和下载器之间,在middlerwares类中
作用:批量拦截到整个工程中所有的请求和响应
拦截请求:UA伪装(process_request)、代理IP(process_exception:return request)
user_agent_list = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 " "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ] PROXY_http = [ '153.180.102.104:80', '195.208.131.189:56055', ] PROXY_https = [ '120.83.49.90:9000', '95.189.112.214:35508', ] # UA伪装 request.headers['User-Agent'] = random.choice(user_agent_list) # 为了检验代理IP是否生效 request.meta['proxy'] = 'http://' + random.choice(self.PROXY_http) return None
def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain # 设置代理IP if request.url.split(':')[0] == 'http': request.meta['proxy'] = 'http://' + random.choice(self.PROXY_http) else: request.meta['proxy'] = 'https://' + random.choice(self.PROXY_https) # 将修正之后的请求对象进行重新的请求发送 return request
拦截响应:篡改响应数据、响应对象