scrapy中间件
scrapy中间有两种:爬虫中间件,下载中间件
爬虫中间件:处于引擎和爬虫spider之间
下载中间件:处于引擎和下载器之间
主要对下载中间件进行处理
下载中间件
作用:批量拦截请求和响应
拦截请求
UA伪装:将所有的请求尽可能多的设定成不同的请求载体身份标识
request.headers['User-Agent'] = 'xxx'
需要构建一个请求载体池:
user_agent_list = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
"(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
"(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
"(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
"(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
"(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
"(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
"(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
"(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
"(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
代理操作:发送请求使用代理
request.meta['proxy'] = 'http://ip:port'
需要构建代理池
# 代理需要自己去获取,这里的已经无效
PROXY_http = [
'153.180.102.104:80',
'195.208.131.189:56055',
]
PROXY_https = [
'120.83.49.90:9000',
'95.189.112.214:35508',
]
注意点:不光在process_request方法中使用,在process_exception方法中也要使用
原因是:ip被封的时候访问某些网站,访问成功,但是返回了错误页面,有的直接是请求失败,针对这两种情况,应该分别设置中间件处理
拦截响应
篡改响应数据或者直接替换响应对象
拦截请求示例:
class MovieproDownloaderMiddleware(object):
#拦截正常的请求,参数request就是拦截到的请求对象
def process_request(self, request, spider):
print('i am process_request()')
#实现:将拦截到的请求尽可能多的设定成不同的请求载体身份标识
request.headers['User-Agent'] = random.choice(user_agent_list)
#代理操作
if request.url.split(':')[0] == 'http':
request.meta['proxy'] = 'http://'+random.choice(PROXY_http) #http://ip:port
else:
request.meta['proxy'] = 'https://' + random.choice(PROXY_https) # http://ip:port
return None
#拦截响应:参数response就是拦截到的响应
def process_response(self, request, response, spider):
print('i am process_response()')
return response
#拦截发生异常的请求
def process_exception(self, request, exception, spider):
print('i am process_exception()')
#拦截到异常的请求然后对其进行修正,然后重新进行请求发送
# 代理操作
if request.url.split(':')[0] == 'http':
request.meta['proxy'] = 'http://' + random.choice(PROXY_http) # http://ip:port
else:
request.meta['proxy'] = 'https://' + random.choice(PROXY_https) # http://ip:port
return request #将修正之后的请求进行重新发送
selenium在scrapy中的使用
使用场景:有动态加载数据,此时访问首页url无法获取想要的数据,用到selenium进行获取
使用流程
1.实例化浏览器对象:写在爬虫类的构造方法中
bro = webdriver.Chrome(executable_path=r'C:\Users\oldboy-python\Desktop\爬虫+数据\tools\chromedriver.exe')
2.在中间件中执行浏览器自动化的操作
3.关闭浏览器:爬虫类中的closed(self,spider)关闭浏览器
def closed(self,spider):
self.bro.quit()
拦截响应中间件的示例
1.先找到不满足要求的响应对象对应的请求对象
可以在爬虫类中定义容器,存储不满足要求响应对象的请求url
通过spider点方法进行获取,spider就是爬虫文件中爬虫类实例化的对象
2.通过HtmlResponse类重新发送请求,该请求对应的响应对象替换之前不满足要求的响应对象
from scrapy.http import HtmlResponse
new_response = HtmlResponse(url=request.url,body=page_text,encoding='utf-8',request=request)
def process_response(self, request, response, spider):
#spider.five_model_urls:五个板块对应的url
bro = spider.bro
if request.url in spider.five_model_urls:
bro.get(request.url)
sleep(1)
page_text = bro.page_source #包含了动态加载的新闻数据
#如果if条件成立则该response就是五个板块对应的响应对象
new_response = HtmlResponse(url=request.url,body=page_text,encoding='utf-8',request=request)
return new_response
return response