Scrapy CrawlSpider源码分析

2023-10-13 16:28:40

crawl.py中主要包含两个类：

1. CrawlSpider

2. Rule

　　link_extractor：传LinkExtractor实例对象

　　callback：传”func_name“

　　cb_kwargs=None

　　follow=None 跟配置文件中CRAWLSPIDER_FOLLOW_LINKS做and，都为True才有效

　　process_links=None 用于预处理url

　　process_request=identity 默认调用process_request时返回process_request的参数

CrawlSpider：继承Spider类

1）入口：调用Spider类的start_requests，默认使用parse处理

2）CrawlSpider重写了Spider类的parse方法：返回调用_parse_response方法（*自定义时不能重载parse函数处理response）

3）_parse_response方法：scrapy预留了parse_start_url、process_results方法供我们自定义逻辑处理response，最后遍历process_results结果，yield（如果没重写上面的函数相当于之前什么都没执行），判断配置文件（CRAWLSPIDER_FOLLOW_LINKS=True），调用_requests_to_follow，遍历结果，yield

4）_requests_to_follow方法：调用rules中Rule的LinkExtractor的extract_links方法，抽取每一个link，并且放到集合中做了一个去重，调用_build_request创建request对象，yield Rule实例的process_request方法，传入reuqest对象作为参数（默认相当于yield Request对象）

5）_build_request方法：实例化Request（callback通过_response_downloaded获取），返回Request实例对象

6）_response_downloaded方法：拿到Rule中rule，返回_parse_response函数

7）_parse_response方法：调用rule.callback

重点：

1. 重写预留函数：parse_start_url、process_results方法

2. 自定义Rule中参数配置：process_links（预处理url）

码农公寓

相关文章