scrapy之spiders

官方文档:https://docs.scrapy.org/en/latest/topics/spiders.html#

一句话总结:spider是定义爬取的动作(是否跟进新的链接)及分析网页结构(提取数据,返回item)的地方。

一 scrapy.Spider

  1 name

  2 allowed_domins  <----------------------->  offsitemiddleware

  3 start_urls  <-----------------------> start_requests()

  4 custom_settings  <------------------------->Built-in settings reference

  It must be defined as a class attribute since the settings are updated before instantiation.

class BaiduSpider(scrapy.Spider):
name = 'baidu'
allowed_domains = ['https://www.baidu.com']
start_urls = ['http://https://www.baidu.com/']
custom_settings = {
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
}
def parse(self, response):
pass

  5 crawler <----------> from_crawler()

  6 settings

  7 logger

  8 from_crawler(crawler,*args,**kwargs)

  This is the class method used by Scrapy to create your spiders.

  9 start_request()

  It is called by Scrapy when the spider is opened for scraping.

  核心代码:

for url in self.start_urls:
yield Request(url, dont_filter=True)

   关于Request的说明。以下是Requet的源码。

class Request(object_ref):

    def __init__(self, url, callback=None, method='GET', headers=None, body=None,
cookies=None, meta=None, encoding='utf-8', priority=,
dont_filter=False, errback=None, flags=None):

  源码中可以看到,Request默认是get请求,如果是发post请求,需要在重写此方法。这里涉及到了 Request类

class MySpider(scrapy.Spider):
name = 'myspider' def start_requests(self):
return [scrapy.FormRequest("http://www.example.com/login",
formdata={'user': 'john', 'pass': 'secret'},
callback=self.logged_in)] def logged_in(self, response):
# here you would extract links to follow and return Requests for
# each of them, with another callback
pass

  10 parse(response)

  This method, as well as any other Request callback, must return an iterable of Requestand/or dicts or Item objects.

  11 log(message[ , level,component])

  12 closed(reason)

  

二 Spider arguments

  -a

三 Generic Spiders

  1 CrawlSpider

    推荐

    加了 rules,简化了相关操作。

  2 XMLFeedSpider

  3 CSVFeedSpider

  4 SitemapSpider

上一篇:Scrapy:为spider指定pipeline


下一篇:scrapy cookies:将cookies保存到文件以及从文件加载cookies