基于spider的全站数据爬取
- 基于spider的全站数据爬取
- 就是将网站中某板块下的全部页码对应的页面数据进行爬取
- 需求:爬取校花网中的照片名称
- 实现方式:
- 将所有页面的url添加到start_urls列表(不推荐)
- 自行手动进行请求发送(推荐)
- yield scrapy.Request(url,callback):callbakc专门用作于数据解析
import scrapy
class XiaohuaSpider(scrapy.Spider):
name = 'xiaohua'
# allowed_domains = ['www.xxx.com']
start_urls = ['http://www.521609.com/tuku/index.html']
url = 'http://www.521609.com/tuku/index_%d.html'
page_num = 2
def parse(self, response):
li_list = response.xpath('/html/body/div[4]/div[3]/ul/li')
for li in li_list:
img_name = li.xpath('./a/p/text()').extract_first()
print(img_name)
if self.page_num <= 6:
new_url = format(self.url % self.page_num)
print(new_url)
self.page_num += 1
# 手动请求发送:callback回调函数是专门用作数据解析
yield scrapy.Request(url=new_url, callback=self.parse)