基本文件设计
(即除了spider.py以外的文件设计)
除了item.py以外其他的并没有太大改动
item.py的设计如下:
class WorldjournalspiderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
n_url = scrapy.Field()
n_title = scrapy.Field()
n_scrip = scrapy.Field()
n_time = scrapy.Field()
n_tag = scrapy.Field()
Spider.py设计
1.爬取信息的初始化
name = 'worldjournal'
allowed_domains = ['www.worldjournal.com']
keyword = "关键词"
search = "https://www.worldjournal.com/search/word/8877/"+ keyword + "?zh-cn"
start_urls = [search]
最终的爬取url由搜索url加上关键词再加上简体字版后缀组成
2.爬取的信息的路径寻找
item = WorldjournalspiderItem()
url = response.xpath('//div[@class="subcate-list__link tag-page"]/a/@href').extract()
title = response.xpath('//div[@class="subcate-list__link tag-page"]/div/a/h2/text()').extract()
scription = response.xpath('//div[@class="subcate-list__link tag-page"]/div/a/p/text()').extract()
time = response.xpath(
'//div[@class="subcate-list__link tag-page"]/div/div/span[@class="subcate-list__time"]/text()').extract()
tag = response.xpath('//div[@class="subcate-list__link tag-page"]/div/div/a/text()').extract()
经过网页源代码分析后通过Xpath的寻找可获取各个对应信息
3.信息提交
for i in range(len(title)):
item['n_title'] = title[i]
item['n_url'] = url[i]
item['n_scrip'] = scription[i]
item['n_time'] = time[i]
item['n_tag'] = tag[i]
yield item