项目实训报告-11 (附加)爬取世界日报网设计

基本文件设计

(即除了spider.py以外的文件设计)

除了item.py以外其他的并没有太大改动

item.py的设计如下:

class WorldjournalspiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    n_url = scrapy.Field()
    n_title = scrapy.Field()
    n_scrip = scrapy.Field()
    n_time = scrapy.Field()
    n_tag = scrapy.Field()

Spider.py设计 

1.爬取信息的初始化

    name = 'worldjournal'
    allowed_domains = ['www.worldjournal.com']
    keyword = "关键词"
    search =  "https://www.worldjournal.com/search/word/8877/"+ keyword + "?zh-cn"
    start_urls = [search]

最终的爬取url由搜索url加上关键词再加上简体字版后缀组成

2.爬取的信息的路径寻找

        item = WorldjournalspiderItem()
        url = response.xpath('//div[@class="subcate-list__link tag-page"]/a/@href').extract()
        title = response.xpath('//div[@class="subcate-list__link tag-page"]/div/a/h2/text()').extract()
        scription = response.xpath('//div[@class="subcate-list__link tag-page"]/div/a/p/text()').extract()
        time = response.xpath(
            '//div[@class="subcate-list__link tag-page"]/div/div/span[@class="subcate-list__time"]/text()').extract()
        tag = response.xpath('//div[@class="subcate-list__link tag-page"]/div/div/a/text()').extract()

经过网页源代码分析后通过Xpath的寻找可获取各个对应信息

3.信息提交

        for i in range(len(title)):
            item['n_title'] = title[i]
            item['n_url'] = url[i]
            item['n_scrip'] = scription[i]
            item['n_time'] = time[i]
            item['n_tag'] = tag[i]
            yield item

 

上一篇:Scrapy框架的基本使用


下一篇:贪吃蛇 Java实现