scrapy持久化存储

2023-10-31 16:28:40

方法一：基于终端指令

　　说明：只可以将parse()的返回值存储到本地的文件中，而且存储的文本文件的类型只能为：'json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle'

　　指令：终端输入命令，scrapy crawl xxx -o filePath

　　优缺点：简洁便携高效，但局限性比较强

    # 基于终端指令的持久化存储
    def parse(self, response):
        div_list = response.xpath('//*[@id="content-left"]/div')
        all_data = []
        for div in div_list[1:]:
            # xpath返回的一定是列表，但列表元素一定是Selector类型的对象
            # extract()可以将Selector对象中data参数存储的字符串提取出来
            # title = div.xpath('./div[2]/a/text()')[0].extract()
            # 如果xpath返回的列表中只有一个元素，可以不用[0]，直接用extract_first()方法获取字符串
            title = div.xpath('./div[2]/a/text()').extract_first()

            # //text()不能直接用[0]，但列表也可以直接调用extract()，将列表中的每一个Selector对象中对应的data以字符串形式提取出来
            content = div.xpath('./div[3]/a//text()').extract()
            # 'sep'.join(seq)：以sep为分隔符，将原有的元素合并成一个新的字符串
            content = ''.join(content)

            dic = {
                'title':title,
                'content':content
            }
            all_data.append(dic)
        return all_data

方法二：基于管道

　　编码流程：

　　　　数据解析

    # 基于管道的持久化存储
    def parse(self, response):
        div_list = response.xpath('//*[@id="content-left"]/div')
        all_data = []
        for div in div_list[1:]:
            # xpath返回的一定是列表，但列表元素一定是Selector类型的对象
            # extract()可以将Selector对象中data参数存储的字符串提取出来
            # title = div.xpath('./div[2]/a/text()')[0].extract()
            # 如果xpath返回的列表中只有一个元素，可以不用[0]，直接用extract_first()方法获取字符串
            title = div.xpath('./div[2]/a/text()').extract_first()

            # //text()不能直接用[0]，但列表也可以直接调用extract()，将列表中的每一个Selector对象中对应的data以字符串形式提取出来
            content = div.xpath('./div[3]/a//text()').extract()
            # 'sep'.join(seq)：以sep为分隔符，将原有的元素合并成一个新的字符串
            content = ''.join(content)

　　　　在item中定义相关属性

class BaoxiaoproItem(scrapy.Item):
    # define the fields for your item here like:
    # 在item类中定义相关属性
    title = scrapy.Field()
    content = scrapy.Field()
    # pass

　　　　将解析的数据封装存储到item类型的对象中并将item类型的对象提交给管道进行持久化存储（承接数据解析那部分的代码）

            # 将解析到的数据存储到item类型的对象中
            item = BaoxiaoproItem()
            item['title'] = title
            item['content'] = content

            # 将item提交给管道
            yield item

　　　　在管道类的process_item中，将其接收到的item对象中存储的数据进行持久化存储

class BaoxiaoproPipeline(object):
    fp = None

    # 重写父类的一个方法，该方法只在开始爬虫时被调用一次
    def open_spider(self,spider):
        print('开始爬虫......')
        self.fp = open('./baoxiao.txt','w',encoding='utf-8')

    # 该方法专门用来处理item类型的对象，接收爬虫文件提交过来的item对象
    # 该方法每接收到一个item就会被调用一次
    def process_item(self, item, spider):
        title = item['title']
        content = item['content']

        self.fp.write(title + ':' + '\n' + content + '\n')

        # 传递给下一个即将执行的管道类
        return item

    def close_spider(self,spider):
        print('结束爬虫！！！')
        self.fp.close()

　　　　在配置文件中开启管道

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'baoxiaoPro.pipelines.BaoxiaoproPipeline': 300,
    # 数字表示优先级，数字越小优先级越高
   'baoxiaoPro.pipelines.mysqlPipeline': 301,
}

　　好处：通用性较强

　　面试题：将爬取到的数据一份存储在本地一份存储在数据库中，如何实现？

　　　　首先，管道文件中的一个管道类对应着将数据存储到一个平台；

　　　　其次，爬虫文件提交的item只会给管道文件中第一个执行的管道类；

　　　　最后，process_item()中的return item表示会把item传递给下个执行的管道类。

码农公寓

相关文章