用scrapy/selenium爬取校花网

用scrapy/selenium爬取校花网

 

校花网http://www.xiaohuar.com/

美女校花首页http://www.xiaohuar.com/list-1-0.html

第二页:http://www.xiaohuar.com/list-1-1.html

依次类推

 

步骤:

1、  创建项目(使用终端输入,在相应的目录下)

source activate spider

scrapy startproject xiaohua

 

2、  编写item(在items.py写)

class XiaohuaItem(scrapy.Item):

    title = scrapy.Field()

    href = scrapy.Field()

    imgsrc = scrapy.Field()

3、  编写spider(注意:书签块的xpath以网页源代码的为主,网页源代码与F12看到的div属性值不一样)

import scrapy

from urllib import request

import re

from xiaohua.items import XiaohuaItem

class XiaohuaSpider(scrapy.Spider):

    name = 'xiaohua'

    allowed_domains = ['xiaohuar.com']

    start_urls = ['http://www.xiaohuar.com/list-1-0.html']

    def parse(self,response):

        #注意:书签块的xpath以网页源代码的为主,网页源代码与F12看到的div属性值不一样

        bookmarks = response.xpath('//div[@class="item masonry_brick"]')

        print('bookmarks lenth:',len(bookmarks))

        for bm in bookmarks:

            item = XiaohuaItem()

            title = bm.xpath('.//div[@class="title"]/span/a/text()').extract()[0]

            href = bm.xpath('.//div[@class="title"]/span/a/@href').extract()[0]

            imgsrc = bm.xpath('.//div[@class ="img"]/a/img/@src').extract()[0]

            item['title'] = title

            item['href'] = href

            item['imgsrc'] = request.urljoin(response.url,imgsrc)

            '''

            仿照视频教程,试一下多页爬取;成了

            这段必须放在for循环中,否则会一直访问新的网页;

            可以理解为,bookmarks0则不再爬取网页

            '''

            # 提取当前页的数字

            curpage = re.search('(\d+)-(\d+)', response.url).group(2)  # group(2) 列出第二个括号匹配部分

            # 生成下一页的数字值

            pagenum = int(curpage) + 1

            # 生成下一页url

            url = re.sub(r'1-(\d+)', '1-'+str(pagenum), response.url)

            # 把地址通过yield返回;注意callback的写法

            yield scrapy.Request(url, callback=self.parse)

            yield item

 

4、  编写pipeline,会生成 xiaohua.json文件

import json

class XiaohuaPipeline(object):

    '''

    在初始化Pipeline时打开json文件,在关闭spider时关闭json文件,即整个爬取过程中只打开一次json文件

    '''

    def __init__(self):

        self.file = open('xiaohua.json','w')

    def process_item(self,item,spider):

        #item可以直接转化成字典

        content = json.dumps(dict(item),ensure_ascii=False) + '\n'

        self.file.write(content)

        return item

    def close_spider(self,spider):

        self.file.close()

    '''

    #下面这种方法也行,但多次打开json文件

    def process_item(self,item,spider):

        with open('xiaohua.json', 'a') as f:

            json.dump(dict(item), f, ensure_ascii=False)

        return item

    '''

5、  设置pipeline(在settings.py中)

ITEM_PIPELINES = {

    'xiaohua.pipelines.XiaohuaPipeline': 300,

}

6、  中间件,会使用selenium(后来没用selenium)

 

7、  设置中间件(在settings.py中,与pipeline设置方法类似,没用上)

 

8、  在终端中(spider/exec/xiaohua目录下)输入命令,执行爬取

scrapy crawl xiaohua

上一篇:爬虫之scrapy框架(一)


下一篇:使用Scrapy 框架爬取段子(入门)