Python scrapy实现对网站图片的爬取与保存
编码工具
Visual Studio Code
实现步骤
1.创建ImageSpider项目
在vscode中新建终端并依次输入下列代码:
scrapy startproject ImageSpider
cd ImageSpider
code
打开项目ImageSpider
2.源代码
pipelines.py
from scrapy.pipelines.images import ImagesPipeline
from scrapy import Request
from scrapy.logformatter import logging
import re
class ImagespiderPipeline(ImagesPipeline):
def get_media_requests(self, item, info): # scrapy的内置函数
for image_url in item['imgurls']:
# 利用scrpy.Request请求url,生成一个Request对象
yield Request(image_url)
settings.py
BOT_NAME = 'ImageSpider'
SPIDER_MODULES = ['ImageSpider.spiders']
NEWSPIDER_MODULE = 'ImageSpider.spiders'
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
'ImageSpider.pipelines.ImagespiderPipeline': 300,
}
# 设置图片的存储路径
IMAGES_STORE='D:\MyImages'
DOWNLOAD_DELAY=0.3
scrapy.cfg
[settings]
default = ImageSpider.settings
[deploy]
project = ImageSpider
_ init_.py
_ init_.py文件编写了两个不同图片网站的爬取代码,去掉注释即可爬取另一网站的图片
import scrapy
class ImageSpider(scrapy.Spider):
name='ImgSpider'
#start_url列举了你想要爬取的页面的起始页
start_urls=[
# www.veer.com 网
'https://www.veer.com/photo/?utm_source=baidu&utm_medium%20=cpc&utm_campaign=%E9%80%9A%E7%94%A8%E8%AF%8D&utm_content=%E9%80%9A%E7%94%A8%E8%AF%8D-%E5%9B%BE%E7%89%87&utm_term=%E5%9B%BE%E7%89%87&chid=901&bd_vid=7293469400167961779'
# scrapy实验室
#'http://lab.scrapyd.cn/archives/55.html'
]
#下面的函数 parse是一个回调函数,当爬取页面成功时,定义"做什么",参数response中是一些处理网页的API
def parse(self,response):
# www.veer.com 网
dic={} #定义字典dic
arr=[] #定义数组arr
for imgurls in response.css("img::attr(src)").getall():
imgurls='http:'+imgurls
arr.append(imgurls)
dic.setdefault('imgurls',arr)
yield dic
# scrapy实验室
# groupName='pic'
# imgurls=response.css(".post img::attr(src)").getall()
# yield{'imgurls':imgurls}
3.运行程序
打开终端,定位到ImageSpider项目路径下,输入下列代码运行程序。
scrapy crawl ImageSpider