介绍
是否苦于看漫画收费,那不如自己编个程序白嫖吧!
基本环境
python 3.6
pycharm
安装scrapy
pip install scrapy
创建scrapy项目
在命令行输入:
scrapy startproject 项目名称
创建爬虫文件
命令行输入:
scrapy genspider 爬虫文件名称 域名
注意:爬虫文件名称不能和项目名称一样
域名:比如https://www.baidu.com,baidu.com就是其域名
创建start文件
在spiders同级目录下创建start.py,写入:
from scrapy import cmdline
cmdline.execute('scrapy crawl 爬虫文件名称'.split())
目标url
本文介绍爬取漫画台的漫画.
QAQ,之前爬过一次,这次发现它url规则变了,不过没关系,逢山开路,遇水架桥
这里选取一剑独尊为例:
分析url规律
这里看到Google开发者工具elements-XHR-preview中返回的json数据,我们可以通过分析json数据爬取该漫画。
通过与其他url比对,发现只有chapter_newid部分是变化的,且上一话的url.request返回的json数据中包含了下一话的chapter_newid,如此一来url规律就分析好了.
提取数据
提取方法如下:
# 提取章节名称,url规则以及该章总共多少页
chapter_name = response.json()['data']['current_chapter']['chapter_name']
rule = response.json()['data']['current_chapter']['rule']
end_num = response.json()['data']['current_chapter']['end_num']
之后因为获取的url只是一部分规则,所以需要后续处理:
for num in range(1, end_num + 1):
whole_url = origin + rule.replace('$$', str(num)) + '-mht.middle.webp'
整合爬虫文件
import scrapy
from ..items import ManhuataispiderItem
class ManhuataiSpider(scrapy.Spider):
name = 'manhuatai'
allowed_domains = ['manhuatai.com']
start_urls = [
'https://www.manhuatai.com/api/getchapterinfo?product_id=2&productname=mht&platformname=pc&comic_id=108393&chapter_newid=di1hua-1600506710013']
def parse(self, response):
# print(response.json())
chapter_name = response.json()['data']['current_chapter']['chapter_name']
rule = response.json()['data']['current_chapter']['rule']
end_num = response.json()['data']['current_chapter']['end_num']
# print(chapter_newid, chapter_name, rule, end_num)
origin = 'https://mhpic.dm300.com'
for num in range(1, end_num + 1):
whole_url = origin + rule.replace('$$', str(num)) + '-mht.middle.webp'
# print(whole_url)
yield ManhuataispiderItem(img_url=whole_url, chapter_name=chapter_name, num=num)
chapter_newid = response.json()['data']['next_chapter']['chapter_newid']
print(chapter_newid)
next_url = f'https://www.manhuatai.com/api/getchapterinfo?product_id=2&productname=mht&platformname=pc&comic_id=108393&chapter_newid={chapter_newid}'
if chapter_newid:
yield scrapy.Request(url=next_url, callback=self.parse)
请求头设置
请求头在middlewares文件中设置:
import faker
class HeadersDownloaderMiddleware:
"""headers中间件"""
def process_request(self, request, spider):
# 可以拿到请求体
fake = faker.Faker()
# request.headers 拿到请求头, 请求头是一个字典
request.headers.update(
{
'user-agent': fake.user_agent(),
'authority': 'www.manhuatai.com',
}
)
return None
保持数据
保存数据在pipelines中保持,因为是二进制数据,所以需要重写:
from scrapy.pipelines.images import ImagesPipeline
import scrapy
# 二进制数据下载需要继承自 ImagesPipeline
class DownloadPicturePipeline(ImagesPipeline):
"""二进制数据的保存"""
# get_media_requests 请求二进制数据的方法
def get_media_requests(self, item, info):
# 需要把二进制url提取出来
img_url = item['img_url']
# 构建二进制数据请求
yield scrapy.Request(url=img_url)
# file_path 定义文件保存路径
def file_path(self, request, response=None, info=None, *, item=None):
num = item['num']
chapter_name = item['chapter_name']
return f'{chapter_name}/{num}.jpg'
settings设置
-
ROBOTSTXT_OBEY = False
-
SPIDER_MIDDLEWARES = { 'ManhuataiSpider.middlewares.HeadersDownloaderMiddleware': 543, }
-
ITEM_PIPELINES = { 'ManhuataiSpider.pipelines.DownloadPicturePipeline': 300, }
-
IMAGES_STORE = './images' # 在最后一定要添加,否则无法保存
运行
打开start.py文件,运行等待即可看到漫画被爬取了.