请使用多协程和队列,爬取时光网电视剧TOP100的数据(剧名、导演、主演和简介),并用CSV模块将数据存储下来(文件名:time100.csv)。
时光网电视剧排行榜链接:http://list.mtime.com/listIndex
知识点:
该站点启用了cookies反爬技术,因此,需要准确复制你的headers:例如:
a=’’‘Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.9
Connection: keep-alive
Cookie: userId=0; defaultCity=%25E5%258C%2597%25E4%25BA%25AC%257C290; waf_cookie=59ca4180-5a16-459e122021f2731eb3889667e33bee3b5cd0; _ydclearance=dca49a10afc623028d11eefe-48d8-4053-bde0-dea67b20ab57-1586501304; userCode=20204101248277038; userIdentity=2020410124827743; tt=731C76D4E29CB5ED5BD5F19F3774A2AC; Hm_lvt_6dd1e3b818c756974fb222f0eae5512e=1586494108; __utma=196937584.377597232.1586494108.1586494108.1586494108.1; __utmc=196937584; __utmz=196937584.1586494108.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmt=1; _utmt~1=1; __utmb=196937584.18.10.1586494108; Hm_lpvt_6dd1e3b818c756974fb222f0eae5512e=1586495472
Host: www.mtime.com
Referer: http://www.mtime.com/top/tv/top100/index-2.html
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36’’’
注意headers需要换行且更改类型为字典,需要用到字符串操作:
字典的Key:line.split(": “,1)
字典的Value:for line in a.split(”\n")
形成键值对后使用dict(XXXX)进行类型转换
需要导入的包:
from gevent import monkey
monkey.patch_all()
import gevent,requests,bs4,csv
from gevent.queue import Queue
使用gevent实现多协程爬虫的重点:
定义爬取函数
用gevent.spawn()创建任务
用gevent.joinall()执行任务
使用queue模块的重点:
用Queue()创建队列
用put_nowait()储存数据
用get_nowait()提取数据
queue对象的其他方法:empty()判断队列是否为空,full()判断队列是否为满,qsize()判断队列还剩多少
csv写入的步骤:
创建文件调用open()函数
创建对象借助writer()函数
写入内容调用writer对象的writerow()方法
关闭文件close()
解题思路:
该站点启用了cookies反爬技术,因此,需要准确复制你的headers,首先找到headers
找到我们需要的剧名,导演,主演,简介,遍历前Top100条,当出现空的时候即director == ‘’ or isinstance(director, str) == 0时我们表明未知防止报错。
使用gevent,用gevent.spawn()创建任务,用gevent.joinall()执行任务,最后存储文件CSV格式即可。
from json.decoder import JSONDecodeError
import gevent
import requests
import csv
from gevent import monkey
monkey.patch_all()
id = []
director=[]
actor=[]
name=[]
movie=[]
story=[]
task=[]
Actors=[]
#cookie
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.63',
'Content-Type': 'application/json'
}
#获取json
url = 'http://front-gateway.mtime.com/library/index/app/topList.api?tt=1616811596867&'
request = requests.get(url, headers=headers)
#取元素
def catch(x,y):
request = requests.get(url, headers=headers)
html = request.json()
items = ((((html['data'])['tvTopList'])['topListInfos'])[0])['items']
for i in range(100):#前100条电视剧
tvid = ((items[i])['movieInfo'])['movieId']
id.append(tvid)
for i in range(x,y):
try:
url1 = 'http://front-gateway.mtime.com/library/movie/detail.api?tt=1617412224076&movieId='+str(id[i])+'&locationId=290'
request = requests.get(url=url1, headers=headers)
tvhtml = request.json()
except JSONDecodeError:
url1 = 'http://front-gateway.mtime.com/library/movie/detail.api?tt=1617412224076&movieId=' + str(id[i]) + '&locationId=290'
request = requests.get(url=url1, headers=headers)
tvhtml = request.json()
a = []
Movie =(((tvhtml['data'])['basic'])['director'])
#如果信息为空,则填写未知
if Movie==None:
director.append('未知')
elif Movie['name']=='':
director.append(Movie['nameEn'])
else:
director.append(Movie['name'])
Actors=(((tvhtml['data'])['basic'])['actors'])
for j in (((tvhtml['data'])['basic'])['actors']):
if j['name'] == '':
a.append(j['nameEn'])
else:
a.append(j['name'])
actor.append(a)
demo=(((tvhtml['data'])['basic'])['name'])
movie.append(demo)
simple=(((tvhtml['data'])['basic'])['story'])
# 如果信息为空,则填写未知
if simple==None:
story.append('未知')
else:
story.append(simple)
if __name__ == '__main__':
x = 0
for i in range(10):
task1 = gevent.spawn(catch(x, x + 10))
task.append(task1)
x = x + 10
gevent.joinall(task)
f=open('Timetop100.csv','w',newline='',encoding='gb18030')
csv_write = csv.writer(f)
for i in range(100):
csv_write.writerow(['电视剧',movie[i]])
csv_write.writerow(['导演',director[i]])
csv_write.writerow(['演员'])
for x in actor[i]:
csv_write.writerow([x])
csv_write.writerow(['剧情',story[i]])
print("完成")
f.close()
print('Tans.plt')