爬虫学习-scrape center闯关(ssr1)

场景

最近在学习爬虫,实践使用的是https://scrape.center/网站的环境
第一关没有任何限制,结果爬取的是所有的电影地址,标题,主题,分数,剧情简介

技术

主要使用的是request库和BeautifulSoup,最后导出一个csv文档

代码

import pandas as pd
import urllib3
from bs4 import BeautifulSoup
import requests

urllib3.disable_warnings()  #去除因为网页没有ssl证书出现的警告
url,title,theme,score,content = [],[],[],[],[]
headers =  {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                         'Chrome/87.0.4280.141 Safari/537.36'}
global url_list,title_list,theme_list,score_list,content_list
for i in range(1,11):
    the_url = 'https://ssr1.scrape.center/page/' + str(i)
    html = requests.get(the_url,headers=headers,verify=False)
    soup = BeautifulSoup(html.content,'lxml')
    url_list = soup.find_all(class_='name')
    for x in url_list :
        url.append('https://ssr1.scrape.center'+x['href'])
for a in url :
    html = requests.get(a,headers=headers,verify=False)
    soup = BeautifulSoup(html.content, 'lxml')
    title_list = soup.find_all(class_='m-b-sm')
    theme_list = soup.find_all(class_='categories')
    score_list = soup.find_all(class_='score m-t-md m-b-n-sm')
    content_list = soup.find_all("div",class_='drama')
    for y,z,i,x in zip(title_list,theme_list,score_list,content_list):
        title.append(y.text)
        theme.append(z.text.replace('\n','').replace('\r',''))
        score.append(i.text.strip())
        content.append(x.text.replace('剧情简介','').replace('\n','').replace('\r','').strip())
bt = {
    '链接':url,
    '标题':title,
    '主题':theme,
    '评分':score,
    '剧情简介':content
}
work = pd.DataFrame(bt)
work.to_csv('work.csv')


上一篇:数据结构——深度优先遍历、广度优先遍历 C++实现(含实例代码以及详细注解以及测试数据)


下一篇:智能合约不够安全?微软建专项小组从编程语言入手根治