10月22日学习总结
一、通过Cookie向服务器亮明身份,破解封禁IP地址的反爬手段
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Y3A0EQXz-1634953658228)(C:\Users\wby\AppData\Roaming\Typora\typora-user-images\image-20211023093903396.png)]
Cookeie是服务器向浏览器写入的临时数据,很多时候被用于用户跟踪(记住这个用户是谁)
![img](file:///C:\Users\wby\Documents\Tencent Files\1735651388\Image\Group2{B\06{B06O@2YLI7X69U9KKD
~Q.jpg)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-R71QfAih-1634953658230)(C:\Users\wby\AppData\Roaming\Typora\typora-user-images\image-20211023093813973.png)]
import re
import time
import requests
from requests.cookies import RequestsCookieJar
cookie_jar = RequestsCookieJar()
cookie_str = 'Cookie'
for item in cookie_str.split('; '):
key, value = item.split('=', maxsplit=1)
cookie_jar.set(key,value)
for page in range(6):
resp = requests.get(
url=f'https://movie.douban.com/top250',
params={
'start':page*25
'filter'=''
},
headers={
'Accept-Language':'~',
'User-Agent':'~'
},
cookies=cookie_jar
)
if resp.status_code == 200:
titles = re.findall(r'<span class="title">([^&;]*?)</span>', resp.text)
ratings = re.findall(r'<span class="rating_num" property="v:average">(.*?)</span>', resp.text)
mottos = re.findall(r'<span class="inq">(.*?)</span>', resp.text)
for title, rating, motto in zip(titles, ratings, mottos):
print('名称:', title)
print('评分:', rating)
print('名句:', motto)
print('-' * 50)
time.sleep(2)
二、通过使用代理服务器访问豆瓣网,破解封禁IP的反爬手段
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-8DokvNfr-1634953658232)(file:///C:\Users\wby\Documents\Tencent Files\1735651388\Image\Group2}1{6}1{6A}}[8LL2YS_~X0OSSCU.jpg)]
商业IP代理 —> 蘑菇代理,芝麻代理、快代理、讯代理
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-go9n8Ept-1634953658234)(C:\Users\wby\AppData\Roaming\Typora\typora-user-images\image-20211023094120240.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7ia3RfmF-1634953658237)(C:\Users\wby\AppData\Roaming\Typora\typora-user-images\image-20211023094131636.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jRhubSe4-1634953658238)(C:\Users\wby\AppData\Roaming\Typora\typora-user-images\image-20211023094159787.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jM3IF6Cn-1634953658239)(C:\Users\wby\AppData\Roaming\Typora\typora-user-images\image-20211023094232298.png)]
import requests
resp = requests.get(
url='~'
)
print(resp.json())
import random
import re
import requests
proxies_list = ['port':'~','ip':'~']
for page in range(6):
proxy_dict = random.choice(proxies_list)
print(pro_dict)
resp = requests.get(
url=f'https://movie.douban.com/top250',
params={
'start': page * 25,
'filter': ''
},
headers={
'Accept-Language': '~',
'User-Agent': '~',
},
proxies={
'http': f'http://{proxy_dict["ip"]}:{proxy_dict["port"]}',
'https': f'http://{proxy_dict["ip"]}:{proxy_dict["port"]}',
},
timeout=5,
verify=False
)
print(resp.status_code)
if resp.status_code == 200:
titles = re.findall(r'<span class="title">([^&;]*?)</span>', resp.text)
ratings = re.findall(r'<span class="rating_num" property="v:average">(.*?)</span>', resp.text)
mottos = re.findall(r'<span class="inq">(.*?)</span>', resp.text)
for title, rating, motto in zip(titles, ratings, mottos):
print('名称:', title)
print('评分:', rating)
print('名句:', motto)
print('-' * 50)
三、CSS(层叠样式表)选择器解析页面
库的准备:pip install beautifulsoup4
soup = bs4.BeautifulSoup(html_code, ‘html.parser’)
soup.select(‘selector’) —>[Tag]
soup.select_one(‘selector’) —>Tag—>text / attrs
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-qrNeB2t0-1634953658240)(C:\Users\wby\AppData\Roaming\Typora\typora-user-images\image-20211023094334251.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bVmUDPpM-1634953658240)(C:\Users\wby\AppData\Roaming\Typora\typora-user-images\image-20211023094256438.png)]
import bs4
import requests
from bs4 import Tag
resp = requests.get('https://www.sohu.com')
soup = bs4.BeautifulSoup(resp.text, 'lxml')
anchors_list = soup.select('a[title]')
for anchor in anchors_list: # type: Tag
print(anchor.attrs['href'])
print(anchor.attrs['title'])
print(len(anchors_list))
import bs4
import requests
for page in range(6):
resp = requests.get(
url=f'https://movie.douban.com/top250',
params={
'start': page * 25,
'filter': ''
},
headers={
'Accept-Language': '~',
'User-Agent': '~',
},
proxies={
'http': 'socks5://127.0.0.1:1086',
'https': 'socks5://127.0.0.1:1086',
}
)
print(resp.status_code)
if resp.status_code == 200:
soup = bs4.BeautifulSoup(resp.text, 'html.parser')
title_spans = soup.select('div.hd > a > span.title:nth-child(1)')
rating_spans = soup.select('div.bd > div > span.rating_num')
motto_spans = soup.select('div.bd > p.quote > span')
for title_span, rating_span, motto_span in zip(title_spans, rating_spans, motto_spans):
print(title_span.text)
print(rating_span.text)
print(motto_span.text)
print('-' * 50)
(title_spans, rating_spans, motto_spans):
print(title_span.text)
print(rating_span.text)
print(motto_span.text)
print('-' * 50)