文章目录
1. 简介
使用的技术栈 : python3, re, BeautifulSoup
目标网站: https://www.umei.net/p/gaoqing/cn/
免责声明:仅用于学习,请勿商用!!!!
2. 开始行动
2.1 步骤
- 获取
html
- 数据清洗(获取图片标签)
- 获取图片标签里面的
src
- 发起请求并保存图片
2.2 实现代码
import requests
import re
from bs4 import BeautifulSoup
url = 'https://www.umei.net/p/gaoqing/cn/'
r = requests.get(url)
# with open('./meinv.html','wb+') as f:
# f.write(r.content)
if(r.status_code == 200 ):
imgs = []
soup = BeautifulSoup(r.content, 'html5lib')
img_list = soup.select('.TypeBigPics img ')
for i in img_list:
# print(i)
res = re.search('src="(.*?)"', str(i) , re.M | re.I)
imgs.append( str (res.group(1)) )
for i,k in enumerate (imgs):
# print(i,type(k))
ans = requests.get(k)
if (ans.status_code == 200):
with open(str (i) +'.jpg', 'wb+') as f:
f.write(ans.content)
2.3 成果
2.4 成果分析
- 虽然成功拿到了图片,但是图片的清晰度不够,可进一步优化
2.5 优化
- 优化分析
通过分析可知:我们可以通过点击图片外面的
a
标签获取到图片大图
2.6 代码优化
import requests
import re
from bs4 import BeautifulSoup
url = 'https://www.umei.net/p/gaoqing/cn/'
r = requests.get(url)
# with open('./meinv.html','wb+') as f:
# f.write(r.content)
if(r.status_code == 200 ):
imgs = []
soup = BeautifulSoup(r.content, 'html5lib')
# 获取img外面的a标签
aList = soup.select('.TypeBigPics')
for item in aList:
obj = re.search('.*?\/cn\/(.*?)".*', str(item), re.M | re.I )
imgs.append( str( obj.group(1)) )
ans_imgs = []
for i,k in enumerate(imgs):
# print(str(url + k))
ans = requests.get(str(url+k))
if(ans.status_code==200):
soup1 = BeautifulSoup(ans.content, 'html5lib')
imgBody = soup1.select('.ImageBody img')
# print(imgBody)
# 获取大图的src
obj = re.search('.*?src="(.*?)"', str(imgBody), re.M | re.I )
ans_imgs.append( obj.group(1))
# print(ans_imgs)
# 保存大图
for i,k in enumerate(ans_imgs):
b = requests.get(str(k))
if(b.status_code==200):
with open('./'+str(i)+'.jpg','wb+') as f:
f.write(b.content)