唤醒手腕Python爬虫学习笔记
1、基础知识点
字符串的分割
webString = 'www.baidu.com'
print(webString.split('.'))
# ['www', 'baidu', 'com']
字符串前后空格的处理,或者特殊字符的处理
webString = ' www.baidu.com '
print(webString.strip())
# www.baidu.com
webString = '!*www.baidu.com*!'
print(webString.strip('!*'))
# www.baidu.com
字符串格式化
webString = '{}www.baidu.com'.format('https://')
print(webString)
# https://www.baidu.com
自定义函数
webString = input("Please input url = ")
print(webString)
def change_number(number):
return number.replace(number[3:7], '*'*4)
print(change_number("15916881234"))
# 159****1234
2、基本爬虫操作
首先安装request第三方的库
GuessedAtParserWarning: No parser was explicitly specified 未添加解析器
基本请求的案例
import requests
link = "http://www.santostang.com/"
headers = {'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/604.1'}
data = requests.get(link, headers=headers)
print(data.text)
完整代码展示
import requests
from bs4 import BeautifulSoup
link = "http://www.santostang.com/"
headers = {
'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/604.1'}
data = requests.get(link, headers=headers)
soup = BeautifulSoup(data.text, "html.parser")
print(soup.find("h1", class_="post-title").a.text)
# 第四章 – 4.3 通过selenium 模拟浏览器抓取