唤醒手腕Python网络爬虫学习笔记(学习中,更新中)

唤醒手腕Python爬虫学习笔记

1、基础知识点

字符串的分割

webString = 'www.baidu.com'
print(webString.split('.'))
# ['www', 'baidu', 'com']

字符串前后空格的处理,或者特殊字符的处理

webString = ' www.baidu.com '
print(webString.strip())

# www.baidu.com

webString = '!*www.baidu.com*!'
print(webString.strip('!*'))

# www.baidu.com

字符串格式化

webString = '{}www.baidu.com'.format('https://')
print(webString)

# https://www.baidu.com

自定义函数

webString = input("Please input url = ")
print(webString)


def change_number(number):
    return number.replace(number[3:7], '*'*4)


print(change_number("15916881234"))
# 159****1234

2、基本爬虫操作

首先安装request第三方的库
唤醒手腕Python网络爬虫学习笔记(学习中,更新中)

GuessedAtParserWarning: No parser was explicitly specified 未添加解析器

基本请求的案例

import requests

link = "http://www.santostang.com/"
headers = {'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/604.1'}
data = requests.get(link, headers=headers)
print(data.text)

完整代码展示

import requests
from bs4 import BeautifulSoup

link = "http://www.santostang.com/"
headers = {
    'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/604.1'}
data = requests.get(link, headers=headers)

soup = BeautifulSoup(data.text, "html.parser")
print(soup.find("h1", class_="post-title").a.text)

# 第四章 – 4.3 通过selenium 模拟浏览器抓取
上一篇:Java学习路线


下一篇:俄罗斯方块—c++版本