网络爬虫第三章

正则表达式的概念与作用

概念

正则表达式是一种字符串匹配的模式

网络爬虫第三章
网络爬虫第三章

re.findall()方法

re.findall(pattern,string,flags=0)(重点)

作用:扫描整个string字符串,返回所有与pattern匹配的列表

参数:

? pattern:正则表达式

? string:从那个字符串中查找

? flags:匹配模式

举例: re.findall("\d","chuan1zhi2")>>["1","2"]

案例

import re

#  findall方法,返回匹配的结果列表
rs = re.findall(‘\d+‘,‘chuan13zhi24‘)
print(rs)

#  findall方法,flag参数的作用
rs1 = re.findall(‘a.bc‘,‘a\nbc‘,re.DOTALL)
rs2 = re.findall(‘a.bc‘,‘a\nbc‘,re.S)
print(rs1)
print(rs1)

#  finfall方法中分组的使用
rs3 = re.findall(‘a.+bc‘,‘a\nbc‘,re.DOTALL)
print(rs3)

rs4 = re.findall(‘a(.+)bc‘,‘a\nbc‘,re.DOTALL)    #使用分组
print(rs4)

运行结果

[‘13‘, ‘24‘]
[‘a\nbc‘]
[‘a\nbc‘]
[‘a\nbc‘]
[‘\n‘]

正则表达式中r原串的使用

正则中使用r原始字符串,能够忽略转义符号带来的影响

待匹配的字符串中有多少个,r原串正则中就添加多少个\即可

案例

import re


rs0 = re.findall("a\nb","a\nb")
print(rs0)

rs1 = re.findall("a\\nb","a\\nb")
print(rs1)

rs2 = re.findall("a\\\\nb","a\\nb")
print(rs2)

rs3 = re.findall(r"a\\nb","a\\nb")
print(rs3)

运行结果

[‘a\nb‘]
[]
[‘a\\nb‘]
[‘a\\nb‘]

案例--提取最新的疫情数据的json字符串

代码

#1,导入相关模块
import requests
from bs4 import BeautifulSoup
import re

#2,发送请求,获取疫情首页内容
response = requests.get(‘https://ncov.dxy.cn/ncovh5/view/pneumonia‘)
home_page = response.content.decode()
# print(home_page)     #测试一下

#3,使用BeautifulSoup提取疫情数据
soup = BeautifulSoup(home_page,‘lxml‘)
script = soup.find(id=‘getListByCountryTypeService2true‘)
countries_text = script.string

#4,提取json字符串
json_str = re.findall(r‘(\[.*\])‘,countries_text)
print(json_str)

结果

[‘[{"id":10409664,"createTime":1629769854000,"modifyTime":1629769854000,"tags":"","countryType":2,"continents":"北美洲","provinceId":"8","provinceName":"美国","provinceShortName":"","cityName":"","currentConfirmedCount":6734506,"confirmedCount":37932709,"confirmedCountRank":1,"suspectedCount":0,"curedCount":30568819,"deadCount":629384,"deadCountRank":1,"deadRate":"1.65","deadRateRank":96,"comment":"","sort":0,"operator":"chengxinzhe1","locationId":971002,"countryShortCode":"USA","countryFullName":"United States of America","statisticsData":"https://file1.dxycdn.com/2020/0315/553/3402160512808052518-135.json","incrVo":{"currentConfirmedIncr":124655,"confirmedIncr":221550,"curedIncr":96015,"deadIncr":880},"showRank":true,"yesterdayConfirmedCount":2147383647,"yesterdayLocalConfirmedCount":2147383647,......

网络爬虫第三章

上一篇:执行上下文过程-变量对象预解析到代码执行过程


下一篇:基础类库的学习