bs4进行数据解析:
-数据解析的原理:
1、标签定位
2、提取标签、标签属性中存储的数据值
-bs4数据解析的原理:
1、实例化一个BeautifulSoup对象,并且将页面源码数据加载到该对象
2、通过调用BeautifulSoup对象中相关的属性或方法进行标签定位和数据提取
-环境安装:
1、pip install bs4
2、pip install lxml
-如何实例化BeautifulSoup对象:
1、from bs4 import BeautifulSoup
2、对象的实例化:
1、将本地的html文档中的数据加载到该对象中
fp=open('./dog.html','r',encoding='utf-8')
soup=BeautifulSoup(fp,'lxml')
2、将互联网上获取的页面源码加载到该对象中
page_text=response.text
soup=BeautifulSoup(page_text,’lxml’)
3、提供的用于数据解析的方法和属性:
1、soup.tagName:返回的是文档中第一次出现的tagName对应的标签
2、soup.find():
-find(‘tagName’):等同于soup.div
-属性定位:
-soup.find(‘div’,class_=’song’)
3、soup.find_all(‘tagName’):返回符合要求的所有标签(列表)
4、select:
-select(‘某种选择器(id,class,标签选择器)’),返回的是一个列表
-层级选择器:
-soup.select(‘.tang > ul > li >a’): > 表示的是一个层级
-soup.select(‘.tang > ul a’): 空格表示的多个层级
5、获取标签之间的文本数据:
-soup.a.text/string/get_text()
-text/get_text():可以获取某一个标签中所有的文本内容
-string:只可以获取该标签下面直系的文本内容
6、获取标签中属性值:
-soup.a[‘href’]
实例一:晋江文学城某个作者的一个作品的目录详情爬取
1 from bs4 import BeautifulSoup 2 import re 3 if __name__=="__main__": 4 #将本地的html文档中的数据加载到该对象中 5 fp=open('./text.html','r',encoding='utf-8') 6 soup=BeautifulSoup(fp,'lxml') 7 #print(soup.a)#soup.tagName 返回的是第一次出现的<a></a>标签 8 #print(soup.find('tr',itemprop='chapter')) 9 #print(soup.find_all('tr',itemprop='chapter')) 10 #print(soup.select('#oneboolt > tbody > tr:nth-child(5)')) 11 #print(type(soup.select('table tr')[5])) 12 13 chapterlist=soup.find_all('tr',attrs={'itemprop':'chapter','itemscope':'','itemtype':'http://schema.org/Chapter'}) 14 list=[] 15 for tr in chapterlist: 16 text=tr.text 17 list.append(text) 18 reObj=re.compile("\\s*|\t|\r|\n") #去除换行符和空白字符 19 for text in list: 20 print(type(text)) 21 txtlist=reObj.split(text) 22 for txt in txtlist: 23 if txt=='': 24 txtlist.remove(txt) 25 print(txtlist)
实例二:三国演义书籍的爬取
1、使用select选择器
1 #https://www.shicimingju.com/book/sanguoyanyi.html 2 3 import requests 4 from bs4 import BeautifulSoup 5 if __name__=="__main__": 6 headers={ 7 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36', 8 'Cookie':'Hm_lvt_649f268280b553df1f778477ee743752=1613016932; key_kw=; key_cate=zuozhe; Hm_lpvt_649f268280b553df1f778477ee743752=1613016981' 9 } 10 url='https://www.shicimingju.com/book/sanguoyanyi.html' 11 Response=requests.get(url=url,headers=headers) 12 Response.encoding='utf-8' 13 page_txt=Response.text 14 15 #解析章节标题和详细页的url 16 soup=BeautifulSoup(page_txt,'lxml') 17 li_list=soup.select('#main_left > div > div.book-mulu > ul > li') 18 19 fp=open('./三国演义.txt','w',encoding='utf-8') 20 for li in li_list: 21 title=li.a.string 22 title_url='https://www.shicimingju.com'+li.a['href'] 23 24 #获取title章节的html 25 detail_response=requests.get(url=title_url,headers=headers) 26 detail_response.encoding='utf-8' 27 detail_html=detail_response.text 28 #解析title章节的html 29 30 title_soup=BeautifulSoup(detail_html,'lxml') 31 32 content=title_soup.select('#main_left > div.card.bookmark-list > div')[0].text #select返回的是一个list 33 fp.write('\n'+title+content+'\n') 34 print(title+' 爬取成功!!!') 35 36 fp.close()
2、使用find选择器和select选择器组合拳
1 #https://www.shicimingju.com/book/sanguoyanyi.html 2 3 import requests 4 from bs4 import BeautifulSoup 5 if __name__=="__main__": 6 headers={ 7 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36', 8 'Cookie':'Hm_lvt_649f268280b553df1f778477ee743752=1613016932; key_kw=; key_cate=zuozhe; Hm_lpvt_649f268280b553df1f778477ee743752=1613016981' 9 } 10 url='https://www.shicimingju.com/book/sanguoyanyi.html' 11 Response=requests.get(url=url,headers=headers) 12 Response.encoding='utf-8' 13 page_txt=Response.text 14 15 #解析章节标题和详细页的url 16 soup=BeautifulSoup(page_txt,'lxml') 17 li_list=soup.select('#main_left > div > div.book-mulu > ul > li') 18 19 fp=open('./三国演义.txt','w',encoding='utf-8') 20 for li in li_list: 21 title=li.a.string 22 title_url='https://www.shicimingju.com'+li.a['href'] 23 24 #获取title章节的html 25 detail_response=requests.get(url=title_url,headers=headers) 26 detail_response.encoding='utf-8' 27 detail_html=detail_response.text 28 #解析title章节的html 29 30 title_soup=BeautifulSoup(detail_html,'lxml') 31 32 detail_Tag=title_soup.find('div',attrs={'class':'chapter_content'}) #返回的是一个bs4.element.Tag类型 33 content=detail_Tag.text 34 fp.write('\n' + title + content + '\n') 35 print(title + ' 爬取成功!!!') 36 37 fp.close()