为了学爬虫，我用三步爬取了大佬崔庆才爬虫相关文章，并保持为pdf学习

2024-01-15 23:14:04

为了学习网络爬虫，我爬取了网络爬虫届大佬崔庆才的所有Python博客，并转换成了pdf，以便后续学习。

1.代码思路

获取所有博客的URL
获得每篇博客的html内容，并转化为pdf文件
合并pdf文件

2.获取所有博客URL

首先，通过崔老师的博客网站可知，目前Python博客内容包含7页，如下图

通过这些博客分类页面，很方面就能获得每篇博客的网址，代码如下：

#获取所有URL
def get_url():
    url_list = []
    for i in range(7,0,-1):
        if(i==1):
            url = "https://cuiqingcai.com/categories/Python"
        else:
            url = "https://cuiqingcai.com/categories/Python/page/{number}/".format(number=str(i))
        driver.get(url)
        time.sleep(5)
        soup = BeautifulSoup(driver.page_source,'lxml')
        soup = soup.find('div',class_='posts-collapse').find_all('article')
        for a in soup[::-1]:
            url_list.append('https://cuiqingcai.com' + a.find('a',class_='post-title-link')['href'])
    return url_list

3.将每篇博客内容保存为PDF

在将博客内容保存为pdf的过程中，主要使用Python的pdfkit扩展包，pdfkit扩展包能够非常方便的将网页内容或者HTML以及字符串等保存为pdf文件。
pdfkit直接使用pip install pdfkit语句安装即可。
pdfkit扩展包的使用是基于wkhtmltopdf应用程序的，因此在此之前需要先下载安装wkhtmltopdf，安装完成之后需要将其添加的环境变量中，或者在使用pdfkit时以参数的形式指明该程序的路径。
wkhtmltopdf稳定版下载地址：https://wkhtmltopdf.org/downloads.html
将博客内容保存为pdf的具体代码如下：

def html_to_pdf(url_list,html_template,options,config):
    for artical_url in url_list:
        driver.get(artical_url)
        time.sleep(5)
        soup = BeautifulSoup(driver.page_source,'lxml')
        title = soup.find('h1',class_='post-title')
        body = soup.find('div',class_='post-body')
        
        body.find('div',class_='advertisements').clear()#去除带css格式的广告
        
        html = html_template.format(title=title,content=body)
        
        print(title,':',artical_url)
#         print(html)
        try:
            pdfkit.from_string(html, r"D:\toPDF\{name}.pdf".format(name=title.getText()),configuration=config,options=options)
        except OSError:
            pass
        print('文章：',title,'已转化为PDF')

4.合并pdf文件

python中，可以使用PyPDF2扩展包非常方便的合并pdf文件，使用pip install PyPDF2即可安装该扩展包。

def merge_pdf(file_path):
    merger = PdfFileMerger()
    for root, dirs, files in os.walk(file_path):  
        files = files 
    for file in files:
        print(file)
        inputPDF = open(file_path + '\\' + file, "rb")
        merger.append(inputPDF,import_bookmarks=False)
        inputPDF.close()
    output = open(r"D:\toPDF\allPdfMerge.pdf", "wb")
    merger.write(output)

5.完整代码

通过上面几个步骤就完成了对博客的爬取，以及转换成最终的pdf，完成代码如下：

from selenium import webdriver
from bs4 import BeautifulSoup
import pdfkit
import PyPDF2
import time
import os

#获取所有URL
def get_url():
    url_list = []
    for i in range(7,0,-1):
        if(i==1):
            url = "https://cuiqingcai.com/categories/Python"
        else:
            url = "https://cuiqingcai.com/categories/Python/page/{number}/".format(number=str(i))
        driver.get(url)
        time.sleep(5)
        soup = BeautifulSoup(driver.page_source,'lxml')
        soup = soup.find('div',class_='posts-collapse').find_all('article')
        for a in soup[::-1]:
            url_list.append('https://cuiqingcai.com' + a.find('a',class_='post-title-link')['href'])
    return url_list

def html_to_pdf(url_list,html_template,options,config):
    for artical_url in url_list:
        driver.get(artical_url)
        time.sleep(5)
        soup = BeautifulSoup(driver.page_source,'lxml')
        title = soup.find('h1',class_='post-title')
        body = soup.find('div',class_='post-body')
        
        body.find('div',class_='advertisements').clear()#去除带css格式的广告
        
        html = html_template.format(title=title,content=body)
        
        print(title,':',artical_url)
#         print(html)
        try:
            pdfkit.from_string(html, r"D:\toPDF\{name}.pdf".format(name=title.getText()),configuration=config,options=options)
        except OSError:
            pass
        print('文章：',title,'已转化为PDF')

def merge_pdf(file_path):
    merger = PdfFileMerger()
    for root, dirs, files in os.walk(file_path):  
        files = files 
    for file in files:
        print(file)
        inputPDF = open(file_path + '\\' + file, "rb")
        merger.append(inputPDF,import_bookmarks=False)
        inputPDF.close()
    output = open(r"D:\toPDF\allPdfMerge.pdf", "wb")
    merger.write(output)   

def main():
	driver = webdriver.Firefox()
	url_list = get_url()
	
	html_template = """ 
	<!DOCTYPE html> 
	<html lang="en"> 
	<head> 
	 <meta charset="UTF-8"> 
	</head> 
	<p>
	{title}
	</p>
	<body> 
	{content} 
	</body> 
	</html> 
	"""
	#将HTML保存为PDF
	options = {
	'page-size':'Letter','margin-top':'0.75in','margin-right':'0.75in',
	'margin-bottom':'0.75in','margin-left':'0.75in','encoding':"UTF-8",
	'custom-header': [('Accept-Encoding','gzip')],
	'cookie': [('cookie-name1','cookie-value1'),('cookie-name2','cookie-value2'),],
	'outline-depth':10,
	}
	path_wk = r"D:\wkhtmltopdf\bin\wkhtmltopdf.exe" #wkhtmltopdf安装位置
	config = pdfkit.configuration(wkhtmltopdf = path_wk)
	
	html_to_pdf(url_list,html_template,options,config)
	
	file_path = r"D:\toPDF"
	
	merge_pdf(file_path)
	
if __name__ == "__main__":
	main()

ps：本文使用的webdriver来获取的网页html，也可以使用requests.get()方法来获取网页html。

码字不易，喜欢请点赞！！！

我们下次再见，如果还有下次 想进入交流群的，欢迎加我微信，并备注： 数据分析交流群
我们共同学习，共同进步！！！

码农公寓

1.代码思路

2.获取所有博客URL

3.将每篇博客内容保存为PDF

4.合并pdf文件

5.完整代码

相关文章