关于爬取猫眼排行榜的教程网上可以说是烂大街了,因此感谢那些踩坑的前辈,我又再次把你们的坑在踩了一次,手动哭泣
这是我的思路:
得到网页url——爬取网页源代码——使用正则表达式分析网页——写入TXT文件
-----------------------------------------------------------------------------------------------------------------------------
得到网页url,这没得说
def get_page_url(n): url=('https://maoyan.com/board/4?offset='+str(n)+'0') return url
爬取网页源代码
def get_one_page(url): page=requests.get(url) return page.text
正则分析网页源码,这里踩了一个坑,正则表达式忘记添加了re.compile了,导致执行报错
def parse_page(page): pattern=re.compile('<dd>.*?board-index.*?>(.*?)</i>.*?data-src=(.*?)alt=.*?data-act.*?>(.*?)</a>.*?class="star".*?>(.*?)</p>.*?releasetime">(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>',re.S) paged=re.findall(pattern,page) for item in paged: print(item) return paged
写入文件,再次踩坑,这里import了个OS模块,使用os.open,导致一直报错,正常情况下是只写open()的
def write_to_txt(paged): paged=str(paged) maoyan=open('猫眼电影排行榜.txt','a') maoyan.write(paged) maoyan.write('\n') maoyan.close()
全部代码
import requests import re #首页url='https://maoyan.com/board/4?offset=0' def get_page_url(n): url=('https://maoyan.com/board/4?offset='+str(n)+'0') return url def get_one_page(url): page=requests.get(url) return page.text def parse_page(page): pattern=re.compile('<dd>.*?board-index.*?>(.*?)</i>.*?data-src=(.*?)alt=.*?data-act.*?>(.*?)</a>.*?class="star".*?>(.*?)</p>.*?releasetime">(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>',re.S) paged=re.findall(pattern,page) for item in paged: print(item) return paged def write_to_txt(paged): paged=str(paged) maoyan=open('猫眼电影排行榜.txt','a') maoyan.write(paged) maoyan.write('\n') maoyan.close() def main(): for i in range(0,10): url=get_page_url(i) page=get_one_page(url) writed=parse_page(page) write_to_txt(writed) main()
还有坑待补,一个是每个list中的元素单独放一排,现在是1个list放一排,以及写入excel分析