python网络爬虫--简单爬取糗事百科

2022-09-10 16:38:59

　　刚开始学习python爬虫，写了一个简单python程序爬取糗事百科。

　　具体步骤是这样的：首先查看糗事百科的url：http://www.qiushibaike.com/8hr/page/2/?s=4959489，可以发现page后的数据代表第几页。

　　然后装配request，注意要设置user_agent

 user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'

 headers = {'User-Agent': user_agent}

 request=urllib2.Request(url,headers=headers)

 response=urllib2.urlopen(request)

　　然后获取返回的数据

content=response.read().decode('utf-8')

　　然后是关键，使用正则匹配出所有的具体内容。这里可以使用浏览器的检查功能查看页面结构，写出相对应的正则式，比如我们对下面的<div class="content">...</div>进行匹配的正则式如下

pattern=re.compile('<div class="content">....<span>(.*?)</span>...</div>',re.S)

items=re.findall(pattern,content)

　　(.*?) ：表示组，该部分为一个整体，将该部分匹配到字符串作为返回值返回，findall表示找到所有匹配的字符串，以序列的形式返回

　　参数re.S表示"."点号匹配所有字符包括换行

下面是完整代码

 import urllib

 import urllib2

 import re

 import time

 page=2

 f=open("D:\qiushi.txt","r+")

 user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'

 headers = {'User-Agent': user_agent}

 while page<100:

     url="http://www.qiushibaike.com/8hr/page/"+str(page)+"/?s=4959460"

     print url

     try:

         request=urllib2.Request(url,headers=headers)

         response=urllib2.urlopen(request)

         content=response.read().decode('utf-8')

         # print content

         pattern=re.compile('<div class="content">....<span>(.*?)</span>...</div>',re.S)

         items=re.findall(pattern,content)

         f.write((url+"\n").encode('utf-8'))

         for item in items:

             print "------"

             item=item+"\n"

             print item

             f.write("------\n".encode('utf-8'))

             f.write(item.replace('<br/>','\n').encode('utf-8'))

     except urllib2.URLError,e:

         if hasattr(e,"code"):

             print e.code

         if hasattr(e,"reason"):

             print e.reason

     finally:

         page+=1

         time.sleep(1)

这里我是将找到的输出到d盘下的qiushi.txt文件

码农公寓

相关文章