本来晚上是准备写贴吧爬虫的,但是在分析页面时就遇到了大麻烦!选取了某个帖子,在爬取的时候,发现正则匹配不全..很尴尬!!先来看看吧,
#!/usr/bin/env python
# -*- coding:utf-8 -*-
__author__ = 'ziv·chan' import requests
import re url = 'http://tieba.baidu.com/p/3138733512?see_lz=1&pn=3'
html = requests.get(url)
html.encoding = 'utf-8'
pageCode = html.text pattern = re.compile('d_post_content j_d_post_content ">(.*?)</div><br>',re.S)
items = re.findall(pattern,pageCode)
i = 1
for item in items:
hasImg = re.search('<img',item)
hasHref = re.search('href',item)
# 过滤img
if hasImg:
pattern_1 = re.compile('<img class="BDE_Image".*?<br><br>')
item = re.sub(pattern_1,'',item)
# 过滤href
if hasHref:
pattern_2 = re.compile('onclick="Stats.sendRequest.*?class="at">(.*?)</a>',re.S)
item = re.findall(pattern_2,item) print str(i) + ':'
# 提取href标签下的用户
if type(item) is list:
for each in item:
print each
else:
# 过滤多余标签 ' <br> '
pattern_Br = re.compile('<br>')
item = re.sub(pattern_Br, '\n', item)
# 默认删除空白符
print item.strip()
print '\n'
i += 1
# if not hasImg and not hasHref:
# print i
# print item.strip()
# i += 1
本来都以为大功告成了,结果..结果在提取含有@的content的时候,不是少这个就是缺那个...心塞,正则的功夫还是没下够,但是今天白天学得那些方法还是现学现用了,Get!
明天看看静觅怎么做的,又是一顿大餐,好好消化,加油!!