1 如果没有看过scrapy的朋友,可以到scrapy的官网看一下再来看这篇文章
2 创建一个scrapy的项目,请看http://blog.csdn.net/chenguolinblog/article/details/19699865
3 下面我们就一个一个文件的来分析,最后我会给出GitHub上面的源码
(1)第一个文件 spidr.py,这个文件的作用就是我们自己定义的蜘蛛,用来爬取网页的,具体看以下的注释
__author__ = 'chenguolin' """ Date: 2014-03-06 """ from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.contrib.spiders import CrawlSpider, Rule #这个是预定义的蜘蛛,使用它可以自定义爬取链接的规则rule from scrapy.selector import HtmlXPathSelector #导入HtmlXPathSelector进行解析 from firstScrapy.items import FirstscrapyItem class firstScrapy(CrawlSpider): name = "firstScrapy" #爬虫的名字要唯一 allowed_domains = ["yuedu.baidu.com"] #运行爬取的网页 start_urls = ["http://yuedu.baidu.com/book/list/0?od=0&show=1&pn=0"] #第一个爬取的网页 #以下定义了两个规则,第一个是当前要解析的网页,回调函数是myparse;第二个则是抓取到下一页链接的时候,不需要回调直接跳转 rules = [Rule(SgmlLinkExtractor(allow=('/ebook/[^/]+fr=booklist')), callback='myparse'), Rule(SgmlLinkExtractor(allow=('/book/list/[^/]+pn=[^/]+', )), follow=True)] #回调函数 def myparse(self, response): x = HtmlXPathSelector(response) item = FirstscrapyItem() # get item item['link'] = response.url item['title'] = "" strlist = x.select("//h1/@title").extract() if len(strlist) > 0: item['title'] = strlist[0] # return the item return item
(2)第二个文件是items.py,定义我们所需要的字段,因为我们这边只抓取图书的“名字”和“链接“,于是字段都是str
from scrapy.item import Item, Field class FirstscrapyItem(Item): title = Field(serializer=str) link = Field(serializer=str)
(3) 第三个文件是pipelines.py,由于要连接数据库,这边用到了twisted连接mysql的方法
# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html from twisted.enterprise import adbapi #导入twisted的包 import MySQLdb import MySQLdb.cursors class FirstscrapyPipeline(object): def __init__(self): #初始化连接mysql的数据库相关信息 self.dbpool = adbapi.ConnectionPool('MySQLdb', db = 'bookInfo', user = 'root', passwd = '123456', cursorclass = MySQLdb.cursors.DictCursor, charset = 'utf8', use_unicode = False ) # pipeline dafault function #这个函数是pipeline默认调用的函数 def process_item(self, item, spider): query = self.dbpool.runInteraction(self._conditional_insert, item) return item # insert the data to databases #把数据插入到数据库中 def _conditional_insert(self, tx, item): sql = "insert into book values (%s, %s)" tx.execute(sql, (item["title"], item["link"]))
(4)在unbuntu下mysql的可视化工具截图
(5)大家可以从我的github上面直接clone项目,地址:https://github.com/chenguolin/firstScrapyProject.git
==================================
== from:陈国林 ==
== email:cgl1079743846@gmail.com ==
== 转载请注明出处,谢谢! ==
==================================