[scrapy] scrapy 使用goose作为正文提取

import scrapy
from goose import Goose class Article(scrapy.Item):
title = scrapy.Field()
text = scrapy.Field() class MyGooseSpider(scrapy.Spider):
name = 'goose'
start_urls = [
'http://blog.scrapinghub.com/2014/06/18/extracting-schema-org-microdata-using-scrapy-selectors-and-xpath/',
'http://blog.scrapinghub.com/2014/07/17/xpath-tips-from-the-web-scraping-trenches/',
] def parse(self, response):
article = Goose().extract(raw_html=response.body)
yield Article(title=article.title, text=article.cleaned_text)

转自:http://*.com/questions/26940002/can-i-use-scrapy-with-goose

上一篇:8.7 浅析图论最短路算法


下一篇:2021“MINIEYE杯”中国大学生算法设计超级联赛(1)部分题解