Python Web从入门到精通(一) Scrapy框架爬取天气网并将数据存入数据库

  1. 创建项目

    scrapy startproject 项目名
    
  2. 个人习惯使用vscode进行编码,相较于pycharm而言,vscode属于轻量级编译器,打开终端,输入以下命令

    1. scrapy genspider spider名  爬取的网站
    2. 例如:scrapy genspider weather https://www.tianqi.com/fuan/
    
  3. 此时会在项目的spider文件夹下生成weather.py.

  4. 由于爬取的天气网站https://www.tianqi.com/fuan/右键无法查看网页源代码,我自己就先ctrl+s将html页面保存到桌面端,然后再打开,此时便可以右键查看到网页源代码.

  5. 以上工作都做完后,我们便可以开始编写核心代码了.

  6. 首先书写items.py, 定义我们需要封装的字段名字

    class FuanweatherItem(scrapy.Item):
        city = scrapy.Field() # 城市
        date = scrapy.Field() # 日期
        week = scrapy.Field() # 星期几
        Weather = scrapy.Field() # 天气
        minTempeture = scrapy.Field() # 最低温度
        maxTempeture = scrapy.Field() # 最高温度
        wind = scrapy.Field() # 风向
    
  7. 编写weather.py,目的是从相应网站抓取数据,并保存入item中

    import scrapy
    from fuanWeather.items import FuanweatherItem
    
    class WeatherSpider(scrapy.Spider):
        name = 'weather'
        allowed_domains = ['tianqi.com']
        start_urls = ['https://www.tianqi.com/fuan/']
    
        def parse(self, response):
            items = []
            # 获取到7天内福安天气所要信息,分别存入每一个数组中     
            citys = response.xpath('//dd[@class="name"]/h2/text()').extract()
            container = response.xpath('//div[@class="day7"]')
            dates = container.xpath('ul[@class="week"]/li/b/text()').extract()
            weeks = container.xpath('ul[@class="week"]/li/span/text()').extract()
            weathers = container.xpath('ul[@class="txt txt2"]/li/text()').extract()
            minTempetures = container.xpath('div[@class="zxt_shuju"]/ul/li/b/text()').extract()
            maxTempetures = container.xpath('div[@class="zxt_shuju"]/ul/li/span/text()').extract()
            winds = container.xpath('ul[@class="txt"]/li/text()').extract()
    
            # for循环加入items
            for t in range(7):
                try:
                    item = FuanweatherItem()
                    item['city'] = citys[0]
                    item['date'] = dates[t]
                    item['week'] = weeks[t]
                    item['Weather'] = weathers[t]
                    item['minTempeture'] = minTempetures[t]
                    item['maxTempeture'] = maxTempetures[t]
                    item['wind'] = winds[t]
                    items.append(item)
                except IndexError as e:
                    exit()
            return items
    
  8. 接下来我们编写pipelines.py文件,目的是处理已经存好的数据,并将其存入数据库中

    from itemadapter import ItemAdapter
    import pymysql
    
    class FuanweatherPipeline:
        def process_item(self, item, spider):
            city = item['city']
            date = item['date']
            week = item['week']
            Weather = item['Weather']
            minTempeture = item['minTempeture']
            maxTempeture = item['maxTempeture']
            wind = item['wind']
    
            # 连接本地数据库
            connection = pymysql.connect(
                host='localhost',
                user='root',
                passwd='123',
                db='scrapy',
                charset='utf8mb4',
                cursorclass=pymysql.cursors.DictCursor
            )
            # 插库
            try:
                with connection.cursor() as cursor:
                    # 创建更新数据库的sql
                    sql = """INSERT INTO WEATHER(city,date,week,Weather,minTempeture,maxTempeture,wind) 
                        VALUES (%s, %s, %s, %s, %s, %s, %s)"""
                    # 执行sql
                    cursor.execute(sql, (city, date, week, Weather, minTempeture, maxTempeture, wind))
                # 提交插入数据
                connection.commit()
            finally:
                connection.close()
            return item
    
  9. 在setting中配置pipelines:

    ITEM_PIPELINES = {
        'fuanWeather.pipelines.FuanweatherPipeline': 300
    }
    
  10. 执行项目

    scrapy crawl weather
    
  11. 执行完项目后,查看数据库表格信息,如下图所示:
    Python Web从入门到精通(一) Scrapy框架爬取天气网并将数据存入数据库

  12. 爬取天气网站过程中所可能遇到的问题
    (1) 爬取过程中报如下403错误:
    Python Web从入门到精通(一) Scrapy框架爬取天气网并将数据存入数据库
    解决办法:在setting中配置:

    # settings.py 文件
    
    # Chrome版
    USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
    
    # Firefox版
    USER_AGENT = "Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/21.0.1"
    

    (2) 数据存入mysql中,中文转变为unicode.
    解决方法:在setting.py中配置:

    FEED_EXPORT_ENCODING = 'utf-8'
    

    欢迎大家留言评论

上一篇:JS组件系列——BootstrapTable+KnockoutJS实现增删改查解决方案(一)


下一篇:Python中 获取到今天为止这个周过了几天