爬取西刺网的免费IP

2021-08-29 23:52:05

在写爬虫时，经常需要切换IP，所以很有必要自已在数据维护库中维护一个IP池，这样，就可以在需用的时候随机切换IP，我的方法是爬取西刺网的免费IP，存入数据库中，然后在scrapy 工程中加入tools这个目录，里面存放一些常用的目录，包括这个免费IP池，具体目录如下：

crawl_ip_from_xichi.py 代码如下：

import requests

from fake_useragent import UserAgent

from scrapy.selector import Selector

import time

import pymysql

class GetIPFromXichi(object):

    """通过西刺得到可用的IP，存入数据库"""

    def crawl_ip(self):

        """爬取西刺的免费IP"""

        ip_list = []

        for i in range(1, 20):

            headers = UserAgent()

            ua = getattr(headers, "random")

            ua = {"User-Agent": ua}

            url = "http://www.xicidaili.com/nn/" + str(i)

            response = requests.get("http://www.xicidaili.com/nn/", headers=ua)

            time.sleep(3)

            selector = Selector(text=response.text)

            alltr = selector.css("#ip_list tr")

            for tr in alltr[1:]:

                speed_str = tr.css(".bar::attr(title)").extract_first()

                if speed_str:

                    speed = float(speed_str.split("秒")[0])

                else:

                    speed = 0

                all_text = tr.css("td ::text").extract()

                ip = all_text[0]

                port = all_text[1]

                type = all_text[6]

                if not 'HTTP' in type.upper():

                    type = "HTTP"

                ip_list.append((ip, port, type, speed))

        conn = pymysql.connect(host="127.0.0.1", user="root", password="root", db="outback")

        cursor = conn.cursor()

        insert_sql = """insert into ip_proxy(ip,port,type,speed) VALUES (%s,%s,%s,%s) """

        for i in ip_list:

            try:

                cursor.execute(insert_sql, (i[0], i[1], i[2], i[3]))

                conn.commit()

            except Exception as e:

                print(e)

                conn.rollback()

        cursor.close()

        conn.close()

if __name__ == "__main__":

    crawl_ip_from_xichi=GetIPFromXichi()

    crawl_ip_from_xichi.crawl_ip()

这里有几个容易出错的地方，

一，把函数放在main线程中去执行，这样在以后导入这个类时就不会执行一次，

二，数据连接一定是在整个循环执行完之后才关闭。

三，为了使这个爬虫更加友好，每爬取一页面 sleep 3秒，

github https://github.com/573320328/tools

码农公寓

相关文章