我正在尝试构建使用tor代理的多线程爬虫:
我正在使用以下建立tor连接:
from stem import Signal
from stem.control import Controller
controller = Controller.from_port(port=9151)
def connectTor():
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9150)
socket.socket = socks.socksocket
def renew_tor():
global request_headers
request_headers = {
"Accept-Language": "en-US,en;q=0.5",
"User-Agent": random.choice(BROWSERS),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Referer": "http://thewebsite2.com",
"Connection": "close"
}
controller.authenticate()
controller.signal(Signal.NEWNYM)
这是url fetcher:
def get_soup(url):
while True:
try:
connectTor()
r = requests.Session()
response = r.get(url, headers=request_headers)
the_page = response.content.decode('utf-8',errors='ignore')
the_soup = BeautifulSoup(the_page, 'html.parser')
if "captcha" in the_page.lower():
print("flag condition matched while url: ", url)
#print(the_page)
renew_tor()
else:
return the_soup
break
except Exception as e:
print ("Error while URL :", url, str(e))
然后我创建多线程获取作业:
with futures.ThreadPoolExecutor(200) as executor:
for url in zurls:
future = executor.submit(fetchjob,url)
然后我得到以下错误,我在使用多处理时没有看到:
Socket connection failed (Socket error: 0x01: General SOCKS server failure)
我将不胜感激任何建议,以避免袜子错误和提高爬行方法的性能,使其多线程.
解决方法:
这是为什么猴子修补socket.socket是坏的一个完美的例子.
这将使用SOCKS套接字替换所有套接字连接(大多数情况下)使用的套接字.
当您稍后再连接到控制器时,它会尝试使用SOCKS协议进行通信,而不是建立直接连接.
由于您已经在使用请求,我建议删除SocksiPy和socks.socket = socks.socksocket代码并使用内置于请求中的SOCKS proxy功能:
proxies = {
'http': 'socks5h://127.0.0.1:9050',
'https': 'socks5h://127.0.0.1:9050'
}
response = r.get(url, headers=request_headers, proxies=proxies)