好久没上来写博客了,直入主题。
大家经常用google搜索,如何提取搜索结果的链接呢
google搜索结果url提取,F12,来到console端; 粘贴下面语句,回车。
var tag=document.getElementsByClassName(‘r‘); for (var i=0;i<tag.length;i++){ var a=tag[i].getElementsByTagName("a"); console.log(a[0].href) }
提取出来,保存到url.txt. 待检测的url和域名,一行一个,先经过去重去空白行
import io import shutil readPath=‘oldurl.txt‘ writePath=‘url.txt‘ lines_seen=set() outfiile=io.open(writePath,‘a+‘,encoding=‘utf-8‘) f=io.open(readPath,‘r‘,encoding=‘utf-8‘) for line in f: if not len(line): continue if line not in lines_seen: outfiile.write(line) lines_seen.add(line)
然后再批量检测
ok.txt 域名正常
red.txt 已经屏蔽的域名和链接
#! /usr/bin/env python #coding:utf-8 import os,urllib,linecache import sys import time import requests result = list() strxx = ‘"Code":"102"‘ html = ‘‘ for y in linecache.updatecache(r‘url.txt‘): try: headers = {‘user-agent‘: ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36‘} #response = urllib.urlopen(x) #html = response.read() x = ‘http://wx.rrbay.com/pro/wxUrlCheck.ashx?url=‘ + y response = requests.get(x,headers=headers) html = response.text time.sleep(3) #print x,a except Exception,e: html = ‘‘ print e if strxx in html: print ‘ok:‘ print x with open (‘ok.txt‘,‘a‘) as f: f.write(y) else: print ‘error:‘ print y html = ‘‘ with open (‘red.txt‘,‘a‘) as f: f.write(y)