1. 样本采集
首先当然是采集图片了,对接通用模型上怼样本就是了。
for i in range(100000):
sess.headers = {
"User-Agent": ua.random
}
sess.proxies = get_proxy()
# print(get_proxy())
before_url = "https://www.battlenet.com.cn/login/zh/"
before_resp = sess.get(before_url)
before_html = Selector(before_resp.text)
csrf_token = before_html.xpath('//input[@name="csrftoken"]/@value').extract_first()
session_timeout = before_html.xpath('//input[@name="sessionTimeout"]/@value').extract_first()
captcha_url = "https://www.battlenet.com.cn/login/captcha.jpg"
captcha_resp = sess.get(captcha_url)
captcha_bytes = captcha_resp.content
print(captcha_bytes)
captcha_text = requests.post("http://127.0.0.1:19952/captcha/v3", data=captcha_bytes).json()['message']
payload = {
"accountName": "00000",
"password": ".",
"srpEnabled": "true",
"upgradeVerifier": "",
"useSrp": "true",
"publicA": public_a,
"clientEvidenceM1": client_evidence_m1,
"persistLogin": "on",
"captchaInput": captcha_text,
"csrftoken": csrf_token,
"sessionTimeout": session_timeout
}
resp_submit = sess.post(before_url, data=payload)
if "找不到该暴雪游戏通行证" in resp_submit.text:
tag = hashlib.md5(captcha_bytes).hexdigest()
name = "{}_{}.png".format(captcha_text, tag)
print('正确')
true_count += 1
with open(os.path.join(target_dir, name), "wb") as f:
f.write(captcha_bytes)
else:
print('错误')
false_count += 1
# print(before_resp.text)
print(true_count+false_count, captcha_text, true_count / (true_count+false_count))
采集代码有所省略,如下图所示,采集到正确标注的样本。
2. 训练
打开训练工具
(地址:https://github.com/kerlomz/captcha_trainer)
具体教程可参考:https://www.jianshu.com/p/80ef04b16efc
- 首先输入项目名,按下回车将自动生成如图后缀
- 在样本区域内导入第一步收集到的样本
- 网络部分会根据导入的样本自动配置,打包样本即可,如下图:
- 打包完成点击 【开始训练】即可,如下图:
- 如图所示,已经开始训练
验证码使用
部署项目:https://github.com/kerlomz/captcha_platform
编译版(一键部署):https://github.com/kerlomz/captcha_platform/releases
启动成功如图:
调用如图:
如图可见,单次识别速度在10ms以内,平均8ms,属业内领先水准。
对接官网测试结果:
测了13次全对,识别率想必差不了。
模型下载
https://github.com/kerlomz/captcha_platform/releases
后记
顺便宣传一波麻瓜OCR识别
https://pypi.org/project/muggle-ocr/1.0/
这是一个OCR和验证码皆可识别的本地识别模块,使用简单,pip安装,三行代码即可调用。
import time
# 第一步:导入包
import muggle_ocr
# 第二步:初始化;model_type 包含了 ModelType.OCR/ModelType.Captcha 两种
sdk = muggle_ocr.SDK(model_type=muggle_ocr.ModelType.OCR)
# ModelType.Captcha 可识别光学印刷文本
with open(r"test1.png", "rb") as f:
b = f.read()
for i in range(5):
st = time.time()
# 第三步: 调用(识别普通OCR)
text = sdk.predict(image_bytes=b)
print(text, time.time() - st)
# ModelType.Captcha 可识别4-6位验证码
sdk = muggle_ocr.SDK(model_type=muggle_ocr.ModelType.Captcha)
with open(r"test2.jpg", "rb") as f:
b = f.read()
for i in range(5):
st = time.time()
# 第三步: 调用(识别验证码)
text = sdk.predict(image_bytes=b)
print(text, time.time() - st)
喜欢的各位可以加QQ群:857149419