介绍
爬虫江湖,风云再起。自从有了爬虫,也就有了反爬虫;自从有了反爬虫,也就有了反反爬虫。
反爬虫界的一大利器,就是验证码(CAPTCHA),各种各样的验证码让人眼花缭乱,也让很多人在爬虫的过程知难而返,从入门到放弃,当然,这就达到了网站建设者们的目的。但是,但是,所谓的验证码,并不是牢不可破的,在深度学习(Deeping Learning)盛行的今天,很多简单的验证码也许显得不堪一击。
本文将会介绍如何利用Python,OpenCV和CNN来攻破一类验证码,希望能让大家对Deeping Learning的魅力有些体会。
获取数据
笔者收集了某个账号注册网站的验证码,一共是346个验证码,如下:
可以看到,这些验证码由大写字母和数字组成,噪声较多,而且部分字母会黏连在一起。
标记数据
仅仅用这些验证码是无法建模的,我们需要对这些验证码进行预处理,以符合建模的标准。
验证码的预处理方法见博客: OpenCV入门之获取验证码的单个字符(二),然后对每张图片进行标记,将它们放入到合适到文件夹中。没错,你没看错,就是对每张图片进行一一标记,笔者一共花了3个小时多,o(╥﹏╥)o~(为了建模,前期的数据标记是不可避免的,当然,也是一个痛苦的过程,比如WordNet, ImageNet等。)标记完后的文件夹如下:
可以看到,一共是31个文件夹,也就是31个目标类,字符0,M,W,I,O没有出现在验证码中。得到的有效字符为1371个,也就是1371个样本。以字母U为例,字母U的文件夹中的图片如下:
统一尺寸
仅仅标记完图片后,还是没能达到建模的标准,这是因为得到的每个字符的图片大小是统一的。因此,我们需要这些样本字符统一尺寸,经过观察,笔者将统一尺寸定义为16*20,实现的Python脚本如下:
import os
import cv2
import uuid
def convert(dir, file):
imagepath = dir+'/'+file
# 读取图片
image = cv2.imread(imagepath, 0)
# 二值化
ret, thresh = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)
img = cv2.resize(thresh, (16, 20), interpolation=cv2.INTER_AREA)
# 显示图片
cv2.imwrite('%s/%s.jpg' % (dir, uuid.uuid1()), img)
os.remove(imagepath)
def main():
chars = '123456789ABCDEFGHJKLNPQRSTUVXYZ'
dirs= ['E://verifycode_data/%s'%char for char in chars]
for dir in dirs:
for file in os.listdir(dir):
convert(dir, file)
main()
样本数据集
有了尺寸统一的字符图片,我们就需要将这些图片转化为向量。图片为黑白图片,因此,我们将图片读取为0-1值的向量,其标签(y值)为该图片所在的文件的名称。具体的Python实现脚本如下:
import os
import cv2
import pandas as pd
table= []
def Read_Data(dir, file):
imagepath = dir+'/'+file
# 读取图片
image = cv2.imread(imagepath, 0)
# 二值化
ret, thresh = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)
# 显示图片
bin_values = [1 if pixel==255 else 0 for pixel in thresh.ravel()]
label = dir.split('/')[-1]
table.append(bin_values+[label])
def main():
chars = '123456789ABCDEFGHJKLNPQRSTUVXYZ'
dirs= ['E://verifycode_data/%s'%char for char in chars]
print(dirs)
for dir in dirs:
for file in os.listdir(dir):
Read_Data(dir, file)
features = ['v'+str(i) for i in range(1, 16*20+1)]
label = ['label']
df = pd.DataFrame(table, columns=features+label)
# print(df.head())
df.to_csv('E://verifycode_data/data.csv', index=False)
main()
我们将样本的字符图片转为为data.csv中的向量及标签,data.csv的部分内容如下:
CNN大战验证码
有了样本数据集,我们就可以用CNN来进行建模了。典型的CNN由多层卷积层(Convolution Layer)和池化层(Pooling Layer)组成, 最后由全连接网络层输出,示意图如下:
本文建模的CNN模型由两个卷积层和两个池化层,在此基础上增加一个dropout层(防止模型过拟合),再连接一个全连接层(Fully Connected),最后由softmax层输出结果。采用的损失函数为对数损失函数,用梯度下降法(GD)调整模型中的参数。具体的Python代码(VerifyCodeCNN.py)如下:
# -*- coding: utf-8 -*-
import tensorflow as tf
import logging
# 日志设置
logging.basicConfig(level = logging.INFO, format='%(asctime)s - %(levelname)s: %(message)s')
logger = logging.getLogger(__name__)
class CNN:
# 初始化
# 参数为: epoch: 训练次数
# learning_rate: 使用GD优化时的学习率
# save_model_path: 模型保存的绝对路径
def __init__(self, epoch, learning_rate, save_model_path):
self.epoch = epoch
self.learning_rate = learning_rate
self.save_model_path = save_model_path
"""
第一层 卷积层和池化层
x_image(batch, 16, 20, 1) -> h_pool1(batch, 8, 10, 10)
"""
x = tf.placeholder(tf.float32, [None, 320])
self.x = x
x_image = tf.reshape(x, [-1, 16, 20, 1]) # 最后一维代表通道数目,如果是rgb则为3
W_conv1 = self.weight_variable([3, 3, 1, 10])
b_conv1 = self.bias_variable([10])
h_conv1 = tf.nn.relu(self.conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = self.max_pool_2x2(h_conv1)
"""
第二层 卷积层和池化层
h_pool1(batch, 8, 10, 10) -> h_pool2(batch, 4, 5, 20)
"""
W_conv2 = self.weight_variable([3, 3, 10, 20])
b_conv2 = self.bias_variable([20])
h_conv2 = tf.nn.relu(self.conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = self.max_pool_2x2(h_conv2)
"""
第三层 全连接层
h_pool2(batch, 4, 5, 20) -> h_fc1(1, 100)
"""
W_fc1 = self.weight_variable([4 * 5 * 20, 200])
b_fc1 = self.bias_variable([200])
h_pool2_flat = tf.reshape(h_pool2, [-1, 4 * 5 * 20])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)
"""
第四层 Dropout层
h_fc1 -> h_fc1_drop, 训练中启用,测试中关闭
"""
self.keep_prob = tf.placeholder(dtype=tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, self.keep_prob)
"""
第五层 Softmax输出层
"""
W_fc2 = self.weight_variable([200, 31])
b_fc2 = self.bias_variable([31])
self.y_conv = tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)
"""
训练和评估模型
ADAM优化器来做梯度最速下降,feed_dict中加入参数keep_prob控制dropout比例
"""
self.y_true = tf.placeholder(shape = [None, 31], dtype=tf.float32)
self.cross_entropy = -tf.reduce_mean(tf.reduce_sum(self.y_true * tf.log(self.y_conv), axis=1)) # 计算交叉熵
# 使用adam优化器来以0.0001的学习率来进行微调
self.train_model = tf.train.AdamOptimizer(self.learning_rate).minimize(self.cross_entropy)
self.saver = tf.train.Saver()
logger.info('Initialize the model...')
def train(self, x_data, y_data):
logger.info('Training the model...')
with tf.Session() as sess:
# 对所有变量进行初始化
sess.run(tf.global_variables_initializer())
feed_dict = {self.x: x_data, self.y_true: y_data, self.keep_prob:1.0}
# 进行迭代学习
for i in range(self.epoch + 1):
sess.run(self.train_model, feed_dict=feed_dict)
if i % int(self.epoch / 50) == 0:
# to see the step improvement
print('已训练%d次, loss: %s.' % (i, sess.run(self.cross_entropy, feed_dict=feed_dict)))
# 保存ANN模型
logger.info('Saving the model...')
self.saver.save(sess, self.save_model_path)
def predict(self, data):
with tf.Session() as sess:
logger.info('Restoring the model...')
self.saver.restore(sess, self.save_model_path)
predict = sess.run(self.y_conv, feed_dict={self.x: data, self.keep_prob:1.0})
return predict
"""
权重初始化
初始化为一个接近0的很小的正数
"""
def weight_variable(self, shape):
initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)
def bias_variable(self, shape):
initial = tf.constant(0.1, shape=shape)
return tf.Variable(initial)
"""
卷积和池化,使用卷积步长为1(stride size),0边距(padding size)
池化用简单传统的2x2大小的模板做max pooling
"""
def conv2d(self, x, W):
return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
def max_pool_2x2(self, x):
return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
模型训练
对上述的1371个样本用CNN模型进行训练,训练集为960个赝本,411个样本为测试集。一共训练1000次,梯度下降法(GD)的学习率取0.0005.
模型训练的Python脚本如下:
# -*- coding: utf-8 -*-
"""
数字字母识别
利用CNN对验证码的数据集进行多分类
"""
from VerifyCodeCNN import CNN
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelBinarizer
CSV_FILE_PATH = 'E://verifycode_data/data.csv' # CSV 文件路径
df = pd.read_csv(CSV_FILE_PATH) # 读取CSV文件
# 数据集的特征
features = ['v'+str(i+1) for i in range(16*20)]
labels = df['label'].unique()
# 对样本的真实标签进行标签二值化
lb = LabelBinarizer()
lb.fit(labels)
y_ture = pd.DataFrame(lb.transform(df['label']), columns=['y'+str(i) for i in range(31)])
y_bin_columns = list(y_ture.columns)
for col in y_bin_columns:
df[col] = y_ture[col]
# 将数据集分为训练集和测试集,训练集70%, 测试集30%
x_train, x_test, y_train, y_test = train_test_split(df[features], df[y_bin_columns], \
train_size = 0.7, test_size=0.3, random_state=123)
# 使用CNN进行预测
# 构建CNN网络
# 模型保存地址
MODEL_SAVE_PATH = 'E://logs/cnn_verifycode.ckpt'
# CNN初始化
cnn = CNN(1000, 0.0005, MODEL_SAVE_PATH)
# 训练CNN
cnn.train(x_train, y_train)
# 预测数据
y_pred = cnn.predict(x_test)
label = '123456789ABCDEFGHJKLNPQRSTUVXYZ'
# 预测分类
prediction = []
for pred in y_pred:
label = labels[list(pred).index(max(pred))]
prediction.append(label)
# 计算预测的准确率
x_test['prediction'] = prediction
x_test['label'] = df['label'][y_test.index]
print(x_test.head())
accuracy = accuracy_score(x_test['prediction'], x_test['label'])
print('CNN的预测准确率为%.2f%%.'%(accuracy*100))
该CNN模型一共训练了75min,输出的结果如下:
2018-09-24 11:51:17,784 - INFO: Initialize the model...
2018-09-24 11:51:17,784 - INFO: Training the model...
2018-09-24 11:51:17.793631: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
已训练0次, loss: 3.5277689.
已训练20次, loss: 3.2297606.
已训练40次, loss: 2.8372495.
已训练60次, loss: 1.9687067.
已训练80次, loss: 0.90995216.
已训练100次, loss: 0.42356998.
已训练120次, loss: 0.25189978.
已训练140次, loss: 0.16736577.
已训练160次, loss: 0.116674595.
已训练180次, loss: 0.08325087.
已训练200次, loss: 0.06060778.
已训练220次, loss: 0.045051433.
已训练240次, loss: 0.03401592.
已训练260次, loss: 0.026168587.
已训练280次, loss: 0.02056558.
已训练300次, loss: 0.01649161.
已训练320次, loss: 0.013489108.
已训练340次, loss: 0.011219621.
已训练360次, loss: 0.00946489.
已训练380次, loss: 0.008093053.
已训练400次, loss: 0.0069935927.
已训练420次, loss: 0.006101626.
已训练440次, loss: 0.0053245267.
已训练460次, loss: 0.004677901.
已训练480次, loss: 0.0041349586.
已训练500次, loss: 0.0036762774.
已训练520次, loss: 0.003284876.
已训练540次, loss: 0.0029500276.
已训练560次, loss: 0.0026618005.
已训练580次, loss: 0.0024126293.
已训练600次, loss: 0.0021957452.
已训练620次, loss: 0.0020071461.
已训练640次, loss: 0.0018413183.
已训练660次, loss: 0.001695599.
已训练680次, loss: 0.0015665392.
已训练700次, loss: 0.0014519279.
已训练720次, loss: 0.0013496162.
已训练740次, loss: 0.001257321.
已训练760次, loss: 0.0011744777.
已训练780次, loss: 0.001099603.
已训练800次, loss: 0.0010316349.
已训练820次, loss: 0.0009697884.
已训练840次, loss: 0.00091331534.
已训练860次, loss: 0.0008617487.
已训练880次, loss: 0.0008141668.
已训练900次, loss: 0.0007705136.
已训练920次, loss: 0.0007302323.
已训练940次, loss: 0.00069312396.
已训练960次, loss: 0.0006586343.
已训练980次, loss: 0.00062668725.
2018-09-24 13:07:42,272 - INFO: Saving the model...
已训练1000次, loss: 0.0005970755.
2018-09-24 13:07:42,538 - INFO: Restoring the model...
INFO:tensorflow:Restoring parameters from E://logs/cnn_verifycode.ckpt
2018-09-24 13:07:42,538 - INFO: Restoring parameters from E://logs/cnn_verifycode.ckpt
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 ... v313 v314 v315 v316 \
657 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1
18 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1
700 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1
221 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1
1219 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1
v317 v318 v319 v320 prediction label
657 1 1 1 1 G G
18 1 1 1 1 T 1
700 1 1 1 1 H H
221 1 1 1 1 5 5
1219 1 1 1 1 V V
[5 rows x 322 columns]
CNN的预测准确率为93.45%.
可以看到,该CNN模型在测试集上的预测准确率为93.45%,效果OK.训练完后的模型保存为 E://logs/cnn_verifycode.ckpt.
预测新验证码
训练完模型,以下就是见证奇迹的时刻!
笔者重新在刚才的账号注册网站弄了60张验证码,新的验证码如下:
笔者写了个预测验证码的Pyhton脚本,如下:
# -*- coding: utf-8 -*-
"""
利用训练好的CNN模型对验证码进行识别
(共训练960条数据,训练1000次,loss:0.00059, 测试集上的准确率为%93.45.)
"""
import os
import cv2
import pandas as pd
from VerifyCodeCNN import CNN
def split_picture(imagepath):
# 以灰度模式读取图片
gray = cv2.imread(imagepath, 0)
# 将图片的边缘变为白色
height, width = gray.shape
for i in range(width):
gray[0, i] = 255
gray[height-1, i] = 255
for j in range(height):
gray[j, 0] = 255
gray[j, width-1] = 255
# 中值滤波
blur = cv2.medianBlur(gray, 3) #模板大小3*3
# 二值化
ret,thresh1 = cv2.threshold(blur, 200, 255, cv2.THRESH_BINARY)
# 提取单个字符
chars_list = []
image, contours, hierarchy = cv2.findContours(thresh1, 2, 2)
for cnt in contours:
# 最小的外接矩形
x, y, w, h = cv2.boundingRect(cnt)
if x != 0 and y != 0 and w*h >= 100:
chars_list.append((x,y,w,h))
sorted_chars_list = sorted(chars_list, key=lambda x:x[0])
for i,item in enumerate(sorted_chars_list):
x, y, w, h = item
cv2.imwrite('E://test_verifycode/chars/%d.jpg'%(i+1), thresh1[y:y+h, x:x+w])
def remove_edge_picture(imagepath):
image = cv2.imread(imagepath, 0)
height, width = image.shape
corner_list = [image[0,0] < 127,
image[height-1, 0] < 127,
image[0, width-1]<127,
image[ height-1, width-1] < 127
]
if sum(corner_list) >= 3:
os.remove(imagepath)
def resplit_with_parts(imagepath, parts):
image = cv2.imread(imagepath, 0)
os.remove(imagepath)
height, width = image.shape
file_name = imagepath.split('/')[-1].split(r'.')[0]
# 将图片重新分裂成parts部分
step = width//parts # 步长
start = 0 # 起始位置
for i in range(parts):
cv2.imwrite('E://test_verifycode/chars/%s.jpg'%(file_name+'-'+str(i)), \
image[:, start:start+step])
start += step
def resplit(imagepath):
image = cv2.imread(imagepath, 0)
height, width = image.shape
if width >= 64:
resplit_with_parts(imagepath, 4)
elif width >= 48:
resplit_with_parts(imagepath, 3)
elif width >= 26:
resplit_with_parts(imagepath, 2)
# rename and convert to 16*20 size
def convert(dir, file):
imagepath = dir+'/'+file
# 读取图片
image = cv2.imread(imagepath, 0)
# 二值化
ret, thresh = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)
img = cv2.resize(thresh, (16, 20), interpolation=cv2.INTER_AREA)
# 保存图片
cv2.imwrite('%s/%s' % (dir, file), img)
# 读取图片的数据,并转化为0-1值
def Read_Data(dir, file):
imagepath = dir+'/'+file
# 读取图片
image = cv2.imread(imagepath, 0)
# 二值化
ret, thresh = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)
# 显示图片
bin_values = [1 if pixel==255 else 0 for pixel in thresh.ravel()]
return bin_values
def main():
VerifyCodePath = 'E://test_verifycode/E224.jpg'
dir = 'E://test_verifycode/chars'
files = os.listdir(dir)
# 清空原有的文件
if files:
for file in files:
os.remove(dir + '/' + file)
split_picture(VerifyCodePath)
files = os.listdir(dir)
if not files:
print('查看的文件夹为空!')
else:
# 去除噪声图片
for file in files:
remove_edge_picture(dir + '/' + file)
# 对黏连图片进行重分割
for file in os.listdir(dir):
resplit(dir + '/' + file)
# 将图片统一调整至16*20大小
for file in os.listdir(dir):
convert(dir, file)
# 图片中的字符代表的向量
table = [Read_Data(dir, file) for file in os.listdir(dir)]
test_data = pd.DataFrame(table, columns=['v%d'%i for i in range(1,321)])
# 模型保存地址
MODEL_SAVE_PATH = 'E://logs/cnn_verifycode.ckpt'
# CNN初始化
cnn = CNN(1000, 0.0005, MODEL_SAVE_PATH)
y_pred = cnn.predict(test_data)
# 预测分类
prediction = []
labels = '123456789ABCDEFGHJKLNPQRSTUVXYZ'
for pred in y_pred:
label = labels[list(pred).index(max(pred))]
prediction.append(label)
print(prediction)
main()
以图片E224.jpg为例,输出的结果为:
2018-09-25 20:50:33,227 - INFO: Initialize the model...
2018-09-25 20:50:33.238309: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-09-25 20:50:33,227 - INFO: Restoring the model...
INFO:tensorflow:Restoring parameters from E://logs/cnn_verifycode.ckpt
2018-09-25 20:50:33,305 - INFO: Restoring parameters from E://logs/cnn_verifycode.ckpt
['E', '2', '2', '4']
预测完全准确。接下来我们对所有的60张图片进行测试,一共有54张图片预测完整正确,其他6张验证码有部分错误,预测的准确率高达90%.
总结
在验证码识别的过程中,CNN模型大放异彩,从中我们能够感受到深度学习的强大~
当然,文本识别的验证码还是比较简单的,只是作为CNN的一个应用,对于更难的验证码,处理的流程会更复杂,希望读者在读者此文后,可以自己去尝试更难的验证码识别~~
注意:本人现已开通微信公众号: 轻松学会Python爬虫(微信号为:easy_web_scrape), 欢迎大家关注哦~~