獲取資料
筆者收集了某個賬號註冊網站的驗證碼,一共是346個驗證碼,如下:
驗證碼資料集
可以看到,這些驗證碼由大寫字母和數字組成,噪聲較多,而且部分字母會黏連在一起。
標記資料僅僅用這些驗證碼是無法建模的,我們需要對這些驗證碼進行預處理,以符合建模的標準。 驗證碼的預處理方法見部落格: OpenCV入門之獲取驗證碼的單個字元(二),然後對每張圖片進行標記,將它們放入到合適到資料夾中。沒錯,你沒看錯,就是對每張圖片進行一一標記,筆者一共花了3個小時多,o(╥﹏╥)o~(為了建模,前期的資料標記是不可避免的,當然,也是一個痛苦的過程,比如WordNet, ImageNet等。)標記完後的資料夾如下:
標記完後的資料夾
可以看到,一共是31個資料夾,也就是31個目標類,字元0,M,W,I,O沒有出現在驗證碼中。得到的有效字元為1371個,也就是1371個樣本。以字母U為例,字母U的資料夾中的圖片如下:
字母U的樣本
統一尺寸僅僅標記完圖片後,還是沒能達到建模的標準,這是因為得到的每個字元的圖片大小是統一的。因此,我們需要這些樣本字元統一尺寸,經過觀察,筆者將統一尺寸定義為16*20,實現的Python指令碼如下:
import osimport cv2import uuiddef convert(dir, file): imagepath = dir+'/'+file # 讀取圖片 image = cv2.imread(imagepath, 0) # 二值化 ret, thresh = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY) img = cv2.resize(thresh, (16, 20), interpolation=cv2.INTER_AREA) # 顯示圖片 cv2.imwrite('%s/%s.jpg' % (dir, uuid.uuid1()), img) os.remove(imagepath)def main(): chars = '123456789ABCDEFGHJKLNPQRSTUVXYZ' dirs= ['E://verifycode_data/%s'%char for char in chars] for dir in dirs: for file in os.listdir(dir): convert(dir, file)main()
樣本資料集有了尺寸統一的字元圖片,我們就需要將這些圖片轉化為向量。圖片為黑白圖片,因此,我們將圖片讀取為0-1值的向量,其標籤(y值)為該圖片所在的檔案的名稱。具體的Python實現指令碼如下:
import osimport cv2import pandas as pdtable= []def Read_Data(dir, file): imagepath = dir+'/'+file # 讀取圖片 image = cv2.imread(imagepath, 0) # 二值化 ret, thresh = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY) # 顯示圖片 bin_values = [1 if pixel==255 else 0 for pixel in thresh.ravel()] label = dir.split('/')[-1] table.append(bin_values+[label])def main(): chars = '123456789ABCDEFGHJKLNPQRSTUVXYZ' dirs= ['E://verifycode_data/%s'%char for char in chars] print(dirs) for dir in dirs: for file in os.listdir(dir): Read_Data(dir, file) features = ['v'+str(i) for i in range(1, 16*20+1)] label = ['label'] df = pd.DataFrame(table, columns=features+label) # print(df.head()) df.to_csv('E://verifycode_data/data.csv', index=False)main()
我們將樣本的字元圖片轉為為data.csv中的向量及標籤,data.csv的部分內容如下:
字元圖片對應的向量及標籤
CNN大戰驗證碼有了樣本資料集,我們就可以用CNN來進行建模了。典型的CNN由多層卷積層(Convolution Layer)和池化層(Pooling Layer)組成, 最後由全連線網路層輸出,示意圖如下:
CNN模型結構示意圖
本文建模的CNN模型由兩個卷積層和兩個池化層,在此基礎上增加一個dropout層(防止模型過擬合),再連線一個全連線層(Fully Connected),最後由softmax層輸出結果。採用的損失函式為對數損失函式,用梯度下降法(GD)調整模型中的引數。具體的Python程式碼(VerifyCodeCNN.py)如下:
# -*- coding: utf-8 -*-import tensorflow as tfimport logging# 日誌設定logging.basicConfig(level = logging.INFO, format='%(asctime)s - %(levelname)s: %(message)s')logger = logging.getLogger(__name__)class CNN: # 初始化 # 引數為: epoch: 訓練次數 # learning_rate: 使用GD最佳化時的學習率 # save_model_path: 模型儲存的絕對路徑 def __init__(self, epoch, learning_rate, save_model_path): self.epoch = epoch self.learning_rate = learning_rate self.save_model_path = save_model_path """ 第一層 卷積層和池化層 x_image(batch, 16, 20, 1) -> h_pool1(batch, 8, 10, 10) """ x = tf.placeholder(tf.float32, [None, 320]) self.x = x x_image = tf.reshape(x, [-1, 16, 20, 1]) # 最後一維代表通道數目,如果是rgb則為3 W_conv1 = self.weight_variable([3, 3, 1, 10]) b_conv1 = self.bias_variable([10]) h_conv1 = tf.nn.relu(self.conv2d(x_image, W_conv1) + b_conv1) h_pool1 = self.max_pool_2x2(h_conv1) """ 第二層 卷積層和池化層 h_pool1(batch, 8, 10, 10) -> h_pool2(batch, 4, 5, 20) """ W_conv2 = self.weight_variable([3, 3, 10, 20]) b_conv2 = self.bias_variable([20]) h_conv2 = tf.nn.relu(self.conv2d(h_pool1, W_conv2) + b_conv2) h_pool2 = self.max_pool_2x2(h_conv2) """ 第三層 全連線層 h_pool2(batch, 4, 5, 20) -> h_fc1(1, 100) """ W_fc1 = self.weight_variable([4 * 5 * 20, 200]) b_fc1 = self.bias_variable([200]) h_pool2_flat = tf.reshape(h_pool2, [-1, 4 * 5 * 20]) h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1) """ 第四層 Dropout層 h_fc1 -> h_fc1_drop, 訓練中啟用,測試中關閉 """ self.keep_prob = tf.placeholder(dtype=tf.float32) h_fc1_drop = tf.nn.dropout(h_fc1, self.keep_prob) """ 第五層 Softmax輸出層 """ W_fc2 = self.weight_variable([200, 31]) b_fc2 = self.bias_variable([31]) self.y_conv = tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2) """ 訓練和評估模型 ADAM最佳化器來做梯度最速下降,feed_dict中加入引數keep_prob控制dropout比例 """ self.y_true = tf.placeholder(shape = [None, 31], dtype=tf.float32) self.cross_entropy = -tf.reduce_mean(tf.reduce_sum(self.y_true * tf.log(self.y_conv), axis=1)) # 計算交叉熵 # 使用adam最佳化器來以0.0001的學習率來進行微調 self.train_model = tf.train.AdamOptimizer(self.learning_rate).minimize(self.cross_entropy) self.saver = tf.train.Saver() logger.info('Initialize the model...') def train(self, x_data, y_data): logger.info('Training the model...') with tf.Session() as sess: # 對所有變數進行初始化 sess.run(tf.global_variables_initializer()) feed_dict = {self.x: x_data, self.y_true: y_data, self.keep_prob:1.0} # 進行迭代學習 for i in range(self.epoch + 1): sess.run(self.train_model, feed_dict=feed_dict) if i % int(self.epoch / 50) == 0: # to see the step improvement print('已訓練%d次, loss: %s.' % (i, sess.run(self.cross_entropy, feed_dict=feed_dict))) # 儲存ANN模型 logger.info('Saving the model...') self.saver.save(sess, self.save_model_path) def predict(self, data): with tf.Session() as sess: logger.info('Restoring the model...') self.saver.restore(sess, self.save_model_path) predict = sess.run(self.y_conv, feed_dict={self.x: data, self.keep_prob:1.0}) return predict """ 權重初始化 初始化為一個接近0的很小的正數 """ def weight_variable(self, shape): initial = tf.truncated_normal(shape, stddev=0.1) return tf.Variable(initial) def bias_variable(self, shape): initial = tf.constant(0.1, shape=shape) return tf.Variable(initial) """ 卷積和池化,使用卷積步長為1(stride size),0邊距(padding size) 池化用簡單傳統的2x2大小的模板做max pooling """ def conv2d(self, x, W): return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME') def max_pool_2x2(self, x): return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
模型訓練對上述的1371個樣本用CNN模型進行訓練,訓練集為960個贗本,411個樣本為測試集。一共訓練1000次,梯度下降法(GD)的學習率取0.0005. 模型訓練的Python指令碼如下:
# -*- coding: utf-8 -*-"""數字字母識別利用CNN對驗證碼的資料集進行多分類"""from VerifyCodeCNN import CNNimport pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_scorefrom sklearn.preprocessing import LabelBinarizerCSV_FILE_PATH = 'E://verifycode_data/data.csv' # CSV 檔案路徑df = pd.read_csv(CSV_FILE_PATH) # 讀取CSV檔案# 資料集的特徵features = ['v'+str(i+1) for i in range(16*20)]labels = df['label'].unique()# 對樣本的真實標籤進行標籤二值化lb = LabelBinarizer()lb.fit(labels)y_ture = pd.DataFrame(lb.transform(df['label']), columns=['y'+str(i) for i in range(31)])y_bin_columns = list(y_ture.columns)for col in y_bin_columns: df[col] = y_ture[col]# 將資料集分為訓練集和測試集,訓練集70%, 測試集30%x_train, x_test, y_train, y_test = train_test_split(df[features], df[y_bin_columns], \ train_size = 0.7, test_size=0.3, random_state=123)# 使用CNN進行預測# 構建CNN網路# 模型儲存地址MODEL_SAVE_PATH = 'E://logs/cnn_verifycode.ckpt'# CNN初始化cnn = CNN(1000, 0.0005, MODEL_SAVE_PATH)# 訓練CNNcnn.train(x_train, y_train)# 預測資料y_pred = cnn.predict(x_test)label = '123456789ABCDEFGHJKLNPQRSTUVXYZ'# 預測分類prediction = []for pred in y_pred: label = labels[list(pred).index(max(pred))] prediction.append(label)# 計算預測的準確率x_test['prediction'] = predictionx_test['label'] = df['label'][y_test.index]print(x_test.head())accuracy = accuracy_score(x_test['prediction'], x_test['label'])print('CNN的預測準確率為%.2f%%.'%(accuracy*100))
該CNN模型一共訓練了75min,輸出的結果如下:
2018-09-24 11:51:17,784 - INFO: Initialize the model...2018-09-24 11:51:17,784 - INFO: Training the model...2018-09-24 11:51:17.793631: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2已訓練0次, loss: 3.5277689.已訓練20次, loss: 3.2297606.已訓練40次, loss: 2.8372495.已訓練60次, loss: 1.9687067.已訓練80次, loss: 0.90995216.已訓練100次, loss: 0.42356998.已訓練120次, loss: 0.25189978.已訓練140次, loss: 0.16736577.已訓練160次, loss: 0.116674595.已訓練180次, loss: 0.08325087.已訓練200次, loss: 0.06060778.已訓練220次, loss: 0.045051433.已訓練240次, loss: 0.03401592.已訓練260次, loss: 0.026168587.已訓練280次, loss: 0.02056558.已訓練300次, loss: 0.01649161.已訓練320次, loss: 0.013489108.已訓練340次, loss: 0.011219621.已訓練360次, loss: 0.00946489.已訓練380次, loss: 0.008093053.已訓練400次, loss: 0.0069935927.已訓練420次, loss: 0.006101626.已訓練440次, loss: 0.0053245267.已訓練460次, loss: 0.004677901.已訓練480次, loss: 0.0041349586.已訓練500次, loss: 0.0036762774.已訓練520次, loss: 0.003284876.已訓練540次, loss: 0.0029500276.已訓練560次, loss: 0.0026618005.已訓練580次, loss: 0.0024126293.已訓練600次, loss: 0.0021957452.已訓練620次, loss: 0.0020071461.已訓練640次, loss: 0.0018413183.已訓練660次, loss: 0.001695599.已訓練680次, loss: 0.0015665392.已訓練700次, loss: 0.0014519279.已訓練720次, loss: 0.0013496162.已訓練740次, loss: 0.001257321.已訓練760次, loss: 0.0011744777.已訓練780次, loss: 0.001099603.已訓練800次, loss: 0.0010316349.已訓練820次, loss: 0.0009697884.已訓練840次, loss: 0.00091331534.已訓練860次, loss: 0.0008617487.已訓練880次, loss: 0.0008141668.已訓練900次, loss: 0.0007705136.已訓練920次, loss: 0.0007302323.已訓練940次, loss: 0.00069312396.已訓練960次, loss: 0.0006586343.已訓練980次, loss: 0.00062668725.2018-09-24 13:07:42,272 - INFO: Saving the model...已訓練1000次, loss: 0.0005970755.2018-09-24 13:07:42,538 - INFO: Restoring the model...INFO:tensorflow:Restoring parameters from E://logs/cnn_verifycode.ckpt2018-09-24 13:07:42,538 - INFO: Restoring parameters from E://logs/cnn_verifycode.ckpt v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 ... v313 v314 v315 v316 \657 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 18 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 700 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 221 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1219 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 v317 v318 v319 v320 prediction label 657 1 1 1 1 G G 18 1 1 1 1 T 1 700 1 1 1 1 H H 221 1 1 1 1 5 5 1219 1 1 1 1 V V [5 rows x 322 columns]CNN的預測準確率為93.45%.
可以看到,該CNN模型在測試集上的預測準確率為93.45%,效果OK.訓練完後的模型儲存為 E://logs/cnn_verifycode.ckpt.
預測新驗證碼訓練完模型,以下就是見證奇蹟的時刻! 筆者重新在剛才的賬號註冊網站弄了60張驗證碼,新的驗證碼如下:
新驗證碼
筆者寫了個預測驗證碼的Pyhton指令碼,如下:
# -*- coding: utf-8 -*-"""利用訓練好的CNN模型對驗證碼進行識別(共訓練960條資料,訓練1000次,loss:0.00059, 測試集上的準確率為%93.45.)"""import osimport cv2import pandas as pdfrom VerifyCodeCNN import CNNdef split_picture(imagepath): # 以灰度模式讀取圖片 gray = cv2.imread(imagepath, 0) # 將圖片的邊緣變為白色 height, width = gray.shape for i in range(width): gray[0, i] = 255 gray[height-1, i] = 255 for j in range(height): gray[j, 0] = 255 gray[j, width-1] = 255 # 中值濾波 blur = cv2.medianBlur(gray, 3) #模板大小3*3 # 二值化 ret,thresh1 = cv2.threshold(blur, 200, 255, cv2.THRESH_BINARY) # 提取單個字元 chars_list = [] image, contours, hierarchy = cv2.findContours(thresh1, 2, 2) for cnt in contours: # 最小的外接矩形 x, y, w, h = cv2.boundingRect(cnt) if x != 0 and y != 0 and w*h >= 100: chars_list.append((x,y,w,h)) sorted_chars_list = sorted(chars_list, key=lambda x:x[0]) for i,item in enumerate(sorted_chars_list): x, y, w, h = item cv2.imwrite('E://test_verifycode/chars/%d.jpg'%(i+1), thresh1[y:y+h, x:x+w])def remove_edge_picture(imagepath): image = cv2.imread(imagepath, 0) height, width = image.shape corner_list = [image[0,0] < 127, image[height-1, 0] < 127, image[0, width-1]<127, image[ height-1, width-1] < 127 ] if sum(corner_list) >= 3: os.remove(imagepath)def resplit_with_parts(imagepath, parts): image = cv2.imread(imagepath, 0) os.remove(imagepath) height, width = image.shape file_name = imagepath.split('/')[-1].split(r'.')[0] # 將圖片重新分裂成parts部分 step = width//parts # 步長 start = 0 # 起始位置 for i in range(parts): cv2.imwrite('E://test_verifycode/chars/%s.jpg'%(file_name+'-'+str(i)), \ image[:, start:start+step]) start += stepdef resplit(imagepath): image = cv2.imread(imagepath, 0) height, width = image.shape if width >= 64: resplit_with_parts(imagepath, 4) elif width >= 48: resplit_with_parts(imagepath, 3) elif width >= 26: resplit_with_parts(imagepath, 2)# rename and convert to 16*20 sizedef convert(dir, file): imagepath = dir+'/'+file # 讀取圖片 image = cv2.imread(imagepath, 0) # 二值化 ret, thresh = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY) img = cv2.resize(thresh, (16, 20), interpolation=cv2.INTER_AREA) # 儲存圖片 cv2.imwrite('%s/%s' % (dir, file), img)# 讀取圖片的資料,並轉化為0-1值def Read_Data(dir, file): imagepath = dir+'/'+file # 讀取圖片 image = cv2.imread(imagepath, 0) # 二值化 ret, thresh = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY) # 顯示圖片 bin_values = [1 if pixel==255 else 0 for pixel in thresh.ravel()] return bin_valuesdef main(): VerifyCodePath = 'E://test_verifycode/E224.jpg' dir = 'E://test_verifycode/chars' files = os.listdir(dir) # 清空原有的檔案 if files: for file in files: os.remove(dir + '/' + file) split_picture(VerifyCodePath) files = os.listdir(dir) if not files: print('檢視的資料夾為空!') else: # 去除噪聲圖片 for file in files: remove_edge_picture(dir + '/' + file) # 對黏連圖片進行重分割 for file in os.listdir(dir): resplit(dir + '/' + file) # 將圖片統一調整至16*20大小 for file in os.listdir(dir): convert(dir, file) # 圖片中的字元代表的向量 table = [Read_Data(dir, file) for file in os.listdir(dir)] test_data = pd.DataFrame(table, columns=['v%d'%i for i in range(1,321)]) # 模型儲存地址 MODEL_SAVE_PATH = 'E://logs/cnn_verifycode.ckpt' # CNN初始化 cnn = CNN(1000, 0.0005, MODEL_SAVE_PATH) y_pred = cnn.predict(test_data) # 預測分類 prediction = [] labels = '123456789ABCDEFGHJKLNPQRSTUVXYZ' for pred in y_pred: label = labels[list(pred).index(max(pred))] prediction.append(label) print(prediction)main()
以圖片E224.jpg為例,輸出的結果為:
2018-09-25 20:50:33,227 - INFO: Initialize the model...2018-09-25 20:50:33.238309: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX22018-09-25 20:50:33,227 - INFO: Restoring the model...INFO:tensorflow:Restoring parameters from E://logs/cnn_verifycode.ckpt2018-09-25 20:50:33,305 - INFO: Restoring parameters from E://logs/cnn_verifycode.ckpt['E', '2', '2', '4']
預測完全準確。接下來我們對所有的60張圖片進行測試,一共有54張圖片預測完整正確,其他6張驗證碼有部分錯誤,預測的準確率高達90%.
總結在驗證碼識別的過程中,CNN模型大放異彩,從中我們能夠感受到深度學習的強大~ 當然,文字識別的驗證碼還是比較簡單的,只是作為CNN的一個應用,對於更難的驗證碼,處理的流程會更復雜,希望讀者在讀者此文後,可以自己去嘗試更難的驗證碼識別~~