本小节通过识别垃圾邮件,讲解tensorflow通过神经网络DNN在网络安全方向的应用,同时还对比了NB算法的垃圾邮件识别效果。
1、数据集与特征化
本小节使用SpamBase这个入门级垃圾邮件数据集进行训练和测试,这里要强调SpamBase数据不是原始的邮件内容,而是已经特征化的数据。共有58个属性,对应的特征是统计的关键字以及特殊符号的词频,其中最后一个是垃圾邮件的标志位。如下图所示,特征结构举例如下:
对应代码如下所示
def load_SpamBase(filename):
x=[]
y=[]
with open(filename) as f:
for line in f:
line=line.strip('\n')
v=line.split(',')
y.append(int(v[-1]))
t=[]
for i in range(57):
t.append(float(v[i]))
t=np.array(t)
x.append(t)
x=np.array(x)
y=np.array(y)
print(x.shape)
print(y.shape)
x_train, x_test, y_train, y_test=train_test_split( x,y, test_size=0.4, random_state=0)
print(x_train.shape)
print(x_test.shape)
return x_train, x_test, y_train, y_test
def main(unused_argv):
x_train, x_test, y_train, y_test=load_SpamBase("../data/spambase/spambase.data")
feature_columns = tf.contrib.learn.infer_real_valued_columns_from_input(x_train)
打印数据集总体size如下所示
(4601, 57)
本节训练集和测试集采用6:4,打印训练集和测试集,他们的特征值size分别如下所示
(2760, 57)
(1841, 57)
打印训练集第一项的57个特征,如下所示:
[2.70e-01 0.00e+00 1.30e-01 0.00e+00 8.20e-01 0.00e+00 0.00e+00 0.00e+00
0.00e+00 0.00e+00 0.00e+00 5.50e-01 4.10e-01 0.00e+00 0.00e+00 0.00e+00
0.00e+00 0.00e+00 1.24e+00 0.00e+00 1.10e+00 0.00e+00 0.00e+00 0.00e+00
1.65e+00 8.20e-01 1.30e-01 1.30e-01 1.30e-01 1.30e-01 1.30e-01 1.30e-01
0.00e+00 1.30e-01 1.30e-01 1.30e-01 4.10e-01 0.00e+00 0.00e+00 1.30e-01
0.00e+00 4.10e-01 1.30e-01 0.00e+00 4.10e-01 0.00e+00 0.00e+00 2.70e-01
4.10e-02 1.02e-01 2.00e-02 2.00e-02 0.00e+00 0.00e+00 2.78e+00 3.40e+01
3.67e+02]
2、DNN训练数据集
本小节使用两个隐藏层,其中隐藏层1为30个神经元,隐藏层2为10个神经元,分为两类
classifier = tf.contrib.learn.DNNClassifier(
feature_columns=feature_columns, hidden_units=[30,10], n_classes=2)
classifier.fit(x_train, y_train, steps=500,batch_size=10)
3、DNN验证数据集
y_predict=list(classifier.predict(x_test, as_iterable=True))
score = metrics.accuracy_score(y_test, y_predict)
print('Accuracy: {0:f}'.format(score))
结果如下所示
Accuracy: 0.724063
4、朴素贝叶斯NB法训练与验证数据集
gnb = GaussianNB()
y_predict = gnb.fit(x_train, y_train).predict(x_test)
score = metrics.accuracy_score(y_test, y_predict)
print('Accuracy: {0:f}'.format(score))
测试结果
Accuracy: 0.826181
5、完整代码
import tensorflow as tf
from tensorflow.contrib.learn.python import learn
from sklearn import metrics
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.naive_bayes import GaussianNB
#0,0.64,0.64,0,0.32,0,0,0,0,0,0,0.64,0,0,0,0.32,0,1.29,1.93,0,0.96,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
# 0,0,0,0,0.778,0,0,3.756,61,278,1
def load_SpamBase(filename):
x=[]
y=[]
with open(filename) as f:
for line in f:
line=line.strip('\n')
v=line.split(',')
y.append(int(v[-1]))
t=[]
for i in range(57):
t.append(float(v[i]))
t=np.array(t)
x.append(t)
x=np.array(x)
y=np.array(y)
print(x.shape)
print(y.shape)
x_train, x_test, y_train, y_test=train_test_split( x,y, test_size=0.4, random_state=0)
print(x_train.shape)
print(x_test.shape)
return x_train, x_test, y_train, y_test
def main(unused_argv):
x_train, x_test, y_train, y_test=load_SpamBase("../data/spambase/spambase.data")
feature_columns = tf.contrib.learn.infer_real_valued_columns_from_input(x_train)
classifier = tf.contrib.learn.DNNClassifier(
feature_columns=feature_columns, hidden_units=[30,10], n_classes=2)
classifier.fit(x_train, y_train, steps=500,batch_size=10)
y_predict=list(classifier.predict(x_test, as_iterable=True))
score = metrics.accuracy_score(y_test, y_predict)
print('Accuracy: {0:f}'.format(score))
gnb = GaussianNB()
y_predict = gnb.fit(x_train, y_train).predict(x_test)
score = metrics.accuracy_score(y_test, y_predict)
print('Accuracy: {0:f}'.format(score))
if __name__ == '__main__':
tf.app.run()