本小节是通过使用nb算法对mnist数据集的数字识别,不过效果一般般。
1.源码改错
作者提供的配套源码编译时有如下问题报错:
C:\ProgramData\Anaconda3\python.exe C:/Users/liujiannan/PycharmProjects/pythonProject/Web安全之机器学习入门/code/7-6.py
Traceback (most recent call last):
File "C:/Users/liujiannan/PycharmProjects/pythonProject/Web安全之机器学习入门/code/7-6.py", line 25, in <module>
training_data, valid_data, test_data=load_data()
File "C:/Users/liujiannan/PycharmProjects/pythonProject/Web安全之机器学习入门/code/7-6.py", line 19, in load_data
training_data, valid_data, test_data = pickle.load(fp)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x90 in position 614: ordinal not in range(128)
查看出错部分源码
def load_data():
with gzip.open('..') as fp:
training_data, valid_data, test_data = pickle.load(fp)
return training_data, valid_data, test_dat
修改方法如下所示:
def load_data():
with gzip.open('../data/MNIST/mnist.pkl.gz') as fp:
training_data, valid_data, test_data = pickle.load(fp, encoding="bytes")
return training_data, valid_data, test_data
2.数据集处理
def load_data():
with gzip.open('../data/MNIST/mnist.pkl.gz') as fp:
training_data, valid_data, test_data = pickle.load(fp, encoding="bytes")
return training_data, valid_data, test_data
if __name__ == '__main__':
training_data, valid_data, test_data=load_data()
x1,y1=training_data
x2,y2=test_data
3.完整源码
# -*- coding:utf-8 -*-
import re
import matplotlib.pyplot as plt
import os
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import model_selection
import os
from sklearn.naive_bayes import GaussianNB
import pickle
import gzip
def load_data():
with gzip.open('../data/MNIST/mnist.pkl.gz') as fp:
training_data, valid_data, test_data = pickle.load(fp, encoding="bytes")
return training_data, valid_data, test_data
if __name__ == '__main__':
training_data, valid_data, test_data=load_data()
x1,y1=training_data
x2,y2=test_data
clf = GaussianNB()
clf.fit(x1, y1)
score = model_selection.cross_val_score(clf, x2, y2, scoring="accuracy")
print(score)
print(score.mean())
4.运行结果
[0.53684841 0.58385839 0.6043857 ]
0.575030833157769
很明显,结果不咋地,nb对于多分类效果较差,而对于二分类效果还可以。