朴素贝叶斯分类算法是机器学习中十分经典而且应用十分广泛的算法,下面将逐步学习和说明。
一、条件概率:
条件概率是概率论中的一个重要实用的概念。所考虑的是事件A已经发生的条件下事件B发生到概率。
(一)条件概率 定义 设A,B是两个事件,且P(A)>0,称:P(B|A) = P(AB) / P(A); 为在事件A发生的条件下事件B发生的条件概率。
(二)乘法定理设P(A) > 0 , 则有P(AB) = P(B|A) * P(A); 此式成为乘法公式;
(三)全概率公式和贝叶斯公式
样本空间划分定义:假设S为试验E的样本空间,B1,B2,B3..Bn为E的一组事件。若:
(i)
(ii)
则称B1,B2,B3...Bn为样本空间S的一个划分。若B1,B2,...,Bn是样本空间的一个划分,那么,对每次试验,事件B1,B2,B3...Bn中必有一个且仅有一个发生。
定理 设试验E的样本空间为S,A为E的事件,B1,B2,...Bn为S的一个划分,且P(Bi)>0(i=1,2,...n),则:
P(A) = P(A|B1)*P(B1) + P(A|B2)*P(B2)+...+P(A|Bn)*P(Bn). 此式成为全概率公式。
在很多实际问题中P(A)不易直接求得,但是却容易找到S的一个划分B1,B2,...Bn,且P(Bi)和P(A|Bi)或为已知,或容易求得,那么就可以全概率公式求得P(A)。
定理 设试验E的样本空间为S,A为E的事件,B1,B2,...,Bn为S的一个划分,且P(A)>0,P(Bi)>0 (i=1,2,...,n),则
此式成为贝叶斯公式。
二、基于贝叶斯决策分类的分类方法:
优点:在数据较少的情况下仍然有效,可以处理多类别问题;
缺点:对于输入数据的准备方式较为敏感;
使用数据类型:标称性数据;
朴素贝叶斯是贝叶斯决策理论的一部分,所以讲述朴素贝叶斯分类之前有必要了解贝叶斯决策理论。我们之所以称之为“朴素”,是因为整个形式化过程只做最原始,最简单的假设。
假设我们有一个数据集,它由两类数据组成,数据分布:
图1:两个参数已知的概率分布,参数决定分布形状。
我们现在用p1(x,y)表示数据点(x,y)属于类别1(图中用圆点表示的类别)的概率,用p2(x,y)表示数据点(x,y)属于类别2(图中用三角形表示的类别)的概率,那么对于一个新数据点(x,y),可以用下面的规则来判断它的类别:
·如果p1(x,y) > p2(x,y) ,那么类别为1。
·如果p1(x,y) < p2(x,y) ,那么类别为2。
也就是说,我们也选择搞概率对应的类别。这就是贝叶斯决策理论的核心思想,即选择具有最高概率的决策。
三、朴素贝叶斯的一般过程:
(1)收集数据:可以使用任何方法。
(2)准备数据:需要数值型或者布尔型数据。
(3)分析数据:有大量特征时,绘制特征作用不大,此时使用直方图效果更好。
(4)训练算法:计算不同的独立特征的条件概率。
(5)测试算法:计算错误率。
(6)使用算法:一个常见的朴素贝叶斯应用是文档分类。可以在任意的分类场景中使用朴素贝叶斯分类器,不一定非要是文本。
四、文本分类:
要从文本中获取特征,需要先拆分文本。具体如何做?这里的特征是来自文本的词条(token),一个词条是字符的任意组合。可以把词条想象为单词,也可以使用非单词词条,如URL、IP地址或者任意其他字符串。
以在线社区的留言板为例。为了不影响社区的发展,我们要屏蔽侮辱性的言论,所以要构建一个快速过滤器,如果某条留言使用了负面或者侮辱性的语言,那么就该留言标识为内容不当。过滤这类内容是一个很常见的需求。对此问题建立两个类别,侮辱性和非侮辱性,使用1和0分别表示。
下面将用C++来设计数据结构和算法。
4.1 准备数据:从文本中构建词向量
把文本看成单词向量或者词条向量,也就是说将句子转换为向量。考虑出现在所有文档中的所有单词,再决定将哪些词纳入词汇表或者说所要的词汇集合,然后必须要将每一篇文档转换为在词汇表上的向量,为什么要这么做,不着急,先往下看。
4-1:词表到向量的转换函数:
/* * code list 4-1 : transfer func from docs list to vocabulary list * */ #include<iostream> #include<map> #include<set> #include<vector> #include<algorithm> #include<numeric> #include<cstring> #include<stdio.h> #include<cstdlib> using namespace std; string posting_list[6][10]={ {"my","dog","has","flea","problems","help","please","null"}, {"maybe","not","take","him","to","dog","park","stupid","null"}, {"my","dalmation","is","so","cute","I","love","him","null"}, {"stop","posting","stupid","worthless","garbage","null"}, {"mr","licks","ate","my","steak","how","to","stop","him","null"}, {"quit","buying","worthless","dog","food","stupid","null"} }; int class_vec[6] = {0,1,0,1,0,1}; //1 is abusive ,0 not class NaiveBayes { private: vector< vector<string> > list_of_posts; //词条向量 vector<int> list_classes; map<string,int> my_vocab_list; //单词列表 int *return_vec; public: NaiveBayes() { //posting_list --> list_of_posts vector<string> vec; for(int i=0;i<6;i++) { vec.clear(); for(int j=0;posting_list[i][j]!="null";j++) { vec.push_back( posting_list[i][j] ); } list_of_posts.push_back( vec ); } //class_vec --> list_classes for(int i=0;i<sizeof(class_vec)/sizeof(class_vec[0]);i++) { list_classes.push_back( class_vec[i] ); } } void create_vocab_list() { vector< vector<string> > :: iterator it = list_of_posts.begin(); int index = 1; while( it!=list_of_posts.end() ) { vector<string> vec = *it; vector<string> :: iterator tmp_it = vec.begin(); while( tmp_it!=vec.end() ) { if( my_vocab_list[*tmp_it] == 0 ) { my_vocab_list[*tmp_it] = index++; //index is the location of the vocabulary } tmp_it++; } it++; } map<string,int>::const_iterator itt = my_vocab_list.begin(); while( itt!=my_vocab_list.end() ) { cout<<itt->first<<" "<<itt->second<<" "; itt++; } }//create_vocab_list //set some one doc to vec with 0 and 1. void set_of_words_to_vec(int idx) { cout<<"set of words to vec begin the document id is : "<<idx<<endl; int len = my_vocab_list.size()+1; return_vec = new int[ len ](); //pay attention to the difference between "new int[len]". initalize all the element to zero. fill(return_vec,return_vec+len,0); for(int i=0;i<len;i++) cout<<return_vec[i]<<" "; for( int i=0;posting_list[idx][i]!="null";i++ ) { int pos = my_vocab_list[ posting_list[idx][i] ]; if( pos != 0 ) { return_vec[pos] = 1; } } cout<<endl; }//set_of_words_to_vec void print() { cout<<"print the return_vec begin :"<<endl; int len = my_vocab_list.size()+1; cout<<"len = "<<len<<endl; for(int i=0;i<len;i++) { cout<<return_vec[i]<<" "; } cout<<endl; delete [] return_vec; }//print() }; int main() { NaiveBayes nb; nb.create_vocab_list(); nb.set_of_words_to_vec(5); nb.print(); system("pause") ; return 0; }
分析:
·NaiveBayes():构造函数做了两方面的工作,其一,初始化了词条切分后的文档集合list_of_posts,即将posting_list转换为list_of_posts,其中list_of_posts中的每一个分量就是一个文档,这些文档来自斑点犬爱好者留言板;其二,用class_vec去初始化类的私有成员变量list_classes,它是类别标签的集合,分为两类,侮辱性和非侮辱性。这些文本的类别由人工标注,这些标注信息用于训练程序以便自动检测侮辱性留言。
·create_vocab_list():创建一个包含在所有文档中出现的不重复词的列表。定义私有成员变量map<string,int> my_vocab_list; key是代表单词,value则代表单词在my_vocab_list中的位置(下标)。
·set_of_words_to_vec(int idx):该函数的输入参数为某个文档的下标值idx,得到了return_vec,即下标为idx的文档向量,向量的每个元素为1或者0,分别表示词汇表中的单词在输入文档中是否出现。首先根据词表长度获得文档向量长度,并用STL中的fill将其元素都设为0;遍历下标为idx的文档中所有单词,得到在词汇表my_vocab_list中的位置pos,然后根据pos将return_vec中对应的值设置为1。该函数使用词汇表或者想要检查的所有单词作为输入,一旦给定一篇文档(斑点犬网站上的一条留言),该文档就会被转换为词向量。
·print():打印得到的文档向量return_vec。
结果:
4.2:训练算法:从词向量计算概率:
前面介绍了如何将一组单词转换为一组数字,接下来看如何使用这些数字来计算概率。现在已经知道一个词是否出现在一篇文档中,也知道该文档所属的类别。那么我们将对某个文档转换成为文档向量W后进行分类,实际上就是计算在W的条件下,类别为Ci的概率。
p(Ci | W) = p( W | Ci ) / p( W ); W:就是需要分类的词向量;
我们将使用上述公式,对于每个类计算该值,然后比较这两个概率值的大小。如何计算?
p(Ci):首先可以通过类别i(侮辱性留言或非侮辱性留言)中文档数除以总的文档数来计算概率p(Ci)。
p(W|Ci):这里就要用到朴素贝叶斯假设。如果将向量W展开为一个个独立特征,那么就可以将上述概率写作p(W0,W1,W2...Wn)。这里假设所有词都相互独立,该假设也称作条件独立性假设,它意味着可以用P(W0|Ci)P(W1|Ci)P(W2|Ci)...P(P(Wn|Ci))来计算上述概率,这就极大地简化了计算的过程。
该函数的伪代码如下:
计算每个类别中的文档数目: 对每篇训练文档: 对每个类别: 如果词条出现文档中-->增加该词条的计数值 增加所有词条的计数值 对每个类别: 对每个词条: 将该词条的数目除以总词条数目得到条件概率 返回每个类别的条件概率
4-2:朴素贝叶斯分类器训练函数:
/* * code list 4-1 : transfer func from docs list to vocabulary list * add code list 4-2 : training func on Naive Bayes Classifier * */ #include<iostream> #include<map> #include<set> #include<vector> #include<algorithm> #include<numeric> #include<cstring> #include<stdio.h> #include<cstdlib> using namespace std; string posting_list[6][10]={ {"my","dog","has","flea","problems","help","please","null"}, {"maybe","not","take","him","to","dog","park","stupid","null"}, {"my","dalmation","is","so","cute","I","love","him","null"}, {"stop","posting","stupid","worthless","garbage","null"}, {"mr","licks","ate","my","steak","how","to","stop","him","null"}, {"quit","buying","worthless","dog","food","stupid","null"} }; int class_vec[6] = {0,1,0,1,0,1}; //1 is abusive ,0 not class NaiveBayes { private: vector< vector<string> > list_of_posts; vector<int> list_classes; map<string,int> my_vocab_list; int *return_vec; vector< vector<int> > train_mat; public: NaiveBayes() { vector<string> vec; for(int i=0;i<6;i++) { vec.clear(); for(int j=0;posting_list[i][j]!="null";j++) { vec.push_back( posting_list[i][j] ); } list_of_posts.push_back( vec ); } for(int i=0;i<sizeof(class_vec)/sizeof(class_vec[0]);i++) { list_classes.push_back( class_vec[i] ); } } void create_vocab_list() { vector< vector<string> > :: iterator it = list_of_posts.begin(); int index = 1; while( it!=list_of_posts.end() ) { //vector<string> vec( *it.begin(),*it.end() ); vector<string> vec = *it; vector<string> :: iterator tmp_it = vec.begin(); while( tmp_it!=vec.end() ) { //cout<<*tmp_it<<" "; if( my_vocab_list[*tmp_it] == 0 ) { my_vocab_list[*tmp_it] = index++; //index is the location of the vovabulary } tmp_it++; } it++; } }//create_vocab_list //set some one word to vec with 0 and 1. void set_of_words_to_vec(int idx) { cout<<"set of words to vec begin the document id is : "<<idx<<endl; int len = my_vocab_list.size()+1; return_vec = new int[ len ](); //pay attention to the difference between "new int[len]". initalize all the element to zero. fill(return_vec,return_vec+len,0); for(int i=0;i<len;i++) cout<<return_vec[i]<<" "; for( int i=0;posting_list[idx][i]!="null";i++ ) { //cout<<posting_list[idx][i]<<" "; int pos = my_vocab_list[ posting_list[idx][i] ]; if( pos != 0 ) { return_vec[pos] = 1; } } cout<<endl; }//set_of_words_to_vec void get_train_matrix() { cout<<"get train matrix begin : "<<endl; train_mat.clear(); for(int i=0;i<6;i++) { set_of_words_to_vec(i); vector<int> vec( return_vec , return_vec + my_vocab_list.size()+1 ); train_mat.push_back(vec); delete []return_vec; } }//get train matrix void print() { cout<<"print the train matrix begin : "<<endl; vector< vector<int> > :: iterator it = train_mat.begin(); while(it!=train_mat.end()) { vector<int> vec = *it; vector<int> :: iterator itt = vec.begin(); while( itt!=vec.end()) { cout<<*itt<<" "; itt++; } cout<<endl; it++; } }//print() void train_NB0() { int num_train_docs = train_mat.size();//sizeof(posting_lists)/sizeof(posting_lists[0]); cout<<"num_train_docs = "<<num_train_docs<<endl; int num_words = train_mat[0].size() - 1 ; /* calculatr the sum of the abusive classes */ int sum = accumulate(list_classes.begin(),list_classes.end(),0); //C++ STL accumulate() cout<<"sum = "<<sum<<endl; float p_abusive = (float)sum/(float)num_train_docs; cout<<"p_abusive = "<<p_abusive<<endl; vector<float> p0vect(train_mat[0].size(),0); //the frequency of each word in non-absusive docs vector<float> p1vect(train_mat[0].size(),0); //the frequency of each word in abusive docs printf("p0num.size() = %d , p1num.size() = %d\n",p0vect.size(),p1vect.size()); float p0Denom = 0.0; //the total number of words in non-abusive docs float p1Denom = 0.0; //the total number of words in abusive docs /* calculate the p0num,p1num,p0Denom,p1Denom */ for(int i=0;i<list_classes.size();i++) { if(list_classes[i] == 1) //abusive doc { for(int j=0;j<p1vect.size();j++) { p1vect[j] += train_mat[i][j]; if(train_mat[i][j]==1) p1Denom++; } } else //non-abusive doc { for(int j=0;j<p0vect.size();j++) { p0vect[j] += train_mat[i][j]; if(train_mat[i][j]==1) p0Denom++; } } } for(int i=0;i<p1vect.size();i++) { p0vect[i] = p0vect[i]/p0Denom; p1vect[i] = p1vect[i]/p1Denom; } cout<<"print the p0vect values : "; for(int i=0;i<p0vect.size();i++) cout<<p0vect[i]<<" "; cout<<"\nprint the p1vect values : "; for(int i=0;i<p1vect.size();i++) cout<<p1vect[i]<<" "; cout<<endl; } }; int main() { NaiveBayes nb; nb.create_vocab_list(); nb.get_train_matrix(); nb.print(); nb.train_NB0(); system("pause") ; return 0; }
分析:
代码中和4-1的代码相比,4-2增加了私有成员变量train_mat和公有成员函数train_NB0。其中:
train_mat:文档矩阵。是由0,1组成的词向量矩阵,单个向量是一个文档转换成为和my_vocab_list等长的[0,1]数组。
train_NB0:朴素贝叶斯分类器训练函数。
首先,计算文档属于侮辱性文档(class=1)的概率:p_abusive,即P(1)。因为这是一个二类分类问题,所以可以通过1-P(1)得到P(0)。对于多于两类的分类问题,则需要对代码稍加修改。
计算P(Wi | C0)和P(Wi | C1),需要初始化程序中的分子变量p0vect/p1vect和分母变量p0Denom/p1Denom。在for循环中,要遍历训练集train_mat中的所有文档。每次某个词语(侮辱性或非侮辱性)在某一文档中出现,则该词在向量p0vect或p1vect对应的位置数值加一,而且在所有的文档中,该文档的总次数p0Denom或p1Denom也相应加1。对于两个类别都需要进行同样的计算处理。最后,对每个元素除以该类别中的总次数。
结果:
4.3:测试算法:根据现实情况修改分类器
利用贝叶斯分类器对文档进行分类时,要计算多个概率乘积以获得文档属于某个类别的概率,即计算p(W0|1)p(W1|1)p(W2|1)。
问题一:如果其中的一个概率值为0,那么最后的乘积也为0。为降低这种影响,可以将所有词的出现数初始化1,并将分母初始化为2.
p0vect.resize(train_mat[0].size(),1);//the frequency of each word in non-absusive docs p1vect.resize(train_mat[0].size(),1);//the frequency of each word in abusive docs float p0Denom = 2.0; //the total number of words in non-abusive docs float p1Denom = 2.0; //the total number of words in abusive docs
问题二:下溢出。这是由于太多很小的数相乘造成的。当计算乘积p(W0|Ci)p(W1|Ci)p(W2|Ci)....p(Wn|Ci)时,由于大部分因子都非常小,所以程序会下溢出或者得到不正确答案。在代数中有ln(a*b) = ln(a) + ln(b),于是通过对数可以避免下溢出或者浮点数舍入导致的错误。同时,采用自然对数进行处理不会有任何损失。自然ln不会影响函数的单调性。
p0vect[i] = log(p0vect[i]/p0Denom); p1vect[i] = log(p1vect[i]/p1Denom);万事俱备,已经可以开始构建完整的分类器了。
4-3:朴素贝叶斯分类函数:
/* * code list 4-1 : transfer func from docs list to vocabulary list * code list 4-2 : training func on Naive Bayes Classifier * add code list 4-3 : naive bayes classify function * */ #include<iostream> #include<map> #include<set> #include<cmath> #include<vector> #include<algorithm> #include<numeric> #include<cstring> #include<stdio.h> #include<cstdlib> using namespace std; string posting_list[6][10]={ {"my","dog","has","flea","problems","help","please","null"}, {"maybe","not","take","him","to","dog","park","stupid","null"}, {"my","dalmation","is","so","cute","I","love","him","null"}, {"stop","posting","stupid","worthless","garbage","null"}, {"mr","licks","ate","my","steak","how","to","stop","him","null"}, {"quit","buying","worthless","dog","food","stupid","null"} }; int class_vec[6] = {0,1,0,1,0,1}; //1 is abusive ,0 not class NaiveBayes { private: vector< vector<string> > list_of_posts; vector<int> list_classes; map<string,int> my_vocab_list; int *return_vec; vector< vector<int> > train_mat; vector<float> p0vect; vector<float> p1vect; float p_abusive; public: NaiveBayes() { vector<string> vec; for(int i=0;i<6;i++) { vec.clear(); for(int j=0;posting_list[i][j]!="null";j++) { vec.push_back( posting_list[i][j] ); } list_of_posts.push_back( vec ); } for(int i=0;i<sizeof(class_vec)/sizeof(class_vec[0]);i++) { list_classes.push_back( class_vec[i] ); } } void create_vocab_list() { vector< vector<string> > :: iterator it = list_of_posts.begin(); int index = 1; while( it!=list_of_posts.end() ) { //vector<string> vec( *it.begin(),*it.end() ); vector<string> vec = *it; vector<string> :: iterator tmp_it = vec.begin(); while( tmp_it!=vec.end() ) { //cout<<*tmp_it<<" "; if( my_vocab_list[*tmp_it] == 0 ) { my_vocab_list[*tmp_it] = index++; //index is the location of the vovabulary } tmp_it++; } it++; } }//create_vocab_list //set some one word to vec with 0 and 1. void set_of_words_to_vec(int idx) { cout<<"set of words to vec begin the document id is : "<<idx<<endl; int len = my_vocab_list.size()+1; return_vec = new int[ len ](); //pay attention to the difference between "new int[len]". initalize all the element to zero. fill(return_vec,return_vec+len,0); for(int i=0;i<len;i++) cout<<return_vec[i]<<" "; for( int i=0;posting_list[idx][i]!="null";i++ ) { //cout<<posting_list[idx][i]<<" "; int pos = my_vocab_list[ posting_list[idx][i] ]; if( pos != 0 ) { return_vec[pos] = 1; } } cout<<endl; }//set_of_words_to_vec void get_train_matrix() { cout<<"get train matrix begin : "<<endl; train_mat.clear(); for(int i=0;i<6;i++) { set_of_words_to_vec(i); vector<int> vec( return_vec , return_vec + my_vocab_list.size()+1 ); train_mat.push_back(vec); delete []return_vec; } }//get train matrix void print() { cout<<"print the train matrix begin : "<<endl; vector< vector<int> > :: iterator it = train_mat.begin(); while(it!=train_mat.end()) { vector<int> vec = *it; vector<int> :: iterator itt = vec.begin(); while( itt!=vec.end()) { cout<<*itt<<" "; itt++; } cout<<endl; it++; } }//print() void train_NB0() { int num_train_docs = train_mat.size();//sizeof(posting_lists)/sizeof(posting_lists[0]); cout<<"num_train_docs = "<<num_train_docs<<endl; int num_words = train_mat[0].size() - 1 ; /* calculatr the sum of the abusive classes */ int sum = accumulate(list_classes.begin(),list_classes.end(),0); cout<<"sum = "<<sum<<endl; //float p_abusive = (float)sum/(float)num_train_docs; p_abusive = (float)sum/(float)num_train_docs; cout<<"p_abusive = "<<p_abusive<<endl; //vector<float> p0vect(train_mat[0].size(),1); //the frequency of each word in non-absusive docs p0vect.resize(train_mat[0].size(),1); //vector<float> p1vect(train_mat[0].size(),1); //the frequency of each word in abusive docs p1vect.resize(train_mat[0].size(),1); printf("p0num.size() = %d , p1num.size() = %d\n",p0vect.size(),p1vect.size()); float p0Denom = 2.0; //the total number of words in non-abusive docs float p1Denom = 2.0; //the total number of words in abusive docs /* calculate the p0num,p1num,p0Denom,p1Denom */ for(int i=0;i<list_classes.size();i++) { if(list_classes[i] == 1) //abusive doc { for(int j=0;j<p1vect.size();j++) { p1vect[j] += train_mat[i][j]; if(train_mat[i][j]==1) p1Denom++; } } else //non-abusive doc { for(int j=0;j<p0vect.size();j++) { p0vect[j] += train_mat[i][j]; if(train_mat[i][j]==1) p0Denom++; } } } for(int i=0;i<p1vect.size();i++) { p0vect[i] = log(p0vect[i]/p0Denom); p1vect[i] = log(p1vect[i]/p1Denom); } cout<<"print the p0vect values : "<<endl; for(int i=0;i<p0vect.size();i++) cout<<p0vect[i]<<" "; cout<<"\nprint the p1vect values : "<<endl; for(int i=0;i<p1vect.size();i++) cout<<p1vect[i]<<" "; cout<<endl; } int classify_NB( string *doc_to_classify ) { return_vec = new int[ my_vocab_list.size()+1 ](); for(int i=0;doc_to_classify[i]!="null";i++) { int pos = my_vocab_list[ doc_to_classify[i] ]; if( pos!=0 ) { return_vec[ pos ] = 1; } }//for for(int i=0;i<my_vocab_list.size()+1;i++) cout<<return_vec[i]<<" "; cout<<endl; float p1 = inner_product( p1vect.begin()+1,p1vect.end(),return_vec+1,0 ) + log(p_abusive); float p0 = inner_product( p0vect.begin()+1,p0vect.end(),return_vec+1,0 ) + log(1-p_abusive); cout<<"p1 = "<<p1<<endl; cout<<"p0 = "<<p0<<endl; if( p1>p0 ) { return 1; } else { return 0; } } }; int main() { NaiveBayes nb; nb.create_vocab_list(); //nb.set_of_words_to_vec(5); nb.get_train_matrix(); nb.print(); nb.train_NB0(); string doc1_to_classify[] = {"love","my","dalmation","null"}; string doc2_to_classify[] = {"stupid","garbage","null"}; cout<<"doc1 classified as : "<<nb.classify_NB( doc1_to_classify )<<endl; cout<<"doc2 classified as : "<<nb.classify_NB( doc2_to_classify )<<endl; return 0; }
结果:
可以看到doc1:{"love","my","dalmation","null"};被分为0类(not abusive); doc2:{"stupid","garbage","null"}被分为1类(abusive)。分类正确!
五、示例:使用朴素贝叶斯过滤垃圾邮件:
在这个例子中,我们将了解朴素贝叶斯的一个最著名的应用:电子邮件垃圾过滤。首先看一下如何使用通用框架来解决问题:
(1)收集数据:提供文本文件;
(2)准备数据:将文本文件解析成词条向量;
(3)分析数据:检查词条确保解析的正确性;
(4)训练算法:使用我们之前建立的train_NB()函数;
(5)测试算法:使用classifyNB(),并且构建一个新的测试函数来计算文档集的错误率;
(6)使用算法:构建一个完整的程序对一组文档进行分类,将错分的文档输出到屏幕上。
5.1:切分文本:
首先我们将写一个python程序textParse.py来对所有的email文件进行解析,正常的邮件放在/email/ham/下,垃圾邮件放在/email/spam/下,将ham下每个文件解析完成后放在/email/hamParse/下,将spam下每个文件解析完成后放在/email/spamParse/下,email共享文件链接:http://yunpan.cn/Q4fXnTtGudGA9 。
代码textParse.py:
#!/usr/bin/env python def textParse(bigString): import re listOfTokens = re.split(r‘\W*‘,bigString) return [tok.lower() for tok in listOfTokens if len(tok) > 2 ] def spamTest(): for i in range(1,26): wordList = textParse( open(‘./email/ham/%d.txt‘ % i).read() ) fp = open( ‘./email/hamParse/%d.dat‘ % i , ‘w‘) for item in wordList: fp.write(item+‘ ‘) wordList = textParse( open(‘./email/spam/%d.txt‘ % i).read() ) fp = open( ‘./email/spamParse/%d.dat‘ % i , ‘w‘) for item in wordList: fp.write(item+‘ ‘) spamTest()
分析:上面的python代码就是读入文本数据,然后切分,得到词向量,然后将词向量中的词都转换成小写,并把长度大于2的提取出来,写入文本文件中去。文本解析是一个相当复杂的过程,可以根据自己的情况自行修改。
5.2:测试算法:使用朴素贝叶斯进行交叉验证
完整代码NB3.cc:
/* * code list 4-1 : transfer func from docs list to vocabulary list * code list 4-2 : training func on Naive Bayes Classifier * code list 4-3 : naive bayes classify function * add code list 4-4 : naive bayes bag-of-word model * add code list 4-5 : text parse : textParse.py and spam email test function : get_error_rate() * */ #include<iostream> #include<map> #include<set> #include<cmath> #include<vector> #include<algorithm> #include<numeric> #include<cstring> #include<stdio.h> #include<cstdlib> #include<fstream> #include<stdlib.h> #include<unistd.h> #include<string.h> using namespace std; class NaiveBayes { private: vector< vector<string> > list_of_docs; vector<int> list_classes; map<string,int> my_vocab_list; int *return_vec; vector< vector<int> > train_mat; vector<float> p0vect; vector<float> p1vect; float p_abusive; ifstream fin; ofstream fout; int test_data_num; public: NaiveBayes() { cout<<"please input the num of test data which should be less than 24 : "<<endl; cin>>test_data_num; vector<string> vec; string word; string filename; char buf[3]; string buf_str; for(int i=test_data_num+1;i<=25;i++) { sprintf(buf,"%d",i); //convert digit to string vec.clear(); buf_str = buf; filename = "./email/hamParse/"+buf_str+".dat"; //cout<<"filename : "<<filename<<endl; fin.open( filename.c_str() ); if(!fin) { cerr<<"open the file "<<filename<<" error"<<endl; exit(1); } while(fin>>word) { vec.push_back(word); } list_of_docs.push_back( vec ); list_classes.push_back(0); filename.clear(); fin.close(); } for(int i=test_data_num+1;i<=25;i++) { sprintf(buf,"%d",i); vec.clear(); buf_str = buf; filename = "./email/spamParse/"+buf_str+".dat"; //cout<<"filename : "<<filename<<endl; fin.open( filename.c_str() ); if(!fin) { cerr<<"open the file "<<filename<<" error"<<endl; } while(fin>>word) { vec.push_back(word); } list_of_docs.push_back( vec ); list_classes.push_back(1); filename.clear(); fin.close(); } } ~NaiveBayes() { fin.close(); fout.close(); list_of_docs.clear(); list_classes.clear(); my_vocab_list.clear(); train_mat.clear(); //delete [] return_vec; p0vect.clear(); p1vect.clear(); } void create_vocab_list() { vector< vector<string> > :: iterator it = list_of_docs.begin(); int index = 1; while( it!=list_of_docs.end() ) { //vector<string> vec( *it.begin(),*it.end() ); vector<string> vec = *it; vector<string> :: iterator tmp_it = vec.begin(); while( tmp_it!=vec.end() ) { //cout<<*tmp_it<<" "; if( my_vocab_list[*tmp_it] == 0 ) { my_vocab_list[*tmp_it] = index++; //index is the location of the vovabulary } tmp_it++; } it++; } }//create_vocab_list //set some one word to vec with 0 and 1. void beg_of_words_to_vec(int idx) { //cout<<"set of words to vec begin the document id is : "<<idx<<endl; int len = my_vocab_list.size()+1; return_vec = new int[ len ](); //pay attention to the difference between "new int[len]". initalize all the element to zero. fill(return_vec,return_vec+len,0); vector< vector<string> >:: iterator it = list_of_docs.begin() + idx - 1 ; vector<string> vec = *it; vector<string> :: iterator itt = vec.begin(); int pos = 0 ; while( itt!=vec.end() ) { // cout<<*itt<<" "; pos = my_vocab_list[ *itt ]; if(pos!=0) { return_vec[pos] += 1; } itt++; } }//beg_of_words_to_vec void get_train_matrix() { cout<<"get train matrix begin : "<<endl; train_mat.clear(); for(int i=1;i<=list_of_docs.size();i++) { beg_of_words_to_vec(i); vector<int> vec( return_vec , return_vec + my_vocab_list.size()+1 ); train_mat.push_back(vec); delete []return_vec; } }//get train matrix void print() { cout<<"print the train matrix begin : "<<endl; vector< vector<int> > :: iterator it = train_mat.begin(); while(it!=train_mat.end()) { vector<int> vec = *it; vector<int> :: iterator itt = vec.begin(); while( itt!=vec.end()) { cout<<*itt<<" "; itt++; } cout<<endl; it++; } }//print() void train_NB0() { int num_train_docs = train_mat.size();//sizeof(docs_lists)/sizeof(docs_lists[0]); cout<<"num_train_docs = "<<num_train_docs<<endl; int num_words = train_mat[0].size() - 1 ; /* calculatr the sum of the abusive classes */ int sum = accumulate(list_classes.begin(),list_classes.end(),0); cout<<"sum = "<<sum<<endl; //float p_abusive = (float)sum/(float)num_train_docs; p_abusive = (float)sum/(float)num_train_docs; cout<<"p_abusive = "<<p_abusive<<endl; //vector<float> p0vect(train_mat[0].size(),1); //the frequency of each word in non-absusive docs p0vect.resize(train_mat[0].size(),1); //vector<float> p1vect(train_mat[0].size(),1); //the frequency of each word in abusive docs p1vect.resize(train_mat[0].size(),1); printf("p0num.size() = %d , p1num.size() = %d\n",p0vect.size(),p1vect.size()); float p0Denom = 2.0; //the total number of words in non-abusive docs float p1Denom = 2.0; //the total number of words in abusive docs /* calculate the p0num,p1num,p0Denom,p1Denom */ for(int i=0;i<list_classes.size();i++) { if(list_classes[i] == 1) //abusive doc { for(int j=0;j<p1vect.size();j++) { p1vect[j] += train_mat[i][j]; if(train_mat[i][j]==1) p1Denom++; } } else //non-abusive doc { for(int j=0;j<p0vect.size();j++) { p0vect[j] += train_mat[i][j]; if(train_mat[i][j]==1) p0Denom++; } } } for(int i=0;i<p1vect.size();i++) { p0vect[i] = log(p0vect[i]/p0Denom); p1vect[i] = log(p1vect[i]/p1Denom); } cout<<endl; } int classify_NB(const char *filename ) { return_vec = new int[ my_vocab_list.size()+1 ](); fin.open(filename); if(!fin) { cerr<<"fail to open the file "<<filename<<endl; exit(1); } string word; while(fin>>word) { int pos = my_vocab_list[ word ]; if( pos!=0 ) { return_vec[ pos ] += 1; } } fin.close(); cout<<endl; float p1 = inner_product( p1vect.begin()+1,p1vect.end(),return_vec+1,0 ) + log(p_abusive); float p0 = inner_product( p0vect.begin()+1,p0vect.end(),return_vec+1,0 ) + log(1-p_abusive); cout<<"p1 = "<<p1<<" "<<"p0 = "<<p0<<endl; if( p1>p0 ) { return 1; } else { return 0; } } void get_error_rate() { string filename ; char buf[3]; string buf_str; int error_count = 0; for(int i=1;i<=test_data_num;i++) { sprintf(buf,"%d",i); buf_str = buf; filename = "./email/hamParse/"+buf_str+".dat"; if( classify_NB( filename.c_str() ) != 0 ) { error_count++; } filename = "./email/spamParse/"+buf_str+".dat"; if( classify_NB( filename.c_str() ) != 1 ) { error_count++; } } cout<<"the error rate is : "<<(float)error_count/(float)(2*test_data_num)<<endl; } }; int main() { NaiveBayes nb; nb.create_vocab_list(); //nb.beg_of_words_to_vec(5); //nb.beg_of_words_to_vec(30); nb.get_train_matrix(); //nb.print(); nb.train_NB0(); char doc1_to_classify[] = "./email/hamParse/1.dat"; char doc2_to_classify[] = "./email/spamParse/1.dat"; cout<<"doc1 classified as : "<<nb.classify_NB( doc1_to_classify )<<endl; cout<<"doc2 classified as : "<<nb.classify_NB( doc2_to_classify )<<endl; nb.get_error_rate(); return 0; }
makefile:
target: ./textParse.py g++ NB3.cc ./a.out clean: rm ./email/spamParse/* ./email/hamParse/* a.out
代码中增加了get_error_rate()函数测试分类函数的错误率。email中ham和spam下分别有25个文本文件,我们定义了成员变量test_data_num,那么我们就将ham/spam下第1~test_data_num的邮件当做测试集,第test_data_num+1~25的邮件当做训练集。这种随机选择数据的一部分作为训练集,而剩余部分作为测试集的过程称为留存交叉验证。那么在构造函数中就将第test_data_num+1~25的数据来初始化list_of_doc,进一步通过create_vocab_list()和get_train_matrix()得到train_mat,再通过训练函数train_NB0()得到p0vect和p1vect,通过classify_NB()对文本进行分类,get_error_rate()测试分类函数的错误率。
下面展示一个在test_data_num = 7情况下的结果:
错误率在7%左右。经过测试随着test_data_num的增加错误率会减小,知道test_data_num=12的时候降为4%。之后又会随着test_data_num的增加而上升。
如果还有问题可以留言进行交流,谢谢!
注明出处:http://blog.csdn.net/lavorange/article/details/17841383
参考:
《概率论与数理统计第四版》浙大版
《机器学习实战中文版》