看《我是歌手》第三期回放,张宇唱得实在太烂,还是回来写会儿blog吧。
1. 初衷
前两次实现,第一次的输入特征限于0-1特征,第二次限于实值特征,两者混用的还没尝试过,这次也不打算尝试。前面实现的都是二分类问题,在现实生活中,最经常遇到的还是多分类问题。由二分类器转成多分类器有两种方法:one vs all,or one vs one。前者,对于K个类别,建立K个分类器,每个分类器区分当前类别和其余类别。后者,K个类别需要K(K-1)个分类器,分别区分每两个类别。工程实践中,相信很少有人好事儿选择后者。
这两天再次学习李航老师的《统计学习方法》,翻看逻辑回归那部分章节。书中简略的提到了逻辑回归在多分类上的公式。所以,手痒,动手实现一个。
2. 原理
对于二分类LR模型,最后用sigmoid函数来计算当前类别的概率,sigmoid函数公式为:f(x) = 1.0 / (1.0 + exp(-x))。在实际计算中,为了提升计算精度,往往用他的变形:f(x) = exp(x) / (1.0 + exp(x))。
对于多分类问题,李航老师书中给出的概率计算函数为:f_i(x) = exp(x_i) / (1.0 + sum_j (exp(x_j)) )。大家凑合着看,带有‘_i’或者‘_j’的都是表示下标。
对于SGD的权重更新公式,我没有具体推导。从代码实现上看,在从前的代码上稍作改动即可。
3. 参数
在处理二分类的时候,每一个特征有一个权重,权重形成一个权重向量。在处理K分类的时候,对于每一个特征,在每一个分类中都有一个权重,特征+分类,其权重形成一个权重矩阵。当特征数为N的时候,按刚才描述,权重矩阵大小为K * N。不过,1 * N能处理二分类问题;则(K-1)*N就能处理K类问题,只要增加一个default分类即可。不知道说明白没有,代码如下:
private: // the number of target class int iClassNum; // the number of feature int iFeatureNum; // the theta matrix of iMaxFeatureNum * (iClassNum - 1) // note: for binary class, we need only 1 vector of theta; for multi-class, // iMaxFeatureNum * (iClassNum - 1) is always enough vector< vector<double> > ThetaMatrix;
4. 整体接口
整体函数接口,定义在LogisticRegression.h文件中,如下:
/*********************************************************************************** * Logistic Regression classifier version 0.03 * Implemented by Jinghui Xiao (xiaojinghui@gmail.com or xiaojinghui1978@qq.com) * Last updated on 2014-1-17 ***********************************************************************************/ #pragma once #include <vector> #include <fstream> #include <iostream> #include <iterator> #include <sstream> #include <algorithm> #include <cmath> using namespace std; // The represetation for a feature and its value, init with ‘-1‘ class FeaValNode { public: int iFeatureId; double dValue; FeaValNode (void); ~FeaValNode (void); }; // The represetation for a sample class Sample { public: // the class index for a sample: 0-1 value, init with ‘-1‘ int iClass; vector<FeaValNode> FeaValNodeVec; Sample (void); ~Sample (void); }; // the minimal float number for smoothing for scaling the input samples #define SMOOTHFATOR 1e-100 // The logistic regression classifier for MULTI-classes class LogisticRegression { public: LogisticRegression(void); ~LogisticRegression(void); // scale all of the sample values and put the result into txt bool ScaleAllSampleValTxt (const char * sFileIn, int iFeatureNum, const char * sFileOut); // train by SGD on the sample file bool TrainSGDOnSampleFile ( const char * sFileName, int iClassNum, int iFeatureNum, // about the samples double dLearningRate, // about the learning int iMaxLoop, double dMinImproveRatio // about the stop criteria ); // train by SGD on the sample file, decreasing dLearningRate during loop bool TrainSGDOnSampleFileEx ( const char * sFileName, int iClassNum, int iFeatureNum, // about the samples double dLearningRate, // about the learning int iMaxLoop, double dMinImproveRatio // about the stop criteria ); // save the model to txt file: the theta matrix with its size bool SaveLRModelTxt (const char * sFileName); // load the model from txt file: the theta matrix with its size bool LoadLRModelTxt (const char * sFileName); // load the samples from file, predict by the LR model bool PredictOnSampleFile (const char * sFileIn, const char * sFileOut, const char * sFileLog); // just for test void Test (void); private: // read a sample from a line, return false if fail bool ReadSampleFrmLine (string & sLine, Sample & theSample); // load all of the samples into sample vector, this is for scale samples bool LoadAllSamples (const char * sFileName, vector<Sample> & SampleVec); // initialize the theta matrix with iClassNum and iFeatureNum bool InitThetaMatrix (int iClassNum, int iFeatureNum); // calculate the model function output for iClassIndex by feature vector double CalcFuncOutByFeaVec (vector<FeaValNode> & FeaValNodeVec, int iClassIndex); // calculate the model function output for all the classes, and return the class index with max probability int CalcFuncOutByFeaVecForAllClass (vector<FeaValNode> & FeaValNodeVec, vector<double> & ClassProbVec); // calculate the gradient and update the theta matrix, it returns the cost double UpdateThetaMatrix (Sample & theSample, vector<double> & ClassProbVec, double dLearningRate); // predict the class for one single sample int PredictOneSample (Sample & theSample); private: // the number of target class int iClassNum; // the number of feature int iFeatureNum; // the theta matrix of iMaxFeatureNum * (iClassNum - 1) // note: for binary class, we need only 1 vector of theta; for multi-class, // iMaxFeatureNum * (iClassNum - 1) is always enough vector< vector<double> > ThetaMatrix; };
增加了一个样本scale函数,用来处理训练和测试样本。增加了另一个SGD训练函数,区别在于学习率随着迭代逐渐衰减。函数实现在LogisticRegression.cpp中,见后续博文。
转载请注明出处:http://blog.csdn.net/xceman1997/article/details/18426073