第三次实现Logistic Regression(c++)_接口

看《我是歌手》第三期回放,张宇唱得实在太烂,还是回来写会儿blog吧。


1. 初衷

前两次实现,第一次的输入特征限于0-1特征,第二次限于实值特征,两者混用的还没尝试过,这次也不打算尝试。前面实现的都是二分类问题,在现实生活中,最经常遇到的还是多分类问题。由二分类器转成多分类器有两种方法:one vs all,or one vs one。前者,对于K个类别,建立K个分类器,每个分类器区分当前类别和其余类别。后者,K个类别需要K(K-1)个分类器,分别区分每两个类别。工程实践中,相信很少有人好事儿选择后者。

这两天再次学习李航老师的《统计学习方法》,翻看逻辑回归那部分章节。书中简略的提到了逻辑回归在多分类上的公式。所以,手痒,动手实现一个。


2. 原理

对于二分类LR模型,最后用sigmoid函数来计算当前类别的概率,sigmoid函数公式为:f(x) = 1.0 / (1.0 + exp(-x))。在实际计算中,为了提升计算精度,往往用他的变形:f(x) = exp(x) / (1.0 + exp(x))。

对于多分类问题,李航老师书中给出的概率计算函数为:f_i(x) = exp(x_i) / (1.0 + sum_j (exp(x_j)) )。大家凑合着看,带有‘_i’或者‘_j’的都是表示下标。

对于SGD的权重更新公式,我没有具体推导。从代码实现上看,在从前的代码上稍作改动即可。


3. 参数

在处理二分类的时候,每一个特征有一个权重,权重形成一个权重向量。在处理K分类的时候,对于每一个特征,在每一个分类中都有一个权重,特征+分类,其权重形成一个权重矩阵。当特征数为N的时候,按刚才描述,权重矩阵大小为K * N。不过,1 * N能处理二分类问题;则(K-1)*N就能处理K类问题,只要增加一个default分类即可。不知道说明白没有,代码如下:

private:
	// the number of target class
	int iClassNum;
	// the number of feature
	int iFeatureNum;
	// the theta matrix of iMaxFeatureNum * (iClassNum - 1)
	// note: for binary class, we need only 1 vector of theta; for multi-class, 
	// iMaxFeatureNum * (iClassNum - 1) is always enough
	vector< vector<double> > ThetaMatrix;

4. 整体接口

整体函数接口,定义在LogisticRegression.h文件中,如下:

/***********************************************************************************
* Logistic Regression classifier version 0.03
* Implemented by Jinghui Xiao (xiaojinghui@gmail.com or xiaojinghui1978@qq.com)
* Last updated on 2014-1-17
***********************************************************************************/

#pragma once

#include <vector>
#include <fstream>
#include <iostream>
#include <iterator>
#include <sstream>
#include <algorithm>
#include <cmath>

using namespace std;

// The represetation for a feature and its value, init with ‘-1‘
class FeaValNode
{
public:
	int iFeatureId;
	double dValue;

	FeaValNode (void);
	~FeaValNode (void);
};

// The represetation for a sample
class Sample
{
public:
	// the class index for a sample: 0-1 value, init with ‘-1‘
	int iClass;
	vector<FeaValNode> FeaValNodeVec;

	Sample (void);
	~Sample (void);
};

// the minimal float number for smoothing for scaling the input samples
#define SMOOTHFATOR 1e-100

// The logistic regression classifier for MULTI-classes
class LogisticRegression
{
public:
	LogisticRegression(void);
	~LogisticRegression(void);

	// scale all of the sample values and put the result into txt
	bool ScaleAllSampleValTxt (const char * sFileIn, int iFeatureNum, const char * sFileOut);
	// train by SGD on the sample file
	bool TrainSGDOnSampleFile (
				const char * sFileName, int iClassNum, int iFeatureNum,		// about the samples
				double dLearningRate,										// about the learning 
				int iMaxLoop, double dMinImproveRatio						// about the stop criteria
				);
	// train by SGD on the sample file, decreasing dLearningRate during loop
	bool TrainSGDOnSampleFileEx (
				const char * sFileName, int iClassNum, int iFeatureNum,		// about the samples
				double dLearningRate,										// about the learning 
				int iMaxLoop, double dMinImproveRatio						// about the stop criteria
				);
	// save the model to txt file: the theta matrix with its size
	bool SaveLRModelTxt (const char * sFileName);
	// load the model from txt file: the theta matrix with its size
	bool LoadLRModelTxt (const char * sFileName);
	// load the samples from file, predict by the LR model
	bool PredictOnSampleFile (const char * sFileIn, const char * sFileOut, const char * sFileLog);

	// just for test
	void Test (void);

private:
	// read a sample from a line, return false if fail
	bool ReadSampleFrmLine (string & sLine, Sample & theSample);
	// load all of the samples into sample vector, this is for scale samples
	bool LoadAllSamples (const char * sFileName, vector<Sample> & SampleVec);
	// initialize the theta matrix with iClassNum and iFeatureNum
	bool InitThetaMatrix (int iClassNum, int iFeatureNum);
	// calculate the model function output for iClassIndex by feature vector
	double CalcFuncOutByFeaVec (vector<FeaValNode> & FeaValNodeVec, int iClassIndex);
	// calculate the model function output for all the classes, and return the class index with max probability
	int CalcFuncOutByFeaVecForAllClass (vector<FeaValNode> & FeaValNodeVec, vector<double> & ClassProbVec);
	// calculate the gradient and update the theta matrix, it returns the cost
	double UpdateThetaMatrix (Sample & theSample, vector<double> & ClassProbVec, double dLearningRate);
	// predict the class for one single sample
	int PredictOneSample (Sample & theSample);

private:
	// the number of target class
	int iClassNum;
	// the number of feature
	int iFeatureNum;
	// the theta matrix of iMaxFeatureNum * (iClassNum - 1)
	// note: for binary class, we need only 1 vector of theta; for multi-class, 
	// iMaxFeatureNum * (iClassNum - 1) is always enough
	vector< vector<double> > ThetaMatrix;
};

增加了一个样本scale函数,用来处理训练和测试样本。增加了另一个SGD训练函数,区别在于学习率随着迭代逐渐衰减。函数实现在LogisticRegression.cpp中,见后续博文。

转载请注明出处:http://blog.csdn.net/xceman1997/article/details/18426073

第三次实现Logistic Regression(c++)_接口

上一篇:ubuntu20.04 安装Markdown编辑器 Typora


下一篇:GWT的默认窗体控件,在右上角增加关闭按钮