本文共 13297 字,大约阅读时间需要 44 分钟。
首先为了防止混淆,我这里用了自已定义的一些变量
i 在样本中遍历,样本数量为m
j 在特征中遍历,特征数量为n k 在分类中遍历,分类数量为cIn these notes, we describe the Softmax regression model. This model generalizes logistic regression to classification problems where the class label y can take on more than two possible values. This will be useful for such problems as MNIST digit classification, where the goal is to distinguish between 10 different numerical digits. Softmax regression is a supervised learning algorithm, but we will later be using it in conjuction with our deep learning/unsupervised feature learning methods.
Recall that in logistic regression, we had a training set { (x(1),y(1)),…,(x(m),y(m))} of m labeled examples, where the input features are x(i)∈Rn+1 . (In this set of notes, we will use the notational convention of letting the feature vectors x be n + 1 dimensional, with x0=1 corresponding to the intercept term.) With logistic regression, we were in the binary classification setting, so the labels were y(i)∈{ 0,1} . Our hypothesis took the form:
In the softmax regression setting, we are interested in multi-class classification (as opposed to only binary classification), and so the label y can take on c different values, rather than only two. Thus, in our training set
We now describe the cost function that we’ll use for softmax regression. In the equation below, 1{ ⋅} is the indicator function, so that 1{a true statement} = 1, and 1{a false statement} = 0. For example, 1{2 + 2 = 4} evaluates to 1; whereas 1{1 + 1 = 5} evaluates to 0. Our cost function will be:
Recall the meaning of the “ ∇θj ” notation. In particular, ∇θkJ(θ) is itself a vector, so that its j-th element is ∂J(θ)∂θkj the partial derivative of J(θ) with respect to the j-th element of θk .
Armed with this formula for the derivative, one can then plug it into an algorithm such as gradient descent, and have it minimize J(θ). For example, with the standard implementation of gradient descent, on each iteration we would perform the update θk:=θk−α∇θkJ(θ) (for each k=1,…,c ). When implementing softmax regression, we will typically use a modified version of the cost function described above; specifically, one that incorporates weight decay. We describe the motivation and details below.Softmax regression has an unusual property that it has a “redundant” set of parameters. To explain what this means, suppose we take each of our parameter vectors θk , and subtract some fixed vector ψ from it, so that every
但是这里我们除了Weight Decay外,还是减去最大值,原因如下
However, when the products θTkx(i) are large, the exponential function eθTkx(i) will become very very large and possibly overflow. To prevent overflow, we can simply subtract some large constant value from each of the θTkx(i) terms before computing the exponential. In practice, for each example, you can use the maximum of the θTkx(i) terms as the constant.
这里的a为行向量,其中每列的元素为矩阵 θTx 对应列 θTx(i) 中的最大的元素
注意这里 θ 为 c∗n 的矩阵, x 为
We will modify the cost function by adding a weight decay term λ2∑ck=1∑nj=0θ2kj which penalizes large values of the parameters. Our cost function is now
这里对 J(θ) 求导得到的结果,推导如下:
注意这里用到了公式:
∂xTa∂x=∂aTx∂x=a (引用于matrix cookbook 2.4.1)note: In particular, ∇θkJ(θ) is itself a vector, so that its j-th element is ∂J(θ)∂θkj the partial derivative of J(θ) with respect to the j-th element of θk .
By minimizing J(θ) with respect to θ <script type="math/tex" id="MathJax-Element-87">θ</script>, we will have a working implementation of softmax regression.
下面提供源码
// Softmax regression#include#include #include using namespace arma;using namespace std;#define elif else if#define MAX_ITER 1000000double cost = 0.0;mat grad;double lrate = 0.1;double lambda = 0.0;int nclasses = 2;colvec vec2colvec(vector & vec) { int length = vec.size(); colvec A(length); for (int i = 0; i & vec) { colvec A = vec2colvec(vec); return A.t();}mat vec2mat(vector >&vec) { int cols = vec.size(); int rows = vec[0].size(); mat A(rows, cols); for (int i = 0; i (nclasses, nsamples); for (int i = 0; i (M.n_cols); for (int i = 0; i maxele) { maxele = M(j, i); which = j; } } result(i) = which; } return result;}void softmax(vector >&vecX, vector &vecY, vector >& testX, vector & testY) { int nsamples = vecX.size(); int nfeatures = vecX[0].size(); //change vecX and vecY into matrix or vector. rowvec y = vec2rowvec(vecY); mat x = vec2mat(vecX); double init_epsilon = 0.12; mat weightsMatrix = randu (nclasses, nfeatures); weightsMatrix = weightsMatrix * (2 * init_epsilon) - init_epsilon; grad = zeros (nclasses, nfeatures); /* //Gradient Checking (remember to disable this part after you're sure the //cost function and dJ function are correct) update_CostFunction_and_Gradient(x, y, weightsMatrix, lambda); mat dJ(grad); cout<<"test!!!!"< > vecX; double tpdouble; int counter = 0; while (1) { if (fscanf(streamX, "%lf", &tpdouble) == EOF) break; if (counter / numofX >= vecX.size()) { vector tpvec; vecX.push_back(tpvec); } vecX[counter / numofX].push_back(tpdouble); ++counter; } fclose(streamX); cout << vecX.size() << ", " << vecX[0].size() << endl; //read training Y from .txt file streamY = fopen("trainY.txt", "r"); vector vecY; while (1) { if (fscanf(streamY, "%lf", &tpdouble) == EOF) break; vecY.push_back(tpdouble); } fclose(streamY); for (int i = 1; i > vecTX; counter = 0; while (1) { if (fscanf(streamX, "%lf", &tpdouble) == EOF) break; if (counter / numofX >= vecTX.size()) { vector tpvec; vecTX.push_back(tpvec); } vecTX[counter / numofX].push_back(tpdouble); ++counter; } fclose(streamX); streamY = fopen("testY.txt", "r"); vector vecTY; while (1) { if (fscanf(streamY, "%lf", &tpdouble) == EOF) break; vecTY.push_back(tpdouble); } fclose(streamY); start = clock(); softmax(vecX, vecY, vecTX, vecTY); end = clock(); cout << "Training used time: " << ((double)(end - start)) / CLOCKS_PER_SEC << " second" << endl; return 0;}
然后,对照源码我们看下该算法是如何实现的
FILE *streamX, *streamY; streamX = fopen("trainX.txt", "r"); int numofX = 30; vector> vecX; double tpdouble; int counter = 0; while (1) { if (fscanf(streamX, "%lf", &tpdouble) == EOF) break; if (counter / numofX >= vecX.size()) { vector tpvec; vecX.push_back(tpvec); } vecX[counter / numofX].push_back(tpdouble); ++counter; } fclose(streamX);
这部分将trainX.txt的数据读入vecX中,
trainX.txt每行有30个数据,一共有284行 所以vecX外层大小为284, 内层大小为30 表示成矩阵 284 * 30同理,我们来看TrainY.txt的数据
//read training Y from .txt file streamY = fopen("trainY.txt", "r"); vectorvecY; while (1) { if (fscanf(streamY, "%lf", &tpdouble) == EOF) break; vecY.push_back(tpdouble); } fclose(streamY);
trainY.txt每行有1个数据,一共有284行
表示成矩阵 284 * 1同理, testX 和 textY 分别跟 trainX 和 trainY一样大小
然后我们看softmax函数
int nsamples = vecX.size();//284 int nfeatures = vecX[0].size();//30 //change vecX and vecY into matrix or vector. rowvec y = vec2rowvec(vecY); mat x = vec2mat(vecX);// x 30 * 284 double init_epsilon = 0.12; mat weightsMatrix = randu(nclasses, nfeatures);//weightsMatrix 2 * 30 这个矩阵就是需要求的θ系数.对于每种分类,对应一个不同的θ向量,该向量有n个分量 weightsMatrix = weightsMatrix * (2 * init_epsilon) - init_epsilon;//给需要求的θ一个初值 grad = zeros (nclasses, nfeatures); //grad 2 * 30
int converge = 0; double lastcost = 0.0; while (converge < MAX_ITER) { update_CostFunction_and_Gradient(x, y, weightsMatrix, lambda); weightsMatrix -= lrate * grad; cout << "learning step: " << converge << ", Cost function value = " << cost << endl; if (fabs((cost - lastcost)) <= 5e-6 && converge > 0) break; lastcost = cost; ++converge;
接着我们看下update_CostFunction_and_Gradient
void update_CostFunction_and_Gradient(mat x, rowvec y, mat weightsMatrix, double lambda) { int nsamples = x.n_cols;//284 int nfeatures = x.n_rows;//30 //calculate cost function mat theta(weightsMatrix); mat M = theta * x; // theta:2*30 x:30*284 M:2*284 mat temp = repmat(max(M, 0), nclasses, 1); // temp:2*284 每列中元素相同,为M对应列中的最大的元素 M = M - temp;//相当于上面描述中减去a M = arma::exp(M);//分子 temp = repmat(sum(M, 0), nclasses, 1);//分母 M = M / temp; mat groundTruth = zeros(nclasses, nsamples);//groundTruth: 2*284 for (int i = 0; i
然后后面就是用梯度下降法进行迭代,代码我就不多解释了