博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Lesson 4 Part 2 Softmax Regression
阅读量:4041 次
发布时间:2019-05-24

本文共 13297 字,大约阅读时间需要 44 分钟。

首先为了防止混淆,我这里用了自已定义的一些变量

i 在样本中遍历,样本数量为m

j 在特征中遍历,特征数量为n
k 在分类中遍历,分类数量为c

In these notes, we describe the Softmax regression model. This model generalizes logistic regression to classification problems where the class label y can take on more than two possible values. This will be useful for such problems as MNIST digit classification, where the goal is to distinguish between 10 different numerical digits. Softmax regression is a supervised learning algorithm, but we will later be using it in conjuction with our deep learning/unsupervised feature learning methods.

Recall that in logistic regression, we had a training set {
(x(1),y(1)),,(x(m),y(m))}
of m labeled examples, where the input features are x(i)Rn+1 . (In this set of notes, we will use the notational convention of letting the feature vectors x be n + 1 dimensional, with x0=1 corresponding to the intercept term.) With logistic regression, we were in the binary classification setting, so the labels were y(i){
0,1}
. Our hypothesis took the form:

hθ(x)=11+exp(θTx),
and the model parameters θ were trained to minimize the cost function

J(θ)=1m[i=1my(i)loghθ(x(i))+(1y(i))log(1hθ(x(i)))]

In the softmax regression setting, we are interested in multi-class classification (as opposed to only binary classification), and so the label y can take on c different values, rather than only two. Thus, in our training set

{(x(1),y(1)),,(x(m),y(m))}
, we now have that y(i){
1,2,,c}
. (Note that our convention will be to index the classes starting from 1, rather than from 0.) For example, in the MNIST digit recognition task, we would have c=10 different classes.
Given a test input x, we want our hypothesis to estimate the probability that p(y=k|x) for each value of k=1,,c . I.e., we want to estimate the probability of the class label taking on each of the c different possible values. Thus, our hypothesis will output a k dimensional vector (whose elements sum to 1) giving us our
c
estimated probabilities. Concretely, our hypothesis hθ(x) takes the form:

hθ(x(i))=p(y(i)=1|x(i);θ)p(y(i)=2|x(i);θ)p(y(i)=c|x(i);θ)=1ck=1eθTkx(i)eθT1x(i)eθT2x(i)eθTcx(i)
Here
θ1,θ2,,θcRn+1
are the parameters of our model. Notice that the term
1ck=1eθTkx(i)
normalizes the distribution, so that it sums to one.
For convenience, we will also write
θ
to denote all the parameters of our model. When you implement softmax regression, it is usually convenient to represent
θ
as a
cby(n+1)
matrix obtained by stacking up
θ1,θ2,,θk
in rows, so that

θ=---θT1------θT2------θTc---

Cost Function

We now describe the cost function that we’ll use for softmax regression. In the equation below, 1{

} is the indicator function, so that 1{a true statement} = 1, and 1{a false statement} = 0. For example, 1{2 + 2 = 4} evaluates to 1; whereas 1{1 + 1 = 5} evaluates to 0. Our cost function will be:

J(θ)=1m[i=1mk=1c1{
y(i)=k}
logeθTkx(i)cl=1eθTlx(i)]
Notice that this generalizes the logistic regression cost function, which could also have been written:

J(θ)=1m[i=1m(1y(i))log(1hθ(x(i)))+y(i)loghθ(x(i))]=1m[i=1mk=011{
y(i)=k}
logp(y(i)=k|x(i);θ)]
The softmax cost function is similar, except that we now sum over the k different possible values of the class label. Note also that in softmax regression, we have that
p(y(i)=k|x(i);θ)=eθTkx(i)cl=1eθTlx(i)
.
There is no known closed-form way to solve for the minimum of J(θ), and thus as usual we’ll resort to an iterative optimization algorithm such as gradient descent or L-BFGS. Taking derivatives, one can show that the gradient is:

θkJ(θ)=1mi=1m[x(i)(1{
y(i)=k}p(y(i)=k|x(i);θ))
]

Recall the meaning of the “ θj ” notation. In particular, θkJ(θ) is itself a vector, so that its j-th element is J(θ)θkj the partial derivative of J(θ) with respect to the j-th element of θk .

Armed with this formula for the derivative, one can then plug it into an algorithm such as gradient descent, and have it minimize J(θ). For example, with the standard implementation of gradient descent, on each iteration we would perform the update θk:=θkαθkJ(θ) (for each k=1,,c ).
When implementing softmax regression, we will typically use a modified version of the cost function described above; specifically, one that incorporates weight decay. We describe the motivation and details below.

Properties of softmax regression parameterization

Softmax regression has an unusual property that it has a “redundant” set of parameters. To explain what this means, suppose we take each of our parameter vectors θk , and subtract some fixed vector ψ from it, so that every

θk
is now replaced with θkψ (for every k=1,,c ). Our hypothesis now estimates the class label probabilities as

p(y(i)=k|x(i);θ)=e(θkψ)Tx(i)cl=1e(θlψ)Tx(i)=eθTkx(i)eψTx(i)cl=1eθTlx(i)eψTx(i)=eθTkx(i)cl=1eθTlx(i).
In other words, subtracting
ψ
from every
θk
does not affect our hypothesis’ predictions at all! This shows that softmax regression’s parameters are “redundant.” More formally, we say that our softmax model is overparameterized, meaning that for any hypothesis we might fit to the data, there are multiple parameter settings that give rise to exactly the same hypothesis function hθ mapping from inputs x to the predictions.
Further, if the cost function
J(θ)
is minimized by some setting of the parameters
(θ1,θ2,,θc)
, then it is also minimized by
(θ1ψ,θ2ψ,,θcψ)
for any value of
ψ
. Thus, the minimizer of
J(θ)
is not unique. (Interestingly,
J(θ)
is still convex, and thus gradient descent will not run into a local optima problems. But the Hessian is singular/non-invertible, which causes a straightforward implementation of Newton’s method to run into numerical problems.)
Notice also that by setting
ψ=θ1
, one can always replace
θ1
with
θ1ψ=0⃗ 
(the vector of all 0’s), without affecting the hypothesis. Thus, one could “eliminate” the vector of parameters
θ1
(or any other
θk
, for any single value of
k
), without harming the representational power of our hypothesis. Indeed, rather than optimizing over the
c(n+1)
parameters
(θ1,θ2,,θc)
(where
θkRn+1
), one could instead set
θ1=0⃗ 
and optimize only with respect to the
(c1)(n+1)
remaining parameters, and this would work fine.
In practice, however, it is often cleaner and simpler to implement the version which keeps all the parameters
(θ1,θ2,,θc)
, without arbitrarily setting one of them to zero. But we will make one change to the cost function: Adding weight decay. This will take care of the numerical problems associated with softmax regression’s overparameterized representation.

但是这里我们除了Weight Decay外,还是减去最大值,原因如下

However, when the products θTkx(i) are large, the exponential function eθTkx(i) will become very very large and possibly overflow. To prevent overflow, we can simply subtract some large constant value from each of the θTkx(i) terms before computing the exponential. In practice, for each example, you can use the maximum of the θTkx(i) terms as the constant.

hθ(x(i))=1ck=1eθTkx(i)eθT1x(i)eθT2x(i)eθTcx(i)=eaeack=1eθTkx(i)eθT1x(i)eθT2x(i)eθTcx(i)=1ck=1eθTkx(i)aeθT1x(i)aeθT2x(i)aeθTcx(i)a

这里的a为行向量,其中每列的元素为矩阵 θTx 对应列 θTx(i) 中的最大的元素

注意这里 θ cn 的矩阵, x

nm
的矩阵, θTx cm 的矩阵

Weight Decay

We will modify the cost function by adding a weight decay term λ2ck=1nj=0θ2kj which penalizes large values of the parameters. Our cost function is now

J(θ)=1m[i=1mk=1c1{
y(i)=k}
logeθTkx(i)cl=1eθTlx(i)]
+λ2k=1cj=0nθ2kj
With this weight decay term (for any
λ>0
), the cost function
J(θ)
is now strictly convex, and is guaranteed to have a unique solution. The Hessian is now invertible, and because
J(θ)
is convex, algorithms such as gradient descent, L-BFGS, etc. are guaranteed to converge to the global minimum.
To apply an optimization algorithm, we also need the derivative of this new definition of
J(θ)
. One can show that the derivative is:
θkJ(θ)=1mi=1m[x(i)(1{
y(i)=k}p(y(i)=k|x(i);θ))]
+λθk

这里对 J(θ) 求导得到的结果,推导如下:

这里写图片描述

注意这里用到了公式:

xTax=aTxx=a (引用于matrix cookbook 2.4.1)

这里写图片描述

note: In particular, θkJ(θ) is itself a vector, so that its j-th element is J(θ)θkj the partial derivative of J(θ) with respect to the j-th element of θk .

By minimizing J(θ) with respect to θ <script type="math/tex" id="MathJax-Element-87">θ</script>, we will have a working implementation of softmax regression.

下面提供源码

// Softmax regression#include 
#include
#include
using namespace arma;using namespace std;#define elif else if#define MAX_ITER 1000000double cost = 0.0;mat grad;double lrate = 0.1;double lambda = 0.0;int nclasses = 2;colvec vec2colvec(vector
& vec) { int length = vec.size(); colvec A(length); for (int i = 0; i
& vec) { colvec A = vec2colvec(vec); return A.t();}mat vec2mat(vector
>&vec) { int cols = vec.size(); int rows = vec[0].size(); mat A(rows, cols); for (int i = 0; i
(nclasses, nsamples); for (int i = 0; i
(M.n_cols); for (int i = 0; i
maxele) { maxele = M(j, i); which = j; } } result(i) = which; } return result;}void softmax(vector
>&vecX, vector
&vecY, vector
>& testX, vector
& testY) { int nsamples = vecX.size(); int nfeatures = vecX[0].size(); //change vecX and vecY into matrix or vector. rowvec y = vec2rowvec(vecY); mat x = vec2mat(vecX); double init_epsilon = 0.12; mat weightsMatrix = randu
(nclasses, nfeatures); weightsMatrix = weightsMatrix * (2 * init_epsilon) - init_epsilon; grad = zeros
(nclasses, nfeatures); /* //Gradient Checking (remember to disable this part after you're sure the //cost function and dJ function are correct) update_CostFunction_and_Gradient(x, y, weightsMatrix, lambda); mat dJ(grad); cout<<"test!!!!"<
> vecX; double tpdouble; int counter = 0; while (1) { if (fscanf(streamX, "%lf", &tpdouble) == EOF) break; if (counter / numofX >= vecX.size()) { vector
tpvec; vecX.push_back(tpvec); } vecX[counter / numofX].push_back(tpdouble); ++counter; } fclose(streamX); cout << vecX.size() << ", " << vecX[0].size() << endl; //read training Y from .txt file streamY = fopen("trainY.txt", "r"); vector
vecY; while (1) { if (fscanf(streamY, "%lf", &tpdouble) == EOF) break; vecY.push_back(tpdouble); } fclose(streamY); for (int i = 1; i
> vecTX; counter = 0; while (1) { if (fscanf(streamX, "%lf", &tpdouble) == EOF) break; if (counter / numofX >= vecTX.size()) { vector
tpvec; vecTX.push_back(tpvec); } vecTX[counter / numofX].push_back(tpdouble); ++counter; } fclose(streamX); streamY = fopen("testY.txt", "r"); vector
vecTY; while (1) { if (fscanf(streamY, "%lf", &tpdouble) == EOF) break; vecTY.push_back(tpdouble); } fclose(streamY); start = clock(); softmax(vecX, vecY, vecTX, vecTY); end = clock(); cout << "Training used time: " << ((double)(end - start)) / CLOCKS_PER_SEC << " second" << endl; return 0;}

然后,对照源码我们看下该算法是如何实现的

FILE *streamX, *streamY;    streamX = fopen("trainX.txt", "r");    int numofX = 30;    vector
> vecX; double tpdouble; int counter = 0; while (1) { if (fscanf(streamX, "%lf", &tpdouble) == EOF) break; if (counter / numofX >= vecX.size()) { vector
tpvec; vecX.push_back(tpvec); } vecX[counter / numofX].push_back(tpdouble); ++counter; } fclose(streamX);

这部分将trainX.txt的数据读入vecX中,

trainX.txt每行有30个数据,一共有284行
所以vecX外层大小为284, 内层大小为30
表示成矩阵 284 * 30

同理,我们来看TrainY.txt的数据

//read training Y from .txt file    streamY = fopen("trainY.txt", "r");    vector
vecY; while (1) { if (fscanf(streamY, "%lf", &tpdouble) == EOF) break; vecY.push_back(tpdouble); } fclose(streamY);

trainY.txt每行有1个数据,一共有284行

表示成矩阵 284 * 1

同理, testX 和 textY 分别跟 trainX 和 trainY一样大小

然后我们看softmax函数

int nsamples = vecX.size();//284    int nfeatures = vecX[0].size();//30    //change vecX and vecY into matrix or vector.    rowvec y = vec2rowvec(vecY);    mat x = vec2mat(vecX);// x 30 * 284    double init_epsilon = 0.12;    mat weightsMatrix = randu
(nclasses, nfeatures);//weightsMatrix 2 * 30 这个矩阵就是需要求的θ系数.对于每种分类,对应一个不同的θ向量,该向量有n个分量 weightsMatrix = weightsMatrix * (2 * init_epsilon) - init_epsilon;//给需要求的θ一个初值 grad = zeros
(nclasses, nfeatures); //grad 2 * 30
int converge = 0;    double lastcost = 0.0;    while (converge < MAX_ITER) {        update_CostFunction_and_Gradient(x, y, weightsMatrix, lambda);        weightsMatrix -= lrate * grad;        cout << "learning step: " << converge << ", Cost function value = " << cost << endl;        if (fabs((cost - lastcost)) <= 5e-6 && converge > 0) break;        lastcost = cost;        ++converge;

接着我们看下update_CostFunction_and_Gradient

void update_CostFunction_and_Gradient(mat x, rowvec y, mat weightsMatrix, double lambda) {    int nsamples = x.n_cols;//284    int nfeatures = x.n_rows;//30    //calculate cost function    mat theta(weightsMatrix);    mat M = theta * x; // theta:2*30   x:30*284   M:2*284    mat temp = repmat(max(M, 0), nclasses, 1); // temp:2*284  每列中元素相同,为M对应列中的最大的元素    M = M - temp;//相当于上面描述中减去a    M = arma::exp(M);//分子    temp = repmat(sum(M, 0), nclasses, 1);//分母    M = M / temp;    mat groundTruth = zeros
(nclasses, nsamples);//groundTruth: 2*284 for (int i = 0; i

然后后面就是用梯度下降法进行迭代,代码我就不多解释了

你可能感兴趣的文章
spring2.5.6下配置定时器
查看>>
为什么很多程序员都选择跳槽?
查看>>
mongdb介绍
查看>>
mongdb安装使用
查看>>
mongdb在java中的应用
查看>>
mongodb与spring集成
查看>>
mongoVue介绍
查看>>
AppServ在Windows平台下的配置与安装使用教程
查看>>
Eclipse+maven开发环境搭建
查看>>
使用Maven使用spring(注解版)
查看>>
spring加入hibernate(注解版)
查看>>
struts2+hibernate+spring注解实现
查看>>
职场招聘:我如何进行简历的筛选与人员的选择
查看>>
前台页面的一些正则
查看>>
Zookeeper 在Windows下的安装过程及测试
查看>>
如何搭建高可用Redis服务
查看>>
struts2注解实现文件下载
查看>>
IT 人的纠结:去大公司还是去小公司?
查看>>
基于jquery的搜索框输入提示
查看>>
将ISO-8859-1编码字符串转换UTF-8编码
查看>>