Written in the front:
I'm polar bear, a freshman at deep learning and also a freshman at csdn community.
Recently I have been studying cs231n which is a famous computer vision course recommended by my tutor. This is my first block at csdn.This block and the following blocks, will be my study notes.
The reason why I use English, is that, all I have learnt in deep learning is in English. The courses are taught in English, notes use English, tutorials use english ect, therefore I found myself being more familiar with my accquantance in English (lol). Hopefully this will also help me improve my poor english along the way.
Training CNN (lecture 7)
§1 Optimizations:
After each iteration, we use optimization methods to update our parameters according to the there gradients with respect to Loss, which was caculated and store during the iteration.
We have several optimization methods here.
SGD is the naive approach with a set of problems, SGD+Momentum fixs them making it worth to try. RMSProp and Adgrad are also viable optimizations, while Adam kind of mix SGD+Momentum and RMSProp and works the best on broadest kinds of data sets.
Use Adam as default. Try others if you want, never use SGD.
§1.1 SGD(naive):
W -= Lr * dW
problem with sgd:
1. High ratios between dimensions(a lot of)(very common in high dimension)
2.Saddle point (and local minima)(problem with (also near) saddle points are very common in high dimension)
3.noise with mini batch
§1.2 SGD + Momentum (SGD plus):
`V = Rho * V + dW`
W -= Lr * V
(rho =eg= 0.9 or 0.99)
(sgd+momentum kind of fix the problems)
a variation to it is Nesters momentum:
V = Rho *V - Alpha * d(W + Rho * V)
X += Lr * V
§1.3 RMSProp:
Grade = Decay_r * Grad + (1 - Decay_r) * dW * dW
W -= Lr * dW / sqrt( Grad + Eps )
§1.4Adam: ( Real good ! Use it as default)
for t in range(num_iters):
Mu = beta1 * Mu + (1 - beta1) * dW
Var = beta2 * Var + (1 - beta2) * dW * dW
Mu_t = Mu / (1 - beta1 ** t)
Var_t = Var / (1 - beta2 ** t)
x -= Lr * Mu_t / (sqrt( Var_t) + Eps)
§1.0.1 Learning rate decay :
Step decay, exponential decay, 1/t decay……
§1.0.2Second order optimization:
(Hessian matrix is to hard to compute)
§2 Regularization
§2.0 Add reg to Loss
Loss = reg * L2(W) + Loss
§2.1 Model ensembles:(converge at different points)(Q:why this work?)
1. Train M independent models
2. Test using the mean of M models
(+2% )
The m models can be snapshots of a single model during one training process.
note: the hyper parameters may be different across the
(use cyclic learning rate, this trick will make the training much faster as you only train the model once)
&2.2 Dropout
Randomly set neurons to zeros .(activation)
In CNN we usually drop channels.
Intuition:
1.prevent co-adaptation of features, preventing overfitting.
2.training a large ensemble of models(with shared parameters) simultaneously.
In practice, we use Inert dropout, with p representing the possibility of a neuron is kept
(We want to keep the expectation of outs consistent between train and test. In order to spend up testing, we divide the out by p during training)
U = (np.random.rand(*Activ.shape) < p) * p
§2.3 Data Augmentation
Create more data from given dataset.
&2.0 Sum up:
Upon is a common pattern of regularization:
Train: Add random noise
Test: Marginalize over noise
Eg:
dropout, batch normalize, data augmentation, dropconnect ,fractional pooling, stochastic depth
Most of the time batch normalization is good enough!!!
Is you still see overfit, try dropout out otherthings.
&3 Transfer learning
When we don’t have enough data set, we can use the a pre-trained model with a sufficient dataset, freeze all the convolution layers and only train the fully-connected on the small data set that we have. If we have a slightly bigger dataset, we can then train the convolution layers as well, only with smaller learning rate as we don’t want to change them to much.)
(The intuition here is that is the task is similar in some way, that the convolution layers may get some features that also differentiate class in the current task)
Thanks for reading, I would be glad if this block helps you in some way.
I will be updating other notes soon. They will all be my study notes when reviewing cs231n lectures. Check them if you like.