卷积神经网络

local receptive field

卷积神经网络

stride length

if we have a 28×28 input image, and 5×5 local receptive fields, then there will be 24×24 neurons in the hidden layer. This is because we can only move the local receptive field 23 neurons across (or 23 neurons down), before colliding with the right-hand side (or bottom) of the input image.

the local receptive field being moved by one pixel at a time. In fact, sometimes a different stride length is used. For instance, we might move the local receptive field 2 pixels to the right (or down), in which case we’d say a stride length of 2 is used.

Shared weights and biases

隐藏层使用相同系数,用以检测相同特征(比如图像中是否有眼睛,不同位置的神经元是为了检测图像不同区域是否有眼睛),相反普通神经网络中的同层神经元往往有不同系数,以实现对不同特征的检测。换句话说,卷积网络很好地适应了图像的平移不变性:稍微移动猫的图片(例如),它仍然是猫的图片

注意:是同一个隐藏层的不同神经元使用相同的一组权重,但是同一组内的权重不同,这样才能产生5×5的像素不同的feature map

we’re going to use the same weights and bias for each of the 24×24 hidden neurons. In other words, for the j,kth hidden neuron, the output is:
卷积神经网络
This means that all the neurons in the first hidden layer detect exactly the same feature

(I haven’t precisely defined the notion of a feature. Informally, think of the feature detected by a hidden neuron as the kind of input pattern that will cause the neuron to activate)

it might be an edge in the image, for instance, or maybe some other type of shape., just at different locations in the input image.

To see why this makes sense, suppose the weights and bias are such that the hidden neuron can pick out, say, a vertical edge in a particular local receptive field. That ability is also likely to be useful at other places in the image. And so it is useful to apply the same feature detector everywhere in the image

feature map

feature map只是下图中的55像素块组合,并不是整个隐藏层,反映了一种映射关系

we sometimes call the map from the input layer to the hidden layer a feature map.

We call the weights defining the feature map the shared weights. And we call the bias defining the feature map in this way the shared bias. The shared weights and bias are often said to define a kernel or filter. In the literature, people sometimes use these terms in slightly different ways

卷积神经网络

In the example shown, there are 3 feature maps. Each feature map is defined by a set of 5×5 shared weights, and a single shared bias. The result is that the network can detect 3 different kinds of features, with each feature being detectable across the entire image.

One of the early convolutional networks, LeNet-5, used 6 feature maps, each associated to a 5×5 local receptive field, to recognize MNIST digits. So the example illustrated above is actually pretty close to LeNet-5. In the examples we develop later in the chapter we’ll use convolutional layers with 20 and 40 feature maps. Let’s take a quick peek at some of the features which are learned

卷积神经网络
Each map is represented as a 5×5 block image, corresponding to the 5×5 weights in the local receptive field. Whiter blocks mean a smaller (typically, more negative) weight, so the feature map responds less to corresponding input pixels. Darker blocks mean a larger weight, so the feature map responds more to the corresponding input pixels. Very roughly speaking, the images above show the type of features the convolutional layer responds to.

每个小像素块代表一个5×5本地感受野,像素块越黑,感受野的共享权重越大,每个大块表示一个feature map,不同的feature map检测不同的特征

convolution

the name convolutional comes from the fact that the operation in Equation
卷积神经网络

is sometimes known as a convolution. A little more precisely, people sometimes write that equation as a1=σ(b+w∗a0), where a1 denotes the set of output activations from one feature map, a0 is the set of input activations, and ∗ is called a convolution operation

Pooling layers

Pooling layers are usually used immediately after convolutional layers. What the pooling layers do is simplify the information in the output from the convolutional layer.

max-pooling.

For instance, each unit in the pooling layer may summarize a region of (say) 2×2 neurons in the previous layer. As a concrete example, one common procedure for pooling is known as max-pooling. In max-pooling, a pooling unit simply outputs the maximum activation in the 2×2 input region, as illustrated in the following diagram:
卷积神经网络
reduce parameter

We can think of max-pooling as a way for the network to ask whether a given feature is found anywhere in a region of the image. It then throws away the exact positional information. The intuition is that once a feature has been found, its exact location isn’t as important as its rough location relative to other features. A big benefit is that there are many fewer pooled features, and so this helps reduce the number of parameters needed in later layers.

pooling 之前相当于拥有精确位置,pooling之后位置模糊化了,省略位置信息来减少参数。

L2 pooling

we take the square root of the sum of the squares of the activations in the 2×2 region

If you’re really trying to optimize performance, you may use validation data to compare several different approaches to pooling, and choose the approach which works best. But we’re not going to worry about that kind of detailed optimization.

complete architecture

卷积神经网络
We can now put all these ideas together to form a complete convolutional neural network. It’s similar to the architecture we were just looking at, but has the addition of a layer of 10 output neurons, corresponding to the 10 possible values for MNIST digits (‘0’, ‘1’, ‘2’, etc):

The final layer of connections in the network is a fully-connected layer. That is, this layer connects every neuron from the max-pooled layer to every one of the 10 output neurons

上一篇:混合(Pooling)样本测序研究


下一篇:【目标检测】Fast RCNN算法