ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation
Paszke, A., Chaurasia, A., Kim, S., & Culurciello, E. (2016). ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. ArXiv, abs/1606.02147.
# Initial block of the model:
# Input
# / \
# / \
#maxpool2d conv2d-3x3
# \ /
# \ /
# concatenate
# Upsampling bottleneck:
# Bottleneck Input
# / \
# / \
# conv2d-1x1 convTrans2d-1x1
# | | PReLU
# | convTrans2d-3x3
# | | PReLU
# | convTrans2d-1x1
# | |
# maxunpool2d Regularizer
# \ /
# \ /
# Summing + PReLU
#
# Params:
# projection_ratio - ratio between input and output channels
# relu - if True: relu used as the activation function else: Prelu us used
# Regular|Dilated|Downsampling bottlenecks:
#
# Bottleneck Input
# / \
# / \
# maxpooling2d conv2d-1x1
# | | PReLU
# | conv2d-3x3
# | | PReLU
# | conv2d-1x1
# | |
# Padding2d Regularizer
# \ /
# \ /
# Summing + PReLU
#
# Params:
# dilation (bool) - if True: creating dilation bottleneck
# down_flag (bool) - if True: creating downsampling bottleneck
# projection_ratio - ratio between input and output channels
# relu - if True: relu used as the activation function else: Prelu us used
# p - dropout ratio
# Asymetric bottleneck:
#
# Bottleneck Input
# / \
# / \
# | conv2d-1x1
# | | PReLU
# | conv2d-1x5
# | |
# | conv2d-5x1
# | | PReLU
# | conv2d-1x1
# | |
# Padding2d Regularizer
# \ /
# \ /
# Summing + PReLU
#
# Params:
# projection_ratio - ratio between input and output channels
论文架构
-
Abstract
-
Introduction
目前的图像分割网络主要是用到了VGG16的架构,大量的参数以及长时间的推理时间。
可好好思考一下谷歌新提出的efficientnet网络架构。
-
Related Work
large architectures and numerous parameters
-
Network architecture
-
Design choices
bottleneck:
下采样的bottleneck:
主线包括三个卷积层,
先是2×2投影做降采样;
然后是卷积(有三种可能,Conv普通卷积,asymmetric分解卷积,Dilated空洞卷积)
后面再接一个1×1的做升维 注意每个卷积层后均接Batch
Norm和PReLU。
辅线包括最大池化和Padding层
最大池化负责提取上下文信息
Padding负责填充通道,达到后续的残差融合 融合后再接PReLU。
非下采样的bottleneck:
主线包括三个卷积层,
先是1×1投影;
然后是卷积(有三种可能,Conv普通卷积,asymmetric分解卷积,Dilated空洞卷积)
后面再接一个1×1的做升维 注意每个卷积层后均接Batch
Norm和PReLU。
辅线直接恒等映射(只有下采样才会增加通道数,故这里不需要padding层)
融合后再接PReLU。Feature map resolution:下采样做成空间信息丢失,会对边缘信息造成影响。要求输入和输出要有相同的分辨率。强有力的下采样也需要强有力的上采样。FCN中通过skip connections;segnet通过池化索引。本文将输入先压缩,只输入小的特征图给网络结构,去除一部分图片的视觉冗余内容。但下采样的好处在于可以获取更大的感受野,获取更多的上下文信息,便于分类。
FCN的解决办法是将encoder阶段的feature map塞给decoder,增加空间信息。
SegNet的解决办法是将encoder阶段做downsampling的indices保留到decoder阶段做upsampling使用。
ENet采用的是SegNet的方法,这可以减少内存需求。同时为了增加更好的上下文信息,使用dilated conv(空洞卷积)扩大上下文信息。后处理的模块:CRF以及RNN都可以用来提高准确率。
-
stage
ENet模型大致分为5个stage:
- initial:初始化模块
左边是做3×3/str=2的卷积,右边是做MaxPooling,将两边结果concat一起,做 通道合 并,这样可以上来显著减少存储空间。
- stage1:encoder阶段。包括5个bottleneck,第一个bottleneck做下采样,后面4个重复的bottleneck
- stage2-3:encoder阶段。stage2的bottleneck2.0做了下采样,后面有时加空洞卷积,或分解卷积。stage3没有下采样,其他都一样。
- stage4-5:属于decoder阶段。比较简单。一个上采样配置两个普通的bottleneck。 -
Network architecture
Early downsampling:早期处理高分辨率的输入会耗费大量计算资源,ENet的初始化模型会大大减少输入的大小。这是考虑到视觉信息在空间上是高度冗余的,可以压缩成更有效的表示方式。
Decoder size:size 相比于SegNet中encoder和decoder的镜像对称,ENet的Encoder和Decoder不对称,由一个较大的Encoder和一个较小的Decoder组成。
Nonlinear operations:relu并不一定有用;PRELU。非线性激活函数relu的应用降低了精度,分析的原因是网络层数太少了,不够深,不能很快的过滤信息,采用prelu。
Factorizing filters:将n×n的卷积核拆为n×1和1×n(InceptionV3提出的)。可以有效的减少参数量,并提高模型感受野。
Dilated convolutions:空洞卷积,精度蹭蹭蹭蹭上升。空洞卷积Dilated convolutions可以有效的提高感受野。有效的使用Dilated convolutions提高了4%的IoU,使用Dilated convolutions是交叉使用,而非连续使用。
Regularization:因为数据集本身不大,很快会过拟合。使用L2效果不佳,使用stochastic depth还可以,但琢磨了一下stochastic depth就是Spatial Dropout的特例,故最后选择Spatial Dropout,效果相对好一点
Regularization
-
Results
-
Conclutison
参考链接 : https://github.com/srihari-humbarwadi/ENet-A-Deep-Neural-Network-Architecture-for-Real-Time-Semantic-Segmentation/blob/master/batch_training.py
代码
def initial_block(tensor):
conv = Conv2D(filters=13,kernel_size=(3,3),strides=(2,2),padding="same",name="initial_block_conv",kernel_initializer="he_normal")(tensor)
pool = MaxPooling2D(pool_size=(2,2),name="initial_blokc_pool")(tensor)
concat = concatenate([conv,pool],axis=-1,name="initial_block_concat")
return concat
def bottleneck_encoder(tensor, nfilters, downsampling=False, dilated=False, asymmetric=False, normal=False, drate=0.1,
name=''):
y = tensor
skip = tensor
stride = 1
ksize = 1
if downsampling:
stride = 2
ksize = 2
skip = MaxPooling2D(pool_size=(2, 2), name=f'max_pool_{name}')(skip)
skip = Permute((1, 3, 2), name=f'permute_1_{name}')(skip) # (B, H, W, C) -> (B, H, C, W)
ch_pad = nfilters - K.int_shape(tensor)[-1]
skip = ZeroPadding2D(padding=((0, 0), (0, ch_pad)), name=f'zeropadding_{name}')(skip)
skip = Permute((1, 3, 2), name=f'permute_2_{name}')(skip) # (B, H, C, W) -> (B, H, W, C)
y = Conv2D(filters=nfilters // 4, kernel_size=(ksize, ksize), kernel_initializer='he_normal',
strides=(stride, stride), padding='same', use_bias=False, name=f'1x1_conv_{name}')(y)
y = BatchNormalization(momentum=0.1, name=f'bn_1x1_{name}')(y)
y = PReLU(shared_axes=[1, 2], name=f'prelu_1x1_{name}')(y)
if normal:
y = Conv2D(filters=nfilters // 4, kernel_size=(3, 3), kernel_initializer='he_normal', padding='same',
name=f'3x3_conv_{name}')(y)
elif asymmetric:
y = Conv2D(filters=nfilters // 4, kernel_size=(5, 1), kernel_initializer='he_normal', padding='same',
use_bias=False, name=f'5x1_conv_{name}')(y)
y = Conv2D(filters=nfilters // 4, kernel_size=(1, 5), kernel_initializer='he_normal', padding='same',
name=f'1x5_conv_{name}')(y)
elif dilated:
y = Conv2D(filters=nfilters // 4, kernel_size=(3, 3), kernel_initializer='he_normal',
dilation_rate=(dilated, dilated), padding='same', name=f'dilated_conv_{name}')(y)
y = BatchNormalization(momentum=0.1, name=f'bn_main_{name}')(y)
y = PReLU(shared_axes=[1, 2], name=f'prelu_{name}')(y)
y = Conv2D(filters=nfilters, kernel_size=(1, 1), kernel_initializer='he_normal', use_bias=False,
name=f'final_1x1_{name}')(y)
y = BatchNormalization(momentum=0.1, name=f'bn_final_{name}')(y)
y = SpatialDropout2D(rate=drate, name=f'spatial_dropout_final_{name}')(y)
y = Add(name=f'add_{name}')([y, skip])
y = PReLU(shared_axes=[1, 2], name=f'prelu_out_{name}')(y)
return y
def bottleneck_decoder(tensor, nfilters, upsampling=False, normal=False, name=''):
y = tensor
skip = tensor
if upsampling:
skip = Conv2D(filters=nfilters, kernel_size=(1, 1), kernel_initializer='he_normal', strides=(1, 1),
padding='same', use_bias=False, name=f'1x1_conv_skip_{name}')(skip)
skip = UpSampling2D(size=(2, 2), name=f'upsample_skip_{name}')(skip)
y = Conv2D(filters=nfilters // 4, kernel_size=(1, 1), kernel_initializer='he_normal', strides=(1, 1),
padding='same', use_bias=False, name=f'1x1_conv_{name}')(y)
y = BatchNormalization(momentum=0.1, name=f'bn_1x1_{name}')(y)
y = PReLU(shared_axes=[1, 2], name=f'prelu_1x1_{name}')(y)
if upsampling:
y = Conv2DTranspose(filters=nfilters // 4, kernel_size=(3, 3), kernel_initializer='he_normal', strides=(2, 2),
padding='same', name=f'3x3_deconv_{name}')(y)
elif normal:
Conv2D(filters=nfilters // 4, kernel_size=(3, 3), strides=(1, 1), kernel_initializer='he_normal',
padding='same', name=f'3x3_conv_{name}')(y)
y = BatchNormalization(momentum=0.1, name=f'bn_main_{name}')(y)
y = PReLU(shared_axes=[1, 2], name=f'prelu_{name}')(y)
y = Conv2D(filters=nfilters, kernel_size=(1, 1), kernel_initializer='he_normal', use_bias=False,
name=f'final_1x1_{name}')(y)
y = BatchNormalization(momentum=0.1, name=f'bn_final_{name}')(y)
y = Add(name=f'add_{name}')([y, skip])
y = ReLU(name=f'relu_out_{name}')(y)
return y
def ENET(input_shape=(None, None, 3), nclasses=11):
print('. . . . .Building ENet. . . . .')
img_input = Input(input_shape)
x = initial_block(img_input)
x = bottleneck_encoder(x, 64, downsampling=True, normal=True, name='1.0', drate=0.01)
for _ in range(1, 5):
x = bottleneck_encoder(x, 64, normal=True, name=f'1.{_}', drate=0.01)
x = bottleneck_encoder(x, 128, downsampling=True, normal=True, name=f'2.0')
x = bottleneck_encoder(x, 128, normal=True, name=f'2.1')
x = bottleneck_encoder(x, 128, dilated=2, name=f'2.2')
x = bottleneck_encoder(x, 128, asymmetric=True, name=f'2.3')
x = bottleneck_encoder(x, 128, dilated=4, name=f'2.4')
x = bottleneck_encoder(x, 128, normal=True, name=f'2.5')
x = bottleneck_encoder(x, 128, dilated=8, name=f'2.6')
x = bottleneck_encoder(x, 128, asymmetric=True, name=f'2.7')
x = bottleneck_encoder(x, 128, dilated=16, name=f'2.8')
x = bottleneck_encoder(x, 128, normal=True, name=f'3.0')
x = bottleneck_encoder(x, 128, dilated=2, name=f'3.1')
x = bottleneck_encoder(x, 128, asymmetric=True, name=f'3.2')
x = bottleneck_encoder(x, 128, dilated=4, name=f'3.3')
x = bottleneck_encoder(x, 128, normal=True, name=f'3.4')
x = bottleneck_encoder(x, 128, dilated=8, name=f'3.5')
x = bottleneck_encoder(x, 128, asymmetric=True, name=f'3.6')
x = bottleneck_encoder(x, 128, dilated=16, name=f'3.7')
x = bottleneck_decoder(x, 64, upsampling=True, name='4.0')
x = bottleneck_decoder(x, 64, normal=True, name='4.1')
x = bottleneck_decoder(x, 64, normal=True, name='4.2')
x = bottleneck_decoder(x, 16, upsampling=True, name='5.0')
x = bottleneck_decoder(x, 16, normal=True, name='5.1')
img_output = Conv2DTranspose(nclasses, kernel_size=(2, 2), strides=(2, 2), kernel_initializer='he_normal',
padding='same', name='image_output')(x)
img_output = Activation('softmax')(img_output)
model = Model(inputs=img_input, outputs=img_output, name='ENET')
print('. . . . .Build Compeleted. . . . .')
return model
其中的loss值定义
def dice_coeff(y_true, y_pred):
smooth = 1.
y_true_f = K.flatten(y_true)
y_pred_f = K.flatten(y_pred)
intersection = K.sum(y_true_f * y_pred_f)
score = (2. * intersection + smooth) / (K.sum(y_true_f) + K.sum(y_pred_f) + smooth)
return score
def dice_loss(y_true, y_pred):
loss = 1 - dice_coeff(y_true, y_pred)
return loss
def total_loss(y_true, y_pred):
loss = binary_crossentropy(y_true, y_pred) + (3*dice_loss(y_true, y_pred))
return loss