【语义分割专题】语义分割相关工作--ENet网络相关工作

2023-11-28 20:48:52

ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

Paszke, A., Chaurasia, A., Kim, S., & Culurciello, E. (2016). ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. ArXiv, abs/1606.02147.

  # Initial block of the model:
  #         Input
  #        /     \
  #       /       \
  #maxpool2d    conv2d-3x3
  #       \       /  
  #        \     /
  #      concatenate


 # Upsampling bottleneck:
  #     Bottleneck Input
  #        /        \
  #       /          \
  # conv2d-1x1     convTrans2d-1x1
  #      |             | PReLU
  #      |         convTrans2d-3x3
  #      |             | PReLU
  #      |         convTrans2d-1x1
  #      |             |
  # maxunpool2d    Regularizer
  #       \           /  
  #        \         /
  #      Summing + PReLU
  #
  #  Params: 
  #  projection_ratio - ratio between input and output channels
  #  relu - if True: relu used as the activation function else: Prelu us used



  # Regular|Dilated|Downsampling bottlenecks:
  #
  #     Bottleneck Input
  #        /        \
  #       /          \
  # maxpooling2d   conv2d-1x1
  #      |             | PReLU
  #      |         conv2d-3x3
  #      |             | PReLU
  #      |         conv2d-1x1
  #      |             |
  #  Padding2d     Regularizer
  #       \           /  
  #        \         /
  #      Summing + PReLU
  #
  # Params: 
  #  dilation (bool) - if True: creating dilation bottleneck
  #  down_flag (bool) - if True: creating downsampling bottleneck
  #  projection_ratio - ratio between input and output channels
  #  relu - if True: relu used as the activation function else: Prelu us used
  #  p - dropout ratio



  # Asymetric bottleneck:
  #
  #     Bottleneck Input
  #        /        \
  #       /          \
  #      |         conv2d-1x1
  #      |             | PReLU
  #      |         conv2d-1x5
  #      |             |
  #      |         conv2d-5x1
  #      |             | PReLU
  #      |         conv2d-1x1
  #      |             |
  #  Padding2d     Regularizer
  #       \           /  
  #        \         /
  #      Summing + PReLU
  #
  # Params:    
  #  projection_ratio - ratio between input and output channels

论文架构

Abstract
Introduction

目前的图像分割网络主要是用到了VGG16的架构，大量的参数以及长时间的推理时间。

可好好思考一下谷歌新提出的efficientnet网络架构。
Related Work

large architectures and numerous parameters
Network architecture
Design choices

bottleneck：
下采样的bottleneck：
主线包括三个卷积层，
先是2×2投影做降采样;
然后是卷积(有三种可能，Conv普通卷积,asymmetric分解卷积，Dilated空洞卷积)
后面再接一个1×1的做升维注意每个卷积层后均接Batch
Norm和PReLU。
辅线包括最大池化和Padding层
最大池化负责提取上下文信息
Padding负责填充通道，达到后续的残差融合融合后再接PReLU。
非下采样的bottleneck:
主线包括三个卷积层，
先是1×1投影;
然后是卷积(有三种可能，Conv普通卷积,asymmetric分解卷积，Dilated空洞卷积)
后面再接一个1×1的做升维注意每个卷积层后均接Batch
Norm和PReLU。
辅线直接恒等映射(只有下采样才会增加通道数，故这里不需要padding层)
融合后再接PReLU。

Feature map resolution:下采样做成空间信息丢失，会对边缘信息造成影响。要求输入和输出要有相同的分辨率。强有力的下采样也需要强有力的上采样。FCN中通过skip connections；segnet通过池化索引。本文将输入先压缩，只输入小的特征图给网络结构，去除一部分图片的视觉冗余内容。但下采样的好处在于可以获取更大的感受野，获取更多的上下文信息，便于分类。
FCN的解决办法是将encoder阶段的feature map塞给decoder，增加空间信息。
SegNet的解决办法是将encoder阶段做downsampling的indices保留到decoder阶段做upsampling使用。
ENet采用的是SegNet的方法，这可以减少内存需求。同时为了增加更好的上下文信息，使用dilated conv(空洞卷积)扩大上下文信息。

后处理的模块：CRF以及RNN都可以用来提高准确率。
stage

ENet模型大致分为5个stage：
- initial：初始化模块
左边是做3×3/str=2的卷积，右边是做MaxPooling，将两边结果concat一起，做通道合并，这样可以上来显著减少存储空间。
- stage1：encoder阶段。包括5个bottleneck，第一个bottleneck做下采样，后面4个重复的bottleneck
- stage2-3：encoder阶段。stage2的bottleneck2.0做了下采样，后面有时加空洞卷积，或分解卷积。stage3没有下采样，其他都一样。
- stage4-5：属于decoder阶段。比较简单。一个上采样配置两个普通的bottleneck。
Network architecture

Early downsampling:早期处理高分辨率的输入会耗费大量计算资源，ENet的初始化模型会大大减少输入的大小。这是考虑到视觉信息在空间上是高度冗余的，可以压缩成更有效的表示方式。

Decoder size：size 相比于SegNet中encoder和decoder的镜像对称，ENet的Encoder和Decoder不对称，由一个较大的Encoder和一个较小的Decoder组成。

Nonlinear operations：relu并不一定有用；PRELU。非线性激活函数relu的应用降低了精度，分析的原因是网络层数太少了，不够深，不能很快的过滤信息，采用prelu。

Factorizing filters:将n×n的卷积核拆为n×1和1×n(InceptionV3提出的)。可以有效的减少参数量，并提高模型感受野。

Dilated convolutions：空洞卷积，精度蹭蹭蹭蹭上升。空洞卷积Dilated convolutions可以有效的提高感受野。有效的使用Dilated convolutions提高了4%的IoU，使用Dilated convolutions是交叉使用，而非连续使用。

Regularization：因为数据集本身不大，很快会过拟合。使用L2效果不佳，使用stochastic depth还可以，但琢磨了一下stochastic depth就是Spatial Dropout的特例，故最后选择Spatial Dropout，效果相对好一点

Regularization
Results
Conclutison

参考链接 : https://github.com/srihari-humbarwadi/ENet-A-Deep-Neural-Network-Architecture-for-Real-Time-Semantic-Segmentation/blob/master/batch_training.py

代码

def initial_block(tensor):
    conv = Conv2D(filters=13,kernel_size=(3,3),strides=(2,2),padding="same",name="initial_block_conv",kernel_initializer="he_normal")(tensor)
    pool = MaxPooling2D(pool_size=(2,2),name="initial_blokc_pool")(tensor)
    concat = concatenate([conv,pool],axis=-1,name="initial_block_concat")
    return concat

def bottleneck_encoder(tensor, nfilters, downsampling=False, dilated=False, asymmetric=False, normal=False, drate=0.1,
                       name=''):
    y = tensor
    skip = tensor
    stride = 1
    ksize = 1
    if downsampling:
        stride = 2
        ksize = 2
        skip = MaxPooling2D(pool_size=(2, 2), name=f'max_pool_{name}')(skip)
        skip = Permute((1, 3, 2), name=f'permute_1_{name}')(skip)  # (B, H, W, C) -> (B, H, C, W)
        ch_pad = nfilters - K.int_shape(tensor)[-1]
        skip = ZeroPadding2D(padding=((0, 0), (0, ch_pad)), name=f'zeropadding_{name}')(skip)
        skip = Permute((1, 3, 2), name=f'permute_2_{name}')(skip)  # (B, H, C, W) -> (B, H, W, C)

    y = Conv2D(filters=nfilters // 4, kernel_size=(ksize, ksize), kernel_initializer='he_normal',
               strides=(stride, stride), padding='same', use_bias=False, name=f'1x1_conv_{name}')(y)
    y = BatchNormalization(momentum=0.1, name=f'bn_1x1_{name}')(y)
    y = PReLU(shared_axes=[1, 2], name=f'prelu_1x1_{name}')(y)

    if normal:
        y = Conv2D(filters=nfilters // 4, kernel_size=(3, 3), kernel_initializer='he_normal', padding='same',
                   name=f'3x3_conv_{name}')(y)
    elif asymmetric:
        y = Conv2D(filters=nfilters // 4, kernel_size=(5, 1), kernel_initializer='he_normal', padding='same',
                   use_bias=False, name=f'5x1_conv_{name}')(y)
        y = Conv2D(filters=nfilters // 4, kernel_size=(1, 5), kernel_initializer='he_normal', padding='same',
                   name=f'1x5_conv_{name}')(y)
    elif dilated:
        y = Conv2D(filters=nfilters // 4, kernel_size=(3, 3), kernel_initializer='he_normal',
                   dilation_rate=(dilated, dilated), padding='same', name=f'dilated_conv_{name}')(y)
    y = BatchNormalization(momentum=0.1, name=f'bn_main_{name}')(y)
    y = PReLU(shared_axes=[1, 2], name=f'prelu_{name}')(y)

    y = Conv2D(filters=nfilters, kernel_size=(1, 1), kernel_initializer='he_normal', use_bias=False,
               name=f'final_1x1_{name}')(y)
    y = BatchNormalization(momentum=0.1, name=f'bn_final_{name}')(y)
    y = SpatialDropout2D(rate=drate, name=f'spatial_dropout_final_{name}')(y)

    y = Add(name=f'add_{name}')([y, skip])
    y = PReLU(shared_axes=[1, 2], name=f'prelu_out_{name}')(y)

    return y

def bottleneck_decoder(tensor, nfilters, upsampling=False, normal=False, name=''):
    y = tensor
    skip = tensor
    if upsampling:
        skip = Conv2D(filters=nfilters, kernel_size=(1, 1), kernel_initializer='he_normal', strides=(1, 1),
                      padding='same', use_bias=False, name=f'1x1_conv_skip_{name}')(skip)
        skip = UpSampling2D(size=(2, 2), name=f'upsample_skip_{name}')(skip)

    y = Conv2D(filters=nfilters // 4, kernel_size=(1, 1), kernel_initializer='he_normal', strides=(1, 1),
               padding='same', use_bias=False, name=f'1x1_conv_{name}')(y)
    y = BatchNormalization(momentum=0.1, name=f'bn_1x1_{name}')(y)
    y = PReLU(shared_axes=[1, 2], name=f'prelu_1x1_{name}')(y)

    if upsampling:
        y = Conv2DTranspose(filters=nfilters // 4, kernel_size=(3, 3), kernel_initializer='he_normal', strides=(2, 2),
                            padding='same', name=f'3x3_deconv_{name}')(y)
    elif normal:
        Conv2D(filters=nfilters // 4, kernel_size=(3, 3), strides=(1, 1), kernel_initializer='he_normal',
               padding='same', name=f'3x3_conv_{name}')(y)
    y = BatchNormalization(momentum=0.1, name=f'bn_main_{name}')(y)
    y = PReLU(shared_axes=[1, 2], name=f'prelu_{name}')(y)

    y = Conv2D(filters=nfilters, kernel_size=(1, 1), kernel_initializer='he_normal', use_bias=False,
               name=f'final_1x1_{name}')(y)
    y = BatchNormalization(momentum=0.1, name=f'bn_final_{name}')(y)

    y = Add(name=f'add_{name}')([y, skip])
    y = ReLU(name=f'relu_out_{name}')(y)

    return y

def ENET(input_shape=(None, None, 3), nclasses=11):
    print('. . . . .Building ENet. . . . .')
    img_input = Input(input_shape)

    x = initial_block(img_input)

    x = bottleneck_encoder(x, 64, downsampling=True, normal=True, name='1.0', drate=0.01)
    for _ in range(1, 5):
        x = bottleneck_encoder(x, 64, normal=True, name=f'1.{_}', drate=0.01)

    x = bottleneck_encoder(x, 128, downsampling=True, normal=True, name=f'2.0')
    x = bottleneck_encoder(x, 128, normal=True, name=f'2.1')
    x = bottleneck_encoder(x, 128, dilated=2, name=f'2.2')
    x = bottleneck_encoder(x, 128, asymmetric=True, name=f'2.3')
    x = bottleneck_encoder(x, 128, dilated=4, name=f'2.4')
    x = bottleneck_encoder(x, 128, normal=True, name=f'2.5')
    x = bottleneck_encoder(x, 128, dilated=8, name=f'2.6')
    x = bottleneck_encoder(x, 128, asymmetric=True, name=f'2.7')
    x = bottleneck_encoder(x, 128, dilated=16, name=f'2.8')

    x = bottleneck_encoder(x, 128, normal=True, name=f'3.0')
    x = bottleneck_encoder(x, 128, dilated=2, name=f'3.1')
    x = bottleneck_encoder(x, 128, asymmetric=True, name=f'3.2')
    x = bottleneck_encoder(x, 128, dilated=4, name=f'3.3')
    x = bottleneck_encoder(x, 128, normal=True, name=f'3.4')
    x = bottleneck_encoder(x, 128, dilated=8, name=f'3.5')
    x = bottleneck_encoder(x, 128, asymmetric=True, name=f'3.6')
    x = bottleneck_encoder(x, 128, dilated=16, name=f'3.7')

    x = bottleneck_decoder(x, 64, upsampling=True, name='4.0')
    x = bottleneck_decoder(x, 64, normal=True, name='4.1')
    x = bottleneck_decoder(x, 64, normal=True, name='4.2')

    x = bottleneck_decoder(x, 16, upsampling=True, name='5.0')
    x = bottleneck_decoder(x, 16, normal=True, name='5.1')

    img_output = Conv2DTranspose(nclasses, kernel_size=(2, 2), strides=(2, 2), kernel_initializer='he_normal',
                                 padding='same', name='image_output')(x)
    img_output = Activation('softmax')(img_output)

    model = Model(inputs=img_input, outputs=img_output, name='ENET')
    print('. . . . .Build Compeleted. . . . .')
    return model

其中的loss值定义

def dice_coeff(y_true, y_pred):
    smooth = 1.
    y_true_f = K.flatten(y_true)
    y_pred_f = K.flatten(y_pred)
    intersection = K.sum(y_true_f * y_pred_f)
    score = (2. * intersection + smooth) / (K.sum(y_true_f) + K.sum(y_pred_f) + smooth)
    return score

def dice_loss(y_true, y_pred):
    loss = 1 - dice_coeff(y_true, y_pred)
    return loss

def total_loss(y_true, y_pred):
    loss = binary_crossentropy(y_true, y_pred) + (3*dice_loss(y_true, y_pred))
    return loss

码农公寓

ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

相关文章