相关链接:
论文链接:https://arxiv.org/pdf/1905.02244.pdf
其他博主的MobileNet v3博客地址(其中有torch和tensorflow代码实现链接):https://blog.csdn.net/DL_wly/article/details/90168883
另外的代码实现链接:MobileNet V3 Tensorflow实现 ; MobileNet V3 torch 实现
一、为什么要读这篇:
之前阅读过MobileNet系列的V1 和 V2,想看看V3在结构上又有哪些创新?用到的小网络模块是否可以用在自己结构中。
在这一篇文章的阅读过程中,我将不再关注实验部分和相关工作等内容,最关键的就是要知道V3的网络结构,和之前两个版本的区别在哪里以及V3的具体代码实现。
二、作者及相关研究背景:
[1] A. Howard et al., "Searching for MobileNetV3," 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp. 1314-1324, doi: 10.1109/ICCV.2019.00140.
依旧全员谷歌。2019年被IEEE、ICCV收录,很想看看是什么创新点吸引了IEEE的审稿人们,况且在自己已有的工作上如何再一步深挖,改进创新也是很具有挑战性的,2017到2019,一年一篇,确实很强大。
三、截止阅读时这篇论文的引用次数
2021.2.19 共130次(数据来源于IEEE中搜索)
四、关键词
hardwareaware network architecture search (NAS)
NetAdapt算法
五、仔细阅读
0 摘要
We present the next generation of MobileNets based on a combination of complementary search techniques as well as a novel architecture design. MobileNetV3 is tuned to mobile phone CPUs through a combination of hardwareaware network architecture search (NAS) complemented by the NetAdapt algorithm and then subsequently improved through novel architecture advances.
This paper starts the exploration of how automated search algorithms and network design can work together to harness complementary approaches improving the overall state of the art. Through this process we create two new MobileNet models for release: MobileNetV3-Large and MobileNetV3-Small which are targeted for high and low resource use cases.
1、简介
This paper describes the approach we took to develop MobileNetV3 Large and Small models in order to deliver the next generation of high accuracy efficient neural network models to power on-device computer vision. The new networks push the state of the art forward and demonstrate how to blend automated search with novel architecture advances to build effective models.
2、相关工作
SqueezeNet[22] extensively uses 1x1 convolutions with squeeze and expand modules primarily focusing on reducing the number of parameters. More recent works shifts the focus from reducing parameters to reducing the number of operations (MAdds) and the actual measured latency. MobileNetV1[19] employs depthwise separable convolution to substantially improve computation efficiency. MobileNetV2[39] expands on this by introducing a resource-efficient block with inverted residuals and linear bottlenecks. ShuffleNet[49] utilizes group convolution and channel shuffle operations to further reduce the MAdds. CondenseNet[21] learns group convolutions at the training stage to keep useful dense connections between layers for feature re-use. ShiftNet[46] proposes the shift operation interleaved with point-wise convolutions to replace expensive spatial convolutions.
[21] Gao Huang, Shichen Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Condensenet: An efficient densenet using learned group convolutions. CoRR, abs/1711.09224, 2017. 2
[46] Bichen Wu, Alvin Wan, Xiangyu Yue, Peter H. Jin, Sicheng Zhao, Noah Golmant, Amir Gholaminejad, Joseph Gonzalez, and Kurt Keutzer. Shift: A zero flop, zero parameter alternative to spatial convolutions. CoRR, abs/1711.08141, 2017. 2
3、Efficient Mobile Building Blocks
Mobile models have been built on increasingly more efficient building blocks. MobileNetV1 [19] introduced depthwise separable convolutions as an efficient replacement for traditional convolution layers. Depthwise separable convolutions effectively factorize traditional convolution by separating spatial filtering from the feature generation mechanism. Depthwise separable convolutions are defined by two separate layers: light weight depthwise convolution for spatial filtering and heavier 1x1 pointwise convolutions for feature generation.
概括:MobiileNetV1使用深度可分离卷积代替常规卷积
MobileNetV2 [39] introduced the linear bottleneck and inverted residual structure in order to make even more efficient layer structures by leveraging the low rank nature of the problem. This structure is shown on Figure 3 and is defined by a 1x1 expansion convolution followed by depthwise convolutions and a 1x1 projection layer. The input and output are connected with a residual connection if and only if they have the same number of channels. This structure maintains a compact representation at the input and the output while expanding to a higher-dimensional feature space internally to increase the expressiveness of nonlinear perchannel transformations.
概括:回顾一下V2的block结构
MnasNet [43] built upon the MobileNetV2 structure by introducing lightweight attention modules based on squeeze and excitation into the bottleneck structure. Note that the squeeze and excitation module are integrated in a different location than ResNet based modules proposed in [20]. The module is placed after the depthwise filters in the expansion in order for attention to be applied on the largest representation as shown on Figure 4.(图4展示的是MobileNet V3 的结构)
[43] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V. Le. Mnasnet: Platform-aware neural architecture search for mobile. CoRR, abs/1807.11626, 2018. 2, 3, 5, 6
[20] J. Hu, L. Shen, and G. Sun. Squeeze-and-Excitation Networks. ArXiv e-prints, Sept. 2017. 2, 3, 7
For MobileNetV3, we use a combination of these layers as building blocks in order to build the most effective models. Layers are also upgraded with modified swish nonlinearities [36, 13, 16]. Both squeeze and excitation as well as the swish nonlinearity use the sigmoid which can be inefficient to compute as well challenging to maintain accuracy in fixed point arithmetic so we replace this with the hard sigmoid [2, 11] as discussed in section 5.2.
PS:V3中使用到了SENet。关于SENet的讲解可以看上一篇博客:论文阅读|CVPR2018&CVPR2020|Squeeze-and-Excitation Networks
4、Network Search
这里分享一篇关于NAS的写的很好的博客:https://blog.csdn.net/jinzhuojun/article/details/84698471
在这里借用这篇博主的话结合自己的理解,介绍一下NAS:
在深度学习中调参成为一项必不可少的工作,自己搭建模型算法实现效果的好坏很大程度上也取决于各种超参数的设置。
但是涉及深度学习领域的伙伴都知道调参的难处。特征提取这一个大难题现在已经可以实现自动化了,超参数的自动搜索优化能否交给机器来做成为了一个新的诉求。
超参数分类两类:
一类是训练参数:learning rate,batch_size等等,一类是网络结构参数:网络为几层,每一层有多少filters, 网络中是否需要加入残差等不同于直线型的简单结构。这些结构之间如何排列。
NAS套路,超参数的设置问题是在高维空间的最优参数搜索问题,如何制定搜索策略呢?
一、定义搜索空间:
搜索空间定义基本是和DNN的发展相应。可以看到,早期的CNN大都是链式结构,因此初期的NAS工作中搜索空间也主要考虑该结构,只需考虑共几层,每层分别是什么类型,以及该类型对应的超参数。再后面,随着像ResNet,DenseNet,Skip connection这样的多叉结构的出现,NAS中也开始考虑多叉结构,这样就增大了组合的*度可以组合出更多样的结构。另外,很多DNN结构开始包含重复的子结构(如Inception,DenseNet,ResNet等),称为cell或block。于是NAS也开始考虑这样的结构,提出基于cell的搜索。即只对cell结构进行结构搜索,总体网络对这些cell进行重叠拼接而成。当然如何拼接本身也可以通过NAS来学习。这样相当于减少*度来达到减少搜索空间的目的,同样也可以使搜索到的结构具有更好的数据集间迁移能力。为了让搜索模型可以处理网络结构,需要对其编码。搜索空间中的网络架构可以表示为描述结构的字符串或向量。
二、制定搜索策略
1、基于强化学习:它将网络架构搜索建模成马尔可夫决策过程,使用RL方法(具体地,Q-learning算法)来产生CNN架构。生成网络结构后训练后得到的评估精度作为回报。
2、基于进化算法:在进行过程中,会随机选取两个模型 ,差的那个直接被干掉(淘汰),好的那个会成为父节点(这种方式为tournament selection)。子节点经过变异(就是在一些预定的网络结构变更操作中随机选取)形成子结点。
3、基于梯度的方法:先将网络结构做嵌入(embedding)到一个连续的空间,这个空间中的每一个点对应一个网络结构。在这个空间上可以定义准确率的预测函数。以它为目标函数进行基于梯度的优化,找到更优网络结构的嵌入表征。优化完成后,再将这个嵌入表征映射回网络结构。
三、加速
NAS消耗的资源很大,基本还不能平民化。
1、层次化表示(Hierarchical Representation):它假设整体网络是由cell重复构建的,那搜索空间就缩小到对两类cell(normal cell和reduction cell)结构的搜索上,从而大大减小了搜索空间。
2、权重共享(Weight sharing):让训练好的网络尽可能重用。
3、表现预测(Performance prediction):为了得到某个网络模型的精度又不花费太多时间训练,通常会找一些代理测度作为估计量。比如在少量数据集上、或是低分辨率上训练的模型精度,或是训练少量epoch后的模型精度。
由于NAS在以后我自己的项目中并不会使用到,在这里仅对基本概念了解即可。下面开始介绍MobileNet V3在网络结构设计上的创新。
5、Network Improvements
5.1 Redesigning Expensive Layers
通过NAS我们发现模型的首尾两个部分的网络层所需要的计算量比其他位置要多。
The first modification reworks how the last few layers of the network interact in order to produce the final features more efficiently. Current models based on MobileNetV2’s inverted bottleneck structure and variants use 1x1 convolution as a final layer in order to expand to a higher-dimensional feature space. This layer is critically important in order to have rich features for prediction. However, this comes at a cost of extra latency.
最Mobilenet V2中 avg pooling之前,存在一个1X1卷积,可以提高特征图的维度,但却会带来一定的计算量。
To reduce latency and preserve the high dimensional features, we move this layer past the final average pooling. This final set of features is now computed at 1x1 spatial resolution instead of 7x7 spatial resolution. The outcome of this design choice is that the computation of the features becomes nearly free in terms of computation and latency.
将1×1层放在到最终平均池之后。这样的话最后一组特征现在不是7x7,而是以1x1计算。
We were able to reduce the number of filters to 16 while maintaining the same accuracy as 32 filters using either ReLU or swish. This saves an additional 2 milliseconds and 10 million MAdds.
修改头部卷积核channel数量:mobilenet v2中使用的是32 x 3 x 3,作者发现,其实32可以再降低一点,于是这里改成了16。
5.2 Nonlinearities
使用替换。
In [36, 13, 16] a nonlinearity called swish was introduced that when used as a drop-in replacement for ReLU, that significantly improves the accuracy of neural networks.
While this nonlinearity improves accuracy, it comes with non-zero cost in embedded environments as the sigmoid function is much more expensive to compute on mobile devices. We deal with this problem in two ways.
swish 可以提升性能,但是会带来额外的计算量。
We replace sigmoid function with its piece-wise linear hard analog: . The minor difference is we use ReLU6 rather than a custom clipping constant。
作者使用ReLU6(x+3)/6来近似替代sigmoid,观察下图可以发现,其实相差不大的。利用ReLU有几点好处,1.可以在任何软硬件平台进行计算,2.量化的时候,它消除了潜在的精度损失,使用h-swish替换swith,在量化模式下回提高大约15%的效率,另外,h-swish在深层网络中更加明显。(本句话摘自https://blog.csdn.net/Chunfengyanyulove/article/details/91358187)
[36] Prajit Ramachandran, Barret Zoph, and Quoc V. Le. Searching for activation functions. CoRR, abs/1710.05941, 2017.
[13] Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoidweighted linear units for neural network function approximation in reinforcement learning. CoRR, abs/1702.03118, 2017.
[16] Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR, abs/1606.08415, 2016.
6、实验
6.1 classification
使用ImageNet作为数据集来测试分类性能。
训练步骤:
硬件设备 | 4x4 TPU Pod |
优化器 | RMSPropOptimizer |
momentum | 0.9 |
initial learning rate | 0.1 |
batch size | 4096(这也太大了) |
learning rate decay rate | 0.01 every 3 epochs |
dropout | 0.8 |
正则化 | L2 |
weight decay | 1e-5 |
batch-normalization layers - average decay | 0.99 |
e exponential moving average - decay | 0.9999 |
六、读后感
感觉机构上相比于V1 V2并没有很大的创新,本文的题目就是Searching for MobileNetV3。突出了文中提及的NAS技术的重要性。但是这个NAS的可实现性还是很低的,在实验部分作者也没有对这一块内容做进一步分析。在网络结构上可借鉴性并没有很多,其中SENet 和 H-swish以后可以在自己的网络中尝试。
作者的改进思路:先确定哪些地方可以再简洁一些,减少卷积核数目也不会影响性能。