https://arxiv.org/format/2010.04879
文章目录
- abstract
- Introduction
- Background and Related Works
- Multi-Dimension Pruning
- Model Scaling
- Algorithm
- Experiments
abstract
To deploy a pre-trained deep CNN on resource-constrained mobile devices, neural network pruning is often used to cut down the model’s computational cost.
For example, filter-level pruning (reducing the model’s width) or layer-level pruning (reducing the model’s depth) can both save computations with some sacrifice of accuracy. Besides, reducing the resolution of input images can also reach the same goal.
Most previous methods focus on reducing one or two of these dimensions (i.e., depth, width, and image resolution) for acceleration. However, excessive reduction of any single dimension will lead to unacceptable accuracy loss, and we have to prune these three dimensions comprehensively to yield the best result.
In this paper, a simple yet effective pruning framework is proposed to comprehensively consider these three dimensions.
Our framework falls into two steps: \textit{1) Determining the optimal depth ( d ⋆ d^\star d⋆), width ( w ⋆ w^\star w⋆), and image resolution ( r ⋆ r^\star r⋆) for the model. 2) Pruning the model in terms of ( d ⋆ , w ⋆ , r ⋆ ) (d^\star, w^\star, r^\star) (d⋆,w⋆,r⋆).} Specifically, in the first step, we formulate model acceleration as an optimization problem. It takes depth ( d d d), width ( w w w) and image resolution ( r r r) as variables, and the model’s accuracy as the optimization objective. Although it is hard to determine the expression of the objective function, approximating it with polynomials is still feasible, during which several properties of the objective function are utilized to ease and speed up the fitting process. Then the optimal d ⋆ , w ⋆ d^\star, w^\star d⋆,w⋆, and r ⋆ r^\star r⋆ are attained by maximizing the objective function with Lagrange multiplier theorem and KKT conditions.
Extensive experiments are done on several popular architectures and datasets. The results show that we have outperformed state-of-the-art pruning methods. The code will be published soon.
为了在资源受限的移动设备上部署预训练深层CNN,通常使用神经网络剪枝来降低模型的计算成本。
例如,过滤器级别的修剪(减少模型的宽度)或层级别的修剪(减少模型的深度)都可以在牺牲精度的情况下节省计算。另外,降低输入图像的分辨率也可以达到同样的目的。
大多数以前的方法集中在减少一个或两个维度(即深度、宽度和图像分辨率)来加速。然而,任何一个维度的过度缩减都会导致不可接受的精度损失,我们必须对这三个维度进行全面的修剪以获得最佳的结果。
本文提出一个简单而有效的剪枝框架(a simple yet effective pruning framework)来综合考虑这三个维度。
我们的框架分为两个步骤:
1)确定模型的最佳深度( d ⋆ d^\star d⋆)、宽度( w ⋆ w^\star w⋆)和图像分辨率( r ⋆ r^\star r⋆)。
2) 用 ( d ⋆ , w ⋆ , r ⋆ ) (d^\star,w^\star,r^\star) (d⋆,w⋆,r⋆).}修剪模型。
具体来说,在第一步中,我们将模型加速描述为一个优化问题。它以深度(
d
d
d)、宽度(
w
w
w)和图像分辨率(
r
r
r)为变量,以模型精度为优化目标。虽然目标函数的表达式很难确定,但用多项式(polynomials)逼近仍然是可行的,在此过程中利用目标函数的一些性质来简化和加快拟合过程。然后利用拉格朗日乘子定理和KKT条件(Lagrange multiplier theorem and KKT conditions)使目标函数最大化,得到最优的
d
⋆
,
w
⋆
d^\star, w^\star
d⋆,w⋆, and
r
⋆
r^\star
r⋆。
在几种流行的体系结构和数据集上进行了大量的实验。结果表明,我们的性能优于最先进的修剪方法。该守则将很快公布。
Introduction
CNN has achieved great success in many computer vision tasks such as image classification~\cite{DBLP:conf/cvpr/HeZRS16,DBLP:conf/cvpr/HuangLMW17}, object detection~\cite{DBLP:conf/eccv/LiuAESRFB16,DBLP:conf/cvpr/RedmonDGF16}, and semantic segmentation~\cite{DBLP:journals/corr/ChenPSA17,DBLP:conf/iccv/HeGDG17}. However, the enormous computational cost of CNNs incapacitates them to be deployed on resource-constrained mobile devices such as smartphones. This gives birth to model acceleration~\cite{he2019filter,DBLP:conf/aaai/ZhouFCBZG18,DBLP:conf/nips/BaC14,DBLP:conf/ijcai/HeKDFY18}, a research field focusing on reducing the computational cost of models and keeping their performance at the same time for mobile deployment.
CNN在图像分类,目标检测,和语义分割等计算机视觉任务中取得了巨大的成功。然而,CNN的巨大计算成本使其无法部署在资源受限的移动设备上,如智能手机。这就产生了模型加速,这是一个致力于降低模型计算成本并同时保持模型性能的研究领域。
Neural network pruning is one of the most popular methods for model acceleration. It prunes redundant components of CNNs to cut down unessential computations. For example, filter-level pruning~\cite{he2019filter,DBLP:conf/iclr/MolchanovTKAK17,wang2019cop,DBLP:journals/pami/LuoZZXWL19,DBLP:conf/iccv/LiuLSHYZ17} removes redundant filters and reduces the width of CNNs; Layer-level pruning~\cite{DBLP:journals/corr/abs-1912-10178} removes unimportant layers and reduces the depth of CNNs. Besides, some methods~\cite{DBLP:journals/corr/HowardZCKWWAA17}, though less common, also turn to using resized low-resolution input images for fewer computations\footnote{Though images are actually resized for acceleration, we will use term ``pruning images’’ for simplicity in the following.}. These three kinds of methods respectively focus on one dimension (i.e., depth, width, and image resolution) that determines the computational cost of a model.
神经网络剪枝(Neural network pruning)是目前最流行的模型加速方法之一。它删减了CNNs的冗余组件,以减少不必要的计算。例如,过滤器级修剪(filter-level pruning)删除冗余过滤器并减少CNN的宽度;层级修剪(Layer-level pruning)移除不重要的层并减少CNN的深度。另外,一些方法,虽然不太常见,但也转向使用调整大小的低分辨率输入图像以减少计算量。这三种方法分别侧重于决定模型计算成本的一维(即深度、宽度和图像分辨率)(depth, width, and image resolution)。
Naturally, we raise a question: given a pre-trained model, which dimension should we prune to minimize the model’s accuracy loss? In practice, users empirically choose the most redundant dimension to prune, which often leads to a sub-optimal pruned model because of an inappropriate dimension choice. Worse, excessive pruning of any of these three dimensions will lead to unacceptable loss, as shown in Fig.~\ref{fig:intro}. Instead, comprehensively pruning these dimensions yields better results. However, there has not been any effective method to achieve this. Though some previous methods can reduce the depth and width of the model simultaneously, the same methods are not applicable when considering image resolution\footnote{Please see Background and Related Works for the reason}. Motivated by~\cite{DBLP:conf/icml/TanL19}, the optimal values for these dimensions can also be observed in a brute-force manner, i.e., grid search. However, grid search is costly (over 500 GPU\footnote{GTX 1080Ti} hours for pruning ResNet for CIFAR-10) and often sub-optimal (if grid size > 1 > 1 >1).
当然,我们会提出一个问题:给定一个预先训练好的模型,我们应该删减哪个维度来最小化模型的精度损失?在实际应用中,用户根据经验选择冗余度最高的维度进行剪枝,往往会因为维数选择不当而导致一个次优的剪枝模型。更糟糕的是,这三个维度中的任何一个都会导致不可接受的损失,如图1. 相反,全面修剪这些维度会产生更好的结果。然而,目前还没有任何有效的方法来实现这一目标。虽然以前的一些方法可以同时减小模型的深度和宽度,但是在考虑图像分辨率时,同样的方法并不适用于。动机是EfficientNet,这些维度的最优值也可以用暴力的方式观察,即网格搜索。然而,网格搜索的代价很高(对于CIFAR-10来说,修剪ResNet超过500 GPU{GTX 1080Ti}小时),而且常常是次优的(如果网格大小 > 1 >1 >1)。
In this paper, we propose a pruning framework that prunes three dimensions comprehensively with acceptable time-consumption. Our framework falls into two steps:
在本文中,我们提出了一个修剪框架,在可接受的时间消耗的情况下对三维进行综合修剪。我们的框架分为两个步骤:
\paragraph{Step 1: Determining Optimal d ⋆ , w ⋆ , r ⋆ d^\star, w^\star, r^\star d⋆,w⋆,r⋆.} Given a pre-trained model, we formulate model acceleration as an optimization problem: It takes depth ( d d d), width ( w w w), and resolution ( r r r) as variables and maximizes the model’s accuracy ( a a a) under the constraint of the computational cost; formally:
\begin{equation}
\begin{aligned}
\max_{d, w, r}\ a=f(d,w,r),\ s.t.\ g(d,w,r) \le \tau,
\end{aligned}
\end{equation}
where f f f is a model accuracy predictor (MAP), g g g represents the computational cost of the model, and τ \tau τ is a constant threshold. ~\cite{DBLP:conf/icml/TanL19} has provided a reasonable form of g g g. Though the expression of the MAP is unknown, it can be approximated with polynomials according to Taylor’s theorem, i.e., taking N ( N ∼ 1 0 2 ) N (N \sim 10^2) N(N∼102) groups of ( d i , w i , r i , a i ) , i ∈ [ 1 , N ] (d_i, w_i, r_i, a_i), i \in [1, N] (di,wi,ri,ai),i∈[1,N] and fitting the MAP with polynomial regression. Then the optimal d ⋆ , w ⋆ , r ⋆ d^\star, w^\star, r^\star d⋆,w⋆,r⋆ can be attained by maximizing the MAP with Lagrange multiplier theorem and KKT conditions. However, attaining ( d i , w i , r i , a i ) (d_i, w_i, r_i, a_i) (di,wi,ri,ai) is a time-consuming process of training hundreds of models. To save time and cost, we speed up the process from two aspects: \textbf{1)} We take some prior information of the MAP (e.g., non-negativity, monotonicity, separability, etc.) as restrictions when fitting it. This prior information restricts the feasible region of solutions, allowing us to approximate the MAP accurately with few groups of data ( N = 13 N=13 N=13). \textbf{2)} Given a pre-trained model and its configuration ( d 0 , w 0 , r 0 , a 0 ) (d_0, w_0, r_0, a_0) (d0,w0,r0,a0), we can prune the model and fine-tune it to get other groups of ( d i , w i , r i , a i ) (d_i, w_i, r_i, a_i) (di,wi,ri,ai) quickly. Generally, getting a group of ( d i , w i , r i , a i ) (d_i, w_i, r_i, a_i) (di,wi,ri,ai) by pruning and fine-tuning takes only about 1 4 \frac{1}{4} 41 as much time as training a new model from scratch.
Step 1: Determining Optimal
d
⋆
,
w
⋆
,
r
⋆
d^\star, w^\star, r^\star
d⋆,w⋆,r⋆.
第1步:确定最优的
d
⋆
,
w
⋆
,
r
⋆
d^\star,w^\star,r^\star
d⋆,w⋆,r⋆. 给定一个预先训练的模型,我们将模型加速描述为一个优化问题:它以深度(
d
d
d)、宽度(
w
w
w)和分辨率(
r
r
r)为变量,在计算成本的约束下使模型的精度最大化(
a
a
a);形式上:
max
d
,
w
,
r
a
=
f
(
d
,
w
,
r
)
,
s
.
t
.
g
(
d
,
w
,
r
)
≤
τ
,
(1)
\begin{aligned} \max_{d, w, r}\ a=f(d,w,r),\ s.t.\ g(d,w,r) \le \tau, \end{aligned} \tag{1}
d,w,rmax a=f(d,w,r), s.t. g(d,w,r)≤τ,(1)
其中,
f
f
f是模型精度预测器(MAP),
g
g
g表示模型的计算成本,
τ
\tau
τ是一个恒定的阈值。EfficientNet提供了一个合理的形式
g
g
g。虽然该映射的表达式未知,但可以根据泰勒定理用多项式逼近,即取
N
(
N
∼
1
0
2
)
N (N \sim 10^2)
N(N∼102)组
(
d
i
,
w
i
,
r
i
,
a
i
)
,
i
∈
[
1
,
N
]
(d_i, w_i, r_i, a_i), i \in [1, N]
(di,wi,ri,ai),i∈[1,N],并用多项式回归拟合映射。然后利用拉格朗日乘子定理和KKT条件使映射最大化,得到最优的
d
⋆
,
w
⋆
,
r
⋆
d^\star, w^\star, r^\star
d⋆,w⋆,r⋆。然而,获得
(
d
i
,
w
i
,
r
i
,
a
i
)
(d_i, w_i, r_i, a_i)
(di,wi,ri,ai)是一个耗时的过程,需要训练数百模型。为了节省时间和成本,我们从两个方面加快了处理速度:
1)在拟合时,我们以MAP的一些先验信息(如非负性、单调性、可分性等)作为约束条件。这些先验信息限制了解决方案的可行范围,使我们能够用很少的数据组(N=13$)精确地近似MAP。
2)给定一个预先训练的模型及其配置 ( d 0 , w 0 , r 0 , a 0 ) (d_0, w_0, r_0, a_0) (d0,w0,r0,a0),我们可以修剪模型并对其进行微调以快速获得其他组 ( d i , w i , r i , a i ) (d_i, w_i, r_i, a_i) (di,wi,ri,ai)。一般来说,通过修剪和微调获得一组 ( d i , w i , r i , a i ) (d_i, w_i, r_i, a_i) (di,wi,ri,ai)只需花费大约 1 4 \frac{1}{4} 41的时间,这与从头开始训练一个新模型所需的时间相同。
\paragraph{Step 2: Pruning and Fine-tuning with d ⋆ , w ⋆ , r ⋆ d^\star, w^\star, r^\star d⋆,w⋆,r⋆.} We prune the model to target depth ( d ⋆ d^\star d⋆) and width ( w ⋆ w^\star w⋆) with some layer-level pruning and filter-level pruning algorithms. Then the model is fine-tuned with images of size r ⋆ r^\star r⋆. At this step, the simplest magnitude-based filter-level pruning~\cite{DBLP:conf/iccv/LiuLSHYZ17} and discrimination-based layer-level pruning~\cite{DBLP:journals/corr/abs-1912-10178} are used for simplicity, though other pruning algorithms are still applicable.
It is worth highlighting our contributions as follows:
\begin{itemize}
\item We propose an effective pruning framework that maximizes the pruned model’s accuracy by comprehensively pruning three dimensions (depth, width, and image resolution) with acceptable time-consumption.
\item We explore the relationships between depth, width, image resolution, and model’s accuracy. Further, an effective method is proposed to fit their relationships in a fast way without training too many models.
\item Extensive experiments are done on different architectures and datasets. The results show that our algorithm performs better than state-of-the-art filter-level pruning and layer-level pruning algorithms.
\end{itemize}
Step 2: Pruning and Fine-tuning with
d
⋆
,
w
⋆
,
r
⋆
d^\star, w^\star, r^\star
d⋆,w⋆,r⋆.
第2步:使用
d
⋆
,
w
⋆
,
r
⋆
d^\star, w^\star, r^\star
d⋆,w⋆,r⋆进行修剪和微调。我们使用一些层级修剪和过滤器级修剪算法将模型修剪到目标深度(
d
⋆
d^\star
d⋆)和宽度(
w
⋆
w^\star
w⋆)。然后使用尺寸为
r
⋆
r^\star
r⋆的图像对模型进行微调。在这一步中,使用了最简单的基于幅度的过滤器级和修剪基于区分的层次剪枝,虽然其他剪枝算法仍然适用,但为了简单起见。
值得强调的是,我们的贡献如下:
- 我们提出了一个有效的修剪框架,通过在可接受的时间消耗下对三维(深度、宽度和图像分辨率)进行全面修剪,使修剪模型的精度最大化。
- 我们探讨深度、宽度、图像分辨率和模型精度之间的关系。在此基础上,提出了一种在不训练过多模型的情况下快速拟合二者关系的有效方法。
- 在不同的体系结构和数据集上进行了大量的实验。实验结果表明,该算法的性能优于现有的滤波器级剪枝算法和层级剪枝算法。
Background and Related Works
Neural Network Pruning
Neural network pruning compresses CNNs by pruning unimportant weights, filters, or layers in models, and the model after pruning will be fine-tuned on the dataset to recover its accuracy. Most neural network pruning methods fall into three categories: weight-level, filter-level, and layer-level pruning. Weight-level~\cite{DBLP:journals/corr/HanMD15} pruning has been receiving less attention because it needs specific libraries for sparse matrix (e.g., cuSPARSE) to accelerate the inference. However, the support of these libraries on mobile devices is limited. Instead, the most dominant pruning method, filter-level pruning~\cite{he2019filter,DBLP:conf/ijcai/HeKDFY18,wang2019cop} compresses models by removing unimportant filters in CNNs, which can directly reduce the number of parameters and computations for all devices. Most researches focus on how to find unimportant filters. For example, magnitude-based methods~\cite{DBLP:conf/iccv/LiuLSHYZ17,DBLP:conf/iclr/0022KDSG17,DBLP:conf/ijcai/HeKDFY18,DBLP:conf/iclr/MolchanovTKAK17} take the magnitude of weights or feature maps from some layers as the importance criterion and prune those with small magnitude. Other methods propose to observe unimportant filters through their geometric median~\cite{he2019filter}, correlation~\cite{wang2019cop}, reconstruction loss~\cite{DBLP:journals/pami/LuoZZXWL19}, Taylor expansion~\cite{DBLP:conf/iclr/MolchanovTKAK17}, and so on.
神经网络剪枝通过剪除模型中不重要的权值、滤波器或层来压缩cnn,剪枝后的模型将在数据集上进行微调以恢复其准确性。大多数的神经网络剪枝方法可分为三类:权值级、滤波器级和层级剪枝(weight-level, filter-level, and layer-level pruning)。权值水平剪枝由于需要特定的稀疏矩阵库(如cuSPARSE)来加速推理,因此受到较少的关注。然而,这些库在移动设备上的支持是有限的。相反,最主要的剪枝方法,过滤级剪枝通过删除CNNs中不重要的滤波器来压缩模型,这可以直接减少所有设备的参数和计算量。大多数研究集中在如何找到不重要的滤波器。\cite-based方法,例如,以某一层的权值或特征图的大小为重要准则,对较小的权值或特征图进行剪枝。其他方法提出通过几何中值(geometric median),相关(correlation),重建损失(reconstruction loss),泰勒展开(Taylor expansion)来观察不重要的滤波器,以此类推。
Multi-Dimension Pruning
To the best of our knowledge, there are two methods~\cite{DBLP:conf/cvpr/LinJYZCYHD19,DBLP:conf/nips/WenWWCL16} pruning models both in filter-level and layer-level. Both of them train base models with extra regularization terms and induce sparsity into models. Then the unit (filter or layer) with much sparsity will be pruned with slight loss. However, the same method cannot be used for balancing image size because images do not contain trainable parameters, and there is no way to induce sparsity into images.
据我们所知,有两种方法同时对过滤器和层剪枝(filter-level and layer-level)。这两种方法都使用额外的正则化项训练基模型,并将稀疏性引入模型中。然后对稀疏度较大的单元(过滤器或层)进行修剪,损失较小。然而,由于图像不包含可训练的参数,并且没有办法将稀疏性引入图像中,因此不能使用相同的方法来平衡图像大小。
Model Scaling
Model scaling is a task for neural network architecture search (NAS), which scales up a searched small network for higher accuracy. EfficientNet~\cite{DBLP:conf/icml/TanL19} proposes to scale up three dimensions (model’s depth, model’s width, and image size) simultaneously for best performance. Further, a grid search like method is proposed in the paper. Though the grid search method can also be used for pruning, it either takes a very long time to get the pruned model or yield a sub-optimal result (if grid size > > > 1).
模型缩放是神经网络结构搜索(NAS)的一项任务,它将搜索到的小网络放大以获得更高的精度。EfficientNet建议同时放大三维(模型深度、模型宽度和图像大小)以获得最佳性能。此外,本文还提出了一种类似网格搜索的方法。虽然网格搜索方法也可以用于修剪,但它要么需要很长时间才能得到修剪的模型,要么得到一个次优的结果(如果网格大小 > > > 1)。
Algorithm
Definitions and Notations
We define the depth of a model M M M as the number of basic blocks (e.g., Conv-BN-Relu blocks, residual blocks, etc.) that M M M contains. For example, ResNet32 contains 15 residual blocks, so its depth is 15 ( depth ( M ) = 15 \text{depth}(M)=15 depth(M)=15). The width represents the number of filters of a certain layer l l l ( width ( M [ l ] ) \text{width}(M[l]) width(M[l])), and resolution ( M ) \text{resolution}(M) resolution(M) indicates the resolution of M M M's input images. For simplicity, we define ( d , w , r ) (d, w, r) (d,w,r) of a model ( M i M_i Mi) as the normalized depth, width, and image resolution:
\newcommand{\myfont}{\fontsize{8.5pt}{\baselineskip}\selectfont}
\begin{myfont}
\begin{equation}
\label{equ:def}
d_i=\frac{\text{depth}(M_i)}{\text{depth}(M_0)},\ w_i=\frac{\text{width}(M_i[l])}{\text{width}(M_0[l])},\ r_i = \frac{\text{resolution}(M_i)}{\text{resolution}(M_0)},
\end{equation}
\end{myfont}where M 0 M_0 M0 is the baseline model. For filter pruning, we restrict all layers to be pruned with the same ratio, so the w i w_i wi of a model has no concern with the choice of l l l. Further, for a pruning task, it is natural to yield: d i , w i , r i ∈ ( 0 , 1 ] d_i, w_i, r_i \in (0, 1] di,wi,ri∈(0,1] and d 0 = w 0 = r 0 = 1 d_0 = w_0 = r_0 = 1 d0=w0=r0=1.
我们将模型
M
M
M的深度定义为
M
M
M包含的基本块(例如Conv-BN-Relu blocks, residual blocks等)的数量。例如,ResNet32包含15个residual块,因此它的深度是15(
depth
(
M
)
=
15
\text{depth}(M)=15
depth(M)=15)。宽度表示某个层
l
l
l(
width
(
M
[
l
]
)
\text{width}(M[l])
width(M[l]))的过滤器数量,
resolution
(
M
)
\text{resolution}(M)
resolution(M)表示
M
M
M输入图像的分辨率。为了简单起见,我们将模型(
M
i
M_i
Mi)的
(
d
,
w
,
r
)
(d, w, r)
(d,w,r)定义为标准化深度、宽度和图像分辨率:
d
i
=
depth
(
M
i
)
depth
(
M
0
)
,
w
i
=
width
(
M
i
[
l
]
)
width
(
M
0
[
l
]
)
,
r
i
=
resolution
(
M
i
)
resolution
(
M
0
)
,
(2)
\begin{aligned} d_i=\frac{\text{depth}(M_i)}{\text{depth}(M_0)},\ w_i=\frac{\text{width}(M_i[l])}{\text{width}(M_0[l])},\ r_i = \frac{\text{resolution}(M_i)}{\text{resolution}(M_0)}, \end{aligned} \tag{2}
di=depth(M0)depth(Mi), wi=width(M0[l])width(Mi[l]), ri=resolution(M0)resolution(Mi),(2)
其中, M 0 M_0 M0是baseline模型。对于过滤器修剪,我们限制所有层以相同的比率进行修剪,因此模型的 w i w_i wi与 l l l的选择无关。此外,对于修剪任务,很自然地会产生: d i , w i , r i ∈ ( 0 , 1 ] d_i, w_i, r_i \in (0, 1] di,wi,ri∈(0,1] and d 0 = w 0 = r 0 = 1 d_0 = w_0 = r_0 = 1 d0=w0=r0=1。
Overview
We propose a comprehensive pruning framework that prunes the model along three dimensions (i.e., depth, width, and image resolution) simultaneously. Our framework falls into two steps: Given a pre-trained model, for the first step, we determine the optimal ( d ⋆ , w ⋆ , r ⋆ ) (d^\star, w^\star, r^\star) (d⋆,w⋆,r⋆) by solving an optimization problem. For the second step, the model is pruned to the optimal ( d ⋆ , w ⋆ ) (d^\star, w^\star) (d⋆,w⋆) and fine-tuned with the r ⋆ r^\star r⋆-sized images. The total pipeline of our algorithm is concluded in Fig.~\ref{fig:alg}.
我们提出了一个全面的剪枝框架,它可以同时沿深度、宽度和图像分辨率三个维度对模型进行修剪。我们的框架分为两个步骤:给定一个预先训练的模型,对于第一步,我们通过求解一个优化问题来确定最优 ( d ⋆ , w ⋆ , r ⋆ ) (d^\star, w^\star, r^\star) (d⋆,w⋆,r⋆)。在第二步中,模型被修剪到最优 ( d ⋆ , w ⋆ ) (d^\star, w^\star) (d⋆,w⋆),并使用 r ⋆ r^\star r⋆大小的图像进行微调。我们算法的总体流程如图2。
Model Acceleration as Optimization
Given a CNN architecture, the model’s depth, width, and image resolution are three key aspects that affect both the model’s accuracy and its computational cost. Thus, we formulate model acceleration as the following problem:
\begin{equation}
\label{equ:method1}
\begin{aligned}
d^\star, w^\star, r^\star =& \mathop{\arg\max}\limits_{d, w, r} f(d, w, r;\Omega) \
& s.t.\ g(d, w, r) \le T \times g(d_0, w_0, r_0), % 0\le d \le d_0; 0 \le w \le w, 0 \le r
\end{aligned}
\end{equation}
where f f f is a \textbf{model accuracy predictor (MAP)}, which takes depth, width, and resolution as input and outputs the model’s accuracy. The MAP’s parameters, Ω \Omega Ω, differ for different architectures or datasets; ( d 0 , w 0 , r 0 d_0, w_0, r_0 d0,w0,r0) is the base model’s configuration. g g g represents the computational cost (i.e., FLOPs) of a model. T ∈ ( 0 , 1 ) T \in (0, 1) T∈(0,1) is a manually set parameter, representing that the pruned model’s computational cost is T T T of the original model. Motivated by \cite{DBLP:conf/icml/TanL19} that the computational cost of a model is proportional to d , w 2 d, w^2 d,w2, and r 2 r^2 r2 ( g ∝ d w 2 r 2 g \propto d w^2 r^2 g∝dw2r2), we re-define the computational cost constraint as the following:
\begin{equation}
\label{equ:method1_11}
\begin{aligned}
dw2r2 \le T \times d_0w_02r_02.
\end{aligned}
\end{equation}
The optimal d ⋆ , w ⋆ , r ⋆ d^\star, w^\star, r^\star d⋆,w⋆,r⋆ can be found with Lagrange multiplier theorem and KKT conditions once the MAP’s expression is known. According to Taylor’s theorem, though it is hard to speculate the expression of the MAP, approaching it with polynomials is still feasible. Specifically, we can train N N N models with different ( d , w , r ) (d, w, r) (d,w,r), attain their accuracy ( a a a), and take these groups of ( d i , w i , r i , a i ) , i ∈ [ 1 , N ] (d_i, w_i, r_i, a_i), i \in [1, N] (di,wi,ri,ai),i∈[1,N] to fit the MAP with polynomials. Generally speaking, the polynomials will fit better as N N N increases. However, it is a time-consuming process to train so many models. Thus, we propose a fast MAP fitting method to finish the step in a rapid way in the next section.
给定CNN结构,模型的深度、宽度和图像分辨率是影响模型精度和计算成本的三个关键因素。因此,我们将模型加速度公式化为以下问题:
d
⋆
,
w
⋆
,
r
⋆
=
arg
max
d
,
w
,
r
f
(
d
,
w
,
r
;
Ω
)
s
.
t
.
g
(
d
,
w
,
r
)
≤
T
×
g
(
d
0
,
w
0
,
r
0
)
,
(3)
\begin{aligned} d^\star, w^\star, r^\star =& \mathop{\arg\max}\limits_{d, w, r} f(d, w, r;\Omega) \\ & s.t.\ g(d, w, r) \le T \times g(d_0, w_0, r_0), % 0\le d \le d_0; 0 \le w \le w, 0 \le r \end{aligned} \tag{3}
d⋆,w⋆,r⋆=d,w,rargmaxf(d,w,r;Ω)s.t. g(d,w,r)≤T×g(d0,w0,r0),(3)
其中
f
f
f是一个模型精度预测函数(model accuracy predictor (MAP)),它以深度、宽度和分辨率作为输入并输出模型的精度。参数
Ω
\Omega
Ω因不同的结构或数据集而有所不同;(
d
0
,
w
0
,
r
0
d_0, w_0, r_0
d0,w0,r0)是base model的配置。
g
g
g表示模型的计算成本(即FLOPs)。
T
∈
(
0
,
1
)
T \in (0, 1)
T∈(0,1)是一个手动设置的参数,表示修剪后的模型的计算成本为原始模型的
T
T
T。按照EfficientNet,如果模型的计算成本与
d
,
w
2
d, w^2
d,w2, and
r
2
r^2
r2 (
g
∝
d
w
2
r
2
g \propto d w^2 r^2
g∝dw2r2)成正比,我们将计算成本约束重新定义如下:
d
w
2
r
2
≤
T
×
d
0
w
0
2
r
0
2
.
(4)
\begin{aligned} dw^2r^2 \le T \times d_0w_0^2r_0^2. \end{aligned} \tag{4}
dw2r2≤T×d0w02r02.(4)
在已知MAP表达式的情况下,利用拉格朗日乘子定理和KKT条件(Lagrange multiplier theorem and KKT conditions)可以找到最优的 d ⋆ , w ⋆ , r ⋆ d^\star, w^\star, r^\star d⋆,w⋆,r⋆。根据泰勒定理(Taylor’s theorem),虽然很难推测MAP的表达式,但用多项式(polynomials)逼近仍然是可行的。具体地说,我们可以用不同的 ( d , w , r ) (d, w, r) (d,w,r)训练 N N N个模型,获得它们的精度( a a a),并将这些 ( d i , w i , r i , a i ) , i ∈ [ 1 , N ] (d_i, w_i, r_i, a_i), i \in [1, N] (di,wi,ri,ai),i∈[1,N]来拟合MAP多项式。一般来说,随着N$的增加,多项式会更适合。然而,训练这么多模型是一个耗时的过程。因此,我们在下一节中提出了一种快速的MAP拟合方法(a fast MAP fitting method)来快速完成这一步骤。
Fast MAP Fitting
We provide two basic ideas to speed up the fitting process: 1) providing some prior information about MAP’s properties so that we can fit it well even with few data points (a smaller N N N). 2) Accelerate the process of fetching each data point (i.e., ( d i , w i , r i , a i ) (d_i, w_i, r_i, a_i) (di,wi,ri,ai)) by taking advantage of pruning and fine-tuning. The details are as follows:
According to its definition, some of MAP’s properties are transparent:
\paragraph{Non-negativity & Boundedness.} The accuracy of the model ranges from zero to one, i.e., 0 ≤ f ( d , w , r ) ≤ 1 0 \le f(d, w, r) \le 1 0≤f(d,w,r)≤1.
\paragraph{Monotonicity.} Models with smaller images or width or depth yield smaller accuracy\footnote{Degradation problem~\cite{DBLP:conf/cvpr/HeZRS16} is not considered for simplicity because nowadays models have solved this problem to some extent}, i.e.,
\begin{equation}
\label{equ:inequal}
\begin{aligned}
d_1 \le d_2 &\iff f(d_1, w_1, r_1)\le f(d_2, w_1, r_1) \
w_1 \le w_2 &\iff f(d_1, w_1, r_1) \le f(d_1, w_2, r_1) \
r_1 \le r_2 &\iff f(d_1, w_1, r_1) \le f(d_1, w_1, r_2) \
%d_0 \le d_1 &\iff h_1(d_0) \le h_1(d_1) \
%w_0 \le w_1 &\iff h_2(w_0) \le h_2(w_1) \
%r_0 \le r_1 &\iff h_3(r_0) \le h_3(r_1) \
\end{aligned}
\end{equation}
Further, we design some experiments where two dimensions (e.g., depth, resolution) are freezing and explore how the accuracy varies along with the last dimension (e.g., width). As a result, we find that the influences of these three dimensions on accuracy are independent of each other. Formally,
\begin{equation}
\label{equ:method1_1}
\begin{aligned}
\frac{f(d_2, w_1, r_1)}{f(d_2, w_2, r_1)} = \frac{f(d_2, w_1, r_2)}{f(d_2, w_2, r_2)} \
\frac{f(d_1, w_1, r_1)}{f(d_2, w_1, r_1)} = \frac{f(d_1, w_1, r_2)}{f(d_2, w_1, r_2)} \
\frac{f(d_1, w_1, r_2 )}{f(d_2, w_1, r_2)} = \frac{f(d_1, w_2, r_2)}{f(d_2, w_2, r_2)} \
\end{aligned}
\end{equation}
holds for all { ( d , w , r ) ∣ d , w , r ≠ 0 } \{(d, w, r) | d, w, r \ne 0\} {(d,w,r)∣d,w,r=0}, which leads us to f ( d , w , r ) f(d, w, r) f(d,w,r)‘s another property\footnote{Please see our supplementary materials for experiments and derivation of all properties.}:
\paragraph{Separability.} Three variables in f ( d , w , r ) f(d, w, r) f(d,w,r) are separable, i.e., f ( d , w , r ) f(d, w, r) f(d,w,r) can be expressed in the form of:
\begin{equation}
\label{equ:method2}
\begin{aligned}
f(d, w, r) = h_1(d) \times h_2(w) \times h_3®.
\end{aligned}
\end{equation}
Thanks to its separability, we can construct complex f ( d , w , r ) f(d, w, r) f(d,w,r) by stacking unary polynomials h i ( x ; Ω i ) , i ∈ { 1 , 2 , 3 } h_i(x;\Omega_i), i \in \{1, 2, 3\} hi(x;Ωi),i∈{1,2,3}. Combined with the MAP’s non-negativity and boundedness, we restrict 0 ≤ h i ( x ) ≤ 1 , i ∈ { 1 , 2 , 3 } 0 \le h_i(x) \le 1, i \in \{1, 2, 3\} 0≤hi(x)≤1,i∈{1,2,3} without loss of generality. Moreover, it is easy to deduce that h i ( x ) , i ∈ { 1 , 2 , 3 } h_i(x), i \in \{1, 2, 3\} hi(x),i∈{1,2,3} are also monotonic increasing.
With this prior information, the MAP’s fitting is re-formulated as:
\begin{equation}
\label{equ:re-form}
\begin{aligned}
\mathop{\arg min}\limits_{\Omega}\sum_{n=1}^N(h_1(d_n) \times h_2(w_n) \times h_3(&r_n)-a_n)^2 \
s.t.~0 \le h_i(x) \le 1,\ i \in {1, 2, 3},\ &x \in [0, 1] \
h_i’(x) \ge 0,\ i \in {1, 2, 3},\ &x \in [0, 1], \
\end{aligned}
\end{equation}
where h i ( x ) h_i(x) hi(x) is a unary polynomial:
\begin{equation}
\label{equ:re-form1}
\begin{aligned}
h_i(x) = \sum_{j=0}{k}\Omega_{ij}xj,\ \Omega_{ij} \in \mathbb{R} \
\end{aligned}
\end{equation}
and k k k is the degree of h h h. The restrictions (prior information about MAP’s properties) in Eq.~\ref{equ:re-form} scale down the feasible region of solutions, allowing us to fit the MAP with fewer data points (a smaller N N N). Eq.~\ref{equ:re-form} can be transformed as a semi-definite programming problem and solved easily.
\paragraph{Pruning for Fast MAP Fitting.} We also take advantage of iterative pruning and fine-tuning to speed up fetching each group ( d i , w i , r i , a i ) (d_i, w_i, r_i, a_i) (di,wi,ri,ai). Precisely, given a pre-trained model M 0 M_0 M0 and its configuration ( d 0 , w 0 , r 0 , a 0 ) (d_0, w_0, r_0, a_0) (d0,w0,r0,a0), compared with training models from scratch, pruning and fine-tuning M 0 M_0 M0 is a faster way to get the accuracy of models with different configurations. Here, we take pruning along width as an example: pruning redundant filters in M 0 M_0 M0 from w 0 w_0 w0 to w 1 ( w 1 < w 0 ) w_1 (w_1 < w_0) w1(w1<w0) and fine-tuning the model yields M 1 M_1 M1 as well as its configuration ( d 0 , w 1 , r 0 , a 1 ) (d_0, w_1, r_0, a_1) (d0,w1,r0,a1). Similarly, pruning M 1 M_1 M1 along width to w 2 ( w 2 < w 1 ) w_2 (w_2 < w_1) w2(w2<w1) yields ( d 0 , w 2 , r 0 , a 2 ) (d_0, w_2, r_0, a_2) (d0,w2,r0,a2). As the same rule, we can also prune the model along depth or fine-tune it with smaller images to attain other groups ( d i , w i , r i , a i ) (d_i, w_i, r_i, a_i) (di,wi,ri,ai).
Iterative pruning, rather than one-pass pruning, is used for two reasons: 1) Besides the final pruned model, iterative pruning yields several intermediate models as well as their configurations ( d i , w i , r i , a i ) (d_i, w_i, r_i, a_i) (di,wi,ri,ai), which can also be used for the MAP’s fitting. 2) Compared with one-pass pruning, iterative pruning prunes fewer filters at each round, so it is possible to recover the model’s accuracy with fewer fine-tuning epochs.
With the prior information for the MAP and the help of iterative pruning and fine-tuning, we can approach MAP with polynomials in a fast way.
为了加快拟合过程,我们提供了两个基本思路:1)提供一些关于MAP属性的先验信息,以便我们即使在很少的数据点(较小的 N N N)的情况下也能很好地拟合它。2) 通过利用修剪和微调,加快获取每个数据点(即 ( d i , w i , r i , a i ) (d_i, w_i, r_i, a_i) (di,wi,ri,ai))的过程。具体情况如下:
根据其定义,MAP的某些属性是透明的:
非负性和有界性:模型的精度从0到1,即 0 ≤ f ( d , w , r ) ≤ 1 0 \le f(d, w, r) \le 1 0≤f(d,w,r)≤1
单调性:具有较小图像或宽度或深度的模型会产生较小的精度,也就是说。
d 1 ≤ d 2 ⟺ f ( d 1 , w 1 , r 1 ) ≤ f ( d 2 , w 1 , r 1 ) w 1 ≤ w 2 ⟺ f ( d 1 , w 1 , r 1 ) ≤ f ( d 1 , w 2 , r 1 ) r 1 ≤ r 2 ⟺ f ( d 1 , w 1 , r 1 ) ≤ f ( d 1 , w 1 , r 2 ) (5) \begin{aligned} d_1 \le d_2 &\iff f(d_1, w_1, r_1)\le f(d_2, w_1, r_1) \\ w_1 \le w_2 &\iff f(d_1, w_1, r_1) \le f(d_1, w_2, r_1) \\ r_1 \le r_2 &\iff f(d_1, w_1, r_1) \le f(d_1, w_1, r_2) \\ %d_0 \le d_1 &\iff h_1(d_0) \le h_1(d_1) \\ %w_0 \le w_1 &\iff h_2(w_0) \le h_2(w_1) \\ %r_0 \le r_1 &\iff h_3(r_0) \le h_3(r_1) \\ \end{aligned} \tag{5} d1≤d2w1≤w2r1≤r2⟺f(d1,w1,r1)≤f(d2,w1,r1)⟺f(d1,w1,r1)≤f(d1,w2,r1)⟺f(d1,w1,r1)≤f(d1,w1,r2)(5)
此外,我们设计了一些二维(如深度、分辨率)冻结的实验,并探讨了精度如何随最后一个维度(如宽度)的变化。结果表明,这三个维度对精度的影响是相互独立的。正式地,
f
(
d
2
,
w
1
,
r
1
)
f
(
d
2
,
w
2
,
r
1
)
=
f
(
d
2
,
w
1
,
r
2
)
f
(
d
2
,
w
2
,
r
2
)
f
(
d
1
,
w
1
,
r
1
)
f
(
d
2
,
w
1
,
r
1
)
=
f
(
d
1
,
w
1
,
r
2
)
f
(
d
2
,
w
1
,
r
2
)
f
(
d
1
,
w
1
,
r
2
)
f
(
d
2
,
w
1
,
r
2
)
=
f
(
d
1
,
w
2
,
r
2
)
f
(
d
2
,
w
2
,
r
2
)
(6)
\begin{aligned} \frac{f(d_2, w_1, r_1)}{f(d_2, w_2, r_1)} = \frac{f(d_2, w_1, r_2)}{f(d_2, w_2, r_2)} \\ \frac{f(d_1, w_1, r_1)}{f(d_2, w_1, r_1)} = \frac{f(d_1, w_1, r_2)}{f(d_2, w_1, r_2)} \\ \frac{f(d_1, w_1, r_2 )}{f(d_2, w_1, r_2)} = \frac{f(d_1, w_2, r_2)}{f(d_2, w_2, r_2)} \\ \end{aligned} \tag{6}
f(d2,w2,r1)f(d2,w1,r1)=f(d2,w2,r2)f(d2,w1,r2)f(d2,w1,r1)f(d1,w1,r1)=f(d2,w1,r2)f(d1,w1,r2)f(d2,w1,r2)f(d1,w1,r2)=f(d2,w2,r2)f(d1,w2,r2)(6)
适用于所有
{
(
d
,
w
,
r
)
∣
d
,
w
,
r
≠
0
}
\{(d, w, r) | d, w, r \ne 0\}
{(d,w,r)∣d,w,r=0},这导致我们得到
f
(
d
,
w
,
r
)
f(d, w, r)
f(d,w,r)的另一个属性:
可分离性:
f
(
d
,
w
,
r
)
f(d, w, r)
f(d,w,r)中的三个变量是可分离的,即,
f
(
d
,
w
,
r
)
f(d, w, r)
f(d,w,r)可以表示为:
f
(
d
,
w
,
r
)
=
h
1
(
d
)
×
h
2
(
w
)
×
h
3
(
r
)
.
(8)
\begin{aligned} f(d, w, r) = h_1(d) \times h_2(w) \times h_3(r). \end{aligned} \tag{8}
f(d,w,r)=h1(d)×h2(w)×h3(r).(8)
由于它的可分离性,我们可以通过叠加一元多项式(unary polynomials) h i ( x ; Ω i ) , i ∈ { 1 , 2 , 3 } h_i(x;\Omega_i), i \in \{1, 2, 3\} hi(x;Ωi),i∈{1,2,3}来构造 f ( d , w , r ) f(d, w, r) f(d,w,r)。结合MAP的非负性和有界性,我们在不损失一般性的前提下,将限制 0 ≤ h i ( x ) ≤ 1 , i ∈ { 1 , 2 , 3 } 0 \le h_i(x) \le 1, i \in \{1, 2, 3\} 0≤hi(x)≤1,i∈{1,2,3}。此外,还可以很容易地推断出 h i ( x ) , i ∈ { 1 , 2 , 3 } h_i(x), i \in \{1, 2, 3\} hi(x),i∈{1,2,3}也是单调递增的。
有了这些先前的信息,MAP的拟合重新公式化为:
arg
m
i
n
Ω
∑
n
=
1
N
(
h
1
(
d
n
)
×
h
2
(
w
n
)
×
h
3
(
r
n
)
−
a
n
)
2
s
.
t
.
0
≤
h
i
(
x
)
≤
1
,
i
∈
{
1
,
2
,
3
}
,
x
∈
[
0
,
1
]
h
i
′
(
x
)
≥
0
,
i
∈
{
1
,
2
,
3
}
,
x
∈
[
0
,
1
]
,
(8)
\begin{aligned} \mathop{\arg min}\limits_{\Omega}\sum_{n=1}^N(h_1(d_n) \times h_2(w_n) \times h_3(&r_n)-a_n)^2 \\ s.t.~0 \le h_i(x) \le 1,\ i \in \{1, 2, 3\},\ &x \in [0, 1] \\ h_i'(x) \ge 0,\ i \in \{1, 2, 3\},\ &x \in [0, 1], \\ \end{aligned} \tag{8}
Ωargminn=1∑N(h1(dn)×h2(wn)×h3(s.t. 0≤hi(x)≤1, i∈{1,2,3}, hi′(x)≥0, i∈{1,2,3}, rn)−an)2x∈[0,1]x∈[0,1],(8)
其中,
h
i
(
x
)
h_i(x)
hi(x)是一元多项式:
h
i
(
x
)
=
∑
j
=
0
k
Ω
i
j
x
j
,
Ω
i
j
∈
R
(9)
\begin{aligned} h_i(x) = \sum_{j=0}^{k}\Omega_{ij}x^j,\ \Omega_{ij} \in \mathbb{R} \\ \end{aligned} \tag{9}
hi(x)=j=0∑kΩijxj, Ωij∈R(9)
而
k
k
k是
h
h
h的度数。公式中的限制(关于MAP属性的先验信息)缩小可行的解决方案区域,使我们能够用更少的数据点(更小的
N
N
N)来拟合MAP。公式可转化为半定规划问题(semi-definite programming problem),易于求解。
修剪快速MAP拟合:我们还利用迭代剪枝和微调(iterative pruning and fine-tuning)来加快获取每个组 ( d i , w i , r i , a i ) (d_i, w_i, r_i, a_i) (di,wi,ri,ai)。准确地说,给定一个预先训练的模型 M 0 M_0 M0及其配置 ( d 0 , w 0 , r 0 , a 0 ) (d_0, w_0, r_0, a_0) (d0,w0,r0,a0),与从头开始的训练模型相比,修剪和微调 M 0 M_0 M0是获得不同配置模型精度的更快方法。这里,我们以沿宽度修剪为例:将 M 0 M_0 M0中的冗余过滤器从 w 0 w_0 w0修剪到 w 1 ( w 1 < w 0 ) w_1 (w_1 < w_0) w1(w1<w0),并对模型进行微调,得到 M 1 M_1 M1及其配置 ( d 0 , w 1 , r 0 , a 1 ) (d_0, w_1, r_0, a_1) (d0,w1,r0,a1)。类似地,沿宽度将 M 1 M_1 M1修剪为 w 2 ( w 2 < w 1 ) w_2 (w_2 < w_1) w2(w2<w1),则会产生 ( d 0 , w 2 , r 0 , a 2 ) (d_0, w_2, r_0, a_2) (d0,w2,r0,a2)。同样的规则,我们也可以沿着深度修剪模型,或者用更小的图像对其进行微调,以获得其他组 ( d i , w i , r i , a i ) (d_i, w_i, r_i, a_i) (di,wi,ri,ai)。
使用迭代剪枝,而不是一次剪枝,有两个原因:1)除了最终剪枝模型外,迭代剪枝产生几个中间模型及其配置 ( d i , w i , r i , a i ) (d_i, w_i, r_i, a_i) (di,wi,ri,ai),这些模型也可以用于MAP拟合。2) 与一次剪枝相比,迭代剪枝在每一轮剪枝次数较少,因此可以用较少的微调周期来恢复模型的精度。
利用MAP的先验信息,通过迭代剪枝和微调,可以快速地用多项式逼近映射。
Optimal Values for Three Dimensions
得到MAP后,公式 ( 3 ) (3) (3)的最优 ( d ⋆ , w ⋆ , r ⋆ ) (d^\star, w^\star, r^\star) (d⋆,w⋆,r⋆)很容易解决。具体地说,根据KKT条件,等式公式 ( 3 ) (3) (3)可以转化为相同限制:
d w 2 r 2 − T × d 0 w 0 2 r 0 2 = 0 , (10) \begin{aligned} dw^2r^2 - T \times d_0w_0^2r_0^2 = 0, \end{aligned} \tag{10} dw2r2−T×d0w02r02=0,(10)
答案就在公式 ( 11 ) (11) (11)根据拉格朗日乘数定理(Lagrange multiplier theorem),其中 λ \lambda λ是拉格朗日乘数。
d w 2 r 2 − T × d 0 w 0 2 r 0 2 = 0 h 1 ′ ( d ) h 2 ( w ) h 3 ( r ) + λ w 2 r 2 = 0 h 1 ( d ) h 2 ′ ( w ) h 3 ( r ) + 2 λ d w r 2 = 0 h 1 ( d ) h 2 ( w ) h 3 ′ ( r ) + 2 λ d w 2 r = 0 (11) \begin{aligned} dw^2r^2 - T \times d_0w_0^2r_0^2 &= 0 \\ h_1'(d)h_2(w)h_3(r)+\lambda w^2r^2 &= 0 \\ h_1(d)h_2'(w)h_3(r)+2\lambda dwr^2 &= 0 \\ h_1(d)h_2(w)h_3'(r)+2\lambda dw^2r &= 0 \\ \end{aligned} \tag{11} dw2r2−T×d0w02r02h1′(d)h2(w)h3(r)+λw2r2h1(d)h2′(w)h3(r)+2λdwr2h1(d)h2(w)h3′(r)+2λdw2r=0=0=0=0(11)
Comprehensive Pruning and Fine-tuning
在获得最优 ( d ⋆ , w ⋆ , r ⋆ ) (d^\star, w^\star, r^\star) (d⋆,w⋆,r⋆)后,使用过滤级剪枝和层级剪枝方法(filter-level pruning and layer-level pruning)将模型修剪为目标 d ⋆ d^\star d⋆和 w ⋆ w^\star w⋆,然后使用大小为 r ⋆ r^\star r⋆的图像对模型进行微调。在剪枝过程中,先进行层修剪和先过滤剪枝是可行的,并且得到相同的剪枝模型。在不损失一般性的前提下,我们先假设层修剪来描述这个过程。具体步骤如下:
Pruning Layers:根据DBP,在模型 M 0 M_0 M0的每一layer后面加上一个线性分类器,并在评估数据集上测试其准确性。每一个线性分类器的准确度反映了其对应层特征的判别。此外,每一层相对于其前一层的识别增强被视为该层的重要性。通过这个重要性度量,我们挑选出最不重要的 ( 1 − d ⋆ / d 0 ) × 100 % (1-d^\star/d_0) \times 100\% (1−d⋆/d0)×100%层,并将它们从 M 0 M_0 M0中移除,得到一个模型 M p 1 M_{p1} Mp1。
Pruning Filters:对 M p 1 M_{p1} Mp1执行过滤器级别的修剪。特别地,我们使用BN层的比例因子作为重要度量,就像Slimming. 但是,与全局比较所有过滤器的重要性不同的是,我们只比较同一层中过滤器的重要性,并对每个层中最不重要的 ( 1 − w ⋆ / w 0 ) × 100 % (1-w^\star/w_0) \times 100\% (1−w⋆/w0)×100%的过滤器进行修剪。通过这样的修改,我们保持所有层的修剪比率不变。假设过滤器修剪后的模型是 M p 2 M_{p2} Mp2。
Fine-tuning with Smaller Images:修剪后,修剪后的模型 M p 2 M_{p2} Mp2将使用大小为 r ⋆ r^\star r⋆的图像进行微调。利用双线性下采样方法(the bilinear down-sampling method)将图像调整到目标大小,这是图像最常见的下采样方法。该模型将以较小的学习速率进行微调直至收敛,最终得到一个修剪模型 M p M_p Mp。
Experiments
……