Neural Network Compression Framework for fast model inference

论文背景

文章地址
代码地址

  • Alexander Kozlov Ivan Lazarevich Vasily Shamporov Nikolay Lyalyushkin Yury Gorbachev
    intel

名字看起来都是俄罗斯人

  • 期刊/会议: CVPR 2020

Abstract

基于pytorch框架, 可以提供quantization, sparsity, filter pruning and binarization等压缩技术. 可独立使用, 也可以与现有的training code整合在一起.

features

  • Support of quantization, binarization, sparsity and filter pruning algorithms with fine-tuning.
  • Automatic model graph transformation in PyTorch – the model is wrapped and additional layers are inserted in the model graph.
  • Ability to stack compression methods and apply several of them at the same time.
  • Training samples for image classification, object detection and semantic segmentation tasks as well as configuration files to compress a range of models.
  • Ability to integrate compression-aware training into third-party repositories with minimal modifications of the existing training pipelines, which allows integrating NNCF into large-scale model/pipeline aggregation repositories such as MMDetection or Transformers.
  • Hardware-accelerated layers for fast model fine-tuning and multi-GPU training support.
  • Compatibility with O p e n V I N O T M OpenVINO^{TM} OpenVINOTM Toolkit for model inference.

A few caveats and Framework Architecture

  • NNCF does not perform additional network graph transformations during the quantization process, such as batch normalization folding
  • The sparsity algorithms implemented in NNCF constitute non-structured network sparsification approaches. Another approach is the so-called structured sparsity, which aims to prune away whole neurons or convolutional filters.
  • Each compression method acts on this wrapper by defining the following basic components:
    • Compression Algorithm Builder
    • Compression Algorithm Controller
    • Compression Loss
    • Compression Scheduler
  • Another important novelty of NNCF is the support of algorithm stacking where the users can build custom compression pipelines by combining several compression methods.(可以在一次训练中同时生成稀疏且量化的模型)
  • 使用步骤
    • the model is wrapped by the transparent NNCFNetwork wrapper
    • one or more particular compression algorithm builders are instantiated and applied to the wrapped model.
    • The wrapped model can then be fine-tuned on the target dataset using either an original training pipeline, or a slightly modified pipeline.
    • After the compressed model is trained we can export it to ONNX format for further usage in the O p e n V I N O T M OpenVINO^{TM} OpenVINOTM inference toolkit

Compression Methods Overview

quantization

借鉴的方法有

  • QAT
  • PACT
  • TQT
q m i n q_{min} qmin​ q m a x q_{max} qmax​
Weights − 2 b i t s − 1 + 1 -2^{bits-1}+1 −2bits−1+1 2 b i t s − 1 − 1 2^{bits-1}-1 2bits−1−1
Signed Activation − 2 b i t s − 1 -2^{bits-1} −2bits−1 2 b i t s − 1 − 1 2^{bits-1}-1 2bits−1−1
Unsigned Activation 0 2 b i t s − 1 2^{bits}-1 2bits−1

对称量化

scale是训练得到的, 用以表示实际的范围
Neural Network Compression Framework for fast model inference

非对称量化

训练优化float的范围, 0点为最小是
float zero-point经过映射后需要是在量化范围内的一个整数, 这个限制可以使带padding的layer计算效率高
Neural Network Compression Framework for fast model inference

Training and inference

和QAT, TQT不同, 论文中的方法并不会进行BN fold, 但是为了train和inference时的统计量一致, 需要使用大的batch size.(>256)

混合精度量化

使用HAWQ-v2方法来选择bit位,
敏感度计算方式如下:
Neural Network Compression Framework for fast model inference

压缩率计算方式: int8的复杂度/mixed-precision复杂度
复杂度 = FLOPs * bit-width

混合精度就是在满足压缩率阈值的情况下, 找到具有最小敏感度的精度配置.

Binarization

weights通过XNOR和DoReFa实现.

  • Stage 1: the network is trained without any binarization,
  • Stage 2: the training continues with binarization enabled for activations only,
  • Stage 3: binarization is enabled both for activations and weights,
  • Stage 4: the optimizer learning rate, which had been kept constant at previous stages, is decreased according to a polynomial law, while weight decay parameter of the optimizer is set to 0.

Sparsity

NNCF支持两只sparsity方式:
1 根据weights大小来训练
2 基于L0 regularization的训练

Filter pruning

NNCF implements three different criteria for filter importance:

  • L1-norm,
  • L2-norm
  • geometric median.
上一篇:因果推断书籍(计量经济学)推荐


下一篇:Levenshtein距离