TVM量化路线图roadmap
INT8量化方案
本文介绍了量化过程的原理概述,提出了在TVM中实现量化过程的建议。
l 介绍量子化的背景知识
l INT8量化-后端代码生成
l 这个线程只
量子开发
基于搜索的自动量化
提出了一种新的量化框架,将硬件和训练方法结合起来。
借鉴已有的一些量化框架的思想,选择采用注释annotation,校准calibration,实现热啊;realization三阶段设计。
l Annotation注释:
注释过程pass根据每个算子的重写函数,重写图形并插入模拟量化操作。
模拟量化操作,模拟从浮点量化到整数的舍入误差和饱和误差,
l Calibration校准:
校准过程pass,将调整模拟量化操作的阈值,以减少精度下降。
l Realization实现:
实现过程pass,将实际用float32计算的仿真图,转化为一个真正的低精度整数图。
TVM支持的量化框架
TF量化相关
TVM支持所有预量化TFLite托管
l 在Intel VNNI支持的C5.12xlarge Cascade lake机器上,对性能进行了评估
l 尚未自动调化整模型
PYTORCH量子化相关
如何通过relay将模型转换为量化模型?
如何为torch.quantization.get\u default\u qconfig('fbgemm')设置qconfig
量化模型精度基准:PyTorch vs TVM
如何将量化pytorch模型转换为tvm模型
比较resent18、resent5、mobilenet-v2、mobilenet-v3、inception\u v3和googlenet的准确度和速度。
在PYTORCH中包含静态量化和eager模式:PYTORCH的量化turorial。
l gap量化
l PyTorch的GAP8导出和PyTorch量化module
l 包括squeezenet-v1.1的量化文件
MXNET RELATED
产品级神经网络推理模型量化
l 以下CPU性能来自AWS EC2 C5.24xlarge实例,该实例具有定制的第二代Intel Xeon Scalable Processors (Cascade Lake)。
l 模型量化提供了比所有模型更稳定的加速比,例如ResNet 50 v1为3.66倍,ResNet 101 v1为3.82倍,SSD-VG16为3.77倍,这非常接近INT8的理论4倍加速比。
l Apache/MXNet量化solution精度,非常接近FP32模型,不需要保留模式。在图8中,MXNet只确保了精度的小幅度降低,小于0.5%。
TENSOR CORE RELATED张量内核相关
- [RFC][Tensor Core] Optimization of CNNs on Tensor Core基于Tensor Core的CNNs优化
- [Perf] Enhance cudnn and cublas backend and enable TensorCore增强cudnn和cublas后端并启用TensorCore
RELATED COMMIT相关提交
- [OPT] Low-bit Quantization #2116低bit位量化
- Benchmarking Quantization on Intel CPU英特尔CPU上的基准量化
- [RFC][Quantization] Support quantized models from TensorflowLite#2351支持TensorflowLite的量化模型
- After initial investigation and effort, in the Mobilenet V1 model, INT8 can get speed up about 30% when compared with FP32 on ARM CPU. 经过初步调查和努力,在Mobilenet V1模型中,INT8与ARM CPU上的FP32相比,可以获得大约30%的速度。
- [TFLite] Support TFLite FP32 Relay frontend. #2365支持TFLite FP32 relay前端
- This is the first PR of #2351 to support importing exist quantized int8 TFLite model. The base version of Tensorflow / TFLite is 1.12. 这是#2351的第一个支持导入exist量化int8 TFLite模型的PR。Tensorflow/TFLite的基础版本是1.12
- Recently introduce op strategy currently has some issues with task extraction with AutoTVM. This PR fixes them for x86/CUDA. 最近引入的op策略目前在使用AutoTVM extraction任务方面存在一些问题。这个PR为x86/CUDA修复了。
- [Torch, QNN] Add support for quantized models via QNN #4977增加对量化模型的支持
- [QNN][Legalize] Specialize for Platforms w/o fast Int8 support #4307
-
The inference time is longer after int8 quantization
- TVM-relay.quantize vs quantization of other Framework
- TVM FP32、TVM int8、TVM int8 quantization + AutoTVM,MXNet
SPEED UP
COMPARISON
AUTOMATIC INTEGER QUANTIZATION
Quantization int8 slower than int16 on skylake CPU
- The int8 is always slower than int16 before and after the auto-tuning
- Target: llvm -mcpu=skylake-avx512
- Problem is solved by creating the int8 task explicitly
- create the task topi_x86_conv2d_NCHWc_int8
- set output dtype to int32, input dtype=uint8, weight dtype=int8
- TVM FP32、TVM int8、TVM int8 quantization , MXNet, TF1.13
- 含测试代码
8bit@Cuda: AutoTVMvs TensorRT vs MXNet
- In this post, we show how to use TVM to automatically optimize of quantized deep learning models on CUDA.
ACCEPTING PRE-QUANTIZED INTEGER MODELS
-
Is there any speed comparison of quantization on cpu
- discuss a lot about speed comparison among torch-fp32, torch-int8, tvm-fp32, tvm-int16, tvm-int8
SPEED PROFILE TOOLS
-
How to profile speed in each layer with RPC?
- the debug runtime will give you some profiling information from the embedded device, e.g.:
Node Name Ops Time(us) Time(%) Start Time End Time Shape Inputs Outputs
--------- --- -------- ------- ---------- -------- ----- ------ -------
1_NCHW1c fuse___layout_transform___4 56.52 0.02 15:24:44.177475 15:24:44.177534 (1, 1, 224, 224) 1 1
_contrib_conv2d_nchwc0 fuse__contrib_conv2d_NCHWc 12436.11 3.4 15:24:44.177549 15:24:44.189993 (1, 1, 224, 224, 1) 2 1
relu0_NCHW8c fuse___layout_transform___broadcast_add_relu___layout_transform__ 4375.43 1.2 15:24:44.190027 15:24:44.194410 (8, 1, 5, 5, 1, 8) 2 1
_contrib_conv2d_nchwc1 fuse__contrib_conv2d_NCHWc_1 213108.6 58.28 15:24:44.194440 15:24:44.407558 (1, 8, 224, 224, 8) 2 1
relu1_NCHW8c fuse___layout_transform___broadcast_add_relu___layout_transform__ 2265.57 0.62 15:24:44.407600 15:24:44.409874 (64, 1, 1) 2 1
_contrib_conv2d_nchwc2 fuse__contrib_conv2d_NCHWc_2 104623.15 28.61 15:24:44.409905 15:24:44.514535 (1, 8, 224, 224, 8) 2 1
relu2_NCHW2c fuse___layout_transform___broadcast_add_relu___layout_transform___1 2004.77 0.55 15:24:44.514567 15:24:44.516582 (8, 8, 3, 3, 8, 8) 2 1
_contrib_conv2d_nchwc3 fuse__contrib_conv2d_NCHWc_3 25218.4 6.9 15:24:44.516628 15:24:44.541856 (1, 8, 224, 224, 8) 2 1
reshape1 fuse___layout_transform___broadcast_add_reshape_transpose_reshape 1554.25 0.43 15:24:44.541893 15:24:44.543452 (64, 1, 1) 2 1
参考链接:
https://www.freesion.com/article/3155559638/