TVM性能评估分析(五)
Figure 3. A futher speed up with operator fusion
Table 1. Performance issue of cuBLAS’ batch matmul
Table 2. Finding the best combination of number_thread. The results are obtained on a NVIDIA M40 GPU device with CUDA8.0.
Figure 4. DLPack provides an intermediate wrapper that is shared between frameworks and TVM
Figure 5. The OpenGL/WebGL Backend
Figure 6. TVM utilizes a unified AST to define kernels, and compiles it to code on different platforms.
Figure 7. The benchmark is run in 4 different settings
Figure 8. Inference Speed of Different Backends on ImageNet
Figure 9. Mali T860 and T880
Figure 10. Inference Speed of Different Backends on ImageNet
Table 3. Inference Speed of FP16 on ImageNet