【模型加速】PointPillars模型加速实验(1)

2024-01-16 19:43:52

在阅读这个系列文章之前假定你已经对PointPillars算法本身有一定的了解，并且有一个可用的Pytorch模型。这里的加速实验主要还是遵循"Pytorch-->ONNX-->TensorRT引擎-->推理"这一思路，暂不涉及使用TensorRT API手动搭建网络。

实验环境

系统，Ubuntu16.04
内核，Linux version 4.15.0-112-generic
CUDA，10.2
Python, 3.6
Pytorch，1.4.0
TensorRT, 7.1.3.4
cmake, 3.13.2
GPU，4卡Titan RTX

算法流程

整个算法逻辑包含3个部分：数据预处理，神经网络，后处理。其中神经网络部分，原论文中将其结构描叙为3个部分：PFN(PillarFeature Net)，Backbone(2D CNN)和Detection Head(SSD)。在实际部署的时候，结构拆分和论文中稍微有些出入。主要是分成PFN，MFN和RPN。其中MFN是用来将PFN提取的Pillar级的点云深度特征进一步转化成伪点云图像。RPN就是Backbone，而检测头的部分功能被包含在了后处理的逻辑里面。

输入数据准备

PFN的输入包括8组数据:

pillar_x:包含Pillar化后的点云x坐标,shape为(1,1,P,100)
pillar_y:包含Pillar化后的点云y坐标,shape为(1,1,P,100)
pillar_z:包含Pillar化后的点云z坐标, shape为(1,1,P,100)
pillar_i:包含Pillar化后的点云强度值,shape为(1,1,P,100)
num_points: 保存每个Pillar包含的实际点云数量,shape为(1,P)
x_sub_shaped:保存Pillar的中心x坐标,shape为(1,1,P,100)
y_sub_shaped:保存Pillar的中心y坐标,shape为(1,1,P,100)
mask:pillar点云掩码,shape为(1,1,P,100)

MFN的输入包括2组数据：

voxel_features：就是是就是PFN的输出，shape为(1,64,P,1)
coords：pillar在x-y网格中的坐标，shape为(P,4)

这两部分的输入我统一在make_input函数中实现。

    def make_input(self, points):
        pillars,coors,num_points_per_pillar = self.points_to_pillar(points,
                                                self.pillar_size,
                                                self.point_cloud_range,
                                                self.max_num_points_per_pillar,
                                                self.reverse_index,
                                                self.max_num_pillars)
        pillar_x = pillars[0][np.newaxis, np.newaxis, :] #(1,1,N,100)
        pillar_y = pillars[1][np.newaxis, np.newaxis, :]
        pillar_z = pillars[2][np.newaxis, np.newaxis, :]
        pillar_i = pillars[3][np.newaxis, np.newaxis, :]
        
        x_sub = coors[:, 2][:, np.newaxis].astype(np.float32) * 0.16 + 0.1
        y_sub = coors[:, 1][:, np.newaxis].astype(np.float32) * 0.16 + -39.9
        
        ones_array = np.ones(shape=[1, self.max_num_points_per_pillar], dtype=np.float32)
        x_sub_shaped = np.dot(x_sub, ones_array) #(N,1)x(1,100)=> (N,100)
        y_sub_shaped = np.dot(y_sub, ones_array)
        x_sub_shaped = x_sub_shaped.reshape(1, 1, *x_sub_shaped.shape)
        y_sub_shaped = y_sub_shaped.reshape(1, 1, *y_sub_shaped.shape)

        #num_points_a_pillar = pillar_x_d.shape[3]
        num_points_per_pillar = num_points_per_pillar[np.newaxis, :].astype(np.float32)
        mask = self.get_paddings_indicator(num_points_per_pillar, self.max_num_points_per_pillar)
        mask = mask.astype(pillar_x.dtype) #bool->float32,i have tested,not necessary
        anchors = self.anchors_cache['anchors']
        anchors = anchors[np.newaxis,:]

        anchor_area_threshold = self.anchors_cache['anchor_area_threshold']
        if anchor_area_threshold >= 0:
            dense_pillar_map = box_np_ops.sparse_sum_for_anchors_mask(coors, tuple(self.grid_size[::-1][1:])) 
            dense_pillar_map = dense_pillar_map.cumsum(0)
            dense_pillar_map = dense_pillar_map.cumsum(1)
            anchors_area = box_np_ops.fused_get_anchors_area(
                dense_map = dense_pillar_map,
                anchors_bv = self.anchors_cache['anchors_bv'],
                stride = self.pillar_size,
                offset = self.point_cloud_range,
                grid_size = self.grid_size)
            #Use torch.bool/torch.uint8 as mask type
            anchors_mask = anchors_area > anchor_area_threshold
            #anchors_mask = anchors_mask.astype(np.uint8) #deprecated,use torch.bool instead
            anchors_mask = anchors_mask[np.newaxis,:]

        return (pillar_x,pillar_y,pillar_z,pillar_i,
            num_points_per_pillar,x_sub_shaped,y_sub_shaped,mask,anchors,anchors_mask,coors)

其中points_to_pillar函数用来将原始点云points(?,4)转化为pillars(D,P,N)，同时得到coords和num_points_per_pillar。其中最重要的是_points_to_voxel_reverse_kernel这个转换函数。

@numba.jit(nopython=True)
def g_points_to_pillar_reverse_kernel(points,
                            pillar_size,
                            point_cloud_range,
                            num_points_per_pillar,
                            coor_to_pillar_idx,
                            pillars,
                            coors,
                            max_num_points_per_pillar=100,
                            max_num_pillars=12000):
    ndim = 3
    N = points.shape[0]
    ndim_minus_1 = ndim - 1
    grid_size = (point_cloud_range[3:] - point_cloud_range[:3]) / pillar_size
    grid_size = np.round(grid_size, 0, grid_size).astype(np.int32)
    coor = np.zeros(shape=(3, ), dtype=np.int32)
    pillar_num = 0
    failed = False
    for i in range(N):
        failed = False
        for j in range(ndim):
            c = np.floor((points[i, j] - point_cloud_range[j]) / pillar_size[j])
            if c < 0 or c >= grid_size[j]:
                failed = True
                break
            coor[ndim_minus_1 - j] = c
        if failed:
            continue
        pillar_idx = coor_to_pillar_idx[coor[0], coor[1], coor[2]]
        if pillar_idx == -1:
            pillar_idx = pillar_num
            if pillar_num >= max_num_pillars:  # pillar_num >= max_num_pillars 立刻结束
                break
            pillar_num += 1
            coor_to_pillar_idx[coor[0], coor[1], coor[2]] = pillar_idx
            coors[pillar_idx] = coor
        num = num_points_per_pillar[pillar_idx]
        if num < max_num_points_per_pillar:
            pillars[:, pillar_idx, num] = points[i]
            num_points_per_pillar[pillar_idx] += 1
    return pillar_num

它被numba的jit装饰器装饰，括号里的(nopython=True)表示关闭掉python编译模式，用llvm代替默认的python编译器。numba是一款可以将python函数编译为机器代码的JIT编译器，经过numba编译的python代码(仅限数组运算)，其速度可以接近C或者FORTRAN语言。在使用numpy数组做大量科学计算或者使用for循环时推荐使用numba来加速计算，效果显著。

汇编码（Assembly Code）是用 人类可读的 汇编语言助记符 书写的代码。
机器码（Machine Code）是用 硬件可执行的 二进制 表示的代码。
十六进制码（Hexadecimal Code） 是用 人类可读的 十六进制 表示的代码。

我这里在我服务器上简单对比了一下numba加速的效果，就这么一行代码简直是化腐朽为神奇，你都不需要改动代码本身，速度提升了322倍。

numba加速	g_points_to_pillar_reverse_kernel
是	2.45ms
否	790ms

【参考文献】

https://blog.csdn.net/Small_Munich/article/details/101559424

https://github.com/nutonomy/second.pytorch

https://forums.developer.nvidia.com/t/6-assertion-failed-convertdtype-onnxtype-dtype-unsupported-cast/179605/2

https://zhuanlan.zhihu.com/p/78882641

码农公寓

实验环境

算法流程

输入数据准备

相关文章