Faster_RCNN 1.准备工作

2022-11-28 20:54:59

总结自论文：Faster_RCNN，与Pytorch代码：

代码结构： simple-faster-rcnn-pytorch.py

data
- __init__.py
- dataset.py
- util.py
- voc_dataset.py　　
misc
- convert_caffe_pretain.py
- train_fast.py　　
model
- utils
  - nms
    - __init__.py
    - _nms_gpu_post.py
    - build.py
    - non_maximum_suppression.py　　
  - __init__.py
  - bbox_tools.py
  - creator_tool.py
  - roi_cupy.py　　
- __init__.py
- faster_rcnn.py
- faster_rcnn_vgg16.py
- region_proposal_network.py
- roi_module.py　　
utils
- __init__.py
- array_tool.py
- config.py
- eval_tool.py
- vis_tool.py
demo.ipynb
train.py
trainer.py

代码中有四个包分别为data、misc、model、utils。最核心的部分在model，包括了nms（非极大值抑制）、RPN网络实现、模型定义等。train.py与trainer.py为训练脚本。

本文主要介绍代码第一部分：data包与 utils包。

一. data包

首先下载VOC2007数据集：

wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar

wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar

wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCdevkit_08-Jun-2007.tar

并将三个压缩包解压至一个文件夹（名为VOCdevkit）下：

tar xvf VOCtrainval_06-Nov-.tar

tar xvf VOCtest_06-Nov-.tar

tar xvf VOCdevkit_08-Jun-.tar

1. utils.py

import numpy as np

from PIL import Image

import random

def read_image(path, dtype=np.float32, color=True):

    """Read an image from a file.

    This function reads an image from given file. The image is CHW format and

    the range of its value is :math:`[0, 255]`. If :obj:`color = True`, the

    order of the channels is RGB.

    Args:

        path (str): A path of image file.

        dtype: The type of array. The default value is :obj:`~numpy.float32`.

        color (bool): This option determines the number of channels.

            If :obj:`True`, the number of channels is three. In this case,

            the order of the channels is RGB. This is the default behaviour.

            If :obj:`False`, this function returns a grayscale image.

    Returns:

        ~numpy.ndarray: An image.

    """

    f = Image.open(path)

    try:

        if color:

            img = f.convert('RGB')

        else:

            img = f.convert('P')

        img = np.asarray(img, dtype=dtype)

    finally:

        if hasattr(f, 'close'):

            f.close()

    if img.ndim == 2:

        # reshape (H, W) -> (1, H, W)

        return img[np.newaxis]

    else:

        # transpose (H, W, C) -> (C, H, W)

        return img.transpose((2, 0, 1))

def resize_bbox(bbox, in_size, out_size):

    """Resize bounding boxes according to image resize.

    The bounding boxes are expected to be packed into a two dimensional

    tensor of shape :math:`(R, 4)`, where :math:`R` is the number of

    bounding boxes in the image. The second axis represents attributes of

    the bounding box. They are :math:`(y_{min}, x_{min}, y_{max}, x_{max})`,

    where the four attributes are coordinates of the top left and the

    bottom right vertices.

    Args:

        bbox (~numpy.ndarray): An array whose shape is :math:`(R, 4)`.

            :math:`R` is the number of bounding boxes.

        in_size (tuple): A tuple of length 2. The height and the width

            of the image before resized.

        out_size (tuple): A tuple of length 2. The height and the width

            of the image after resized.

    Returns:

        ~numpy.ndarray:

        Bounding boxes rescaled according to the given image shapes.

    """

    bbox = bbox.copy()

    y_scale = float(out_size[0]) / in_size[0]

    x_scale = float(out_size[1]) / in_size[1]

    bbox[:, 0] = y_scale * bbox[:, 0]

    bbox[:, 2] = y_scale * bbox[:, 2]

    bbox[:, 1] = x_scale * bbox[:, 1]

    bbox[:, 3] = x_scale * bbox[:, 3]

    return bbox

def flip_bbox(bbox, size, y_flip=False, x_flip=False):

    """Flip bounding boxes accordingly.

    The bounding boxes are expected to be packed into a two dimensional

    tensor of shape :math:`(R, 4)`, where :math:`R` is the number of

    bounding boxes in the image. The second axis represents attributes of

    the bounding box. They are :math:`(y_{min}, x_{min}, y_{max}, x_{max})`,

    where the four attributes are coordinates of the top left and the

    bottom right vertices.

    Args:

        bbox (~numpy.ndarray): An array whose shape is :math:`(R, 4)`.

            :math:`R` is the number of bounding boxes.

        size (tuple): A tuple of length 2. The height and the width

            of the image before resized.

        y_flip (bool): Flip bounding box according to a vertical flip of

            an image.

        x_flip (bool): Flip bounding box according to a horizontal flip of

            an image.

    Returns:

        ~numpy.ndarray:

        Bounding boxes flipped according to the given flips.

    """

    H, W = size

    bbox = bbox.copy()

    if y_flip:

        y_max = H - bbox[:, 0]

        y_min = H - bbox[:, 2]

        bbox[:, 0] = y_min

        bbox[:, 2] = y_max

    if x_flip:

        x_max = W - bbox[:, 1]

        x_min = W - bbox[:, 3]

        bbox[:, 1] = x_min

        bbox[:, 3] = x_max

    return bbox

def crop_bbox(

        bbox, y_slice=None, x_slice=None,

        allow_outside_center=True, return_param=False):

    """Translate bounding boxes to fit within the cropped area of an image.

    This method is mainly used together with image cropping.

    This method translates the coordinates of bounding boxes like

    :func:`data.util.translate_bbox`. In addition,

    this function truncates the bounding boxes to fit within the cropped area.

    If a bounding box does not overlap with the cropped area,

    this bounding box will be removed.

    The bounding boxes are expected to be packed into a two dimensional

    tensor of shape :math:`(R, 4)`, where :math:`R` is the number of

    bounding boxes in the image. The second axis represents attributes of

    the bounding box. They are :math:`(y_{min}, x_{min}, y_{max}, x_{max})`,

    where the four attributes are coordinates of the top left and the

    bottom right vertices.

    Args:

        bbox (~numpy.ndarray): Bounding boxes to be transformed. The shape is

            :math:`(R, 4)`. :math:`R` is the number of bounding boxes.

        y_slice (slice): The slice of y axis.

        x_slice (slice): The slice of x axis.

        allow_outside_center (bool): If this argument is :obj:`False`,

            bounding boxes whose centers are outside of the cropped area

            are removed. The default value is :obj:`True`.

        return_param (bool): If :obj:`True`, this function returns

            indices of kept bounding boxes.

    Returns:

        ~numpy.ndarray or (~numpy.ndarray, dict):

        If :obj:`return_param = False`, returns an array :obj:`bbox`.

        If :obj:`return_param = True`,

        returns a tuple whose elements are :obj:`bbox, param`.

        :obj:`param` is a dictionary of intermediate parameters whose

        contents are listed below with key, value-type and the description

        of the value.

        * **index** (*numpy.ndarray*): An array holding indices of used \

            bounding boxes.

    """

    t, b = _slice_to_bounds(y_slice)

    l, r = _slice_to_bounds(x_slice)

    crop_bb = np.array((t, l, b, r))

    if allow_outside_center:

        mask = np.ones(bbox.shape[0], dtype=bool)

    else:

        center = (bbox[:, :2] + bbox[:, 2:]) / 2

        mask = np.logical_and(crop_bb[:2] <= center, center < crop_bb[2:]) \

            .all(axis=1)

    bbox = bbox.copy()

    bbox[:, :2] = np.maximum(bbox[:, :2], crop_bb[:2])

    bbox[:, 2:] = np.minimum(bbox[:, 2:], crop_bb[2:])

    bbox[:, :2] -= crop_bb[:2]

    bbox[:, 2:] -= crop_bb[:2]

    mask = np.logical_and(mask, (bbox[:, :2] < bbox[:, 2:]).all(axis=1))

    bbox = bbox[mask]

    if return_param:

        return bbox, {'index': np.flatnonzero(mask)}

    else:

        return bbox

def _slice_to_bounds(slice_):

    if slice_ is None:

        return 0, np.inf

    if slice_.start is None:

        l = 0

    else:

        l = slice_.start

    if slice_.stop is None:

        u = np.inf

    else:

        u = slice_.stop

    return l, u

def translate_bbox(bbox, y_offset=0, x_offset=0):

    """Translate bounding boxes.

    This method is mainly used together with image transforms, such as padding

    and cropping, which translates the left top point of the image from

    coordinate :math:`(0, 0)` to coordinate

    :math:`(y, x) = (y_{offset}, x_{offset})`.

    The bounding boxes are expected to be packed into a two dimensional

    tensor of shape :math:`(R, 4)`, where :math:`R` is the number of

    bounding boxes in the image. The second axis represents attributes of

    the bounding box. They are :math:`(y_{min}, x_{min}, y_{max}, x_{max})`,

    where the four attributes are coordinates of the top left and the

    bottom right vertices.

    Args:

        bbox (~numpy.ndarray): Bounding boxes to be transformed. The shape is

            :math:`(R, 4)`. :math:`R` is the number of bounding boxes.

        y_offset (int or float): The offset along y axis.

        x_offset (int or float): The offset along x axis.

    Returns:

        ~numpy.ndarray:

        Bounding boxes translated according to the given offsets.

    """

    out_bbox = bbox.copy()

    out_bbox[:, :2] += (y_offset, x_offset)

    out_bbox[:, 2:] += (y_offset, x_offset)

    return out_bbox

def random_flip(img, y_random=False, x_random=False,

                return_param=False, copy=False):

    """Randomly flip an image in vertical or horizontal direction.

    Args:

        img (~numpy.ndarray): An array that gets flipped. This is in

            CHW format.

        y_random (bool): Randomly flip in vertical direction.

        x_random (bool): Randomly flip in horizontal direction.

        return_param (bool): Returns information of flip.

        copy (bool): If False, a view of :obj:`img` will be returned.

    Returns:

        ~numpy.ndarray or (~numpy.ndarray, dict):

        If :obj:`return_param = False`,

        returns an array :obj:`out_img` that is the result of flipping.

        If :obj:`return_param = True`,

        returns a tuple whose elements are :obj:`out_img, param`.

        :obj:`param` is a dictionary of intermediate parameters whose

        contents are listed below with key, value-type and the description

        of the value.

        * **y_flip** (*bool*): Whether the image was flipped in the\

            vertical direction or not.

        * **x_flip** (*bool*): Whether the image was flipped in the\

            horizontal direction or not.

    """

    y_flip, x_flip = False, False

    if y_random:

        y_flip = random.choice([True, False])

    if x_random:

        x_flip = random.choice([True, False])

    if y_flip:

        img = img[:, ::-1, :]

    if x_flip:

        img = img[:, :, ::-1]

    if copy:

        img = img.copy()

    if return_param:

        return img, {'y_flip': y_flip, 'x_flip': x_flip}

    else:

        return img

工具文件：

函数read_image首先用PIL将图像读入为RGB格式或单通道格式彩图，然后分别转为C*H*W与1*H*W格式。图像范围【0，255】。

函数resize_bbox将形状为(R,4)的bbox按照输入与输出的height、weight进行resize。

函数flip_bbox将根据是否翻转实现对输入bbox的横向与纵向翻转。

函数crop_bbox将bbox适应于图像的裁剪区域。

函数translate_bbox根据输入的偏移量，进行水平或竖直偏移。

函数random_flip将图片（CHW格式）随机水平或竖直反转：

img = img[:, ::-1, :] 竖直翻转
img = img[:, :, ::-1] 水平翻转

2. voc_dataset.py

import os

import xml.etree.ElementTree as ET

import numpy as np

from .util import read_image

class VOCBboxDataset:

    """Bounding box dataset for PASCAL `VOC`_.

    .. _`VOC`: http://host.robots.ox.ac.uk/pascal/VOC/voc2012/

    The index corresponds to each image.

    When queried by an index, if :obj:`return_difficult == False`,

    this dataset returns a corresponding

    :obj:`img, bbox, label`, a tuple of an image, bounding boxes and labels.

    This is the default behaviour.

    If :obj:`return_difficult == True`, this dataset returns corresponding

    :obj:`img, bbox, label, difficult`. :obj:`difficult` is a boolean array

    that indicates whether bounding boxes are labeled as difficult or not.

    The bounding boxes are packed into a two dimensional tensor of shape

    :math:`(R, 4)`, where :math:`R` is the number of bounding boxes in

    the image. The second axis represents attributes of the bounding box.

    They are :math:`(y_{min}, x_{min}, y_{max}, x_{max})`, where the

    four attributes are coordinates of the top left and the bottom right

    vertices.

    The labels are packed into a one dimensional tensor of shape :math:`(R,)`.

    :math:`R` is the number of bounding boxes in the image.

    The class name of the label :math:`l` is :math:`l` th element of

    :obj:`VOC_BBOX_LABEL_NAMES`.

    The array :obj:`difficult` is a one dimensional boolean array of shape

    :math:`(R,)`. :math:`R` is the number of bounding boxes in the image.

    If :obj:`use_difficult` is :obj:`False`, this array is

    a boolean array with all :obj:`False`.

    The type of the image, the bounding boxes and the labels are as follows.

    * :obj:`img.dtype == numpy.float32`

    * :obj:`bbox.dtype == numpy.float32`

    * :obj:`label.dtype == numpy.int32`

    * :obj:`difficult.dtype == numpy.bool`

    Args:

        data_dir (string): Path to the root of the training data.

            i.e. "/data/image/voc/VOCdevkit/VOC2007/"

        split ({'train', 'val', 'trainval', 'test'}): Select a split of the

            dataset. :obj:`test` split is only available for

            2007 dataset.

        year ({'2007', '2012'}): Use a dataset prepared for a challenge

            held in :obj:`year`.

        use_difficult (bool): If :obj:`True`, use images that are labeled as

            difficult in the original annotation.

        return_difficult (bool): If :obj:`True`, this dataset returns

            a boolean array

            that indicates whether bounding boxes are labeled as difficult

            or not. The default value is :obj:`False`.

    """

    def __init__(self, data_dir, split='trainval',

                 use_difficult=False, return_difficult=False,

                 ):

        # if split not in ['train', 'trainval', 'val']:

        #     if not (split == 'test' and year == '2007'):

        #         warnings.warn(

        #             'please pick split from \'train\', \'trainval\', \'val\''

        #             'for 2012 dataset. For 2007 dataset, you can pick \'test\''

        #             ' in addition to the above mentioned splits.'

        #         )

        id_list_file = os.path.join(

            data_dir, 'ImageSets/Main/{0}.txt'.format(split))

        self.ids = [id_.strip() for id_ in open(id_list_file)]

        self.data_dir = data_dir

        self.use_difficult = use_difficult

        self.return_difficult = return_difficult

        self.label_names = VOC_BBOX_LABEL_NAMES

    def __len__(self):

        return len(self.ids)

    def get_example(self, i):

        """Returns the i-th example.

        Returns a color image and bounding boxes. The image is in CHW format.

        The returned image is RGB.

        Args:

            i (int): The index of the example.

        Returns:

            tuple of an image and bounding boxes

        """

        id_ = self.ids[i]

        anno = ET.parse(

            os.path.join(self.data_dir, 'Annotations', id_ + '.xml'))

        bbox = list()

        label = list()

        difficult = list()

        for obj in anno.findall('object'):

            # when in not using difficult split, and the object is

            # difficult, skipt it.

            if not self.use_difficult and int(obj.find('difficult').text) == 1:

                continue

            difficult.append(int(obj.find('difficult').text))

            bndbox_anno = obj.find('bndbox')

            # subtract 1 to make pixel indexes 0-based

            bbox.append([

                int(bndbox_anno.find(tag).text) - 1

                for tag in ('ymin', 'xmin', 'ymax', 'xmax')])

            name = obj.find('name').text.lower().strip()

            label.append(VOC_BBOX_LABEL_NAMES.index(name))

        bbox = np.stack(bbox).astype(np.float32)

        label = np.stack(label).astype(np.int32)

        # When `use_difficult==False`, all elements in `difficult` are False.

        difficult = np.array(difficult, dtype=np.bool).astype(np.uint8)  # PyTorch don't support np.bool

        # Load a image

        img_file = os.path.join(self.data_dir, 'JPEGImages', id_ + '.jpg')

        img = read_image(img_file, color=True)

        # if self.return_difficult:

        #     return img, bbox, label, difficult

        return img, bbox, label, difficult

    __getitem__ = get_example

VOC_BBOX_LABEL_NAMES = (

    'aeroplane',

    'bicycle',

    'bird',

    'boat',

    'bottle',

    'bus',

    'car',

    'cat',

    'chair',

    'cow',

    'diningtable',

    'dog',

    'horse',

    'motorbike',

    'person',

    'pottedplant',

    'sheep',

    'sofa',

    'train',

    'tvmonitor')

实现VOC2007数据类：共9963张图片

VOC2007包含{'train', 'val', 'trainval', 'test'}，共20类，加背景21类。四个集合图片数分别为2501， 2510，5011，4952（trainval=train+val）。VOC2012无test集。

训练时使用trainval数据集，测试时使用test数据集。

每张图像的标注都在xml文件中：

<annotation>

    <folder>VOC2007</folder>

    <filename>000001.jpg</filename>

    <source>

        <database>The VOC2007 Database</database>

        <annotation>PASCAL VOC2007</annotation>

        <image>flickr</image>

        <flickrid>341012865</flickrid>

    </source>

    <owner>

        <flickrid>Fried Camels</flickrid>

        <name>Jinky the Fruit Bat</name>

    </owner>

    <size>

        <width>353</width>

        <height>500</height>

        <depth>3</depth>

    </size>

    <segmented>0</segmented>

    <object>

        <name>dog</name>

        <pose>Left</pose>

        <truncated>1</truncated>

        <difficult>0</difficult>

        <bndbox>

            <xmin>48</xmin>

            <ymin>240</ymin>

            <xmax>195</xmax>

            <ymax>371</ymax>

        </bndbox>

    </object>

    <object>

        <name>person</name>

        <pose>Left</pose>

        <truncated>1</truncated>

        <difficult>0</difficult>

        <bndbox>

            <xmin>8</xmin>

            <ymin>12</ymin>

            <xmax>352</xmax>

            <ymax>498</ymax>

        </bndbox>

    </object>

</annotation>

每个xml文件给出了此图像的size，每个bbox坐标、bbox所含label、以及是否是difficult。

类 VOCBboxDataset继承自Object基类，实例化该类时只需提供VOC数据集路径即可。

类 VOCBboxDataset的方法只有一个，即返回第i张图片的信息（图像、bbox、label、difficult）

3. dataset.py

import torch as t

from .voc_dataset import VOCBboxDataset

from skimage import transform as sktsf

from torchvision import transforms as tvtsf

from . import util

import numpy as np

from utils.config import opt

def inverse_normalize(img):

    if opt.caffe_pretrain:

        img = img + (np.array([122.7717, 115.9465, 102.9801]).reshape(3, 1, 1))

        return img[::-1, :, :]

    # approximate un-normalize for visualize

    return (img * 0.225 + 0.45).clip(min=0, max=1) * 255

def pytorch_normalze(img):

    """

    https://github.com/pytorch/vision/issues/223

    return appr -1~1 RGB

    """

    normalize = tvtsf.Normalize(mean=[0.485, 0.456, 0.406],

                                std=[0.229, 0.224, 0.225])

    img = normalize(t.from_numpy(img))

    return img.numpy()

def caffe_normalize(img):

    """

    return appr -125-125 BGR

    """

    img = img[[2, 1, 0], :, :]  # RGB-BGR

    img = img * 255

    mean = np.array([122.7717, 115.9465, 102.9801]).reshape(3, 1, 1)

    img = (img - mean).astype(np.float32, copy=True)

    return img

def preprocess(img, min_size=600, max_size=1000):

    """Preprocess an image for feature extraction.

    The length of the shorter edge is scaled to :obj:`self.min_size`.

    After the scaling, if the length of the longer edge is longer than

    :param min_size:

    :obj:`self.max_size`, the image is scaled to fit the longer edge

    to :obj:`self.max_size`.

    After resizing the image, the image is subtracted by a mean image value

    :obj:`self.mean`.

    Args:

        img (~numpy.ndarray): An image. This is in CHW and RGB format.

            The range of its value is :math:`[0, 255]`.

    Returns:

        ~numpy.ndarray: A preprocessed image.

    """

    C, H, W = img.shape

    scale1 = min_size / min(H, W)

    scale2 = max_size / max(H, W)

    scale = min(scale1, scale2)

    img = img / 255.

    img = sktsf.resize(img, (C, H * scale, W * scale), mode='reflect')

    # both the longer and shorter should be less than

    # max_size and min_size

    if opt.caffe_pretrain:

        normalize = caffe_normalize

    else:

        normalize = pytorch_normalze

    return normalize(img)

class Transform(object):

    def __init__(self, min_size=600, max_size=1000):

        self.min_size = min_size

        self.max_size = max_size

    def __call__(self, in_data):

        img, bbox, label = in_data

        _, H, W = img.shape

        img = preprocess(img, self.min_size, self.max_size)

        _, o_H, o_W = img.shape

        scale = o_H / H

        bbox = util.resize_bbox(bbox, (H, W), (o_H, o_W))

        # horizontally flip

        img, params = util.random_flip(

            img, x_random=True, return_param=True)

        bbox = util.flip_bbox(

            bbox, (o_H, o_W), x_flip=params['x_flip'])

        return img, bbox, label, scale

class Dataset:

    def __init__(self, opt):

        self.opt = opt

        self.db = VOCBboxDataset(opt.voc_data_dir)

        self.tsf = Transform(opt.min_size, opt.max_size)

    def __getitem__(self, idx):

        ori_img, bbox, label, difficult = self.db.get_example(idx)

        img, bbox, label, scale = self.tsf((ori_img, bbox, label))

        # TODO: check whose stride is negative to fix this instead copy all

        # some of the strides of a given numpy array are negative.

        return img.copy(), bbox.copy(), label.copy(), scale

    def __len__(self):

        return len(self.db)

class TestDataset:

    def __init__(self, opt, split='test', use_difficult=True):

        self.opt = opt

        self.db = VOCBboxDataset(opt.voc_data_dir, split=split, use_difficult=use_difficult)

    def __getitem__(self, idx):

        ori_img, bbox, label, difficult = self.db.get_example(idx)

        img = preprocess(ori_img)

        return img, ori_img.shape[1:], bbox, label, difficult

    def __len__(self):

        return len(self.db)

制作数据

函数inverse_normalize实现对caffe与torchvision版本的去正则化。因为可以利用caffe版本的vgg预训练权重，也可利用torchvision版本的预训练权重。只不过后者结果略微逊色于前者。

函数pytorch_normalze实现对pytorch模型输入图像的标准化：由【0，255】的RGB转为【0，1】的RGB再正则化为【-1，1】的RGB。

函数caffe_normalze实现对caffe模型输入图像的标准化：由【0，255】的RGB转为【0，1】的RGB再正则化为【-125，125】的BGR。

函数preprocess实现对图像的预处理：由read_image函数读入的图像为CHW的【0，255】格式，这里首先除以255，再按照论文长边不超1000，短边不超600。按此比例缩放。然后调用pytorch_normalze或者caffe_normalze对图像进行正则化。

类Transform实现了预处理，定义了__call__方法，在__call__方法中利用函数preprocess对图像预处理，并将bbox按照图像缩放的尺度等比例缩放。然后随机对图像与bbox同时进行水平翻转。

类Dataset实现对训练集样本的生成，即trainval。__getitem__方法利用VOCBboxDataset类来生成一张训练图片，并调用Trandform类处理。返回处理后的图像，bbox，label，scale。

类TestDataset实现对测试机样本的生成，即test。__getitem__方法利用VOCBboxDataset类来生成一张测试图片，不同于训练的是调用preprocess函数处理。也即没有对bbox进行相应resize，而是返回处理前的图像尺寸。

二. utils包

1. array_tool.py

"""

tools to convert specified type

"""

import torch as t

import numpy as np

def tonumpy(data):

    if isinstance(data, np.ndarray):

        return data

    if isinstance(data, t._TensorBase):

        return data.cpu().numpy()

    if isinstance(data, t.autograd.Variable):

        return tonumpy(data.data)

def totensor(data, cuda=True):

    if isinstance(data, np.ndarray):

        tensor = t.from_numpy(data)

    if isinstance(data, t._TensorBase):

        tensor = data

    if isinstance(data, t.autograd.Variable):

        tensor = data.data

    if cuda:

        tensor = tensor.cuda()

    return tensor

def tovariable(data):

    if isinstance(data, np.ndarray):

        return tovariable(totensor(data))

    if isinstance(data, t._TensorBase):

        return t.autograd.Variable(data)

    if isinstance(data, t.autograd.Variable):

        return data

    else:

        raise ValueError("UnKnow data type: %s, input should be {np.ndarray,Tensor,Variable}" %type(data))

def scalar(data):

    if isinstance(data, np.ndarray):

        return data.reshape(1)[0]

    if isinstance(data, t._TensorBase):

        return data.view(1)[0]

    if isinstance(data, t.autograd.Variable):

        return data.data.view(1)[0]

类别转换脚本，实现tensor、numpy、Variable之间的转换。

2. config.py

from pprint import pprint

# Default Configs for training

# NOTE that, config items could be overwriten by passing argument through command line.

# e.g. --voc-data-dir='./data/'

class Config:

    # data

    voc_data_dir = '/home/cy/.chainer/dataset/pfnet/chainercv/voc/VOCdevkit/VOC2007/'

    min_size = 600  # image resize

    max_size = 1000 # image resize

    num_workers = 8

    test_num_workers = 8

    # sigma for l1_smooth_loss

    rpn_sigma = 3.

    roi_sigma = 1.

    # param for optimizer

    # 0.0005 in origin paper but 0.0001 in tf-faster-rcnn

    weight_decay = 0.0005

    lr_decay = 0.1  # 1e-3 -> 1e-4

    lr = 1e-3

    # visualization

    env = 'faster-rcnn'  # visdom env

    port = 8097

    plot_every = 40  # vis every N iter

    # preset

    data = 'voc'

    pretrained_model = 'vgg16'

    # training

    epoch = 14

    use_adam = False # Use Adam optimizer

    use_chainer = False # try match everything as chainer

    use_drop = False # use dropout in RoIHead

    # debug

    debug_file = '/tmp/debugf'

    test_num = 10000

    # model

    load_path = None

    caffe_pretrain = False # use caffe pretrained model instead of torchvision

    caffe_pretrain_path = 'checkpoints/vgg16-caffe.pth'

    def _parse(self, kwargs):

        state_dict = self._state_dict()

        for k, v in kwargs.items():

            if k not in state_dict:

                raise ValueError('UnKnown Option: "--%s"' % k)

            setattr(self, k, v)

        print('======user config========')

        pprint(self._state_dict())

        print('==========end============')

    def _state_dict(self):

        return {k: getattr(self, k) for k, _ in Config.__dict__.items() \

                if not k.startswith('_')}

opt = Config()

配置文件。包括数据及地址、visdom环境、图像尺寸、预训练权重类型、学习率及各超参数。

3. vis_tool.py

import time

import numpy as np

import matplotlib

import torch as t

import visdom

matplotlib.use('Agg')

from matplotlib import pyplot as plot

# from data.voc_dataset import VOC_BBOX_LABEL_NAMES

VOC_BBOX_LABEL_NAMES = (

    'fly',

    'bike',

    'bird',

    'boat',

    'pin',

    'bus',

    'c',

    'cat',

    'chair',

    'cow',

    'table',

    'dog',

    'horse',

    'moto',

    'p',

    'plant',

    'shep',

    'sofa',

    'train',

    'tv',

)

def vis_image(img, ax=None):

    """Visualize a color image.

    Args:

        img (~numpy.ndarray): An array of shape :math:`(3, height, width)`.

            This is in RGB format and the range of its value is

            :math:`[0, 255]`.

        ax (matplotlib.axes.Axis): The visualization is displayed on this

            axis. If this is :obj:`None` (default), a new axis is created.

    Returns:

        ~matploblib.axes.Axes:

        Returns the Axes object with the plot for further tweaking.

    """

    if ax is None:

        fig = plot.figure()

        ax = fig.add_subplot(1, 1, 1)

    # CHW -> HWC

    img = img.transpose((1, 2, 0))

    ax.imshow(img.astype(np.uint8))

    return ax

def vis_bbox(img, bbox, label=None, score=None, ax=None):

    """Visualize bounding boxes inside image.

    Args:

        img (~numpy.ndarray): An array of shape :math:`(3, height, width)`.

            This is in RGB format and the range of its value is

            :math:`[0, 255]`.

        bbox (~numpy.ndarray): An array of shape :math:`(R, 4)`, where

            :math:`R` is the number of bounding boxes in the image.

            Each element is organized

            by :math:`(y_{min}, x_{min}, y_{max}, x_{max})` in the second axis.

        label (~numpy.ndarray): An integer array of shape :math:`(R,)`.

            The values correspond to id for label names stored in

            :obj:`label_names`. This is optional.

        score (~numpy.ndarray): A float array of shape :math:`(R,)`.

             Each value indicates how confident the prediction is.

             This is optional.

        label_names (iterable of strings): Name of labels ordered according

            to label ids. If this is :obj:`None`, labels will be skipped.

        ax (matplotlib.axes.Axis): The visualization is displayed on this

            axis. If this is :obj:`None` (default), a new axis is created.

    Returns:

        ~matploblib.axes.Axes:

        Returns the Axes object with the plot for further tweaking.

    """

    label_names = list(VOC_BBOX_LABEL_NAMES) + ['bg']

    # add for index `-1`

    if label is not None and not len(bbox) == len(label):

        raise ValueError('The length of label must be same as that of bbox')

    if score is not None and not len(bbox) == len(score):

        raise ValueError('The length of score must be same as that of bbox')

    # Returns newly instantiated matplotlib.axes.Axes object if ax is None

    ax = vis_image(img, ax=ax)

    # If there is no bounding box to display, visualize the image and exit.

    if len(bbox) == 0:

        return ax

    for i, bb in enumerate(bbox):

        xy = (bb[1], bb[0])

        height = bb[2] - bb[0]

        width = bb[3] - bb[1]

        ax.add_patch(plot.Rectangle(

            xy, width, height, fill=False, edgecolor='red', linewidth=2))

        caption = list()

        if label is not None and label_names is not None:

            lb = label[i]

            if not (-1 <= lb < len(label_names)):  # modfy here to add backgroud

                raise ValueError('No corresponding name is given')

            caption.append(label_names[lb])

        if score is not None:

            sc = score[i]

            caption.append('{:.2f}'.format(sc))

        if len(caption) > 0:

            ax.text(bb[1], bb[0],

                    ': '.join(caption),

                    style='italic',

                    bbox={'facecolor': 'white', 'alpha': 0.5, 'pad': 0})

    return ax

def fig2data(fig):

    """

    brief Convert a Matplotlib figure to a 4D numpy array with RGBA

    channels and return it

    @param fig： a matplotlib figure

    @return a numpy 3D array of RGBA values

    """

    # draw the renderer

    fig.canvas.draw()

    # Get the RGBA buffer from the figure

    w, h = fig.canvas.get_width_height()

    buf = np.fromstring(fig.canvas.tostring_argb(), dtype=np.uint8)

    buf.shape = (w, h, 4)

    # canvas.tostring_argb give pixmap in ARGB mode. Roll the ALPHA channel to have it in RGBA mode

    buf = np.roll(buf, 3, axis=2)

    return buf.reshape(h, w, 4)

def fig4vis(fig):

    """

    convert figure to ndarray

    """

    ax = fig.get_figure()

    img_data = fig2data(ax).astype(np.int32)

    plot.close()

    # HWC->CHW

    return img_data[:, :, :3].transpose((2, 0, 1)) / 255.

def visdom_bbox(*args, **kwargs):

    fig = vis_bbox(*args, **kwargs)

    data = fig4vis(fig)

    return data

class Visualizer(object):

    """

    wrapper for visdom

    you can still access naive visdom function by

    self.line, self.scater,self._send,etc.

    due to the implementation of `__getattr__`

    """

    def __init__(self, env='default', **kwargs):

        self.vis = visdom.Visdom(env=env, **kwargs)

        self._vis_kw = kwargs

        # e.g.（’loss',23） the 23th value of loss

        self.index = {}

        self.log_text = ''

    def reinit(self, env='default', **kwargs):

        """

        change the config of visdom

        """

        self.vis = visdom.Visdom(env=env, **kwargs)

        return self

    def plot_many(self, d):

        """

        plot multi values

        @params d: dict (name,value) i.e. ('loss',0.11)

        """

        for k, v in d.items():

            if v is not None:

                self.plot(k, v)

    def img_many(self, d):

        for k, v in d.items():

            self.img(k, v)

    def plot(self, name, y, **kwargs):

        """

        self.plot('loss',1.00)

        """

        x = self.index.get(name, 0)

        self.vis.line(Y=np.array([y]), X=np.array([x]),

                      win=name,

                      opts=dict(title=name),

                      update=None if x == 0 else 'append',

                      **kwargs

                      )

        self.index[name] = x + 1

    def img(self, name, img_, **kwargs):

        """

        self.img('input_img',t.Tensor(64,64))

        self.img('input_imgs',t.Tensor(3,64,64))

        self.img('input_imgs',t.Tensor(100,1,64,64))

        self.img('input_imgs',t.Tensor(100,3,64,64),nrows=10)

        ！！！don‘t ~~self.img('input_imgs',t.Tensor(100,64,64),nrows=10)~~！！！

        """

        self.vis.images(t.Tensor(img_).cpu().numpy(),

                        win=name,

                        opts=dict(title=name),

                        **kwargs

                        )

    def log(self, info, win='log_text'):

        """

        self.log({'loss':1,'lr':0.0001})

        """

        self.log_text += ('[{time}] {info} <br>'.format(

            time=time.strftime('%m%d_%H%M%S'), \

            info=info))

        self.vis.text(self.log_text, win)

    def __getattr__(self, name):

        return getattr(self.vis, name)

    def state_dict(self):

        return {

            'index': self.index,

            'vis_kw': self._vis_kw,

            'log_text': self.log_text,

            'env': self.vis.env

        }

    def load_state_dict(self, d):

        self.vis = visdom.Visdom(env=d.get('env', self.vis.env), **(self.d.get('vis_kw')))

        self.log_text = d.get('log_text', '')

        self.index = d.get('index', dict())

        return self

函数vis_image读入一张3,H,W的RGB图像并显示。

函数vis_bbox显示图像及该图的bbox，及bbox的label和score。

函数visdom_bbox调用函数fig2data、fig4vis返回显示后的图像。

类Visualizer将要在visdom中显示的项包装起来。

4. eval_tool.py

评估检测结果

函数calc_detection_voc_prec_rec计算每一类的precision和recall。

函数calc_detection_voc_ap调用第一个函数计算每一类的average precision（ap）。

函数eval_detection_voc调用前两个函数，得到ap、map。

注：

1. bbox坐标都是以(R,4)的形状出现，在进行bounding box回归的时候会将bbox坐标转为中心点坐标(x,y)与height、weight，因为回归的目的是学到offsets（偏移）和scales（尺度），所以需要转换坐标表示。在其他情况bbox坐标都是左上右下角坐标，即`(y_{min}, x_{min}, y_{max}, x_{max})'。

2. 代码实现时的batch_size=1，其余多数代码也是基于size=1的。所以每次输入一张图片，这样很方便处理。

Reference：

从编程实现角度学习Faster R-CNN（附极简实现）

Precision、Recall、Ap概念

码农公寓

一. data包

二. utils包

从编程实现角度学习Faster R-CNN（附极简实现）

相关文章