Faster_RCNN 1.准备工作

总结自论文:Faster_RCNN,与Pytorch代码

代码结构:  simple-faster-rcnn-pytorch.py

  • data
    • __init__.py
    • dataset.py
    • util.py
    • voc_dataset.py  
  • misc
    • convert_caffe_pretain.py
    • train_fast.py  
  • model
    • utils
      • nms
        • __init__.py
        • _nms_gpu_post.py
        • build.py
        • non_maximum_suppression.py  
      • __init__.py
      • bbox_tools.py
      • creator_tool.py
      • roi_cupy.py  
    • __init__.py
    • faster_rcnn.py
    • faster_rcnn_vgg16.py
    • region_proposal_network.py
    • roi_module.py  
  • utils
    • __init__.py
    • array_tool.py
    • config.py
    • eval_tool.py
    • vis_tool.py
  • demo.ipynb
  • train.py
  • trainer.py

代码中有四个包分别为data、misc、model、utils。最核心的部分在model,包括了nms(非极大值抑制)、RPN网络实现、模型定义等。train.py与trainer.py为训练脚本。

本文主要介绍代码第一部分:data包 与 utils包。

一. data包

首先下载VOC2007数据集:

wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCdevkit_08-Jun-2007.tar

并将三个压缩包解压至一个文件夹(名为VOCdevkit)下:

tar xvf VOCtrainval_06-Nov-.tar
tar xvf VOCtest_06-Nov-.tar
tar xvf VOCdevkit_08-Jun-.tar

1.  utils.py

import numpy as np
from PIL import Image
import random def read_image(path, dtype=np.float32, color=True):
"""Read an image from a file. This function reads an image from given file. The image is CHW format and
the range of its value is :math:`[0, 255]`. If :obj:`color = True`, the
order of the channels is RGB. Args:
path (str): A path of image file.
dtype: The type of array. The default value is :obj:`~numpy.float32`.
color (bool): This option determines the number of channels.
If :obj:`True`, the number of channels is three. In this case,
the order of the channels is RGB. This is the default behaviour.
If :obj:`False`, this function returns a grayscale image. Returns:
~numpy.ndarray: An image.
""" f = Image.open(path)
try:
if color:
img = f.convert('RGB')
else:
img = f.convert('P')
img = np.asarray(img, dtype=dtype)
finally:
if hasattr(f, 'close'):
f.close() if img.ndim == 2:
# reshape (H, W) -> (1, H, W)
return img[np.newaxis]
else:
# transpose (H, W, C) -> (C, H, W)
return img.transpose((2, 0, 1)) def resize_bbox(bbox, in_size, out_size):
"""Resize bounding boxes according to image resize. The bounding boxes are expected to be packed into a two dimensional
tensor of shape :math:`(R, 4)`, where :math:`R` is the number of
bounding boxes in the image. The second axis represents attributes of
the bounding box. They are :math:`(y_{min}, x_{min}, y_{max}, x_{max})`,
where the four attributes are coordinates of the top left and the
bottom right vertices. Args:
bbox (~numpy.ndarray): An array whose shape is :math:`(R, 4)`.
:math:`R` is the number of bounding boxes.
in_size (tuple): A tuple of length 2. The height and the width
of the image before resized.
out_size (tuple): A tuple of length 2. The height and the width
of the image after resized. Returns:
~numpy.ndarray:
Bounding boxes rescaled according to the given image shapes. """
bbox = bbox.copy()
y_scale = float(out_size[0]) / in_size[0]
x_scale = float(out_size[1]) / in_size[1]
bbox[:, 0] = y_scale * bbox[:, 0]
bbox[:, 2] = y_scale * bbox[:, 2]
bbox[:, 1] = x_scale * bbox[:, 1]
bbox[:, 3] = x_scale * bbox[:, 3]
return bbox def flip_bbox(bbox, size, y_flip=False, x_flip=False):
"""Flip bounding boxes accordingly. The bounding boxes are expected to be packed into a two dimensional
tensor of shape :math:`(R, 4)`, where :math:`R` is the number of
bounding boxes in the image. The second axis represents attributes of
the bounding box. They are :math:`(y_{min}, x_{min}, y_{max}, x_{max})`,
where the four attributes are coordinates of the top left and the
bottom right vertices. Args:
bbox (~numpy.ndarray): An array whose shape is :math:`(R, 4)`.
:math:`R` is the number of bounding boxes.
size (tuple): A tuple of length 2. The height and the width
of the image before resized.
y_flip (bool): Flip bounding box according to a vertical flip of
an image.
x_flip (bool): Flip bounding box according to a horizontal flip of
an image. Returns:
~numpy.ndarray:
Bounding boxes flipped according to the given flips. """
H, W = size
bbox = bbox.copy()
if y_flip:
y_max = H - bbox[:, 0]
y_min = H - bbox[:, 2]
bbox[:, 0] = y_min
bbox[:, 2] = y_max
if x_flip:
x_max = W - bbox[:, 1]
x_min = W - bbox[:, 3]
bbox[:, 1] = x_min
bbox[:, 3] = x_max
return bbox def crop_bbox(
bbox, y_slice=None, x_slice=None,
allow_outside_center=True, return_param=False):
"""Translate bounding boxes to fit within the cropped area of an image. This method is mainly used together with image cropping.
This method translates the coordinates of bounding boxes like
:func:`data.util.translate_bbox`. In addition,
this function truncates the bounding boxes to fit within the cropped area.
If a bounding box does not overlap with the cropped area,
this bounding box will be removed. The bounding boxes are expected to be packed into a two dimensional
tensor of shape :math:`(R, 4)`, where :math:`R` is the number of
bounding boxes in the image. The second axis represents attributes of
the bounding box. They are :math:`(y_{min}, x_{min}, y_{max}, x_{max})`,
where the four attributes are coordinates of the top left and the
bottom right vertices. Args:
bbox (~numpy.ndarray): Bounding boxes to be transformed. The shape is
:math:`(R, 4)`. :math:`R` is the number of bounding boxes.
y_slice (slice): The slice of y axis.
x_slice (slice): The slice of x axis.
allow_outside_center (bool): If this argument is :obj:`False`,
bounding boxes whose centers are outside of the cropped area
are removed. The default value is :obj:`True`.
return_param (bool): If :obj:`True`, this function returns
indices of kept bounding boxes. Returns:
~numpy.ndarray or (~numpy.ndarray, dict): If :obj:`return_param = False`, returns an array :obj:`bbox`. If :obj:`return_param = True`,
returns a tuple whose elements are :obj:`bbox, param`.
:obj:`param` is a dictionary of intermediate parameters whose
contents are listed below with key, value-type and the description
of the value. * **index** (*numpy.ndarray*): An array holding indices of used \
bounding boxes. """ t, b = _slice_to_bounds(y_slice)
l, r = _slice_to_bounds(x_slice)
crop_bb = np.array((t, l, b, r)) if allow_outside_center:
mask = np.ones(bbox.shape[0], dtype=bool)
else:
center = (bbox[:, :2] + bbox[:, 2:]) / 2
mask = np.logical_and(crop_bb[:2] <= center, center < crop_bb[2:]) \
.all(axis=1) bbox = bbox.copy()
bbox[:, :2] = np.maximum(bbox[:, :2], crop_bb[:2])
bbox[:, 2:] = np.minimum(bbox[:, 2:], crop_bb[2:])
bbox[:, :2] -= crop_bb[:2]
bbox[:, 2:] -= crop_bb[:2] mask = np.logical_and(mask, (bbox[:, :2] < bbox[:, 2:]).all(axis=1))
bbox = bbox[mask] if return_param:
return bbox, {'index': np.flatnonzero(mask)}
else:
return bbox def _slice_to_bounds(slice_):
if slice_ is None:
return 0, np.inf if slice_.start is None:
l = 0
else:
l = slice_.start if slice_.stop is None:
u = np.inf
else:
u = slice_.stop return l, u def translate_bbox(bbox, y_offset=0, x_offset=0):
"""Translate bounding boxes. This method is mainly used together with image transforms, such as padding
and cropping, which translates the left top point of the image from
coordinate :math:`(0, 0)` to coordinate
:math:`(y, x) = (y_{offset}, x_{offset})`. The bounding boxes are expected to be packed into a two dimensional
tensor of shape :math:`(R, 4)`, where :math:`R` is the number of
bounding boxes in the image. The second axis represents attributes of
the bounding box. They are :math:`(y_{min}, x_{min}, y_{max}, x_{max})`,
where the four attributes are coordinates of the top left and the
bottom right vertices. Args:
bbox (~numpy.ndarray): Bounding boxes to be transformed. The shape is
:math:`(R, 4)`. :math:`R` is the number of bounding boxes.
y_offset (int or float): The offset along y axis.
x_offset (int or float): The offset along x axis. Returns:
~numpy.ndarray:
Bounding boxes translated according to the given offsets. """ out_bbox = bbox.copy()
out_bbox[:, :2] += (y_offset, x_offset)
out_bbox[:, 2:] += (y_offset, x_offset) return out_bbox def random_flip(img, y_random=False, x_random=False,
return_param=False, copy=False):
"""Randomly flip an image in vertical or horizontal direction. Args:
img (~numpy.ndarray): An array that gets flipped. This is in
CHW format.
y_random (bool): Randomly flip in vertical direction.
x_random (bool): Randomly flip in horizontal direction.
return_param (bool): Returns information of flip.
copy (bool): If False, a view of :obj:`img` will be returned. Returns:
~numpy.ndarray or (~numpy.ndarray, dict): If :obj:`return_param = False`,
returns an array :obj:`out_img` that is the result of flipping. If :obj:`return_param = True`,
returns a tuple whose elements are :obj:`out_img, param`.
:obj:`param` is a dictionary of intermediate parameters whose
contents are listed below with key, value-type and the description
of the value. * **y_flip** (*bool*): Whether the image was flipped in the\
vertical direction or not.
* **x_flip** (*bool*): Whether the image was flipped in the\
horizontal direction or not. """
y_flip, x_flip = False, False
if y_random:
y_flip = random.choice([True, False])
if x_random:
x_flip = random.choice([True, False]) if y_flip:
img = img[:, ::-1, :]
if x_flip:
img = img[:, :, ::-1] if copy:
img = img.copy() if return_param:
return img, {'y_flip': y_flip, 'x_flip': x_flip}
else:
return img

工具文件:

函数read_image首先用PIL将图像读入为RGB格式或单通道格式彩图,然后分别转为C*H*W与1*H*W格式。图像范围【0,255】。

函数resize_bbox将形状为(R,4)的bbox按照输入与输出的height、weight进行resize。

函数flip_bbox将根据是否翻转实现对输入bbox的横向与纵向翻转。

函数crop_bbox将bbox适应于图像的裁剪区域。

函数translate_bbox根据输入的偏移量,进行水平或竖直偏移。

函数random_flip将图片(CHW格式)随机水平或竖直反转:

  • img = img[:, ::-1, :]     竖直翻转
  • img = img[:, :, ::-1]     水平翻转

2.  voc_dataset.py

import os
import xml.etree.ElementTree as ET import numpy as np from .util import read_image class VOCBboxDataset:
"""Bounding box dataset for PASCAL `VOC`_. .. _`VOC`: http://host.robots.ox.ac.uk/pascal/VOC/voc2012/ The index corresponds to each image. When queried by an index, if :obj:`return_difficult == False`,
this dataset returns a corresponding
:obj:`img, bbox, label`, a tuple of an image, bounding boxes and labels.
This is the default behaviour.
If :obj:`return_difficult == True`, this dataset returns corresponding
:obj:`img, bbox, label, difficult`. :obj:`difficult` is a boolean array
that indicates whether bounding boxes are labeled as difficult or not. The bounding boxes are packed into a two dimensional tensor of shape
:math:`(R, 4)`, where :math:`R` is the number of bounding boxes in
the image. The second axis represents attributes of the bounding box.
They are :math:`(y_{min}, x_{min}, y_{max}, x_{max})`, where the
four attributes are coordinates of the top left and the bottom right
vertices. The labels are packed into a one dimensional tensor of shape :math:`(R,)`.
:math:`R` is the number of bounding boxes in the image.
The class name of the label :math:`l` is :math:`l` th element of
:obj:`VOC_BBOX_LABEL_NAMES`. The array :obj:`difficult` is a one dimensional boolean array of shape
:math:`(R,)`. :math:`R` is the number of bounding boxes in the image.
If :obj:`use_difficult` is :obj:`False`, this array is
a boolean array with all :obj:`False`. The type of the image, the bounding boxes and the labels are as follows. * :obj:`img.dtype == numpy.float32`
* :obj:`bbox.dtype == numpy.float32`
* :obj:`label.dtype == numpy.int32`
* :obj:`difficult.dtype == numpy.bool` Args:
data_dir (string): Path to the root of the training data.
i.e. "/data/image/voc/VOCdevkit/VOC2007/"
split ({'train', 'val', 'trainval', 'test'}): Select a split of the
dataset. :obj:`test` split is only available for
2007 dataset.
year ({'2007', '2012'}): Use a dataset prepared for a challenge
held in :obj:`year`.
use_difficult (bool): If :obj:`True`, use images that are labeled as
difficult in the original annotation.
return_difficult (bool): If :obj:`True`, this dataset returns
a boolean array
that indicates whether bounding boxes are labeled as difficult
or not. The default value is :obj:`False`. """ def __init__(self, data_dir, split='trainval',
use_difficult=False, return_difficult=False,
): # if split not in ['train', 'trainval', 'val']:
# if not (split == 'test' and year == '2007'):
# warnings.warn(
# 'please pick split from \'train\', \'trainval\', \'val\''
# 'for 2012 dataset. For 2007 dataset, you can pick \'test\''
# ' in addition to the above mentioned splits.'
# )
id_list_file = os.path.join(
data_dir, 'ImageSets/Main/{0}.txt'.format(split)) self.ids = [id_.strip() for id_ in open(id_list_file)]
self.data_dir = data_dir
self.use_difficult = use_difficult
self.return_difficult = return_difficult
self.label_names = VOC_BBOX_LABEL_NAMES def __len__(self):
return len(self.ids) def get_example(self, i):
"""Returns the i-th example. Returns a color image and bounding boxes. The image is in CHW format.
The returned image is RGB. Args:
i (int): The index of the example. Returns:
tuple of an image and bounding boxes """
id_ = self.ids[i]
anno = ET.parse(
os.path.join(self.data_dir, 'Annotations', id_ + '.xml'))
bbox = list()
label = list()
difficult = list()
for obj in anno.findall('object'):
# when in not using difficult split, and the object is
# difficult, skipt it.
if not self.use_difficult and int(obj.find('difficult').text) == 1:
continue difficult.append(int(obj.find('difficult').text))
bndbox_anno = obj.find('bndbox')
# subtract 1 to make pixel indexes 0-based
bbox.append([
int(bndbox_anno.find(tag).text) - 1
for tag in ('ymin', 'xmin', 'ymax', 'xmax')])
name = obj.find('name').text.lower().strip()
label.append(VOC_BBOX_LABEL_NAMES.index(name))
bbox = np.stack(bbox).astype(np.float32)
label = np.stack(label).astype(np.int32)
# When `use_difficult==False`, all elements in `difficult` are False.
difficult = np.array(difficult, dtype=np.bool).astype(np.uint8) # PyTorch don't support np.bool # Load a image
img_file = os.path.join(self.data_dir, 'JPEGImages', id_ + '.jpg')
img = read_image(img_file, color=True) # if self.return_difficult:
# return img, bbox, label, difficult
return img, bbox, label, difficult __getitem__ = get_example VOC_BBOX_LABEL_NAMES = (
'aeroplane',
'bicycle',
'bird',
'boat',
'bottle',
'bus',
'car',
'cat',
'chair',
'cow',
'diningtable',
'dog',
'horse',
'motorbike',
'person',
'pottedplant',
'sheep',
'sofa',
'train',
'tvmonitor')

实现VOC2007数据类:共9963张图片

VOC2007包含{'train', 'val', 'trainval', 'test'},共20类,加背景21类。四个集合图片数分别为2501, 2510,5011,4952(trainval=train+val)。VOC2012无test集。

训练时使用trainval数据集,测试时使用test数据集。

每张图像的标注都在xml文件中:

<annotation>
<folder>VOC2007</folder>
<filename>000001.jpg</filename>
<source>
<database>The VOC2007 Database</database>
<annotation>PASCAL VOC2007</annotation>
<image>flickr</image>
<flickrid>341012865</flickrid>
</source>
<owner>
<flickrid>Fried Camels</flickrid>
<name>Jinky the Fruit Bat</name>
</owner>
<size>
<width>353</width>
<height>500</height>
<depth>3</depth>
</size>
<segmented>0</segmented>
<object>
<name>dog</name>
<pose>Left</pose>
<truncated>1</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>48</xmin>
<ymin>240</ymin>
<xmax>195</xmax>
<ymax>371</ymax>
</bndbox>
</object>
<object>
<name>person</name>
<pose>Left</pose>
<truncated>1</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>8</xmin>
<ymin>12</ymin>
<xmax>352</xmax>
<ymax>498</ymax>
</bndbox>
</object>
</annotation>

每个xml文件给出了此图像的size,每个bbox坐标、bbox所含label、以及是否是difficult。

类 VOCBboxDataset继承自Object基类,实例化该类时只需提供VOC数据集路径即可。

类 VOCBboxDataset的方法只有一个,即返回第i张图片的信息(图像、bbox、label、difficult)

3.  dataset.py

import torch as t
from .voc_dataset import VOCBboxDataset
from skimage import transform as sktsf
from torchvision import transforms as tvtsf
from . import util
import numpy as np
from utils.config import opt def inverse_normalize(img):
if opt.caffe_pretrain:
img = img + (np.array([122.7717, 115.9465, 102.9801]).reshape(3, 1, 1))
return img[::-1, :, :]
# approximate un-normalize for visualize
return (img * 0.225 + 0.45).clip(min=0, max=1) * 255 def pytorch_normalze(img):
"""
https://github.com/pytorch/vision/issues/223
return appr -1~1 RGB
"""
normalize = tvtsf.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
img = normalize(t.from_numpy(img))
return img.numpy() def caffe_normalize(img):
"""
return appr -125-125 BGR
"""
img = img[[2, 1, 0], :, :] # RGB-BGR
img = img * 255
mean = np.array([122.7717, 115.9465, 102.9801]).reshape(3, 1, 1)
img = (img - mean).astype(np.float32, copy=True)
return img def preprocess(img, min_size=600, max_size=1000):
"""Preprocess an image for feature extraction. The length of the shorter edge is scaled to :obj:`self.min_size`.
After the scaling, if the length of the longer edge is longer than
:param min_size:
:obj:`self.max_size`, the image is scaled to fit the longer edge
to :obj:`self.max_size`. After resizing the image, the image is subtracted by a mean image value
:obj:`self.mean`. Args:
img (~numpy.ndarray): An image. This is in CHW and RGB format.
The range of its value is :math:`[0, 255]`. Returns:
~numpy.ndarray: A preprocessed image. """
C, H, W = img.shape
scale1 = min_size / min(H, W)
scale2 = max_size / max(H, W)
scale = min(scale1, scale2)
img = img / 255.
img = sktsf.resize(img, (C, H * scale, W * scale), mode='reflect')
# both the longer and shorter should be less than
# max_size and min_size
if opt.caffe_pretrain:
normalize = caffe_normalize
else:
normalize = pytorch_normalze
return normalize(img) class Transform(object): def __init__(self, min_size=600, max_size=1000):
self.min_size = min_size
self.max_size = max_size def __call__(self, in_data):
img, bbox, label = in_data
_, H, W = img.shape
img = preprocess(img, self.min_size, self.max_size)
_, o_H, o_W = img.shape
scale = o_H / H
bbox = util.resize_bbox(bbox, (H, W), (o_H, o_W)) # horizontally flip
img, params = util.random_flip(
img, x_random=True, return_param=True)
bbox = util.flip_bbox(
bbox, (o_H, o_W), x_flip=params['x_flip']) return img, bbox, label, scale class Dataset:
def __init__(self, opt):
self.opt = opt
self.db = VOCBboxDataset(opt.voc_data_dir)
self.tsf = Transform(opt.min_size, opt.max_size) def __getitem__(self, idx):
ori_img, bbox, label, difficult = self.db.get_example(idx) img, bbox, label, scale = self.tsf((ori_img, bbox, label))
# TODO: check whose stride is negative to fix this instead copy all
# some of the strides of a given numpy array are negative.
return img.copy(), bbox.copy(), label.copy(), scale def __len__(self):
return len(self.db) class TestDataset:
def __init__(self, opt, split='test', use_difficult=True):
self.opt = opt
self.db = VOCBboxDataset(opt.voc_data_dir, split=split, use_difficult=use_difficult) def __getitem__(self, idx):
ori_img, bbox, label, difficult = self.db.get_example(idx)
img = preprocess(ori_img)
return img, ori_img.shape[1:], bbox, label, difficult def __len__(self):
return len(self.db)

制作数据

函数inverse_normalize实现对caffe与torchvision版本的去正则化。因为可以利用caffe版本的vgg预训练权重,也可利用torchvision版本的预训练权重。只不过后者结果略微逊色于前者。

函数pytorch_normalze实现对pytorch模型输入图像的标准化:由【0,255】的RGB转为【0,1】的RGB再正则化为【-1,1】的RGB。

函数caffe_normalze实现对caffe模型输入图像的标准化:由【0,255】的RGB转为【0,1】的RGB再正则化为【-125,125】的BGR。

函数preprocess实现对图像的预处理:由read_image函数读入的图像为CHW的【0,255】格式,这里首先除以255, 再按照论文长边不超1000,短边不超600。按此比例缩放。然后调用pytorch_normalze或者caffe_normalze对图像进行正则化。

Transform实现了预处理,定义了__call__方法,在__call__方法中利用函数preprocess对图像预处理,并将bbox按照图像缩放的尺度等比例缩放。然后随机对图像与bbox同时进行水平翻转。

Dataset实现对训练集样本的生成, 即trainval。__getitem__方法利用VOCBboxDataset类来生成一张训练图片,并调用Trandform类处理。返回处理后的图像,bbox,label,scale。

TestDataset实现对测试机样本的生成,即test。__getitem__方法利用VOCBboxDataset类来生成一张测试图片,不同于训练的是调用preprocess函数处理。也即没有对bbox进行相应resize,而是返回处理前的图像尺寸。

二. utils包

1.  array_tool.py

"""
tools to convert specified type
"""
import torch as t
import numpy as np def tonumpy(data):
if isinstance(data, np.ndarray):
return data
if isinstance(data, t._TensorBase):
return data.cpu().numpy()
if isinstance(data, t.autograd.Variable):
return tonumpy(data.data) def totensor(data, cuda=True):
if isinstance(data, np.ndarray):
tensor = t.from_numpy(data)
if isinstance(data, t._TensorBase):
tensor = data
if isinstance(data, t.autograd.Variable):
tensor = data.data
if cuda:
tensor = tensor.cuda()
return tensor def tovariable(data):
if isinstance(data, np.ndarray):
return tovariable(totensor(data))
if isinstance(data, t._TensorBase):
return t.autograd.Variable(data)
if isinstance(data, t.autograd.Variable):
return data
else:
raise ValueError("UnKnow data type: %s, input should be {np.ndarray,Tensor,Variable}" %type(data)) def scalar(data):
if isinstance(data, np.ndarray):
return data.reshape(1)[0]
if isinstance(data, t._TensorBase):
return data.view(1)[0]
if isinstance(data, t.autograd.Variable):
return data.data.view(1)[0]

类别转换脚本,实现tensor、numpy、Variable之间的转换。

2.  config.py

from pprint import pprint

# Default Configs for training
# NOTE that, config items could be overwriten by passing argument through command line.
# e.g. --voc-data-dir='./data/' class Config:
# data
voc_data_dir = '/home/cy/.chainer/dataset/pfnet/chainercv/voc/VOCdevkit/VOC2007/'
min_size = 600 # image resize
max_size = 1000 # image resize
num_workers = 8
test_num_workers = 8 # sigma for l1_smooth_loss
rpn_sigma = 3.
roi_sigma = 1. # param for optimizer
# 0.0005 in origin paper but 0.0001 in tf-faster-rcnn
weight_decay = 0.0005
lr_decay = 0.1 # 1e-3 -> 1e-4
lr = 1e-3 # visualization
env = 'faster-rcnn' # visdom env
port = 8097
plot_every = 40 # vis every N iter # preset
data = 'voc'
pretrained_model = 'vgg16' # training
epoch = 14 use_adam = False # Use Adam optimizer
use_chainer = False # try match everything as chainer
use_drop = False # use dropout in RoIHead
# debug
debug_file = '/tmp/debugf' test_num = 10000
# model
load_path = None caffe_pretrain = False # use caffe pretrained model instead of torchvision
caffe_pretrain_path = 'checkpoints/vgg16-caffe.pth' def _parse(self, kwargs):
state_dict = self._state_dict()
for k, v in kwargs.items():
if k not in state_dict:
raise ValueError('UnKnown Option: "--%s"' % k)
setattr(self, k, v) print('======user config========')
pprint(self._state_dict())
print('==========end============') def _state_dict(self):
return {k: getattr(self, k) for k, _ in Config.__dict__.items() \
if not k.startswith('_')} opt = Config()

配置文件。包括数据及地址、visdom环境、图像尺寸、预训练权重类型、学习率及各超参数。

3.  vis_tool.py

import time

import numpy as np
import matplotlib
import torch as t
import visdom matplotlib.use('Agg')
from matplotlib import pyplot as plot # from data.voc_dataset import VOC_BBOX_LABEL_NAMES VOC_BBOX_LABEL_NAMES = (
'fly',
'bike',
'bird',
'boat',
'pin',
'bus',
'c',
'cat',
'chair',
'cow',
'table',
'dog',
'horse',
'moto',
'p',
'plant',
'shep',
'sofa',
'train',
'tv',
) def vis_image(img, ax=None):
"""Visualize a color image. Args:
img (~numpy.ndarray): An array of shape :math:`(3, height, width)`.
This is in RGB format and the range of its value is
:math:`[0, 255]`.
ax (matplotlib.axes.Axis): The visualization is displayed on this
axis. If this is :obj:`None` (default), a new axis is created. Returns:
~matploblib.axes.Axes:
Returns the Axes object with the plot for further tweaking. """ if ax is None:
fig = plot.figure()
ax = fig.add_subplot(1, 1, 1)
# CHW -> HWC
img = img.transpose((1, 2, 0))
ax.imshow(img.astype(np.uint8))
return ax def vis_bbox(img, bbox, label=None, score=None, ax=None):
"""Visualize bounding boxes inside image. Args:
img (~numpy.ndarray): An array of shape :math:`(3, height, width)`.
This is in RGB format and the range of its value is
:math:`[0, 255]`.
bbox (~numpy.ndarray): An array of shape :math:`(R, 4)`, where
:math:`R` is the number of bounding boxes in the image.
Each element is organized
by :math:`(y_{min}, x_{min}, y_{max}, x_{max})` in the second axis.
label (~numpy.ndarray): An integer array of shape :math:`(R,)`.
The values correspond to id for label names stored in
:obj:`label_names`. This is optional.
score (~numpy.ndarray): A float array of shape :math:`(R,)`.
Each value indicates how confident the prediction is.
This is optional.
label_names (iterable of strings): Name of labels ordered according
to label ids. If this is :obj:`None`, labels will be skipped.
ax (matplotlib.axes.Axis): The visualization is displayed on this
axis. If this is :obj:`None` (default), a new axis is created. Returns:
~matploblib.axes.Axes:
Returns the Axes object with the plot for further tweaking. """ label_names = list(VOC_BBOX_LABEL_NAMES) + ['bg']
# add for index `-1`
if label is not None and not len(bbox) == len(label):
raise ValueError('The length of label must be same as that of bbox')
if score is not None and not len(bbox) == len(score):
raise ValueError('The length of score must be same as that of bbox') # Returns newly instantiated matplotlib.axes.Axes object if ax is None
ax = vis_image(img, ax=ax) # If there is no bounding box to display, visualize the image and exit.
if len(bbox) == 0:
return ax for i, bb in enumerate(bbox):
xy = (bb[1], bb[0])
height = bb[2] - bb[0]
width = bb[3] - bb[1]
ax.add_patch(plot.Rectangle(
xy, width, height, fill=False, edgecolor='red', linewidth=2)) caption = list() if label is not None and label_names is not None:
lb = label[i]
if not (-1 <= lb < len(label_names)): # modfy here to add backgroud
raise ValueError('No corresponding name is given')
caption.append(label_names[lb])
if score is not None:
sc = score[i]
caption.append('{:.2f}'.format(sc)) if len(caption) > 0:
ax.text(bb[1], bb[0],
': '.join(caption),
style='italic',
bbox={'facecolor': 'white', 'alpha': 0.5, 'pad': 0})
return ax def fig2data(fig):
"""
brief Convert a Matplotlib figure to a 4D numpy array with RGBA
channels and return it @param fig: a matplotlib figure
@return a numpy 3D array of RGBA values
"""
# draw the renderer
fig.canvas.draw() # Get the RGBA buffer from the figure
w, h = fig.canvas.get_width_height()
buf = np.fromstring(fig.canvas.tostring_argb(), dtype=np.uint8)
buf.shape = (w, h, 4) # canvas.tostring_argb give pixmap in ARGB mode. Roll the ALPHA channel to have it in RGBA mode
buf = np.roll(buf, 3, axis=2)
return buf.reshape(h, w, 4) def fig4vis(fig):
"""
convert figure to ndarray
"""
ax = fig.get_figure()
img_data = fig2data(ax).astype(np.int32)
plot.close()
# HWC->CHW
return img_data[:, :, :3].transpose((2, 0, 1)) / 255. def visdom_bbox(*args, **kwargs):
fig = vis_bbox(*args, **kwargs)
data = fig4vis(fig)
return data class Visualizer(object):
"""
wrapper for visdom
you can still access naive visdom function by
self.line, self.scater,self._send,etc.
due to the implementation of `__getattr__`
""" def __init__(self, env='default', **kwargs):
self.vis = visdom.Visdom(env=env, **kwargs)
self._vis_kw = kwargs # e.g.(’loss',23) the 23th value of loss
self.index = {}
self.log_text = '' def reinit(self, env='default', **kwargs):
"""
change the config of visdom
"""
self.vis = visdom.Visdom(env=env, **kwargs)
return self def plot_many(self, d):
"""
plot multi values
@params d: dict (name,value) i.e. ('loss',0.11)
"""
for k, v in d.items():
if v is not None:
self.plot(k, v) def img_many(self, d):
for k, v in d.items():
self.img(k, v) def plot(self, name, y, **kwargs):
"""
self.plot('loss',1.00)
"""
x = self.index.get(name, 0)
self.vis.line(Y=np.array([y]), X=np.array([x]),
win=name,
opts=dict(title=name),
update=None if x == 0 else 'append',
**kwargs
)
self.index[name] = x + 1 def img(self, name, img_, **kwargs):
"""
self.img('input_img',t.Tensor(64,64))
self.img('input_imgs',t.Tensor(3,64,64))
self.img('input_imgs',t.Tensor(100,1,64,64))
self.img('input_imgs',t.Tensor(100,3,64,64),nrows=10)
!!!don‘t ~~self.img('input_imgs',t.Tensor(100,64,64),nrows=10)~~!!!
"""
self.vis.images(t.Tensor(img_).cpu().numpy(),
win=name,
opts=dict(title=name),
**kwargs
) def log(self, info, win='log_text'):
"""
self.log({'loss':1,'lr':0.0001})
"""
self.log_text += ('[{time}] {info} <br>'.format(
time=time.strftime('%m%d_%H%M%S'), \
info=info))
self.vis.text(self.log_text, win) def __getattr__(self, name):
return getattr(self.vis, name) def state_dict(self):
return {
'index': self.index,
'vis_kw': self._vis_kw,
'log_text': self.log_text,
'env': self.vis.env
} def load_state_dict(self, d):
self.vis = visdom.Visdom(env=d.get('env', self.vis.env), **(self.d.get('vis_kw')))
self.log_text = d.get('log_text', '')
self.index = d.get('index', dict())
return self

函数vis_image读入一张3,H,W的RGB图像并显示。

函数vis_bbox显示图像及该图的bbox,及bbox的label和score。

函数visdom_bbox调用函数fig2data、fig4vis返回显示后的图像。

Visualizer将要在visdom中显示的项包装起来。

4.  eval_tool.py

评估检测结果

函数calc_detection_voc_prec_rec计算每一类的precision和recall。

函数calc_detection_voc_ap调用第一个函数计算每一类的average precision(ap)。

函数eval_detection_voc调用前两个函数,得到ap、map。

注:

1. bbox坐标都是以(R,4)的形状出现,在进行bounding box回归的时候会将bbox坐标转为中心点坐标(x,y)与height、weight,因为回归的目的是学到offsets(偏移)和scales(尺度),所以需要转换坐标表示。在其他情况bbox坐标都是左上右下角坐标,即`(y_{min}, x_{min}, y_{max}, x_{max})'。

2. 代码实现时的batch_size=1,其余多数代码也是基于size=1的。所以每次输入一张图片, 这样很方便处理。

Reference:

从编程实现角度学习Faster R-CNN(附极简实现)

Precision、Recall、Ap概念

上一篇:CF GYM 100703L Many questions


下一篇:关于 Microsoft Dynamics CRM has encountered an error 弹窗的问题