前文: mask rcnn 超详细代码解读(一)
(小小声最近忙别的事去了更新拖了一个月。。。接下来会连续一口气争取日更把所有的内容写完)
文章目录
1 (一)中网络结构总结(刚刚看完(一)可忽略这段)
(一)中解析了Resnet Graph、Region Proposal Network (RPN)、Proposal Layer三个部分。(MaskRCNN Class 层会把大家都关联起来)
Resnet Graph是一系列的卷积,它的目的就是提取特征。图片输入网络,首先通过Resnet Graph提取特征,得到 [C1, C2, C3, C4, C5],这些特征是后面网络的基础。
在后文的 MaskRCNN Class 层解析中会发现, Resnet Graph得到的C系列特征分别经过 3x3 卷积后得到P系列特征,即 [P1, P2, P3, P4, P5],然后 P5 再通过maxpooling得到P6, [P1, P2, P3, P4, P5, P6] 作为 feature_map输入 Region Proposal Network (RPN) 得到 rpn_class_logits
、 rpn_class
、 rpn_bbox
。
这些输出结果再输入 Proposal Layer 就可以得到 proposals 了。
以上就是(一)中所解析的三部分的关联,接下来将继续分析 ROIAlign Layer 、Detection Target Layer、Feature Pyramid Network Heads 这三层的结构。
2 train过程代码继续解析
2.1 ROIAlign Layer
ROIAlign 是最不好理解的一部分,代码中ROIAlign包括两个部分:
- 定义函数
def log2_graph(x)
:这个是因为 TensorFlow 中竟然没有求 l o g 2 x log_2x log2x的方法,所以代码自己定义了一个方法来计算,直接返回tf.log(x) / tf.log(2.0)
- 定义类
class PyramidROIAlign(KE.Layer)
:同样继承 KE.Layer,目的是让TensorFlow 处理的数据流可以让 keras 接着处理,具体前文已有说明,这里不再赘述。
下文对 PyramidROIAlign
类进行解析。
首先是 __init__
方法:
def __init__(self, pool_shape, **kwargs):
super(PyramidROIAlign, self).__init__(**kwargs)
self.pool_shape = tuple(pool_shape)
由源码可知,在实例化PyramidROIAlign
类时,需要传入一个 pool_shape 参数。这个参数非常重要,它决定了 ROIAlign 层输出的特征的shape,一般 pool_shape=(7, 7),也就是说,不管输入特征的大小是多少,输出特征大小必然是 7x7(不考虑通道数)。
这点非常重要。 因为 mask rcnn 的设定是,可以输入任意尺寸的图片。对于卷积来说,该层的参数量 = 卷积核高x卷积核宽x卷积核数量(通道数),其中卷积核的高和宽是设定的参数,通道数是超参数,输入图片大小不会影响卷积层的参数量,只是输出的特征大小不同罢了,不管输入多大的图片都能算(也就是不会报错)。
但是对于dense层来说,输入图片大小不同,参数量是不一样的。在分类时,网络最后要接 dense 层,要确保输入 dense 的 feature 大小一致。但是输入mask rcnn 的图片大小又是不确定的,那该怎么办呢???
所以,这就是 PyramidROIAlign
的重要作用了:不管输入该层的特征大小为多少,经过该层之后,一律变成固定值(即 pool_size
,一般是 7x7)。核心科技就是调用了这个方法:tf.crop_and_resize
(另外,不是把整张输入图片的特征变成 7x7 ,如果是那样就只有resize没有corp了。 PyramidROIAlign
的功能是,根据显著性物体的bbox坐标,以及显著性物体相对于整张图片面积的大小,在不同尺寸的特征图上切出显著性对象的特征。 可以结合代码理解这个过程。)
来看 PyramidROIAlign
具体是怎么做的,也就是 call(self, inputs)
方法。代码流程:
(1)初始化,从 input
中获取 bboxes
、image_meta
、feature_maps
:
def call(self, inputs):
# num_boxes指的是proposal数目
# 通过循环特征层寻找符合的proposal,应用于ROIAlign
# Crop boxes [batch, num_boxes, (y1, x1, y2, x2)] in normalized coords
boxes = inputs[0]
print('boxes:',boxes)
# Image meta
# Holds details about the image. See compose_image_meta()
image_meta = inputs[1]
# Feature Maps. List of feature maps from different level of the
# feature pyramid. Each is [batch, height, width, channels]
feature_maps = inputs[2:]
其中:
- boxes:shape = [batch, num_boxes, (y1, x1, y2, x2)],这里坐标都经过了归一化处理。
- input_meta:里面包含了各种图片信息,包括原输入图片的大小、图片id之类的(虽然只有
image_shape
会用到。。。)这个是通过compose_image_meta
方法生成的,可以用parse_image_meta(meta)
获得meta中的数据,这两个方法在解读(一)中已经说明。 - feature_maps:是通过Resnet Graph提取到的特征,每个的shape都是[batch, height, width, channels]
什么?你问这个参数怎么传进去的,当然是:
layer = PyramidROIAlign(7,7)([bboxes, image_meta, feature_maps])
(2)根据 image_meta 中携带的原图面积信息,得到现在处理的这张图片应该在哪一个特征图中 pooling。
def call(self, inputs):
# (1)初始化,从 `input` 中获取 bboxes 、image_meta 、feature_maps
... # 初始化代码省略
# Assign each ROI to a level in the pyramid based on the ROI area.
# 这里的boxes是ROI的框,用来计算得到每个ROI框的面积
y1, x1, y2, x2 = tf.split(boxes, 4, axis=2)
h = y2 - y1 # h.shape=[batch,num_boxes,1]
w = x2 - x1
# Use shape of first image. Images in a batch must have the same size.
# 这里得到原图的尺寸,计算原图的面积
image_shape = parse_image_meta_graph(image_meta)['image_shape'][0]
# Equation 1 in the Feature Pyramid Networks paper. Account for
# the fact that our coordinates are normalized here.
# e.g. a 224x224 ROI (in pixels) maps to P4
# 原图面积
image_area = tf.cast(image_shape[0] * image_shape[1], tf.float32)
# 分成两步计算每个ROI框需要在哪个层的特征图中进行pooling
roi_level = log2_graph(tf.sqrt(h * w) / (224.0 / tf.sqrt(image_area))) # h,w已经归一化
roi_level = tf.minimum(5, tf.maximum(
2, 4 + tf.cast(tf.round(roi_level), tf.int32))) # 确保值位于2-5之间
roi_level = tf.squeeze(roi_level, 2) # roi_level.shape=[batch,num_boxes,1]
这里增加一点解释:
- 为啥要计算
roi_level
?
roi_level
(记为k)的计算方法是:
k
=
k
0
+
l
o
g
2
(
w
∗
h
244
)
k=k_0+log_2(\frac{\sqrt{w*h}}{244})
k=k0+log2(244w∗h
)这里 w 和 h 分别是显著性物体的绑定框的宽和高,所以 w*h 是显著性物体的大小。244是预训练的 Image Net 的输入大小,比如
k
0
k_0
k0=4,那么,w*h=244时,k=4,该显著性对象的特征从特征金字塔中的 P4 中 crop。
如果显著性物体占原图面积大,则在更“深”(也就是卷积次数更多)的特征图(比如P5)上切割,如果显著性物体是个不起眼的小东西,比如 k 0 k_0 k0=4,w*h=112,则 k=3,小的显著性物体在更“浅”的特征图上切割(比如P3)。这样有利于检测不同尺寸的目标。
- 计算ROI在哪个特征图中进行Pooling的结果储存在
roi_level
里面的,roi_level.shape=[batch,num_boxes,1]
(3)循环 feature_maps,在feature_maps中用 tf.image.crop_and_resize
函数得到 pooled
,存入list:
def call(self, inputs):
#(1)初始化,从 `input` 中获取 bboxes 、image_meta 、feature_maps
... # 初始化代码省略
#(2)根据 image_meta 中携带的原图面积信息,得到现在处理的这张图片应该在哪一个特征图中 pooling
... # 代码省略
# Loop through levels and apply ROI pooling to each. P2 to P5.
# 使用得到的5个融合了不同层级的特征图
pooled = []
box_to_level = [] # box_to_level[i, 0]表示的是当前feat隶属的图片索引,box_to_level[i, 1]表示的是其box序号
for i, level in enumerate(range(2, 6)): # 只使用2-5四个特征图
# 先找出需要在第level层计算ROI
# tf.where 返回格式 [坐标1, 坐标1...]
# np.where 返回格式 [[坐标1.x, 坐标2.x...], [坐标1.y, 坐标2.y...]]
# 返回第n张图片的第i个proposal坐标(n对应batch坐标,i对应num_boxes那一维的坐标)
ix = tf.where(tf.equal(roi_level, level)) # ix是一个坐标集,每个坐标有三个数字,第三位数必然是0(因为roi_level.shape=[batch,num_boxes,1])。
# level_boxes 记录对应的level特征层中分配到的每个box的坐标(候选框索引对应的图片)
# box_indices 记录每个box对应的图片在batch中的索引(候选框索引对应其坐标即小黑框的坐标)
level_boxes = tf.gather_nd(boxes, ix) # [本level的proposal数目,4]
# Box indices for crop_and_resize.
box_indices = tf.cast(ix[:, 0], tf.int32) # 记录每个proposal对应图片序号
# ↑ 取 ix[:,0]是tf.image.crop_and_resize传参需要
# Keep track of which box is mapped to which level
box_to_level.append(ix)
# Stop gradient propogation to ROI proposals
# level_boxes和box_indices本身属于RPN计算出来结果,
# 但是两者作用于feature后的输出Tensor却是RCNN部分的输入,
# 两部分的梯度不能相互流通的,所以需要tf.stop_gradient()截断梯度传播。
level_boxes = tf.stop_gradient(level_boxes)
box_indices = tf.stop_gradient(box_indices)
# Crop and Resize
# From Mask R-CNN paper: "We sample four regular locations, so
# that we can evaluate either max or average pooling. In fact,
# interpolating only a single value at each bin center (without
# pooling) is nearly as effective."
#
# Here we use the simplified approach of a single value per bin,
# which is how it's done in tf.crop_and_resize()
# Result: [batch * num_boxes, pool_height, pool_width, channels]
# 调用API双线性插值
# tf.image.crop_and_resize的参数说明:
# - image: 表示特征图
# - boxes:指需要划分的区域,输入格式为[ymin,xmin,ymax,xmax] 归一化
# - box_ind: 是boxes和image之间的索引,形状为[num_boxes]的1维张量,box_ind[i]值指定第i个方框要引用的图像
# - crop_size: 表示RoiAlign之后的大小
pooled.append(tf.image.crop_and_resize(
feature_maps[i], level_boxes, box_indices, self.pool_shape,
method="bilinear"))
# 输入参数shape:
# [batch, image_height, image_width, channels]
# [this_level_num_boxes, 4]
# [this_level_num_boxes]
# [height, pool_width]
# Pack pooled features into one tensor
# 对每个box,都提取其中每一层特征图上该box对应的特征,然后组成一个大的特征表pooled
pooled = tf.concat(pooled, axis=0)
# Pack box_to_level mapping into one array and add another
# column representing the order of pooled boxes
box_to_level = tf.concat(box_to_level, axis=0)
box_range = tf.expand_dims(tf.range(tf.shape(box_to_level)[0]), 1)
box_to_level = tf.concat([tf.cast(box_to_level, tf.int32), box_range],
axis=1)
关于 tf.image.crop_and_resize
这个关键函数的补充说明:这个函数会先按输入参数 [ymin,xmin,ymax,xmax]
在图上通过索引切出一部分,然后把这部分resize成你想要的大小,比如:
另外,索引那段代码(就是 ix
有关的那段代码)不好理解,可以看本文第三部分索引详解的示例一(讲道理不理解也行,不影响理解整个 mask rcnn 的代码思路,但是理解了有助于以后自己写代码使用索引)
(4)调整shape顺序,得到形如 [batch, num_bbox, pool_height, pool_width, channels]的输出;
def call(self, inputs):
... #(1)(2)(3)代码省略
# 截止到目前,我们获取了记录全部ROIAlign结果feat集合的张量pooled,和记录这些feat相关信息的张量box_to_level,
# 由于提取方法的原因,此时的feat并不是按照原始顺序排序(先按batch然后按box index排序)
# 下面我们设法将之恢复顺序(ROIAlign作用于对应图片的对应proposal生成feat)
# Rearrange pooled features to match the order of the original boxes
# Sort box_to_level by batch then box index
# TF doesn't have a way to sort by two columns, so merge them and sort.
# box_to_level[i, 0]表示的是当前feat隶属的图片索引,box_to_level[i, 1]表示的是其box序号
sorting_tensor = box_to_level[:, 0] * 100000 + box_to_level[:, 1]
ix = tf.nn.top_k(sorting_tensor, k=tf.shape(
box_to_level)[0]).indices[::-1]
ix = tf.gather(box_to_level[:, 2], ix)
pooled = tf.gather(pooled, ix)
# Re-add the batch dimension
shape = tf.concat([tf.shape(boxes)[:2], tf.shape(pooled)[1:]], axis=0)
pooled = tf.reshape(pooled, shape)
return pooled
2.2 Detection Target Layer
Detection Target Layer 部分输入( gt 指 ground truth):
-
proposals
: [POST_NMS_ROIS_TRAINING, (y1, x1, y2, x2)] 坐标是归一化的,如果该图片生成的实际 proposal 数量不足,会补零到固定值 -
gt_class_ids
: [MAX_GT_INSTANCES] int class IDs -
gt_boxes
: [MAX_GT_INSTANCES, (y1, x1, y2, x2)] 坐标是归一化的 -
gt_masks
: [height, width, MAX_GT_INSTANCES] of boolean type.
返回: -
rois
: [TRAIN_ROIS_PER_IMAGE, (y1, x1, y2, x2)] 坐标是归一化的 -
class_ids
: [TRAIN_ROIS_PER_IMAGE]. Integer class IDs. 数量不足会补零到固定值。 -
deltas
: [TRAIN_ROIS_PER_IMAGE, (dy, dx, log(dh), log(dw))] -
masks
: [TRAIN_ROIS_PER_IMAGE, height, width]. 这些 mask 是 cropped 成对应的 bbox 框并且 resized 到网络输出大小的掩码。
有三个部分:
-
overlaps_graph(boxes1, boxes2)
方法:计算两个box之间重叠的部分,也就是IoU值。这部分代码简单,略过。 -
detection_targets_graph
方法:detection的主要处理流程 -
DetectionTargetLayer
类
detection_targets_graph
代码实现流程:
(1)remove zero padding,去掉 gt_class_ids 和 gt_masks、proposals、gt_boxes中的0(gt是 ground truth 的简写)
def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config):
"""Generates detection targets for one image. Subsamples proposals and
generates target class IDs, bounding box deltas, and masks for each.
Inputs:
proposals: [POST_NMS_ROIS_TRAINING, (y1, x1, y2, x2)] in normalized coordinates. Might
be zero padded if there are not enough proposals.
gt_class_ids: [MAX_GT_INSTANCES] int class IDs
gt_boxes: [MAX_GT_INSTANCES, (y1, x1, y2, x2)] in normalized coordinates.
gt_masks: [height, width, MAX_GT_INSTANCES] of boolean type.
Returns: Target ROIs and corresponding class IDs, bounding box shifts,
and masks.
rois: [TRAIN_ROIS_PER_IMAGE, (y1, x1, y2, x2)] in normalized coordinates
class_ids: [TRAIN_ROIS_PER_IMAGE]. Integer class IDs. Zero padded.
deltas: [TRAIN_ROIS_PER_IMAGE, (dy, dx, log(dh), log(dw))]
masks: [TRAIN_ROIS_PER_IMAGE, height, width]. Masks cropped to bbox
boundaries and resized to neural network output size.
Note: Returned arrays might be zero padded if not enough target ROIs.
"""
# Assertions
asserts = [
tf.Assert(tf.greater(tf.shape(proposals)[0], 0), [proposals],
name="roi_assertion"),
]
with tf.control_dependencies(asserts):
proposals = tf.identity(proposals)
# Remove zero padding
proposals, _ = trim_zeros_graph(proposals, name="trim_proposals")
gt_boxes, non_zeros = trim_zeros_graph(gt_boxes, name="trim_gt_boxes")
gt_class_ids = tf.boolean_mask(gt_class_ids, non_zeros,
name="trim_gt_class_ids")
gt_masks = tf.gather(gt_masks, tf.where(non_zeros)[:, 0], axis=2,
name="trim_gt_masks")
(2)处理 crowds (a crowd refers to a bounding box around several instances),用 tf.where
得到 crowd_id,然后用 tf.gather
得到 crowd_boxes,以及用 non_crowd_ix 得到 gt_class_id、gt_boxes、gt_masks
def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config):
# Remove zero padding
... # 代码省略
# Handle COCO crowds
# A crowd box in COCO is a bounding box around several instances. Exclude
# them from training. A crowd box is given a negative class ID.
crowd_ix = tf.where(gt_class_ids < 0)[:, 0]
non_crowd_ix = tf.where(gt_class_ids > 0)[:, 0]
crowd_boxes = tf.gather(gt_boxes, crowd_ix)
gt_class_ids = tf.gather(gt_class_ids, non_crowd_ix)
gt_boxes = tf.gather(gt_boxes, non_crowd_ix)
gt_masks = tf.gather(gt_masks, non_crowd_ix, axis=2)
(3)计算 proposals 和 gt_boxes 的重叠 IoU,存在 overlaps 中
def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config):
# Remove zero padding
# Handle COCO crowds
... # 代码省略
# Compute overlaps matrix [proposals, gt_boxes]
overlaps = overlaps_graph(proposals, gt_boxes)
(4)计算 crowd_overlaps = IoU(proposals 和 crowd_boxes),得到其中的max。判断是否是crowd的标准是:crowd_iou_max<0.001
def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config):
# Remove zero padding
# Handle COCO crowds
# Compute overlaps matrix [proposals, gt_boxes]
... # 代码省略
# Compute overlaps with crowd boxes [proposals, crowd_boxes]
crowd_overlaps = overlaps_graph(proposals, crowd_boxes)
crowd_iou_max = tf.reduce_max(crowd_overlaps, axis=1)
no_crowd_bool = (crowd_iou_max < 0.001)
(5)判断positive/negative ROIs:①positive ROIs 是指与 gt_boxes 的最大IoU>=0.5 ②negative是指与 gt_boxes的最大IoU<0.5并且不是crowd(crowd_iou_max<0.001)
def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config):
# Remove zero padding
# Handle COCO crowds
# Compute overlaps matrix [proposals, gt_boxes]
# Compute overlaps with crowd boxes [proposals, crowd_boxes]
... # 代码省略
# Determine positive and negative ROIs
roi_iou_max = tf.reduce_max(overlaps, axis=1)
# 1. Positive ROIs are those with >= 0.5 IoU with a GT box
positive_roi_bool = (roi_iou_max >= 0.5)
positive_indices = tf.where(positive_roi_bool)[:, 0]
# 2. Negative ROIs are those with < 0.5 with every GT box. Skip crowds.
negative_indices = tf.where(tf.logical_and(roi_iou_max < 0.5, no_crowd_bool))[:, 0]
(6)根据设定的positive数量,控制Positive/Negative比例,对proposals过滤,得到proposal_rois
def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config):
# Remove zero padding
# Handle COCO crowds
# Compute overlaps matrix [proposals, gt_boxes]
# Compute overlaps with crowd boxes [proposals, crowd_boxes]
# Determine positive and negative ROIs
... # 代码省略
# Subsample ROIs. Aim for 33% positive
# Positive ROIs
positive_count = int(config.TRAIN_ROIS_PER_IMAGE *
config.ROI_POSITIVE_RATIO)
positive_indices = tf.random_shuffle(positive_indices)[:positive_count]
positive_count = tf.shape(positive_indices)[0]
# Negative ROIs. Add enough to maintain positive:negative ratio.
r = 1.0 / config.ROI_POSITIVE_RATIO
negative_count = tf.cast(r * tf.cast(positive_count, tf.float32), tf.int32) - positive_count
negative_indices = tf.random_shuffle(negative_indices)[:negative_count]
# Gather selected ROIs
positive_rois = tf.gather(proposals, positive_indices)
negative_rois = tf.gather(proposals, negative_indices)
(7)assign positive rois to gt boxes
roi_gt_box_assignment 是 positive_overlaps 的最大索引值,根据这个索引得到 roi_gt_boxes 和 roi_gt_class_ids
def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config):
# Remove zero padding
# Handle COCO crowds
# Compute overlaps matrix [proposals, gt_boxes]
# Compute overlaps with crowd boxes [proposals, crowd_boxes]
# Determine positive and negative ROIs
# Subsample ROIs. Aim for 33% positive
... # 代码省略
# Assign positive ROIs to GT boxes.
positive_overlaps = tf.gather(overlaps, positive_indices)
roi_gt_box_assignment = tf.cond(
tf.greater(tf.shape(positive_overlaps)[1], 0),
true_fn = lambda: tf.argmax(positive_overlaps, axis=1),
false_fn = lambda: tf.cast(tf.constant([]),tf.int64)
)
roi_gt_boxes = tf.gather(gt_boxes, roi_gt_box_assignment)
roi_gt_class_ids = tf.gather(gt_class_ids, roi_gt_box_assignment)
(8)计算 roi_gt_boxes 与 positive_rois(这个也是坐标)的 delta
def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config):
# Remove zero padding
# Handle COCO crowds
# Compute overlaps matrix [proposals, gt_boxes]
# Compute overlaps with crowd boxes [proposals, crowd_boxes]
# Determine positive and negative ROIs
# Subsample ROIs. Aim for 33% positive
# Assign positive ROIs to GT boxes.
... # 代码省略
# Compute bbox refinement for positive ROIs
deltas = utils.box_refinement_graph(positive_rois, roi_gt_boxes)
deltas /= config.BBOX_STD_DEV
(9)assign positive rois to gt masks
根据 roi_gt_box_assignment 选择正确的 roi_masks
def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config):
# Remove zero padding
# Handle COCO crowds
# Compute overlaps matrix [proposals, gt_boxes]
# Compute overlaps with crowd boxes [proposals, crowd_boxes]
# Determine positive and negative ROIs
# Subsample ROIs. Aim for 33% positive
# Assign positive ROIs to GT boxes.
# Compute bbox refinement for positive ROIs
... # 代码省略
# Assign positive ROIs to GT masks
# Permute masks to [N, height, width, 1]
transposed_masks = tf.expand_dims(tf.transpose(gt_masks, [2, 0, 1]), -1)
# Pick the right mask for each ROI
roi_masks = tf.gather(transposed_masks, roi_gt_box_assignment)
# Compute mask targets
boxes = positive_rois
if config.USE_MINI_MASK:
# Transform ROI coordinates from normalized image space
# to normalized mini-mask space.
# If enabled, resizes instance masks to a smaller size to reduce
# memory load. Recommended when using high-resolution images.
y1, x1, y2, x2 = tf.split(positive_rois, 4, axis=1)
gt_y1, gt_x1, gt_y2, gt_x2 = tf.split(roi_gt_boxes, 4, axis=1)
gt_h = gt_y2 - gt_y1
gt_w = gt_x2 - gt_x1
y1 = (y1 - gt_y1) / gt_h
x1 = (x1 - gt_x1) / gt_w
y2 = (y2 - gt_y1) / gt_h
x2 = (x2 - gt_x1) / gt_w
boxes = tf.concat([y1, x1, y2, x2], 1)
box_ids = tf.range(0, tf.shape(roi_masks)[0])
masks = tf.image.crop_and_resize(tf.cast(roi_masks, tf.float32), boxes,
box_ids,
config.MASK_SHAPE)
# Remove the extra dimension from masks.
masks = tf.squeeze(masks, axis=3)
# Threshold mask pixels at 0.5 to have GT masks be 0 or 1 to use with
# binary cross entropy loss.
masks = tf.round(masks)
(10)给 rois 把 positive 和 negative cat在一起,而roi_gt_class_ids、delta_masks都补零
def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config):
# Remove zero padding
# Handle COCO crowds
# Compute overlaps matrix [proposals, gt_boxes]
# Compute overlaps with crowd boxes [proposals, crowd_boxes]
# Determine positive and negative ROIs
# Subsample ROIs. Aim for 33% positive
# Assign positive ROIs to GT boxes.
# Compute bbox refinement for positive ROIs
# Assign positive ROIs to GT masks
# Compute mask targets
... # 代码省略
# Append negative ROIs and pad bbox deltas and masks that
# are not used for negative ROIs with zeros.
rois = tf.concat([positive_rois, negative_rois], axis=0)
N = tf.shape(negative_rois)[0]
P = tf.maximum(config.TRAIN_ROIS_PER_IMAGE - tf.shape(rois)[0], 0)
rois = tf.pad(rois, [(0, P), (0, 0)])
roi_gt_boxes = tf.pad(roi_gt_boxes, [(0, N + P), (0, 0)])
roi_gt_class_ids = tf.pad(roi_gt_class_ids, [(0, N + P)])
deltas = tf.pad(deltas, [(0, N + P), (0, 0)])
masks = tf.pad(masks, [[0, N + P], (0, 0), (0, 0)])
return rois, roi_gt_class_ids, deltas, masks
3 关于代码中用到的索引
有时候就很迷惑啊,不晓得代码里是怎么转的。其中一个帮助理解的方法是:编一些假数据,把索引这部分的代码单独拿出来运行一下,看看是怎么变化的。
这里给两个示例:
示例一
索引代码示例一
源码是 class PyramidROIAlign(KE.Layer)
中 call 方法的一部分代码:
# 使用得到的5个融合了不同层级的特征图
pooled = []
box_to_level = [] # box_to_level[i, 0]表示的是当前feat隶属的图片索引,box_to_level[i, 1]表示的是其box序号
for i, level in enumerate(range(2, 6)): # 只使用2-5四个特征图
# 先找出需要在第level层计算ROI
# tf.where 返回格式 [坐标1, 坐标1...]
# np.where 返回格式 [[坐标1.x, 坐标2.x...], [坐标1.y, 坐标2.y...]]
# 返回第n张图片的第i个proposal坐标(n对应batch坐标,i对应num_boxes那一维的坐标)
ix = tf.where(tf.equal(roi_level, level)) # ix是一个坐标集,每个坐标有三个数字,第三位数必然是0(因为roi_level.shape=[batch,num_boxes,1])。
# level_boxes 记录对应的level特征层中分配到的每个box的坐标(候选框索引对应的图片)
# box_indices 记录每个box对应的图片在batch中的索引(候选框索引对应其坐标即小黑框的坐标)
level_boxes = tf.gather_nd(boxes, ix) # [本level的proposal数目,4]
# Box indices for crop_and_resize.
box_indices = tf.cast(ix[:, 0], tf.int32) # 记录每个proposal对应图片序号
# ↑ 取 ix[:,0]是tf.image.crop_and_resize传参需要
# Keep track of which box is mapped to which level
box_to_level.append(ix)
这一段索引不好理解,所以我们编几个数据,写一段代码来看具体是怎么变化的:
import numpy as np
import tensorflow as tf
# 我要给你示范一段切片和索引的用法
# 对于某张图片,probs.shape=(N,num_class)
# 其中N为本张图片中检测到的对象数量,在示例中假设N=6,即图片*监测到6个物体
# num_class为所有训练数据中标记的类别种类总数,示例中假设总共有8种物体
def test():
box_to_level = []
# 假设 batch=1 num_boxes=5 在此基础上乱编一些数据:
roi_level = [[
[4],
[3],
[3],
[2],
[5]
]] # roi_level.shape=[batch,num_boxes,1]
roi_level = np.array(roi_level)
print('roi_level.shape=', roi_level.shape)
boxes = [[
[0.1, 0.3, 0.13, 0.34],
[0.5, 0.66, 0.67, 0.89],
[0.4, 0.61, 0.7, 0.8],
[0.2, 0.3, 0.4, 0.5],
[0.23, 0.13, 0.43, 0.54]
]] # [batch, num_boxes, (y1, x1, y2, x2)]
boxes = np.array(boxes)
print('boxes.shape=', boxes.shape)
# ------------ 运行难理解的代码 --------------
for i, level in enumerate(range(2, 6)):
ix = tf.where(tf.equal(roi_level, level))
level_boxes = tf.gather_nd(boxes, ix)
box_indices = tf.cast(ix[:, 0], tf.int32)
print('i=',i,' level=',level,' ---------------')
with tf.Session() as sess:
print('ix:', sess.run(ix))
print('level_boxes:', sess.run(level_boxes))
print('box_indices:', sess.run(box_indices))
box_to_level.append(ix)
print("box_to_level:",)
with tf.Session() as sess:
for i in box_to_level:
print(sess.run(i))
if __name__ == '__main__':
test()
运行结果:
roi_level.shape= (1, 5, 1)
boxes.shape= (1, 5, 4)
roi_level = [[
[4],
[3],
[3],
[2],
[5]
]]
boxes = [[
[0.1, 0.3, 0.13, 0.34],
[0.5, 0.66, 0.67, 0.89],
[0.4, 0.61, 0.7, 0.8],
[0.2, 0.3, 0.4, 0.5],
[0.23, 0.13, 0.43, 0.54]
]]
# i= 0 level= 2 ---------------
ix: [[0 3 0]]
level_boxes: [0.2]
box_indices: [0]
# i= 1 level= 3 ---------------
ix: [[0 1 0]
[0 2 0]]
level_boxes: [0.5 0.4]
box_indices: [0 0]
# i= 2 level= 4 ---------------
ix: [[0 0 0]]
level_boxes: [0.1]
box_indices: [0]
# i= 3 level= 5 ---------------
ix: [[0 4 0]]
level_boxes: [0.23]
box_indices: [0]
示例二
源码是Detection Layer中 refine_detections_graph
的前几句,mask rcnn的代码如下:
# ----------- 获取每个推荐区域得分最高的class的得分 -----------
# Class IDs per ROI
class_ids = tf.argmax(probs, axis=1, output_type=tf.int32) #[N], 每张图片最高得分类
# Class probability of the top class of each ROI
indices = tf.stack([tf.range(probs.shape[0]), class_ids], axis=1) # [N, (图片序号, 最高class序号)]
class_scores = tf.gather_nd(probs, indices) # [N], 每张图片最高得分类得分值
编几个数据,写一段代码来看具体是怎么变化的:
import numpy as np
import tensorflow as tf
# 我要给你示范一段切片和索引的用法
# 对于某张图片,probs.shape=(N,num_class)
# 其中N为本张图片中检测到的对象数量,在示例中假设N=6,即图片*监测到6个物体
# num_class为所有训练数据中标记的类别种类总数,示例中假设总共有8种物体
def test():
probs = np.array([
[0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.3],
[0, 0.5, 0.2, 0.3, 0.4, 0.1, 0.6, 0.2, 0.3],
[0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.1, 0.3],
[0, 0.9, 0.2, 0.3, 0.4, 0.5, 0.6, 0.4, 0.3],
[0, 0.1, 0.2, 0.9, 0.4, 0.5, 0.6, 0.2, 0.3],
[0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.9, 0.6, 0.3],
])
print(probs)
class_ids = tf.argmax(probs, axis=1, output_type=tf.int32) # [N], 每张图片最高得分类
# Class probability of the top class of each ROI
indices = tf.stack([tf.range(probs.shape[0]), class_ids], axis=1) # [N, (图片序号, 最高class序号)]
class_scores = tf.gather_nd(probs, indices) # [N], 每张图片最高得分类得分值
with tf.Session() as sess:
print(sess.run(indices))
print(sess.run(class_scores))
if __name__ == '__main__':
test()
输出结果是:
# 输出结果为:
# probs =
[[0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.3]
[0. 0.5 0.2 0.3 0.4 0.1 0.6 0.2 0.3]
[0. 0.1 0.2 0.3 0.4 0.5 0.6 0.1 0.3]
[0. 0.9 0.2 0.3 0.4 0.5 0.6 0.4 0.3]
[0. 0.1 0.2 0.9 0.4 0.5 0.6 0.2 0.3]
[0. 0.1 0.2 0.3 0.4 0.5 0.9 0.6 0.3]]
# class_ids =
[7 6 6 1 3 6]
# indices =
[[0 7]
[1 6]
[2 6]
[3 1]
[4 3]
[5 6]]
# class_scores =
[0.7 0.6 0.6 0.9 0.9 0.9]
所以能否更理解:
- tf.gather_nd的用法以及与tf.gather的区别
- indices的获得方法与使用意义
写代码的思路:
首先明确变量shape和目标–要从probs中获取每一行最大的得分,而probs是一个2D张量,所以索引也得是2D,用tf.stack就可以办到。要获得每行最大的值的索引,tf.argmax就可以办到。
得到了indices之后,就用tf.gather_nd得到具体的值,到此完成目标!