A Survey of Deep Learning-based Object Detection
2021/12/15
the purpose of object detection: locating instances of semantic objects of a certain class
*object detection and domain-specific object detection
most of the state-of-the-art object detectors utilize deep learning networks as their backbone and detection network to extract features from input images (or videos), classification and localization respectively
well-researched domains of object detection include multi-categories detection, edge detection, salient object detection, pose detection, scene text detection, face detection and pedestrain detection etc
*benchmark: 一个领域公认的基准,具体表现为领域中论文一致使用的数据集、评价指标
-
two kinds of object detectors
two-stage: Faster R-CNN; one-stage: YOLO
two-stage detectors have high localization and object recognition accuracy, whereas the one-stage detectors achieve high inference speed
most of backbone networks for detection are the network for classification task taking out the last FC layer
2021/12/16
Two-stage Detectors
- R-CNN (first deep learning-based detector)
- Fast R-CNN (use of RoI Pooling)
- Faster R-CNN (use of region proposal network/RPN, the use of multi-scale anchors)
- Mask R-CNN (for instance segmentation task, use of feature pyramid network/FPN, use of RoIAlign)
*N+1-way classification layer, N for object classes and 1 for background
One-stage Detectors
- YOLO (real-time detection of full images and webcam)
- YOLOv2 (adopt a series of design decisions from past works with novel concepts, new backbone)
- YOLOv3 (an improved version of YOLOv2)
- SSD (a single-shot detector for multiple categories)
- DSSD (a modified version of SSD)
- RetinaNet (use of focal loss)
- M2Det (have no idea about this)
- RefineDet (have no idea about this)
detecting an object has to state that an object belongs to a specified class and locate it in the image
the localization of an object is typically represented by a bounding box
benchmarks
- PASCAL VOC dataset (basic)
- MS COCO benchmark (large in images per class)
- ImageNet (large in class num)
- VisDrone2018 (have no idea about this)
- OpenImages V5 (have no idea about this)
- Recall
- Precision
- Average Precision (AP)
- mean Average Precision (mAP)
deep neural network based object detection piplines:
- image pre-processing: resize raw data and perform data augmentation
- feature extraction: a key step for further detection
- classification and localization: concluding classification scores and bounding box coordinates
- post-processing: delete any weak detecting results (like NMS)
to obtain precise detection results, there exists several methods can be used alone or in combination with other methods:
- Enhanced features: for extracting effective features from input images (like FPN, Attention)
- Increasing localization accuracy: design a novel loss function
- Solving negatives-positives imbalance issue: for one-stage, like hard mining / add some item in classification loss
- Improving post-processing NMS methods
- Combining one-stage and two-stage detectors to make good results
- Complicated scene solutions (have no idea about this)
- Anchor-free: still a novel direction for further research
- Training from scratch: 有的数据集就是需要从头训练才能保证稳定以及准确性
- Designing new architecture
- Speeding up detection
- Achieving Fast and Accurate Detections
typical application areas:
- Security Field: Face detection, Pedestrain detection, Anomaly detection
- Military field: Remote sensing OD
- Transportation field
- Medical field: Computer Aided Diagnosis (CAD) systems
- Life field: Pattern detection, Image caption generation