State-of-the-Art Real-Time Object Detection: Research review

Shreya Verma
10 min readApr 1, 2022
Object detection from self-driving cars. (ref. link)

A rapidly researched and developing field in today’s world is machine vision. Further, a significant aspect of this rapidly advancing field is object detection.

Object detection is an on-going, heavily researched area of interest that deals with detecting and classifying instances of objects in pictures or videos. It has a wide variety of applications — from self-driving cars to Amazon Go and face recognition to augmented reality. However, its applicability in the real world does not come without navigating through a few caveats. For more practical applications, especially when executing in real-time, model performance should be managed while also regulating model size and prediction time or latency.

While extensive research is being performed in this field, models are still far from a human level of performance, especially in terms of time. In this article, I will be reviewing the research that has gone into making such models simpler and faster.

Research Work

The two papers investigated in this article are:

  1. Non-Deep Networks; Ankit Goyal, Alexey Bochkovskiy, Jia Deng, Vladlen Koltun.
  2. CaTDet: Cascaded Tracked Detector for Efficient Object Detection from Video; Huizi Mao, Taeyoung Kong, William J. Dally; Proceedings of the 2nd SysML Conference.

Non-Deep Networks

‘Depth’ forms the defining characteristic of deep neural networks, especially in the area of large scale visual image recognition. Large models, even those 1000 layers in depth, dominated with state-of-the-art performance. However, this paper proves that depth is not the backbone to performance. ‘Non-deep Networks’ comes to be the first empirical proof that non-deep networks can show performance similar or even better than deep networks! The Par-Net (proposed Non-Deep network architecture) shows accuracy upwards of 80% on ImageNet, 96% on CIFAR10, and 81% on CIFAR100 with just 12 layers!!

Top-1 accuracy vs depth of various models on ImageNet dataset. (ref. paper)

ParNets does not just open the doors to a new scientific understanding of depth versus performance but also provides many practical advantages. The parallel architecture of ParNets allows for it to be parallelized across multiple processors. This is a significant boon as it results in the model outperforming ResNets in terms of both speed and accuracy, despite the extra latency — to allow for communication between the processors. This is one step towards ParNet type architectures being used in fast facial recognition systems with better hardware to further limit the communication latency. Further, the unique architecture also allows for scalability while keeping the model depth constant.

Architecture

This paper shows that contrary to conventional beliefs and proofs, shallow networks can too perform well when involving parallel substructures. Prior proofs only explored the areas of Sequential model structures while this paper disproves the relevance of depth with performance when leveraging parallel substructures.

Schematic representation the 12 layers in ParNet as described by the paper.

The ParNet model, as depicted above, consists of a combination of parallel substructures that are each used to process and extract different elements of the features. The features extracted are fused later in the network model which are in turn used in the downstream task. The parts of the model are described below:

  1. ParNet Block

To verify the performance of non-deep networks, it is found empirically that VGG style blocks are more suitable than ResNet blocks and such blocks can be trained using a technique called structural re-parameterization. The ParNet block design includes 3 parallel branches:

a. 1x1 Convolutions

b. 3x3 Convolutions

c. Skip-Squeeze-Excitation (SSE) layers: To increase the receptive field (ideally, each output pixel is obtained from a large receptive field to allow for maximum possible crucial input information being taken into account)of the model without increasing depth.

Schematic representation the ParNet block as described by the paper.

2. Downsampling and Fusion

The Downsampling block of the model is responsible for reducing the input resolution and increasing width to allow for multi-scale processing, whereas the fusion block essentially combines all the information from different resolutions or extracted features.

Scaling ParNets

In the past, scaling a deep learning model involved scaling the depth of the model. This paper shows that ParNets are scalable by increasing width, input resolution, and number of streams, all while keeping the depth constant. In the table given below we can see proof that ParNets perform better than ResNets:

This table, as per the paper, compares the speed and performance of ParNets and ResNets. Despite communication overheads in ParNets, it remains faster than ResNets.

Results

Strategies used to boost ParNet performance as per the paper.

The above table shows the various ways the authors have attempted to boost model performance such as longer training times, increasing image resolution, and 10-crop testing. These strategies help boost performance of top-1 accuracy from 78.55% to 80.72% and top-5 accuracy of 94.13% to 95.38%.

ParNet even outperforms non-deep ResNet variants, as seen below:

Depiction of ParNet outperforming ResNet variants, as per the paper.

Finally it can be seen that the performance of ParNets increases with increase in resolution, streams, and convolution width, while keeping the depth constant. The graph also makes it apparent that the best scaled ParNet has 3 streams and high resolution. It is also noteworthy to mention that there is no saturation in performance observed by scaling ParNets without varying their depth.

The left plot shows the effect of varying resolution, convolution width, and streams number. The right plot shows the impact on performance by changing only one of these scalable features. (ref. paper)

Conclusion

The authors successfully provide an empirical proof that non-deep networks can perform comparable to deep networks; thus showing that depth is not a major attribute to improving performance. The paper also showed that the use of parallel substructures in non-deep networks perform well on benchmark datasets and further propose their own architecture ‘ParNets’ i.e. a 12 layer neural network that outperforms ResNets on benchmark image datasets. The parallel substructures also show good promise for parallelization on multi-chip processors of the future. This research allows for the facilitation of deep neural networks that are fast and highly accurate.

CaTDet

Object detection in videos is a highly compute intensive task. This paper proposes CaTDet, a new video detection system that expedites object detection by introducing a “tracker” in their system. The authors propose that the proposed cascade along with the tracker can reduce overall system workload. Further, while many object detection algorithms are ranked based on their Average Precision (AP), the authors propose a new “delay” metric that tracks the time taken to identify the presence of an object from when it first appears in a video. This research is designed to apply in the area of autonomous driving, and in accordance, is evaluated on CityPersons and KITTI datasets.

Architecture

Illustration of the CaTDet system. (ref. CaTDet paper)

The implementation of the CaTDet system can be described based on the illustration above. The 3 main components of the system are as follows:

Tracker

This algorithm tracks the object appearance from the previous frames and predicts its location in the next frame. These predicted ‘regions of interest’ are fed into the refinement network.

The algorithm is inspired by SORT (Simple Online and Realtime Tracking). The 2 major components of this algorithm, run for every frame, are: Object association and Motion prediction.

  1. Object association: Based on the Hungarian algorithm, this performs matching objects in two adjacent frames. This produces objects that are matched from previous frames, unmatched from previous frames or lost objects, and new or emerging objects that are unmatched in the current or new frame. This algorithm runs once for every class.
  2. Motion Prediction: leverages the simple exponential decay model. This algorithm is robust with different frame rates and resolutions.
Illustration of the tracker process. (ref. CaTDet paper)

The state of an object is represented by two vectors: x = [x,y,s] and = [ẋ,ẏ,ṡ] and a scalar r. In the vectors, x and y represent center coordinates, s represents width of the bounding box and r represents the height to width ratio. The tracker is robust to a wide range of η values (η=0.7 in this implementation). The updated bounding box operations are performed as follows:

Tracker update rules. (ref. CaTDet paper)

The bounding box formed by the predicted values x’ and r’ is passed on to the refinement network.

Proposal Network

This object detection Deep Neural Network is based on Faster R-CNN. First, image features are extracted, following which, a Region Proposal Network (RPN) predicts 3 types of anchors and 4 scales for each location, and lastly, Non-Maximum Suppression (NMS) is applied and 300 proposals are selected and fed into the classifier.

Refinement Network

This object detection algorithm runs a modified Faster R-CNN. It is different in the following two respects:

  1. Filtered by selected regions: Since, the tracker and proposal network provide only a subset of regions as the ‘regions of interest’ with high-recall, the refinement network only refers to feature maps corresponding to those regions. To allow for enough information passed on to the convolutional network, a margin of 30 pixels is appended around the regions of interest.
  2. Reduced proposals: While RPN’s propose 300 regions, the refinement network considers a much smaller batch as it refers to the proposals by the proposal network and tracker. This is very useful in reducing computations.

Evaluation Metrics

The two metrics used to evaluate detection results are mean Average Precision (mAP) and mean Delay (mD).

Average Precision: It is defined as the area under curve of the Precision-Recall curve and is commonly used to measure the quality of object detection in images.

Delay: It is defined as the number of frames from when an object first appears to when it is finally detected by the algorithm (entry delay). Mean delay is minimized to ensure early detection of instances.

While one could always reduce delay by detecting more objects as the delay metric penalizes false negatives, applying the AP metric allows a trade-off between false negatives and false positives.

Therefore, to correctly compute average delay, delay should be measured at the same precision levels (assume a target precision level to be β).

Computation of mean Delay (mD). (ref. CaTDet paper)

Results

The table below let’s us analyze the performance of the proposal network. We can observe that on varying the base network, the Faster R-CNN models give different mAP values ranging from 0.542 to 0.687 whereas the proposal model of CaTDet gives more or less stable values of 0.74. CaTDet’s proposal models seem to perform substantially better in with most ResNets (except ResNet-18) in both mD and number of operations — they are sensitive to the choice of base ResNets (or proposal networks).

Table of comparison of accuracy of single Faster R-CNNs and the proposal network in CaTDet tested on Hard Mode in the KITTI dataset. The refinement network is based on ResNet-50. (ref. CaTDet paper)

The refinement networks’ performance largely contributes to the overall performance of the model. It is observed from the table below, the performance of base models in terms of mAP (and in most cases mD) does not vary by ‘Setting’. However, the refinement network of CaTDet reduces the number of computations substantially.

Table of comparison of accuracy of single Faster R-CNNs and the refinement network in CaTDet tested on Hard Mode in the KITTI dataset. The proposal network is based on ResNet-10b. (ref. CaTDet paper)

The figure below proves that with a tracker, the mAP does not vary much by cascaded models applied or C-thresh (output threshold for proposal network), however, such models perform significantly better than these models without a tracker.

Illustration of the mean average precision and mean delay (for precision of 0.8) of models with and without a tracker. (ref. CaTDet paper)

Since a higher C-thresh would mean lower number of region proposals fed into the refinement network, it leads to a gradual rise in average delay too.

Conclusion

The proposed CaTDet system is able to significantly reduce the number of computations for object detection (by 5–8 times on KITTI dataset with no mAP loss and by 13 times on CityPersons dataset with 0.8% mAP loss). Further, the mean delay metric proposed for video data would be very useful in delay-critical video applications.

My Thoughts

ParNets are exemplar in showing experimenting against set assumptions can be rewarding. The future of modeling deep neural networks lies not just in making them more accurate but also towards giving fast (and reproducible) results. The ParNets opened a unique avenue for researchers to develop networks that do not need to be massive and should still allow for faster, accurate prediction times by being able to run the network simultaneously on multiple processors.

Similarly, the research behind CaTDet is expanding the boundaries of video detection systems as it proposes temporal correlation (tracking historical presence of objects) and a new metric to rank models i.e. mean delay or mean entry delay.

With the constant research in this field, and tweaking of models in accordance with state-of-the-art research in architectures and other strategies, we might eventually reach human level detection performance!

Collaborator: Vaibhav Bagri

P.S. Also refer to the ‘Looking at Research Work in Real Time Object Detection’ article by Vaibhav Bagri to learn about other papers in state-of-the-art object detection.

References:

Non-Deep Networks paper: https://arxiv.org/pdf/2110.07641v1.pdf

Github → Non-Deep Networks: https://github.com/imankgoyal/NonDeepNetworks

CaTDet paper: https://mlsys.org/Conferences/2019/doc/2019/111.pdf

--

--