Which is best for localizing objects among R-CNN, fast R-CNN, faster R-CNN and YOLO

What is the difference between R-CNN, fast R-CNN, faster R-CNN and YOLO in terms of the following: (1) Accuracy on the same set of images (2) Considering SAME IMIZ SIZE runtime (3) Support for migration android

Given these three criteria, which is the best technique for localizing an object?

+4


source to share


3 answers


If you are interested in these algorithms, you should study this tutorial which walks through the algorithms you named: https://www.youtube.com/watch?v=GxZrEKZfW2o .



PS: There is also a fast YOLO if I remember well haha!

+1


source


R-CNN is the daddy algorithm for all the algorithms mentioned, it really gave researchers the opportunity to build a more complex and better algorithm on top of it.

R-CNN, or Regional Convolutional Neural Network

R-CNN consists of 3 simple steps:

  • Scanning the input image for possible objects using a selective search algorithm that generates ~ 2000 suggestions by region
  • Run a Convolutional Neural Network (CNN) on top of each of these region suggestions
  • Take the output of each CNN and feed it to a) an SVM to classify the area and b) a linear regressor to shrink the bounding box of an object, if such an object exists.

A pictorial description of R-CNN

Fast R-CNN:

The fast R-CNN was immediately followed by R-CNN. Fast R-CNN is faster and better thanks to the following points:

  • Performing feature extraction on the image before suggesting areas, so only one CNN over the entire image instead of 2000 CNNs over 2000 overlapping areas
  • Replacing the SVM with a softmax layer, extending the neural network for predictions instead of creating a new model.

Intuitively, it makes sense to delete the 2000 layers of the convoy and instead take Convolution once and create frames over it.

A pictorial description of Fast R-CNN

Faster than R-CNN:

One of the shortcomings of Fast R-CNN was the slow selective search algorithm, and the faster R-CNN introduced the so-called Regional Proposal Network (RPN).

Here's how the RPN works:

At the final level of the original CNN, a 3x3 sliding window moves across the feature map and maps it to a lower dimension (for example, 256-d). For each location of the sliding window, it generates multiple possible areas based on k binding with a fixed ratio of boxes (default bounding boxes)

Each region offer consists of:

  • an indicator of "objectivity" for this region and
  • 4 coordinates representing the region's bounding box

In other words, we look at each location in our latest feature map and consider k different blocks centered around it: tall block, wide block, large block, etc. For each of these blocks, we deduce whether we think it contains an object and what are the coordinates of this field. Here's what it looks like in one place with sliding windows:



Region Proposal Network

The 2k points represent the softmax probability of each of the k bounding boxes on the "object". Note that although RPN outputs the coordinates of the bounding box, it does not attempt to classify any potential objects: its only job is still to suggest areas of the object. If an anchor box has an "objectivity" score above a certain threshold, the coordinates of that box are passed as a region proposal.

As soon as we have regional suggestions, we'll direct them directly to what is essentially Fast R-CNN. We've added a merge layer, several fully connected layers, and finally a Softmax classification layer and a regressor constraint. In a sense, Faster R-CNN = RPN + Fast R-CNN.

Faster R-CNN

YOLO:

YOLO uses a unified CNN to classify and localize an object using bounding boxes. This is YOLO architecture:

YOLO

As a result, you will have a tensor of the form 1470, i.e. 7 * 7 * 30, and the CNN output structure would be:

Structure of CNN output of YOLO

The output of vector 1470 is divided into three parts giving probability, confidence, and rectangular coordinates. Each of these three parts is also further subdivided into 49 small regions, which corresponds to the 49 cell predictions that form the original image.

In the post-processing steps, we take this vector 1470 output from the network to generate blocks that are likely to be above a certain threshold.


I hope you get an idea of ​​these networks to answer your question about how the performance of these networks differs:

  • In the same dataset: "You can be sure that the performance of these networks is in the order in which they are mentioned, with YOLO being the best and R-CNN the worst."
  • Considering the SAME IMAGE SIZE, Runtime: The faster R-CNN has achieved much better speeds and modern accuracy. It is worth noting that while future models have done a lot to improve detection speed, few models have managed to significantly outperform the Faster R-CNN. Accelerated R-CNN may not be the easiest or fastest method of object detection, but it is still one of the most effective. However, researchers have used YOLO for video segmentation, and it is by far the best and fastest video segmentation tool.

  • Android Porting Support: As far as I know, Tensorflow has several Android APIs to port to Android, but I'm not sure how these networks will work or if you can even port them or not. This is exposed again to hardware and data_size. Can you provide equipment and size so that I can answer clearly.

A YouTube video tagged @A_Piro also provides a good explanation.

PS I borrowed a lot of material from the Joyce Xu Medium blog.

+6


source


I've worked a lot with YOLO and FRCNN. For me, YOLO has the best accuracy and speed, but if you want to do research on image processing, I would suggest FRCNN as a lot of previous work has been done with it, and in order to do research you really want to be consistent.

0


source







All Articles