Research/Blog
Object Detection with EfficientNet and EfficientDet
- March 27, 2020
- Posted by: vsinghal
- Category: Computer Vision Deep Learning
#CellStratAILab #disrupt4.0 #WeCreateAISuperstars
Minutes from Saturday 22nd March 2020 AI Intern Workshop at BLR :-
Session Presenter : NIRAJ KALE, AI Researcher, CellStrat AI Lab
Last Sunday our AI Lab Researcher Niraj Kale presented an amazing workshop on Object Detection with EfficientNet and EfficientDet – state-of-the-art algorithms which were published in 2019 by Google Brain team.
EfficientNet :-
EfficientNet is about developing a efficient neural network scaling method for ConvNets (Convolutional Neural Networks).
Traditionally, one can scale ConvNets by depth (no of layers), width, or resolution.
![](http://www.cellstrat.com/wp-content/uploads/2020/03/Model-Scaling.png)
Image Credit : https://arxiv.org/abs/1905.11946
In research so far, scientist typically scale one of the above dimensions. Even if two dimensions were scaled, they were done arbitrarily.
A ConvNet Layer i can be defined as Yi = Fi(Xi). Here, Xi is input tensor with shape <Hi, Wi, Ci>. A ConvNet N can be represented as :-
![](http://www.cellstrat.com/wp-content/uploads/2020/03/Conv-formula.png)
In practice, ConvNet layers are often partitioned into multiple stages and all layers in each stage share the same architecture: for example, ResNet (He et al., 2016) has five stages, and all layers in each stage has the same convolutional type except the first layer performs down-sampling.
Therefore a ConvNet can be defined as :-
![](http://www.cellstrat.com/wp-content/uploads/2020/03/Conv-formula-2.png)
Here FiLi denotes layer Fi is repeated Li times in stage i, <Hi, Wi, Ci> denotes the shape of input tensor X of layer i.
Regular ConvNet designs try to find ideal Fi architecture, model scaling involves trying to expand the Length, Width or Resolution without changing the Fi design.
In order to further reduce the design space, we restrict that all layers must be scaled uniformly with constant ratio. The optimization problem then becomes to increase model accuracy as per below scheme :-
![](http://www.cellstrat.com/wp-content/uploads/2020/03/Model-accuracy-optimization.png)
where w, d, r are coefficients for scaling network width, depth, and resolution; Fi, Li, Hi, Wi, Ci are predefined parameters in baseline network (Table 1 below).
The main difficulty of problem 2 is that the optimal d, w, r depend on each other and the values change under different resource constraints.
Pre-trained models such as ResNet can be scaled up from ResNet-50 to ResNet-200, as well as they can be scaled down from ResNet-50 to ResNet-18. The intuition is that a deeper network (depth scaling) can capture richer and more complex features, and generalizes well on new tasks. However, Vanishing gradients is one of the most common problems that arises as we go deep. Even if you avoid vanishing gradients, or use some techniques to make the training smooth, adding more layers doesn’t always help. For example, ResNet-1000 has similar accuracy as ResNet-101.
Width scaling is commonly used when we want to keep our model small. Wider networks tend to be able to capture more fine-grained features. Also, smaller models are easier to train. The problem is that even though you can make your network extremely wide, with shallow models (less deep but wider) accuracy saturates quickly with larger width.
.
Next we come to Resolution Scaling. Intuitively, we can say that in a high-resolution image, the features are more fine-grained and hence high-res images should work better. This is also one of the reasons that in complex tasks, like Object detection, we use image resolutions like 300×300, or 512×512, or 600×600. But this doesn’t scale linearly. The accuracy gain diminishes very quickly. For example, increasing resolution from 500×500 to 560×560 doesn’t yield significant improvements.
![](http://www.cellstrat.com/wp-content/uploads/2020/03/Scaling-up-baseline-model.png)
Image Credit : https://arxiv.org/abs/1905.11946
![](http://www.cellstrat.com/wp-content/uploads/2020/03/Scaling-network.png)
Image Credit : https://arxiv.org/abs/1905.11946
Note : Top-1 accuracy is the conventional accuracy: the model answer (the one with highest probability) must be exactly the expected answer. Top-5 accuracy means that any of your model 5 highest probability answers must match the expected answer.
Hence we resort to Combined Scaling. Though it is possible to scale two or three dimensions arbitrarily, arbitrary scaling is a tedious task. Most of the times, manual scaling results in sub-optimal accuracy and efficiency.
It is critical to balance all dimensions of a network (width, depth, and resolution) during CNNs scaling for getting improved accuracy and efficiency.
![](http://www.cellstrat.com/wp-content/uploads/2020/03/Compound-scaling.png)
φ is a user-specified coefficient that controls how many resources are available whereas α, β, and γ specify how to assign these resources to network depth, width, and resolution respectively.
The baseline architecture for scaling can be built from existing ConvNets, but we create a new mobile-size baseline, called EfficientNet-B0.
![](http://www.cellstrat.com/wp-content/uploads/2020/03/B0-Network.png)
The authors obtained their base network by doing a Neural Architecture Search (NAS) that optimizes both accuracy and FLOPS.
From baseline network EfficientNet-B0, we apply compound scaling using a two step method :-
- Fix φ =1, assuming that twice more resources are available, and do a small grid search for α, β, and γ based on equation 2 and 3. For baseline network B0, it turned out the optimal values are α =1.2, β = 1.1, and γ = 1.15 such that α.β2.γ2 ≈ 2.
- Now fix α, β, and γ as constants and experiment with different values of φ using Equation 3, to obtain EfficientNet-B1 to B7.
In order to make the search space smaller and making the search operation less costly, the search for these parameters is done on small baseline network (Step 1 above), and then using same scaling coefficients for all other models (Step 2 above).
The results of EfficientNet scaling are indicated below for EfficientNet-B0 to B7 :-
![](http://www.cellstrat.com/wp-content/uploads/2020/03/EfficientNet-performance.png)
Source Credit : https://arxiv.org/abs/1905.11946
EfficientNet models use many orders of magnitude lesser parameters and FLOPS than other ConvNets with similar accuracy. In particular, our EfficientNet-B7 achieves 84.4% top1 / 97.1% top-5 accuracy with 66M parameters and 37B FLOPS, being more accurate but 8.4x smaller than the previous best GPipe (Huang et al., 2018).
![](http://www.cellstrat.com/wp-content/uploads/2020/03/Class-activation-map.png)
Image Credit : https://arxiv.org/abs/1905.11946
Feature Pyramid Network (FPN) :-
Object Detection can be depicted by these images.
![](http://www.cellstrat.com/wp-content/uploads/2020/03/Object-Detection.png)
Image Credit : https://towardsdatascience.com/object-detection-simplified-e07aa3830954
![](http://www.cellstrat.com/wp-content/uploads/2020/03/Image-Segmentation.png)
Image Credit : https://towardsdatascience.com/object-detection-simplified-e07aa3830954
Object detection before Deep Learning was a several step process, starting with edge detection and feature extraction using techniques like SIFT, HOG etc. These image were then compared with existing object templates, usually at multi scale levels, to detect and localize objects present in the image.
A common metric used to measure accuracy of object detection models in images is the IoU or Intersection over Union.
Detecting objects in different scales is challenging in particular for small objects. We can use a pyramid of the same image at different scale to detect objects (the left diagram below).
However, processing multiple scale images is time consuming and the memory demand is too high to be trained end-to-end simultaneously. Hence, we may only use it in inference to push accuracy as high as possible, in particular for competitions, when speed is not a concern.
Alternatively, we create a pyramid of feature and use them for object detection. However, feature maps closer to the image layer composed of low-level structures that are not effective for accurate object detection.
![](http://www.cellstrat.com/wp-content/uploads/2020/03/FPN-1.png)
Feature Pyramid Network (FPN) is a feature extractor designed for such pyramid concept with accuracy and speed in mind. It replaces the feature extractor of detectors like Faster R-CNN and generates multiple feature map layers (multi-scale feature maps) with better quality information than the regular feature pyramid for object detection.
FPN composes of a bottom-up and a top-down pathway. The bottom-up pathway is the usual convolutional network for feature extraction. As we go up, the spatial resolution decreases. With more high-level structures detected, the semantic value for each layer increases.
![](http://www.cellstrat.com/wp-content/uploads/2020/03/FPN-2.png)
The bottom-up pathway uses ResNet to construct the bottom-up pathway. It composes of many convolution modules (convi for i equals 1 to 5) each has many convolution layers. As we move up, the spatial dimension is reduced by 1/2 (i.e. double the stride). The output of each convolution module is labeled as Ci and later used in the top-down pathway.
![](http://www.cellstrat.com/wp-content/uploads/2020/03/ResNet.png)
We apply a 1 × 1 convolution filter to reduce C5 channel depth to 256-d to create M5. This becomes the first feature map layer used for object prediction.
As we go down the top-down path, we upsample the previous layer by 2 using nearest neighbors upsampling. We again apply a 1 × 1 convolution to the corresponding feature maps in the bottom-up pathway. Then we add them element-wise. We apply a 3 × 3 convolution to all merged layers. This filter reduces the aliasing effect when merged with the upsampled layer.
We repeat the same process for P3 and P2. However, we stop at P2 because the spatial dimension of C1 is too large. Otherwise, it will slow down the process too much. Because we share the same classifier and box regressor of every output feature maps, all pyramid feature maps (P5, P4, P3 and P2) have 256-d output channels.
![](http://www.cellstrat.com/wp-content/uploads/2020/03/FPN-with-faster-RCNN.png)
The formula to pick the feature maps is based on the width w and height h of the ROI.
![](http://www.cellstrat.com/wp-content/uploads/2020/03/FPN-Formula-1-1.png)
![](http://www.cellstrat.com/wp-content/uploads/2020/03/FPN-Formula-2-1.png)
So if k = 3, we select P3 as our feature maps. We apply the ROI pooling and feed the result to the Fast R-CNN head (Fast R-CNN and Faster R-CNN have the same head) to finish the prediction.
(for additional info on Fast RCNN, Faster RCNN and Mask RCNN, click here –
http://www.cellstrat.com/2019/11/13/aibyte-by-cellstrat-object-detection-with-mask-r-cnn/).
Just like Mask R-CNN, FPN is also good at extracting masks for image segmentation. 5 × 5 windows are slide over the feature maps to generate 14 × 14 segments. Later, we merge masks at a different scale to form our final mask predictions.
![](http://www.cellstrat.com/wp-content/uploads/2020/03/FPN-Segmentation-1024x364.png)
![](http://www.cellstrat.com/wp-content/uploads/2020/03/Various-architectures-1024x347.png)
EfficientDet :-
The creators of EfficientDet wanted to see if it is possible to build a scalable detection architecture with both higher accuracy and better efficiency across a wide spectrum of resource constraints (e.g., from 3B to 300B FLOPs)? Their paper aims to tackle this problem by systematically studying various design choices of detector architectures. The paper examines the design choices for backbone, feature fusion, and class/box network, and seeks to solve for these two challenges :-
Efficient multi-scale feature fusion : Feature Pyramid Networks (FPN) has become the de facto for fusing multi-scale features. Some of the detectors that use FPNs are RetinaNet, PANet, NAS-FPN, etc. Most of the fusion strategies adopted in these networks don’t take into consideration the importance of filters while fusing. They sum them up without any distinction. Intuitively, not all features contribute equally to the output features. Hence, a better strategy for multi-scale fusion is required.
Model scaling : Most of the previous works tend to make the backbone network bigger for improving accuracy. The authors observed that scaling up feature network and box/class prediction networks are also critical when taking into account both accuracy and efficiency. Inspired by the compound scaling in EfficientNets, the authors proposed a compound scaling method for object detectors, which jointly scales up the resolution/depth/width for all backbone, feature network, box/class prediction network.
EfficientDet uses a BiFPN architecture. It also uses multi-scale feature fusion, that aims to aggregate features at different resolutions.
![](http://www.cellstrat.com/wp-content/uploads/2020/03/Feature-Network-Design.png)
Image Credit : https://arxiv.org/abs/1911.09070
The conventional FPN aggregates multi-scale features in a top-down manner:
![](http://www.cellstrat.com/wp-content/uploads/2020/03/FPN-equation.png)
Cross-scale connections :
The problem with conventional FPN as shown in Figure (a) is that it is limited by the one-way (top-down) information flow. To address this issue, PANet adds an extra bottom-up path aggregation network, as shown in Figure (b) above.
Also, there are many papers, e.g. NAS-FPN, that also studied the cross-connections for capturing better semantics. In short, the game is all about the connections for connecting low-level features to high-level features and vice-versa for capturing better semantics.
- Remove Nodes that only have one input edge. If a node has only one input edge with no feature fusion, then it will have less contribution to the feature network that aims at fusing different features.
- Add an extra edge from the original input to output node if they are at the same level, in order to fuse more features without adding much cost.
- Treat each bidirectional (top-down & bottom-up) path as one feature network layer, and repeat the same layer multiple times to enable more high-level feature fusion.
Weighted Feature Fusion :
Previous feature fusion methods treat all input features equally without distinction. However, we observe that since different input features are at different resolutions, they usually contribute to the output feature unequally. To address this issue, we propose to add an additional weight for each input during feature fusion, and let the network to learn the importance of each input feature. Based on this idea, we consider three weighted fusion approaches:
Unbounded fusion :
![](http://www.cellstrat.com/wp-content/uploads/2020/03/Unbounded-fusion.png)
where wi is a learnable weight that can be a scalar (per-feature), a vector (per-channel), or a multi-dimensional tensor (per-pixel). To make the training stable, we resort to weight normalization to bound the value range of each weight.
Softmax-based fusion :
An intuitive idea is to apply softmax to each weight, such that all weights are normalized to be a probability with value range from 0 to 1, representing the importance of each input. The extra softmax leads to significant slowdown on GPU hardware.
![](http://www.cellstrat.com/wp-content/uploads/2020/03/Softmax-based-fusion.png)
Fast normalized fusion :
To minimize the extra latency cost, we further propose a fast fusion approach.
![](http://www.cellstrat.com/wp-content/uploads/2020/03/Fast-normalized-fusion.png)
where wi >= 0 is ensured by applying a Relu after each wi, and ε = 0.0001 is a small value to avoid numerical instability. Similarly, the value of each normalized weight also falls between 0 and 1, but since there is no softmax operation here, it is much more efficient. The authors’ ablation study shows this fast fusion approach is better than softmax-based fusion, but runs up to 30% faster on GPUs.
The final BiFPN integrates both the bidirectional cross-scale connections and the fast normalized fusion.
![](http://www.cellstrat.com/wp-content/uploads/2020/03/BiFPN-two-fused-features-1.png)
where P6td is the intermediate feature at level 6 on the topdown pathway, and P6out is the output feature at level 6 on the bottom-up pathway.
Now lets look at EfficientDet architecture.
![](http://www.cellstrat.com/wp-content/uploads/2020/03/EfficientDet-architecture.png)
Image Credit : https://arxiv.org/abs/1911.09070
EfficientDet Architecture :
EfficientDet detectors are single-shot detectors much like SSD and RetinaNet. The backbone networks are ImageNet pretrained EfficientNets. The proposed BiFPN serves as the feature network, which takes level 3–7 features {P3, P4, P5, P6, P7} from the backbone network and repeatedly applies top-down and bottom-up bidirectional feature fusion. These fused features are fed to a class and box network to produce object class and bounding box predictions respectively. The class and box network weights are shared across all levels of features.
We have already seen in EfficientNets that scaling all dimensions provides much better performance. We would like to do the same for our EfficientDet family models. Previous works in object detection scale only the backbone network or the FPN layers for improving accuracy. This is very limiting as we are focusing on scaling only one dimension of the detector. The authors proposed a new compound scaling method for object detection, which uses a simple compound coefficient φ to jointly scale-up all dimensions of the backbone network, BiFPN network, class/box network, and resolution.
Object detectors have much more scaling dimensions than image classification models, so grid search for all dimensions is very expensive. Therefore, the authors used a heuristic-based scaling approach, but still, follow the main idea of jointly scaling up all dimensions.
- Backbone network: Same width/depth scaling coefficients of EfficientNet-B0 to B6 are used so that ImageNet-pretrained checkpoints can be used.
- BiFPN network: The authors exponentially grow BiFPN width (#channels) as done in EfficientNets, but linearly increase the depth (#layers) since depth needs to be rounded to small integers. After a grid search, 1.35 is detected as best scale factor for width.
![](http://www.cellstrat.com/wp-content/uploads/2020/03/BiFPN-Equation-1.png)
- Box/class prediction network: The width is kept same as the BiFPN but the depth (#layers) is linearly increased.
![](http://www.cellstrat.com/wp-content/uploads/2020/03/BiFPN-Equation-2.png)
- Input image resolution: Since feature level 3–7 are used in BiFPN, the input resolution must be dividable by 27 = 128, so we linearly increase resolutions using equation:
![](http://www.cellstrat.com/wp-content/uploads/2020/03/BiFPN-Equation-3.png)
Now, using equations (1), (2), and (3), and different values of φ , we can go from Efficient-D0 (φ =0) to Efficient-D6 (φ =6) as shown in Table 1 below. The models scaled up with φ >= 7 could not fit memory unless changing batch size or other settings. Therefore, the authors expanded D6 to D7 by only enlarging input size while keeping all other dimensions the same, such that we can use the same training settings for all models. Here is a table summarizing all these configs:
![](http://www.cellstrat.com/wp-content/uploads/2020/03/Table-1-1.png)
The chart below shows how EfficientDet outperforms other SOTA models :-
![](http://www.cellstrat.com/wp-content/uploads/2020/03/Model-FLOPs-vs-COCO-accuracy.png)
Source Credit :https://arxiv.org/abs/1911.09070
The results of various EfficientDet models :-
![](http://www.cellstrat.com/wp-content/uploads/2020/03/EfficientDet-performance-1024x506.png)
Interested in learning AI ML from India’s No 1 AI Lab ? Then attend our AI Lab meetup this Saturday 28th Mar 2020 in BLR. Please RSVP below to attend :-
BLR AI Lab :
Register : https://bit.ly/33QJhpB
Topic : DenseNet, Deep Learning Inference Accelerators
Date : Saturday 28th Mar 2020, 10:30 AM – 5:00 PM
Presenters : Darshan C G, Abdus Samad
See you this Saturday for the AI Lab meetup in BLR ! Lets disrupt the world with AI !
Questions ? Call me at +91-9742800566 !
Best Regards,
Vivek Singhal
Co-Founder & Chief Data Scientist, CellStrat
+91-9742800566