Research/Blog

CellStrat > Research/Blog > Artificial Intelligence > Computer Vision > Object Detection with EfficientNet and EfficientDet

Object Detection with EfficientNet and EfficientDet

March 27, 2020
Posted by: vsinghal
Category: Computer Vision Deep Learning

No Comments

#CellStratAILab #disrupt4.0 #WeCreateAISuperstars

Minutes from Saturday 22nd March 2020 AI Intern Workshop at BLR :-

Session Presenter : NIRAJ KALE, AI Researcher, CellStrat AI Lab

Last Sunday our AI Lab Researcher Niraj Kale presented an amazing workshop on Object Detection with EfficientNet and EfficientDet – state-of-the-art algorithms which were published in 2019 by Google Brain team.

EfficientNet :-

EfficientNet is about developing a efficient neural network scaling method for ConvNets (Convolutional Neural Networks).

Traditionally, one can scale ConvNets by depth (no of layers), width, or resolution.

Model Scaling – (a) is a baseline network. (b)-(d) are network scaling based on one dimension. (e) is compound scaling proposed by EfficientNet paper. Figure (a) illustrates a representative ConvNet, where the spatial dimension is gradually shrunk but the channel dimension is expanded over layers, for example, from initial input shape <224, 224, 3> to final output shape <7, 7, 512>.
*Image Credit :* *https://arxiv.org/abs/1905.11946*

In research so far, scientist typically scale one of the above dimensions. Even if two dimensions were scaled, they were done arbitrarily.

A ConvNet Layer i can be defined as Y_i = F_i(X_i). Here, X_i is input tensor with shape <H_i, W_i, C_i>. A ConvNet N can be represented as :-

In practice, ConvNet layers are often partitioned into multiple stages and all layers in each stage share the same architecture: for example, ResNet (He et al., 2016) has five stages, and all layers in each stage has the same convolutional type except the first layer performs down-sampling.

Therefore a ConvNet can be defined as :-

Here F_i^Li denotes layer F_i is repeated L_i times in stage i, <H_i, W_i, C_i> denotes the shape of input tensor X of layer i.

Regular ConvNet designs try to find ideal F_i architecture, model scaling involves trying to expand the Length, Width or Resolution without changing the F_i design.

In order to further reduce the design space, we restrict that all layers must be scaled uniformly with constant ratio. The optimization problem then becomes to increase model accuracy as per below scheme :-

where w, d, r are coefficients for scaling network width, depth, and resolution; F_i,L_i, H_i, W_i, C_i are predefined parameters in baseline network (Table 1 below).

The main difficulty of problem 2 is that the optimal d, w, r depend on each other and the values change under different resource constraints.

Pre-trained models such as ResNet can be scaled up from ResNet-50 to ResNet-200, as well as they can be scaled down from ResNet-50 to ResNet-18. The intuition is that a deeper network (depth scaling) can capture richer and more complex features, and generalizes well on new tasks. However, Vanishing gradients is one of the most common problems that arises as we go deep. Even if you avoid vanishing gradients, or use some techniques to make the training smooth, adding more layers doesn’t always help. For example, ResNet-1000 has similar accuracy as ResNet-101.

Width scaling is commonly used when we want to keep our model small. Wider networks tend to be able to capture more fine-grained features. Also, smaller models are easier to train. The problem is that even though you can make your network extremely wide, with shallow models (less deep but wider) accuracy saturates quickly with larger width.
.
Next we come to Resolution Scaling. Intuitively, we can say that in a high-resolution image, the features are more fine-grained and hence high-res images should work better. This is also one of the reasons that in complex tasks, like Object detection, we use image resolutions like 300×300, or 512×512, or 600×600. But this doesn’t scale linearly. The accuracy gain diminishes very quickly. For example, increasing resolution from 500×500 to 560×560 doesn’t yield significant improvements.

**Scaling a baseline model with width, depth or resolution. Accuracy gains saturate after reaching 80% scaling with one dimension.**
*Image Credit :* *https://arxiv.org/abs/1905.11946*

**Scaling Network Width for Different Baseline Networks. Each dot in a line denotes a model with different width coefficient (w).**
*Image Credit :* *https://arxiv.org/abs/1905.11946*

Note : Top-1 accuracy is the conventional accuracy: the model answer (the one with highest probability) must be exactly the expected answer. Top-5 accuracy means that any of your model 5 highest probability answers must match the expected answer.

Hence we resort to Combined Scaling. Though it is possible to scale two or three dimensions arbitrarily, arbitrary scaling is a tedious task. Most of the times, manual scaling results in sub-optimal accuracy and efficiency.

It is critical to balance all dimensions of a network (width, depth, and resolution) during CNNs scaling for getting improved accuracy and efficiency.

**Equation 3 : Compound Scaling method**

φ is a user-specified coefficient that controls how many resources are available whereas α, β, and γ specify how to assign these resources to network depth, width, and resolution respectively.

The baseline architecture for scaling can be built from existing ConvNets, but we create a new mobile-size baseline, called EfficientNet-B0.

*Source Credit :* *https://arxiv.org/abs/1905.11946*

The authors obtained their base network by doing a Neural Architecture Search (NAS) that optimizes both accuracy and FLOPS.

From baseline network EfficientNet-B0, we apply compound scaling using a two step method :-

Fix φ =1, assuming that twice more resources are available, and do a small grid search for α, β, and γ based on equation 2 and 3. For baseline network B0, it turned out the optimal values are α =1.2, β = 1.1, and γ = 1.15 such that α.β².γ² ≈ 2.
Now fix α, β, and γ as constants and experiment with different values of φ using Equation 3, to obtain EfficientNet-B1 to B7.

In order to make the search space smaller and making the search operation less costly, the search for these parameters is done on small baseline network (Step 1 above), and then using same scaling coefficients for all other models (Step 2 above).

The results of EfficientNet scaling are indicated below for EfficientNet-B0 to B7 :-

**Table 2. EfficientNet Performance Results on ImageNet (Russakovsky et al., 2015). All EfficientNet models are scaled from our baseline EfficientNet-B0 using different compound coefficient** φ in Equation 3. ConvNets with similar top-1/top-5 accuracy are grouped together for efficiency comparison. Our scaled EfficientNet models consistently reduce parameters and FLOPS by an order of magnitude (up to 8.4x parameter reduction and up to 16x FLOPS reduction) than existing ConvNets.
*Source Credit :* *https://arxiv.org/abs/1905.11946*

EfficientNet models use many orders of magnitude lesser parameters and FLOPS than other ConvNets with similar accuracy. In particular, our EfficientNet-B7 achieves 84.4% top1 / 97.1% top-5 accuracy with 66M parameters and 37B FLOPS, being more accurate but 8.4x smaller than the previous best GPipe (Huang et al., 2018).

Class Activation Map (CAM) (Zhou et al., 2016) for Models with different scaling methods- Our compound scaling method allows the scaled model (last column) to focus on more relevant regions with more object details.
*Image Credit :* *https://arxiv.org/abs/1905.11946*

Feature Pyramid Network (FPN) :-

Object Detection can be depicted by these images.

**Object Detection using bounding boxes**
*Image Credit :* *https://towardsdatascience.com/object-detection-simplified-e07aa3830954*

**Semantic Image Segmentation**
*Image Credit :* *https://towardsdatascience.com/object-detection-simplified-e07aa3830954*

Object detection before Deep Learning was a several step process, starting with edge detection and feature extraction using techniques like SIFT, HOG etc. These image were then compared with existing object templates, usually at multi scale levels, to detect and localize objects present in the image.

*Image Credit :* *https://towardsdatascience.com/object-detection-simplified-e07aa3830954*

A common metric used to measure accuracy of object detection models in images is the IoU or Intersection over Union.

Detecting objects in different scales is challenging in particular for small objects. We can use a pyramid of the same image at different scale to detect objects (the left diagram below).

However, processing multiple scale images is time consuming and the memory demand is too high to be trained end-to-end simultaneously. Hence, we may only use it in inference to push accuracy as high as possible, in particular for competitions, when speed is not a concern.

Alternatively, we create a pyramid of feature and use them for object detection. However, feature maps closer to the image layer composed of low-level structures that are not effective for accurate object detection.

*Image Credit :* *https://arxiv.org/abs/1612.03144*

Feature Pyramid Network (FPN) is a feature extractor designed for such pyramid concept with accuracy and speed in mind. It replaces the feature extractor of detectors like Faster R-CNN and generates multiple feature map layers (multi-scale feature maps) with better quality information than the regular feature pyramid for object detection.

FPN composes of a bottom-up and a top-down pathway. The bottom-up pathway is the usual convolutional network for feature extraction. As we go up, the spatial resolution decreases. With more high-level structures detected, the semantic value for each layer increases.

The bottom-up pathway uses ResNet to construct the bottom-up pathway. It composes of many convolution modules (convi for i equals 1 to 5) each has many convolution layers. As we move up, the spatial dimension is reduced by 1/2 (i.e. double the stride). The output of each convolution module is labeled as Ci and later used in the top-down pathway.

*Image Credit :* *https://medium.com/@jonathan_hui/understanding-feature-pyramid-networks-for-object-detection-fpn-45b227b9106c*

We apply a 1 × 1 convolution filter to reduce C5 channel depth to 256-d to create M5. This becomes the first feature map layer used for object prediction.

As we go down the top-down path, we upsample the previous layer by 2 using nearest neighbors upsampling. We again apply a 1 × 1 convolution to the corresponding feature maps in the bottom-up pathway. Then we add them element-wise. We apply a 3 × 3 convolution to all merged layers. This filter reduces the aliasing effect when merged with the upsampled layer.

We repeat the same process for P3 and P2. However, we stop at P2 because the spatial dimension of C1 is too large. Otherwise, it will slow down the process too much. Because we share the same classifier and box regressor of every output feature maps, all pyramid feature maps (P5, P4, P3 and P2) have 256-d output channels.

The formula to pick the feature maps is based on the width w and height h of the ROI.

So if k = 3, we select P3 as our feature maps. We apply the ROI pooling and feed the result to the Fast R-CNN head (Fast R-CNN and Faster R-CNN have the same head) to finish the prediction.

(for additional info on Fast RCNN, Faster RCNN and Mask RCNN, click here –
http://www.cellstrat.com/2019/11/13/aibyte-by-cellstrat-object-detection-with-mask-r-cnn/).

Just like Mask R-CNN, FPN is also good at extracting masks for image segmentation. 5 × 5 windows are slide over the feature maps to generate 14 × 14 segments. Later, we merge masks at a different scale to form our final mask predictions.

EfficientDet :-

The creators of EfficientDet wanted to see if it is possible to build a scalable detection architecture with both higher accuracy and better efficiency across a wide spectrum of resource constraints (e.g., from 3B to 300B FLOPs)? Their paper aims to tackle this problem by systematically studying various design choices of detector architectures. The paper examines the design choices for backbone, feature fusion, and class/box network, and seeks to solve for these two challenges :-

Efficient multi-scale feature fusion : Feature Pyramid Networks (FPN) has become the de facto for fusing multi-scale features. Some of the detectors that use FPNs are RetinaNet, PANet, NAS-FPN, etc. Most of the fusion strategies adopted in these networks don’t take into consideration the importance of filters while fusing. They sum them up without any distinction. Intuitively, not all features contribute equally to the output features. Hence, a better strategy for multi-scale fusion is required.

Model scaling : Most of the previous works tend to make the backbone network bigger for improving accuracy. The authors observed that scaling up feature network and box/class prediction networks are also critical when taking into account both accuracy and efficiency. Inspired by the compound scaling in EfficientNets, the authors proposed a compound scaling method for object detectors, which jointly scales up the resolution/depth/width for all backbone, feature network, box/class prediction network.

EfficientDet uses a BiFPN architecture. It also uses multi-scale feature fusion, that aims to aggregate features at different resolutions.

Feature network design – (a) FPN introduces a top-down pathway to fuse multi-scale features from level 3 to 7 (P3 – P7); (b) PANet adds an additional bottom-up pathway on top of FPN; (c) NAS-FPN use neural architecture search to find an irregular feature network topology and then repeatedly apply the same block; (d) is our BiFPN with better accuracy and efficiency trade-offs.
*Image Credit :* *https://arxiv.org/abs/1911.09070*

The conventional FPN aggregates multi-scale features in a top-down manner:

Cross-scale connections :

The problem with conventional FPN as shown in Figure (a) is that it is limited by the one-way (top-down) information flow. To address this issue, PANet adds an extra bottom-up path aggregation network, as shown in Figure (b) above.

Also, there are many papers, e.g. NAS-FPN, that also studied the cross-connections for capturing better semantics. In short, the game is all about the connections for connecting low-level features to high-level features and vice-versa for capturing better semantics.

Remove Nodes that only have one input edge. If a node has only one input edge with no feature fusion, then it will have less contribution to the feature network that aims at fusing different features.
Add an extra edge from the original input to output node if they are at the same level, in order to fuse more features without adding much cost.
Treat each bidirectional (top-down & bottom-up) path as one feature network layer, and repeat the same layer multiple times to enable more high-level feature fusion.

Weighted Feature Fusion :

Previous feature fusion methods treat all input features equally without distinction. However, we observe that since different input features are at different resolutions, they usually contribute to the output feature unequally. To address this issue, we propose to add an additional weight for each input during feature fusion, and let the network to learn the importance of each input feature. Based on this idea, we consider three weighted fusion approaches:

Unbounded fusion :

where w_i is a learnable weight that can be a scalar (per-feature), a vector (per-channel), or a multi-dimensional tensor (per-pixel). To make the training stable, we resort to weight normalization to bound the value range of each weight.

Softmax-based fusion :

An intuitive idea is to apply softmax to each weight, such that all weights are normalized to be a probability with value range from 0 to 1, representing the importance of each input. The extra softmax leads to significant slowdown on GPU hardware.

Fast normalized fusion :

To minimize the extra latency cost, we further propose a fast fusion approach.

where w_i >= 0 is ensured by applying a Relu after each w_i, and ε = 0.0001 is a small value to avoid numerical instability. Similarly, the value of each normalized weight also falls between 0 and 1, but since there is no softmax operation here, it is much more efficient. The authors’ ablation study shows this fast fusion approach is better than softmax-based fusion, but runs up to 30% faster on GPUs.

The final BiFPN integrates both the bidirectional cross-scale connections and the fast normalized fusion.

where P₆^td is the intermediate feature at level 6 on the topdown pathway, and P₆^out is the output feature at level 6 on the bottom-up pathway.

Now lets look at EfficientDet architecture.

EfficientDet Architecture :

EfficientDet detectors are single-shot detectors much like SSD and RetinaNet. The backbone networks are ImageNet pretrained EfficientNets. The proposed BiFPN serves as the feature network, which takes level 3–7 features {P3, P4, P5, P6, P7} from the backbone network and repeatedly applies top-down and bottom-up bidirectional feature fusion. These fused features are fed to a class and box network to produce object class and bounding box predictions respectively. The class and box network weights are shared across all levels of features.

We have already seen in EfficientNets that scaling all dimensions provides much better performance. We would like to do the same for our EfficientDet family models. Previous works in object detection scale only the backbone network or the FPN layers for improving accuracy. This is very limiting as we are focusing on scaling only one dimension of the detector. The authors proposed a new compound scaling method for object detection, which uses a simple compound coefficient φ to jointly scale-up all dimensions of the backbone network, BiFPN network, class/box network, and resolution.

Object detectors have much more scaling dimensions than image classification models, so grid search for all dimensions is very expensive. Therefore, the authors used a heuristic-based scaling approach, but still, follow the main idea of jointly scaling up all dimensions.

Backbone network: Same width/depth scaling coefficients of EfficientNet-B0 to B6 are used so that ImageNet-pretrained checkpoints can be used.
BiFPN network: The authors exponentially grow BiFPN width (#channels) as done in EfficientNets, but linearly increase the depth (#layers) since depth needs to be rounded to small integers. After a grid search, 1.35 is detected as best scale factor for width.

Box/class prediction network: The width is kept same as the BiFPN but the depth (#layers) is linearly increased.

Input image resolution: Since feature level 3–7 are used in BiFPN, the input resolution must be dividable by 2⁷ = 128, so we linearly increase resolutions using equation:

Now, using equations (1), (2), and (3), and different values of φ , we can go from Efficient-D0 (φ =0) to Efficient-D6 (φ =6) as shown in Table 1 below. The models scaled up with φ >= 7 could not fit memory unless changing batch size or other settings. Therefore, the authors expanded D6 to D7 by only enlarging input size while keeping all other dimensions the same, such that we can use the same training settings for all models. Here is a table summarizing all these configs:

The chart below shows how EfficientDet outperforms other SOTA models :-

Model FLOPs vs COCO accuracy – All numbers are for single-model single-scale. Our EfficientDet achieves much better accuracy with fewer computations. In particular, EfficientDet-D6 achieves new state-of-the-art 50.9% COCO mAP with 4x fewer parameters and 13x fewer FLOPs than prior models.
*Source Credit :https://arxiv.org/abs/1911.09070*

The results of various EfficientDet models :-

Interested in learning AI ML from India’s No 1 AI Lab ? Then attend our AI Lab meetup this Saturday 28th Mar 2020 in BLR. Please RSVP below to attend :-

BLR AI Lab :
Register : https://bit.ly/33QJhpB
Topic : DenseNet, Deep Learning Inference Accelerators
Date : Saturday 28th Mar 2020, 10:30 AM – 5:00 PM
Presenters : Darshan C G, Abdus Samad

See you this Saturday for the AI Lab meetup in BLR ! Lets disrupt the world with AI !

Questions ? Call me at +91-9742800566 !

Best Regards,

Vivek Singhal
Co-Founder & Chief Data Scientist, CellStrat
+91-9742800566

Research/Blog

Object Detection with EfficientNet and EfficientDet

EfficientNet :-

Feature Pyramid Network (FPN) :-

EfficientDet :-

Leave a Reply Cancel reply