Research/Blog

CellStrat > Research/Blog > Artificial Intelligence > Computer Vision > Pose Estimation with OpenPose

Pose Estimation with OpenPose

June 9, 2020
Posted by: vsinghal
Category: Computer Vision Retail

#CellStratAILab #disrupt4.0 #WeCreateAISuperstars #WhereLearningNeverStops

Last Saturday, our AI Lab Researcher Niraj Kale presented an excellent session on OpenPose, an algorithm to efficiently detect the 2D pose of multiple people in an image.

Pose skeleton represents orientation of a person in graphical format. Each coordinate in the skeleton is known as keypoint. A valid connection between two keypoints is known as a limb.

Sample keypoint skeleton
*Image Source :* *https://medium.com/beyondminds/an-overview-of-human-pose-estimation-with-deep-learning-d49eb656739b*

Multi-person pose estimation
*Image Source :* *https://arxiv.org/pdf/1611.08050.pdf*

Applications of Pose Estimation :-

Activity Recognition
- Detect if person has fallen
- Tech workout regimes, sports techniques, dance activities
- Understanding full body sign language
- Security and surveillance
Motion capture and Augmented reality
- CGI applications in movies
Training robots
- Robots can be made to follow the trajectory of human pose skeleton that is performing action
Motion tracking in gaming consoles

Approaches for Multi-person Pose Estimation :-

Top Down approach
- Detect person first
- Estimate the parts
- Calculate the pose for each person
Bottom up approach
- Detect all parts in an image
- Grouping parts belonging to each person

First row : top-down estimation; Second row : bottom-up estimation
*Image Source :* *https://medium.com/beyondminds/an-overview-of-human-pose-estimation-with-deep-learning-d49eb656739b*

Drawbacks of Top down approach :-

If person detector fails for partial person with close proximity, then this failure cannot be recovered
Runtime is proportional to number of people

Architecture of OpenPose :-

OpenPose Architecture – Architecture of the two-branch multi-stage CNN. Each
stage in the first branch predicts confidence maps St, and each
stage in the second branch predicts PAFs Lt. After each stage, the
predictions from the two branches, along with the image features,
are concatenated for next stage.
*Image Source :* *https://medium.com/beyondminds/an-overview-of-human-pose-estimation-with-deep-learning-d49eb656739b*

Initially the feature map F is extracted using VGG-19 layer. This is then input to two parallel layers, i.e. B₁ and B₂ branch.

The first branch predicts a set of confidence maps, with each map representing a particular part of the human pose skeleton.

The second branch predicts a set of PAF (part affinity fields) which represents the degree of association between parts.

Simultaneously inferring these bottom up representations of detection and association encodes sufficient global context for greedy parse to achieve high quality results.

Steps involved in human pose estimation using OpenPose :-

Overall Pipeline
*Image Source :* *https://arxiv.org/pdf/1611.08050.pdf*

The figure above illustrates the overall pipeline of our method.

The system takes, as input, a color image of size w × h (Fig. a) and produces the 2D locations of anatomical keypoints for each person in the image (Fig. e).

First, a feedforward network predicts a set of 2D confidence maps S of body part locations (Fig. b) and a set of 2D vector fields L of part affinity fields (PAFs), which encode the degree of association between parts (Fig. c).

The set S = (S₁ , S₂ , …, S_J ) has J confidence maps, one per part, where S_j ∈ R^w×h, j ∈ {1…J}. The set L = (L_1c,L_2c,…, L_c) has C vector fields, one per limb, where L_c ∈ R^w×h×2, c ∈ {1…C}. We refer to part pairs as limbs for clarity, but some pairs are not human limbs (e.g., the face). Each image location in L_c encodes a 2D vector.

Finally, the confidence maps and the PAFs are parsed by greedy inference (Fig. d) to output the 2D keypoints for all people in the image.

New Architecture :-

*Image Source :* *https://arxiv.org/pdf/1812.08008.pdf*

The figure above shows the Architecture of the multi-stage CNN. The first set of stages predicts PAFs L^t, while the last set predicts confidence maps S^t. The predictions of each stage and their corresponding image features are concatenated for each subsequent stage. Convolutions of kernel size 7 from the original approach (original architecture above) are replaced with 3 layers of convolutions of kernel 3 which are concatenated at their end.

PAFs of right forearm across stages. Although there is confusion between left and right body parts and limbs in early stages, the estimates are increasingly refined through global inference in later stages.
*Image Source :* *https://arxiv.org/pdf/1812.08008.pdf*

Confidence Maps :-

**Mathematical expression of the set S**

Keypoints ID for COCO dataset :-

*Image Source :* *https://medium.com/beyondminds/an-overview-of-human-pose-estimation-with-deep-learning-d49eb656739b*

Example from COCO dataset :-

S will have elements of S₁, S₂, S₃,…, S₁₉. S₁ corresponds to the confidence map for the key point id of 0 which refers to the nose. Then, the confidence map might look as follows.

The figure above shows a very simplified diagram showing a single confidence map where each cell in the table corresponds to a pixel in the original image of dimensions w x h. The value in each cell represents the confidence that a Nose is present.

Part Affinity Field (PAF) Maps :-

**Mathematical expression for the set L**

C, the total number of limbs, depends on the dataset that OpenPose is trained with.

For COCO dataset, C = 19. The figure below shows the different part pairs.

**An array of tuples. Each tuple pair represents body part ID pairs**

Simultaneous detection and association :-

The initial stage is a fine-tuned VGG-19 layer.
This layer generates feature maps F that is input to the first stage.
Next stage generates Part affinity fields (PAFs) L¹ = φ¹(F), where φ¹ refers to the CNNs for inference at Stage 1.
In each subsequent stage, the predictions from the previous stage and the original image features F are concatenated and used to produce refined predictions,

where φ^t refers to the CNNs for inference at Stage t, and T_P to the number of total PAF stages.

After T_P iterations, the process is repeated for the confidence maps detection, starting in the most updated PAF prediction,

where ρ^t refers to the CNNs for inference at Stage t, and T_C to the number of total confidence map stages.

Loss function :-

L2 loss between the estimated predictions and ground truth maps and fields.

The notation p represents a single pixel location in a w x h image.
The * notation next to the set S and L means that it is the ground truth
The output of S(p) is a 1 dimensional vector which consists of the confidence score for that particular body part j at image location p.
The output of L(p) is a 2 dimensional vector which consists of the directional vector for that particular limb c at image location p.
In the OpenPose paper, J , the total number of body part is 19. Also, C , the total number of “limbs” or body to body connections is 19.
W(p) represents the weighing function as previously mentioned. W(p) = 0 when the annotation is missing at an image location p. The mask is used to avoid penalizing the true positive predictions during training.

The overall Loss function looks like this :-

Multi-person :-

**Part association strategies**
**(a)** The body part detection candidates (red and blue dots) for two body part types and all connection candidates (grey lines).
**(b)** The connection results using the midpoint (yellow dots) representation: correct connections (black lines) and incorrect connections (green lines) that also satisfy the incidence constraint.
**(c)** The results using PAFs (yellow arrows). By encoding position and orientation over the support of the limb, PAFs eliminate false associations.

*Image Source :* *https://arxiv.org/pdf/1812.08008.pdf*

Confidence Maps for part detection :-

We first generate individual confidence maps S^*_j,k for each person k. Let x_j,k ∈ R2 be the groundtruth position of body part j for person k in the image. The value at location p ∈ R² in S^*_j,k is defined as,

where σ controls the spread of the peak. The groundtruth confidence map predicted by the network is an aggregation of the individual confidence maps via a max operator,

Maximum of confidence map is used.
So, that the peak for different points remain distinct

Part Affinity field for part association :-

Consider a single limb shown in the figure below. Let x_j1,k and x_j2,k be the groundtruth positions of body parts j₁ and j₂ from the limb c for person k in the image. If a point p lies on the limb, the value at L^*_c,k(p) is a unit vector that points from j₁ to j₂; for all other points, the vector is zero-valued.

To evaluate f_L in the Overall Loss equation above during training, we define the groundtruth PAF, L^*_c,k, at an image point p as

Here

is the unit vector in the direction of the limb.

The groundtruth part affinity field averages the affinity fields of all people in the image,

**Graph matching**
(a) Original image with part detections.
(b) K-partite graph.
(c) Tree structure. (d) A set of
bipartite graphs.

Multi-person parsing using PAFs :-

Initially body part detection candidate D_Jis obtained:

D_J= { d_j^m: for j ∈ { 1…J}, m ∈{1…Nj}}

N_jis the number of candidates for part j, and D_j^m ∈ ℝ² is the location of m^th detection candidate for body part j.

These body parts need to be associated with the body parts of the same person.

where E_C is the overall weight of the matching from limb type c, Z_C is the subset of Z for limb type c, and E_mn is the part affinity between parts d^m_j1 and dⁿ_j2 defined in Eq. 10. The above 2 equations enforce that no two edges share a node, i.e., no two limbs of the same type (e.g., left forearm) share a part. We can use the Hungarian algorithm to obtain the optimal matching.

Optimization of body part detection :-

Initially minimal number of edges are chosen to obtain a spanning tree skeleton as shown in Graph Matching image above (c).

Further the matching problem is decomposed into a set of bipartite matching subproblems and determining in adjacent tree nodes as shown in Graph Matching image above (d).

This gives minimal greedy inference with good approximation of global solution at fraction of computational cost.

With these two relaxations, the optimization is decomposed simply as:

Common failure cases :-

OpenPose fails in these examples :-