Research/Blog
Pose Estimation with OpenPose
- June 9, 2020
- Posted by: vsinghal
- Category: Computer Vision Retail
#CellStratAILab #disrupt4.0 #WeCreateAISuperstars #WhereLearningNeverStops
Last Saturday, our AI Lab Researcher Niraj Kale presented an excellent session on OpenPose, an algorithm to efficiently detect the 2D pose of multiple people in an image.
Pose skeleton represents orientation of a person in graphical format. Each coordinate in the skeleton is known as keypoint. A valid connection between two keypoints is known as a limb.
![](https://www.cellstrat.com/wp-content/uploads/2020/06/Sample-keypoints.png)
Image Source : https://medium.com/beyondminds/an-overview-of-human-pose-estimation-with-deep-learning-d49eb656739b
![](https://www.cellstrat.com/wp-content/uploads/2020/06/multi-purpose-pose.png)
Image Source : https://arxiv.org/pdf/1611.08050.pdf
Applications of Pose Estimation :-
- Activity Recognition
- Detect if person has fallen
- Tech workout regimes, sports techniques, dance activities
- Understanding full body sign language
- Security and surveillance
- Motion capture and Augmented reality
- CGI applications in movies
- Training robots
- Robots can be made to follow the trajectory of human pose skeleton that is performing action
- Motion tracking in gaming consoles
Approaches for Multi-person Pose Estimation :-
- Top Down approach
- Detect person first
- Estimate the parts
- Calculate the pose for each person
- Bottom up approach
- Detect all parts in an image
- Grouping parts belonging to each person
![](https://www.cellstrat.com/wp-content/uploads/2020/06/Man-and-Woman.png)
Image Source : https://medium.com/beyondminds/an-overview-of-human-pose-estimation-with-deep-learning-d49eb656739b
Drawbacks of Top down approach :-
- If person detector fails for partial person with close proximity, then this failure cannot be recovered
- Runtime is proportional to number of people
Architecture of OpenPose :-
![](https://www.cellstrat.com/wp-content/uploads/2020/06/Architectue.png)
stage in the first branch predicts confidence maps St, and each
stage in the second branch predicts PAFs Lt. After each stage, the
predictions from the two branches, along with the image features,
are concatenated for next stage.
Image Source : https://medium.com/beyondminds/an-overview-of-human-pose-estimation-with-deep-learning-d49eb656739b
Initially the feature map F is extracted using VGG-19 layer. This is then input to two parallel layers, i.e. B1 and B2 branch.
The first branch predicts a set of confidence maps, with each map representing a particular part of the human pose skeleton.
The second branch predicts a set of PAF (part affinity fields) which represents the degree of association between parts.
Simultaneously inferring these bottom up representations of detection and association encodes sufficient global context for greedy parse to achieve high quality results.
Steps involved in human pose estimation using OpenPose :-
![](https://www.cellstrat.com/wp-content/uploads/2020/06/Steps-involved.png)
Image Source : https://arxiv.org/pdf/1611.08050.pdf
The figure above illustrates the overall pipeline of our method.
The system takes, as input, a color image of size w × h (Fig. a) and produces the 2D locations of anatomical keypoints for each person in the image (Fig. e).
First, a feedforward network predicts a set of 2D confidence maps S of body part locations (Fig. b) and a set of 2D vector fields L of part affinity fields (PAFs), which encode the degree of association between parts (Fig. c).
The set S = (S1 , S2 , …, SJ ) has J confidence maps, one per part, where Sj ∈ Rw×h, j ∈ {1…J}. The set L = (L1c, L2c, …, Lc) has C vector fields, one per limb, where Lc ∈ Rw×h×2, c ∈ {1…C}. We refer to part pairs as limbs for clarity, but some pairs are not human limbs (e.g., the face). Each image location in Lc encodes a 2D vector.
Finally, the confidence maps and the PAFs are parsed by greedy inference (Fig. d) to output the 2D keypoints for all people in the image.
New Architecture :-
![](https://www.cellstrat.com/wp-content/uploads/2020/06/New-architecture.png)
The figure above shows the Architecture of the multi-stage CNN. The first set of stages predicts PAFs Lt, while the last set predicts confidence maps St. The predictions of each stage and their corresponding image features are concatenated for each subsequent stage. Convolutions of kernel size 7 from the original approach (original architecture above) are replaced with 3 layers of convolutions of kernel 3 which are concatenated at their end.
![](https://www.cellstrat.com/wp-content/uploads/2020/06/Right-forearm.png)
Image Source : https://arxiv.org/pdf/1812.08008.pdf
Confidence Maps :-
![](https://www.cellstrat.com/wp-content/uploads/2020/06/Confidence-maps.png)
Keypoints ID for COCO dataset :-
![](http://www.cellstrat.com/wp-content/uploads/2020/06/Keypoints-IDs-1.png)
Example from COCO dataset :-
S will have elements of S1, S2, S3,…, S19. S1 corresponds to the confidence map for the key point id of 0 which refers to the nose. Then, the confidence map might look as follows.
![](https://www.cellstrat.com/wp-content/uploads/2020/06/Example-COCO-dataset.png)
The figure above shows a very simplified diagram showing a single confidence map where each cell in the table corresponds to a pixel in the original image of dimensions w x h. The value in each cell represents the confidence that a Nose is present.
Part Affinity Field (PAF) Maps :-
![](https://www.cellstrat.com/wp-content/uploads/2020/06/PAF-Maps.png)
Mathematical expression for the set L
C, the total number of limbs, depends on the dataset that OpenPose is trained with.
For COCO dataset, C = 19. The figure below shows the different part pairs.
![](https://www.cellstrat.com/wp-content/uploads/2020/06/CocoPairs.png)
An array of tuples. Each tuple pair represents body part ID pairs
Simultaneous detection and association :-
- The initial stage is a fine-tuned VGG-19 layer.
- This layer generates feature maps F that is input to the first stage.
- Next stage generates Part affinity fields (PAFs) L1 = φ1(F), where φ1 refers to the CNNs for inference at Stage 1.
- In each subsequent stage, the predictions from the previous stage and the original image features F are concatenated and used to produce refined predictions,
![](https://www.cellstrat.com/wp-content/uploads/2020/06/refined-predictions.png)
where φt refers to the CNNs for inference at Stage t, and TP to the number of total PAF stages.
- After TP iterations, the process is repeated for the confidence maps detection, starting in the most updated PAF prediction,
![](https://www.cellstrat.com/wp-content/uploads/2020/06/updated-PAF-prediction.png)
where ρt refers to the CNNs for inference at Stage t, and TC to the number of total confidence map stages.
Loss function :-
L2 loss between the estimated predictions and ground truth maps and fields.
![](https://www.cellstrat.com/wp-content/uploads/2020/06/Loss-function.png)
- The notation p represents a single pixel location in a w x h image.
- The * notation next to the set S and L means that it is the ground truth
- The output of S(p) is a 1 dimensional vector which consists of the confidence score for that particular body part j at image location p.
- The output of L(p) is a 2 dimensional vector which consists of the directional vector for that particular limb c at image location p.
- In the OpenPose paper, J , the total number of body part is 19. Also, C , the total number of “limbs” or body to body connections is 19.
- W(p) represents the weighing function as previously mentioned. W(p) = 0 when the annotation is missing at an image location p. The mask is used to avoid penalizing the true positive predictions during training.
The overall Loss function looks like this :-
![](https://www.cellstrat.com/wp-content/uploads/2020/06/Overall-Loss.png)
Multi-person :-
![](https://www.cellstrat.com/wp-content/uploads/2020/06/Part-association-strategies.png)
(a) The body part detection candidates (red and blue dots) for two body part types and all connection candidates (grey lines).
(b) The connection results using the midpoint (yellow dots) representation: correct connections (black lines) and incorrect connections (green lines) that also satisfy the incidence constraint.
(c) The results using PAFs (yellow arrows). By encoding position and orientation over the support of the limb, PAFs eliminate false associations.
Image Source : https://arxiv.org/pdf/1812.08008.pdf
Confidence Maps for part detection :-
We first generate individual confidence maps S*j,k for each person k. Let xj,k ∈ R2 be the groundtruth position of body part j for person k in the image. The value at location p ∈ R2 in S*j,k is defined as,
![](https://www.cellstrat.com/wp-content/uploads/2020/06/Sjk-formula.png)
where σ controls the spread of the peak. The groundtruth confidence map predicted by the network is an aggregation of the individual confidence maps via a max operator,
![](https://www.cellstrat.com/wp-content/uploads/2020/06/Sj-formula.png)
![](https://www.cellstrat.com/wp-content/uploads/2020/06/Gaussian-curve.png)
- Maximum of confidence map is used.
- So, that the peak for different points remain distinct
Part Affinity field for part association :-
Consider a single limb shown in the figure below. Let xj1,k and xj2,k be the groundtruth positions of body parts j1 and j2 from the limb c for person k in the image. If a point p lies on the limb, the value at L*c,k(p) is a unit vector that points from j1 to j2; for all other points, the vector is zero-valued.
![](https://www.cellstrat.com/wp-content/uploads/2020/06/PAF-for-part-association.png)
To evaluate fL in the Overall Loss equation above during training, we define the groundtruth PAF, L*c,k, at an image point p as
![](https://www.cellstrat.com/wp-content/uploads/2020/06/Lck-formula.png)
Here
![](https://www.cellstrat.com/wp-content/uploads/2020/06/v-vector.png)
is the unit vector in the direction of the limb.
![](https://www.cellstrat.com/wp-content/uploads/2020/06/Equation-3.png)
The groundtruth part affinity field averages the affinity fields of all people in the image,
![](https://www.cellstrat.com/wp-content/uploads/2020/06/Equation-4.png)
![](https://www.cellstrat.com/wp-content/uploads/2020/06/Equation-5.png)
![](https://www.cellstrat.com/wp-content/uploads/2020/06/Graph-matching.png)
(a) Original image with part detections.
(b) K-partite graph.
(c) Tree structure. (d) A set of
bipartite graphs.
Multi-person parsing using PAFs :-
Initially body part detection candidate DJ is obtained:
DJ = { djm: for j ∈ { 1…J}, m ∈{1…Nj}}
Nj is the number of candidates for part j, and Djm ∈ ℝ2 is the location of mth detection candidate for body part j.
These body parts need to be associated with the body parts of the same person.
![](https://www.cellstrat.com/wp-content/uploads/2020/06/Equation-7.png)
where EC is the overall weight of the matching from limb type c, ZC is the subset of Z for limb type c, and Emn is the part affinity between parts dmj1 and dnj2 defined in Eq. 10. The above 2 equations enforce that no two edges share a node, i.e., no two limbs of the same type (e.g., left forearm) share a part. We can use the Hungarian algorithm to obtain the optimal matching.
Optimization of body part detection :-
Initially minimal number of edges are chosen to obtain a spanning tree skeleton as shown in Graph Matching image above (c).
Further the matching problem is decomposed into a set of bipartite matching subproblems and determining in adjacent tree nodes as shown in Graph Matching image above (d).
This gives minimal greedy inference with good approximation of global solution at fraction of computational cost.
With these two relaxations, the optimization is decomposed simply as:
![](https://www.cellstrat.com/wp-content/uploads/2020/06/Equation-8.png)
Common failure cases :-
OpenPose fails in these examples :-
![](https://www.cellstrat.com/wp-content/uploads/2020/06/Failure-cases.png)
(a) rare pose or appearance,
(b) missing or false parts detection,
(c) overlapping parts, i.e., part detections shared by two persons,
(d) wrong connection associating parts from two persons, (e-f): false positives on statues or animals.
Image Source : https://arxiv.org/pdf/1812.08008.pdf