View on GitHub

Deep AnimalPose

A Generative Parametric Representation for 3D Quadrupedal Animal Pose Estimation

DU Yinwei, ZHANG Ziyan

an octocat!

Rotational view of the above animals:

Giraffe

Koala

Cub

Alpaca

Lion

Panda

Abstract

Videography provides valuable data for studying animal behaviors where noninvasive tracking of animals is the essential first step. This paper introduces a parametric approach for generating a massive synthetic dataset for quadrupeds containing more than 10 million ground-truth 3D poses with corresponding 2D landmarks. The generated poses are valid since they are constrained by a hierarchical skeletal model. Our dataset for single-image 3D pose estimation can be readily extended to videos. Two novel network architectures are proposed for 3D pose estimation, one for single images and the other for videos. We demonstrate previously unseen and strong results on tracking quadrupedal animals movements in video sequences while estimating the corresponding 3D pose sequences. We present quantitative evaluation and ablation studies, and further demonstrate our approach's generalizability by estimating 3D poses in significantly different body configurations, and vastly different postures where not all four legs are on the ground.

Hierarchical Model

Our hierarchical model contains 23 joint points (some joints may be omitted depending on different species) and is shown in the following figure:

Range and Distribution of our Synthetic Dataset

The following figures show the range and distribution of our synthetic dataset for each pair of joint connection. The left column shows the distribution and the right column shows the corrsponding position in the skeleton (marked by red arrow).

Backbone
Forward Limb Left Joint 1
Forward Limb Left Joint 2
Forward Limb Right Joint 3
Forward Limb Right Joint 4
Head Joint
Left Eye
Right Eye
Tail

Framework Architecture

Following is the figure illustrating our frame architecture. In short, we take the landmarks of an animal in a video sequence as the input and pass through two separate deep neural network and output the predicted 3D skeletons.

Detailed architecture for our single-image network:

Detailed architecture for our temporal coherence network:

Single Frame Result

Here is the results of our framework on single frame images (i.e. without the temporal network): the three columns are namely: original image with skeleton overlaid, 2D projection of the skeleton and 3D skeleton (in a different view angle). All images in the following two sections are from TigDog dataset.

Video Sequence Results