Rotational view of the above animals:
Giraffe
Koala
Cub
Alpaca
Lion
Panda
Abstract
Videography provides valuable data for studying animal behaviors where noninvasive tracking of animals is the essential first step. This paper introduces a parametric approach for generating a massive synthetic dataset for quadrupeds containing more than 10 million ground-truth 3D poses with corresponding 2D landmarks. The generated poses are valid since they are constrained by a hierarchical skeletal model. Our dataset for single-image 3D pose estimation can be readily extended to videos. Two novel network architectures are proposed for 3D pose estimation, one for single images and the other for videos. We demonstrate previously unseen and strong results on tracking quadrupedal animals movements in video sequences while estimating the corresponding 3D pose sequences. We present quantitative evaluation and ablation studies, and further demonstrate our approach's generalizability by estimating 3D poses in significantly different body configurations, and vastly different postures where not all four legs are on the ground.
Hierarchical Model
Our hierarchical model contains 23 joint points (some joints may be omitted depending on different species) and is shown in the following figure:
Range and Distribution of our Synthetic Dataset
The following figures show the range and distribution of our synthetic dataset for each pair of joint connection. The left column shows the distribution and the right column shows the corrsponding position in the skeleton (marked by red arrow).
Backbone
Framework Architecture
Following is the figure illustrating our frame architecture. In short, we take the landmarks of an animal in a video sequence as the input and pass through two separate deep neural network and output the predicted 3D skeletons.
Detailed architecture for our single-image network:
Detailed architecture for our temporal coherence network:
Single Frame Result
Here is the results of our framework on single frame images (i.e. without the temporal network): the three columns are namely: original image with skeleton overlaid, 2D projection of the skeleton and 3D skeleton (in a different view angle). All images in the following two sections are from TigDog dataset.
Video Sequence Results