Spatial-Temporal Hierarchical Model for Joint Learning and Inference of Human Action and Pose

Abstract

In the community of computer vision, human pose estimation and human action recognitionare two classic and also of particular important tasks. They always serve as basic preprocessingsteps for other high-level tasks such as group activity analysis, visual search and humanidentication and they are also widely used as key components in many real applicationssuch as intelligent surveillance system and human-computer interaction based system. Thetwo tasks are closely related for understanding human motion, most methods, however, learnseparate models and combine them sequentially.In this dissertation, we build systems for pursuing a unied framework to integrate trainingand inference of human pose estimation and action recognition in a spatial-temporalAnd-Or Graph representation. Particularly, we study dierent ways to achievethis goal: A two-level And-Or Tree structure is utilized for representing action as animatedpose template. Each action is a sequence of moving pose templates with transitionprobabilities. Each Pose template consists of a shape template represented by an And-nodecapturing part appearance, and a motion template represented by an Or-node capturing partmotions. The transitions between moving pose templates are governed in a Hidden MarkovModel. The part locations, pose types and action labels are estimated together in inference. In order to tackle actions from unknown and unseen views we present a multi-viewspatial-temporal And-Or Graph for cross-view action recognition. As a compositionalmodel, the MST-AOG compactly represents the hierarchical combinatorial structuresof cross-view actions by explicitly modeling the geometry, appearance and motion variations.The model training takes advantage of the 3D human skeleton data obtained from Kinectcameras to avoid annotating video frames. The ecient inference enables action recognitionfrom novel views. A new Multi-view Action3D dataset has been created and released. To further represent part, pose and action jointly and improve performance, werepresent action at three scales by a ST-AOG model. Each action is decomposed into poseswhich are further divided into mid-level spatial-temporal parts and then parts.The hierarchical model structure captures the geometric and appearance variations of poseat each frame. The lateral connections between ST-parts at adjacent frames capture theaction-specic motions. The model parameters at three scales are learned discriminativelyand dynamic programming is utilized for ecient inference. The experiments demonstratethe large benet of joint modeling of the two tasks. The last but not the least, we study a novel framework for full-body 3D human poseestimation which is a essential task for human attention recognition, robot-based humanaction prediction and interaction. We build a two-level hierarchy of Long Short-Term Memory network with tree-structure to predict the depth on 2D human joints and thenreconstruct the 3D pose. Our two-level model utilizes two cues for depth prediction: 1) theglobal features from 2D skeleton. 2) the local features from image patches of body parts.

Links

PhilArchive



    Upload a copy of this work     Papers currently archived: 93,098

External links

Setup an account with your affiliations in order to access resources via your University's proxy server

Through your library

  • Only published works are available at libraries.

Similar books and articles

Understanding mirror neurons.Giorgio Metta, Giulio Sandini, Lorenzo Natale, Laila Craighero & Luciano Fadiga - 2006 - Interaction Studies. Social Behaviour and Communication in Biological and Artificial Systemsinteraction Studies / Social Behaviour and Communication in Biological and Artificial Systemsinteraction Studies 7 (2):197-232.

Analytics

Added to PP
2017-06-07

Downloads
6 (#1,485,580)

6 months
4 (#862,833)

Historical graph of downloads
How can I increase my downloads?

Citations of this work

No citations found.

Add more citations

References found in this work

No references found.

Add more references