Learning to Understand Video

Cascaded Pictorial Structures Code




Version 1.0, September 20, 2010.

OR, check out the more up-to-date code (of which CPS is a subset) here: Stretchable Models for Parsing Human Motion.


To run on a new image, simply use cps_demo.m, setting opts.imgfile and opts.outputdir as desired.

The input images are intended to be 333x370 pixel images of people roughly localized by an upper body detector. We use the bounding boxes provided by Eichner et et al.'s BMVC09 paper, "Better Appearance Models for Pictorial Structures". More recently, they have developed a more robust upper body person detector, which can be found here: http://www.vision.ee.ethz.ch/~calvin/calvin_upperbody_detector/

To convert from an Calvin upperbody detector to the correct 333x370 cropping, you can use the following code snippet (which uses some functions included in the code zip):

 ubbox = <upper body box returned by your upper body detector>;
 [bh,bw] = boxsize(ubbox);  bctr = boxcenter(ubbox);
 want_size = [333 370];
 hwr = want_size(2)/want_size(1);
 cropbox = box_from_dims(2*hwr*bh,2*bh,[bctr(1);ubbox(4)]);

 img = imread(oldfile);
 % crop out box:
 patch = extractWindow(img,box2rhull(round(cropbox)));
 % resize cropped box:
 patchsmall = imresize(patch,want_size);


A common mex/opencv issue has been addressed on the Stretchable Models code page.


This reference implementation has slightly different numbers than originally reported in the paper; however, it is much cleaner and significantly faster. You should be able to reproduce the following numbers out-of-the-box:

Buffy v2.199.5798.7293.6290.2165.1163.8385.18

For the lazy, you can just use the final predictions in the full and cropped frames contained in here:


Processing Pipeline & Timing

For a single input image of resolution 333x370 pixels, here is a breakdown of the steps and time each step takes:

  • compute HoG part detector detmaps: ~1m20s
  • applying cascade of coarse2fine models: <5 seconds
  • computing richer features
    • pb+ncut: ~2m45s
    • color models: ~7 seconds
    • rest of feature computation: ~25 seconds
    • evaluating gentleboost pairwise potential model: ~30 seconds
  • final prediction inference: <1 second
  • total: ~5m15s

This was evaluated on this machine:
Linux x86_64 GNU/Linux
Intel(R) Xeon(R) CPU E5450 @ 3.00GHz (w/ 8 cpus)

Differences from paper implementation

This implementation is slightly different than the paper version:

  • Coarse-to-fine pruning: Rather than a fixed alpha = 0 for every part (ie., pruning the mean max-marginal), we now learn a separate alpha for each part. During training, we set all alphas to 0 (learning to push the groundtruth score above average), and set the run-time alphas using cross-validation to keep 95% of the groundtruth. This modification maintains convexity of the learning formulation, and allows us to have more flexible pruning - more aggressive on easier parts.
  • Features: We also include an ncut embedding distance feature in the spirit of our pairwise color chi^2 distance feature. This embedding feature is a cosine distance between the average vector over the rectangular support of each limb in the embedding space (30 dimensional corresponding to the top 30 eigenvectors). Also, the HoG part detectors provided are faster (evaluate fewer rounds of boosting), but less accurate.
  • Jackknifing: Because the models are trained sequentially, it is easy to overfit successive cascade models which all see the same data. To remedy this, we employ simple 2-fold jackknifing (a.k.a., "stacking"): split the training data into two halves, and alternate each level the cascade using one half for training, one for validation.