|Abstract or Summary
- Given a video, we would like to recognize group activities, localize video parts where these activities occur, and detect actors involved in them. To this and, we propose a novel, mid-level feature, called control point, for representing group activities. The control points are aimed at summarizing visual cues, lifting from the noisy low-level features, and jointly providing visual evidence of actors and their group activity to higher-level inference algorithms. We formulate a generative model, called chains model, to organize a huge number of video features in an ensemble of chains of control points, representing a group activity. The chains may have arbitrary length, ideally, starting and ending at the beginning and end of the time interval occupied by the activity. We derive an efficient MAP inference, which is a new, EM-like algorithm that iterates two steps: warps the chains of control points to their expected locations so they can better summarize visual cues, and then maximizes their posterior probability. Our evaluation on benchmark UT-Human Interaction and Collective Activities datasets demonstrates that we outperform the sate of the art with reasonable running times.