A Compositional Framework for Grounding Language Inference, Generation, and Acquisition in Video


framework overview

Description

We present an approach to simultaneously reasoning about a video clip and an entire natural-language sentence. The compositional nature of language is exploited to construct models which represent the meanings of entire sentences composed out of the meanings of the words in those sentences mediated by a grammar that encodes the predicate-argument relations. We demonstrate that these models faithfully represent the meanings of sentences and are sensitive to how the roles played by participants (nouns), their characteristics (adjectives), the actions performed (verbs), the manner of such actions (adverbs), and changing spatial relations between participants (prepositions) affect the meaning of a sentence and how it is grounded in video. We exploit this methodology in three ways. In the first, a video clip along with a sentence are taken as input and the participants in the event described by the sentence are highlighted, even when the clip depicts multiple similar simultaneous events. In the second, a video clip is taken as input without a sentence and a sentence is generated that describes an event in that clip. In the third, a corpus of video clips is paired with sentences which describe some of the events in those clips and the meanings of the words in those sentences are learned. We learn these meanings without needing to specify which attribute of the video clips each word in a given sentence refers to. The learned meaning representations are shown to be intelligible to humans.


Example results

language inference

The following video clips are tested on hand-crafted models. Note that two clips in a same row are indentical.

language generation

The following video clips are tested on hand-crafted models. Given a clip, a sentence is produced by searching through the entire lexicon to describe that clip.

language acquisition

The following unseen video clips are tested on sentences that do not appear in the training set. The wording meanings in the test sentences are learned from the training samples.


Reference

Haonan Yu, Siddharth N., Andrei Barbu, and Jeffrey Mark Siskind. ‘A Compositional Framework for Grounding Language Inference, Generation, and Acquisition in Video,’ Journal of Artificial Intelligence Research, 52, 601-713, 2015.

Back to the Research page