Sentence Tracker — Language Learning



People learn language through exposure to a rich perceptual context. Language is grounded by mapping words, phrases, and sentences to meaning representations referring to the world. It has been shown that even with referential uncertainty and noise, a system based on cross-situational learning can robustly acquire a lexicon, mapping words to word-level meanings from sentences paired with sentence-level meanings. We present a method that learns representations for word meanings from short video clips paired with sentences. Unlike prior work on learning language from symbolic input, our input consists of video of people interacting with multiple complex objects in outdoor environments. Unlike prior computer-vision approaches that learn from videos with verb labels or images with noun labels, our labels are sentences containing nouns, verbs, prepositions, adjectives, and adverbs. The correspondence between words and concepts in the video is learned in an unsupervised fashion, even when the video depicts simultaneous events described by multiple sentences or when different aspects of a single event are described with multiple sentences. The learned word meanings can be subsequently used to automatically generate description of new video.


H. Yu and J.M. Siskind. 'Grounded Language Learning from Video Described with Sentences', In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 2013, best paper award.

pdf | bibtex | source code (scheme,c,c++) | dataset | talk file (beamer) | talk script (text)

Back to the Research page