Jan. 16, 2018

Room 202 in building 37

Much of the current computer vision research is focused on labeling visual objects, and impressive results have been achieved for this task. However, human image understanding is much richer, involving understanding that is both below object recognition (e.g., localizing and labeling the object’s parts), as well as above the object level (e.g., categorizing the interactions between two or more ‘person’ objects). In particular, human understating involves structure: identifying objects and their parts, together with a rich set of semantic relations between them. This mapping of sensory input (pixels) to the semantic structure that is perceived by humans is termed here ‘full interpretation’.

In this talk I will describe a set of studies towards a full interpretation process, which is both below and above the object level. Full interpretation is approached by dividing the image into multiple so-called ‘minimal images’, namely reduced local image regions that are minimal in the sense that further reduction will turn them unrecognizable and uninterpretable for humans. Minimal images make the interpretation task tractable, and also provide valuable insights into the computational mechanisms underling interpretation in the human visual system. I will show results of incorporating such mechanisms in a structured prediction algorithm for full interpretation, and discuss how the interpretation processes are combined with convolutional networks.