Much of the current
computer vision research is focused on labeling visual objects, and impressive
results have been achieved for this task. However, human image understanding is
much richer, involving understanding that is both below object recognition
(e.g., localizing and labeling the object’s parts), as well as above the object
level (e.g., categorizing the interactions between two or more ‘person’
objects). In particular, human understating involves structure: identifying
objects and their parts, together with a rich set of semantic relations between
them. This mapping of sensory input (pixels) to the semantic structure that is
perceived by humans is termed here ‘full interpretation’.
In this talk I will describe a set of studies towards a full interpretation
process, which is both below and above the object level. Full interpretation is
approached by dividing the image into multiple so-called ‘minimal images’,
namely reduced local image regions that are minimal in the sense that further
reduction will turn them unrecognizable and uninterpretable for humans. Minimal
images make the interpretation task tractable, and also provide valuable
insights into the computational mechanisms underling interpretation in the
human visual system. I will show results of incorporating such mechanisms in a
structured prediction algorithm for full interpretation, and discuss how the interpretation
processes are combined with convolutional networks.