The courses
3D Localization of Audiovisual Events with Binaural Hearing and Binocular Vision
Radu Horaud, INRIA Grenoble Rhone-Alpes
In this talk we will address the problem of spatial audio-visual integration. Strong single-modality inputs do not necessarily require multi-sensory integration. Nevertheless, in most natural situations the sensorial stimuli are highly ambiguous and it is not straightforward to extract meaningful interpretations from them. In these circumstances, strong evidence of a given stimulus can be obtained from the appropriate combination of weak cues. This is the case in many common configurations such as the analysis of a scene including several speakers. We will consider a human-centered sensor setup that consists in binaural hearing and binocular vision. First, we will briefly describe the geometric principles underlying the computation of auditory cues from interaural time differences (ITD) and a visual depth map from binocular disparities. We will stress the fact that algorithms based on these geometric models alone, cannot recover depth to audiovisual objects in a reliable way. Second, we will describe a probabilistic model that allows to associate joint 2-D visual and 1-D audio observations to speakers through the recovery of the 3-D speakers' locations. We cast the problem within the framework of maximum likelihood with missing variables. The parameters associated with the spatial locations of multiple events are estimated within a variant of expectation-maximization, namely GEM (generalized expectation maximization). We formally derive the GEM clustering algorithm that considers two sets of unknown quantities: the association variables between audio-visual data and speakers as well as the 3-D location parameters of the speakers. We describe experiments performed with the POP project's computational audio-visual analysis (CAVA) data sets available at http://perception.inrialpes.fr/CAVA_Dataset.
Coming soon: Description of other courses.