The visual microphone works by collecting and converting vibration data back into audio signals. This process is initiated when an input sound (i.e. the sound of interest) causes air pressure fluctuations at the surface of some object. These fluctuations cause the object to move with a certain pattern of displacement over time. High speed cameras are used to capture this movement on film at a rate of 1kHz-20kHz, and the ensuing video recording is processed by the Inventors’ algorithm to recover an output sound.
The input to the method is the video of the object. It is assumed that the observed relative motion of the object is due to vibrations caused by a sound signal. The algorithm works to decompose the video input into spatial subbands corresponding to different scales and orientations. Local motion signals are computed at every pixel, orientation and scale. To recover the audio signal, these motion signals are integrated across all the scales and orientations and weighted by their amplitudes to generate a single global motion ID for the object. Various audio denoising and filtering techniques are applied to this motion signal to obtain a recovered audio signal.