The Visual Microphone


The inventors have developed a method to turn effectively any object into a visual microphone to enable the detection of sound from afar. Sound signals produce air pressure fluctuations that cause objects in the vicinity to vibrate. This method works by analyzing video recordings of these vibrations to convert them back into a corresponding sound signal. This technology is mainly applicable for military, surveillance or espionage purposes.

Problem Addressed

Remote sound acquisition has been of great interest in recent years for surveillance and security purposes. Existing approaches that reconstruct sound from the vibration of objects at a distance are active in nature, requiring a laser beam or pattern to be projected onto the vibrating surface. The Inventors have proposed a method of passive audio recovery that uses high-speed cameras to obtain vibration data from video footage. This process is desirable as cameras are not only cheaper and more accessible than lasers, but it is easier to capture surround sound with a single video camera than with multiple laser readings. Moreover, this method does not require vibrating surface be retroreflective and can therefore work with virtually any everyday object (potted plant, bag of chips, etc.).


The visual microphone works by collecting and converting vibration data back into audio signals. This process is initiated when an input sound (i.e. the sound of interest) causes air pressure fluctuations at the surface of some object. These fluctuations cause the object to move with a certain pattern of displacement over time. High speed cameras are used to capture this movement on film at a rate of 1kHz-20kHz, and the ensuing video recording is processed by the Inventors’ algorithm to recover an output sound.

The input to the method is the video of the object. It is assumed that the observed relative motion of the object is due to vibrations caused by a sound signal. The algorithm works to decompose the video input into spatial subbands corresponding to different scales and orientations. Local motion signals are computed at every pixel, orientation and scale. To recover the audio signal, these motion signals are integrated across all the scales and orientations and weighted by their amplitudes to generate a single global motion ID for the object. Various audio denoising and filtering techniques are applied to this motion signal to obtain a recovered audio signal.


  • More cost-effective method of speech recognition than other existing methods
  • Does not require additional sensors or detection modules other than a high speed camera
  • Capable of producing accurate and efficient representation of surround-sound acoustics
  • Technology can be used to intercept wireless standards with high levels of security, eg., Bluetooth, by analyzing vibrations in headset