The Inventors train a deep multimodal embedding network with a vast repository of spoken captions and image data to map entire image frames and entire spoken captions into a shared embedding space. The trained network can then be used to localize patterns corresponding to words and phrases within a spectrogram, which is a visual representation of the spoken caption’s spectrum of frequencies, as well as visual objects within the image by applying it to small sub-regions of the image and spectrogram. The model is comprised of two branches, one which takes as an input images and the other which takes as input spectrograms.
Given an image and its corresponding spoken audio captions, the Inventors apply grounding techniques to extract meaningful segments from the caption and associate them with an appropriate subregion of the image (i.e. a bounding box around the area of interest). Once a set of proposed visual bounding boxes and acoustic segments for a given image/caption pair are discovered, the multimodal network is used to compute a similarity score between each unique image crop/acoustic segment pair. The resulting regions of interest are separated out and a clustering analysis is performed to establish affinity scores between each image cluster and each acoustic cluster. This model is used to associate images with the waveforms representing their spoken audio captions in untranscribed audio data.