Given a collection of images and spoken audio captions, the Inventors present a method for discovering word-like acoustic units in the continuous speech signal and grounding them to semantically relevant image regions. For example, if the model is given an utterance containing spoken instances of the words “lighthouse,” it can associate it with images containing lighthouses. This technique is relevant to many applications in spoken language processing.