Humans are capable of discovering words and other elements of linguistic structure in continuous speech at a very early age, a process proven difficult for computers to emulate. It is inherently a joint segmentation and clustering problem, made difficult by many sources of variability. While conventional, supervised automatic speech recognition (ASR) systems have recently made strides due to deep neural networks (DNNs), their application is limited as they require vast amounts of costly text transcription. The Inventors’ method of acoustic pattern discovery can discover word and phrase categories from continuous speech at the raw signal level with no transcriptions or conventional speech recognition, and jointly learn the semantics of those categories via visual associations. The method of acoustic pattern discovery runs in linear time, vastly superior to previous efforts.