This technology is a small, low-power speech recognizer that enables speech interfaces on devices where an internet connection is unavailable, unreliable, slow, or power-constrained, or where low latency is desired. This is useful to small embedded systems, such as wearable devices, that can benefit from speech interfaces but are highly power-constrained and cannot adopt hardware automatic speech recognition (ASR).
Many consumer electronic devices and computer applications make use of speech interfaces. For small embedded systems such as wearable devices, it is desirable to shrink the size and power footprint of speech recognition capabilities below what is possible with current computing technology. Existing methods for ASR in embedded system are either implemented on low-power processors that do not allow for sophisticated speech models, or use cloud services that are reliant on Internet connectivity. The Inventors have developed a special-purpose architecture for speech processing that applies algorithmic techniques and integrated circuit (IC) technology to produce small, low-power speech recognizers that enable sophisticated speech interfaces on devices where an internet connection is unavailable.
At a high level, the algorithms used by this ASR engine are similar to a typical software implementation. It searches a hidden Markov model (HMM) specified by weighted finite-state transducer (WFST) and deep neural network (DNN) statistical models. An audio waveform is split into frames, each of which is transformed into a high-dimensional feature space. The likelihoods of these feature vectors with respect to the acoustic models are evaluated by a fully-connected, feed-forward DNN. A Viterbi search maintains a set of active hypothesis and combines these likelihoods with transition probabilities from the WFST. At the end of an utterance, the most likely hypothesis is converted to a sequence of words.
Both the acoustic modeling and search are typically memory-intensive processes that become a performance and power bottleneck. The Inventors mitigate this bottleneck by reducing off-chip bandwidth from three sources: 1) The acoustic model (DNN) parameters, which they train with all layers having a comparable number of nodes to maximize on-chip memory utilization; 2) The search graph (WEST) parameters, for which they use on-chip caching to reduce bandwidth and compress WEST with application-specific encodings to maximize hit rates; and 3) The static lattice snapshots constructed by the Viterbi algorithm to represents word hypotheses, which are stored on-chip rather than in external memory to create a much smaller data structure.
- Focus on memory bandwidth reduction allows ASR chip use with slower, non-volatile memories, reducing time-averaged system power consumption when speech input is presented with a low duty cycle
- Technology achieves core power consumption about 10 times lower than otherwise comparable designs