At a high level, the algorithms used by this ASR engine are similar to a typical software implementation. It searches a hidden Markov model (HMM) specified by weighted finite-state transducer (WFST) and deep neural network (DNN) statistical models. An audio waveform is split into frames, each of which is transformed into a high-dimensional feature space. The likelihoods of these feature vectors with respect to the acoustic models are evaluated by a fully-connected, feed-forward DNN. A Viterbi search maintains a set of active hypothesis and combines these likelihoods with transition probabilities from the WFST. At the end of an utterance, the most likely hypothesis is converted to a sequence of words.
Both the acoustic modeling and search are typically memory-intensive processes that become a performance and power bottleneck. The Inventors mitigate this bottleneck by reducing off-chip bandwidth from three sources: 1) The acoustic model (DNN) parameters, which they train with all layers having a comparable number of nodes to maximize on-chip memory utilization; 2) The search graph (WEST) parameters, for which they use on-chip caching to reduce bandwidth and compress WEST with application-specific encodings to maximize hit rates; and 3) The static lattice snapshots constructed by the Viterbi algorithm to represents word hypotheses, which are stored on-chip rather than in external memory to create a much smaller data structure.