Textbook in PDF format
This book introduces the theory, algorithms, and implementation techniques for efficient decoding in speech recognition mainly focusing on the Weighted Finite-State Transducer (WFST) approach. The decoding process for speech recognition is viewed as a search problem whose goal is to find a sequence of words that best matches an input speech signal. Since this process becomes computationally more expensive as the system vocabulary size increases, research has long been devoted to reducing the computational cost. Recently, the WFST approach has become an important state-of-the-art speech recognition technology, because it offers improved decoding speed with fewer recognition errors compared with conventional methods. However, it is not easy to understand all the algorithms used in this framework, and they are still in a black box for many people. In this book, we review the WFST approach and aim to provide comprehensive interpretations of WFST operations and decoding algorithms to help anyone who wants to understand, develop, and study WFST-based speech recognizers. We also mention recent advances in this framework and its applications to spoken language processing. Table of Contents: Introduction / Brief Overview of Speech Recognition / Introduction to Weighted Finite-State Transducers / Speech Recognition by Weighted Finite-State Transducers / Dynamic Decoders with On-the-fly WFST Operations / Summary and Perspective
Preface
Introduction
Speech Recognition and Computation
Why WFST?
Purpose of this Book
Book Organization
Brief Overview of Speech Recognition
Statistical Framework of Speech Recognition
Speech Analysis
Acoustic Model
Hidden Markov Model
Computation of Acoustic Likelihood
Output Probability Distribution
Subword Models and Pronunciation Lexicon
Context-dependent Phone Models
Language Model
Finite-State Grammar
N-gram Model
Back-off Smoothing
Decoder
Viterbi Algorithm for Continuous Speech Recognition
Time-Synchronous Viterbi Beam Search
Practical Techniques for LVCSR
Context-dependent Phone Search Network
Lattice Generation and N-Best Search
Introduction to Weighted Finite-State Transducers
Finite Automata
Basic Properties of Finite Automata
Semiring
Basic Operations
Transducer Composition
Optimization
Determinization
Weight Pushing
Minimization
Epsilon Removal
Speech Recognition by Weighted Finite-State Transducers
Overview of WFST-based Speech Recognition
Construction of Component WFSTs
Acoustic Models
Phone Context Dependency
Pronunciation Lexicon
Language Models
Composition and Optimization
Decoding Algorithm Using a Single WFST
Decoding Performance
Dynamic Decoders with On-the-fly WFST Operations
Problems in the Native WFST Approach
On-the-fly Composition and Optimization
Known Problems of On-the-fly Composition Approach
Look-ahead Composition
How to Obtain Prospective Output Labels
Basic Principle of Look-ahead Composition
Realization of Look-ahead Composition Using a Filter Transducer
Look-ahead Composition with Weight Pushing
Generalized Composition
Interval Representation of Label Sets
On-the-fly Rescoring Approach
Construction of Component WFSTs for On-the-fly Rescoring
Concept
Algorithm
Approximation in Decoding
Comparison with Look-ahead Composition
Summary and Perspective
Realization of Advanced Speech Recognition Techniques Using WFSTs
WFSTs for Extended Language Models
Dynamic Grammars Based on WFSTs
Wide-context-dependent HMMs
Extension of WFSTs for Multi-modal Inputs
Use of WFSTs for Learning
Integration of Speech and Language Processing
Other Speech Applications Using WFSTs
Conclusion
Bibliography
Authors' Biographies