Speech Research Proposal Use of Symbolic Processing for Fluent Speech Recognition As speech research progresses in this decade, connected speech recognition will become more tractable as better pattern matching techniques are devised, and lexical access will become simpler as parallel search strategies are better understood. The basic obstacle that will remain in the task of robust, automatic recognition of unrestricted speech concerns understanding and compensating for the sources of variability in fluent speech. Sources of variability range from acoustic factors due to the speech environment to dialect- and style-specific phonological factors, and any source is capable of introducing unacceptable error into the best recognition system. The thrust of our proposed research is to shift more of the burden of low-level stimulus identification to an early, abstract classification of acoustic distinctions, and to exploit language-particular phonological constraints in the actual categorization task. We have developed algorithms for abstract cluster analysis in response to the phenomemon that certain dissimilar sounds "count" as the same in phonetic terms, while other, acoustically similar sounds must remain distinct for recognition purposes. The same algorithms allow us to separate speech sounds from certain kinds of noise. We have also developed methods to explore higher-level sources of variability, making use of separate phonetic lexicons for different languages (since some easily perceived sounds are meaningful in some languages but are extraneous in others) as well as language-particular phonetic and phonological grammars. Our formalization of this kind of linguistic knowledge permits the integration of various constraints from domains such as prosody that have been difficult to incorporate in the past. This approach to fluent speech recognition based upon modular, symbolic processing is designed to cope with the many sources of varibility in the signal. We feel that our perspective lends itself best to research in the direction of finding solutions to difficulties posed by: (1) timing variation due to rate and prosodic factors; (2) the presence of certain kinds of noise in the signal; (3) morphological productivity; and (4) phonological variation in continuous speech. 1. Overview Approaches to fluent speech recognition differ according to where they locate the boundary between the acoustic processing that produces an initial discrete characterization of the signal and further symbolic processing of that characterization into morphemes or words. Our strategic shift in how to achieve fluent speech recognition pushes the acoustic/symbolic boundary below the standard level of phonetic features that many other systems depend on, down to a level at which it is possible to reliably determine non-linguistic, qualitative, physical descriptors of an unsegmented acoustic wave. In contrast to other approaches, we compensate for the lack of phonetic invariance in the signal and for the well-known segment boundary problem by shifting more of the computational burden from acoustic processing to symbolic processing of these sub-segmental qualitative labels. In our approach, the interpretation of the spoken input as a sequence of morphemes or words emerges from these sub-segmental labels through the interaction of a series of novel transduction processes that obey explicit constraints encoded in distinct allophonic, lexical, and metrical system modules. We have formulated mathematically precise descriptions of such acoustic, segmental, syllabic, and suprasyllabic constraints for English. We have also developed new computational techniques that now make feasible a modular recognition system based on these constraints: Z Constraint specification: The structure-mapping mechanisms of Lexical-Functional Grammar (Kaplan & Bresnan, 1982) operating on formalisms for explicitly representing these constraints, based on several studies in metrical and segmental phonological theory (Withgott, 1982; Kiparsky, 1981). Z Debugging facilities: A collection of tools, such as the Grammar Writer's Workbench for Lexical Functional Grammar (Kaplan & Kiparksy, 1985), that enable us to easily write and rapidly test new descriptions in these formalisms and related experimental formalisms. Z Constraint application: Compile- and run-time methods for processing such constraints. These include multi-processing active chart parsers (Kaplan, 1973) and finite-state transducer models for allophonic rules (Kaplan & Kay, 1981; Kaplan, 1984). Solution algorithms including assumption-based truth-maintenance procedures (de Kleer, 1985) that permit the contraints from the separate modules to be efficiently integrated and enable the interactions between them to be monitored and explicitly controlled. We have also developed new qualitative labeling methods for abstract cluster analysis, based on dynamic attractors (Huberman & Hogg, 1984). These provide a family of simple fault-tolerant pattern-classification architectures that appear to lend themselves to VLSI implementation. We believe that only a system emphasizing the explicit interactions among such symbolic processing components can behave robustly in the face of the productive variability inherent in speech. We also believe that our existing technologies provide a sound foundation for exploring this new division of labor between acoustic and symbolic components in speech recognition. We propose a three-year research project, leveraging off these technologies, to devise enough of the linguistic and acoustic descriptions and enough of the system components to permit a thorough evaluation of this new approach to the design of robust recognition systems for non-restricted, fluent speech. In the remainder of this document, we briefly discuss previous approaches and then describe our theory and technologies in greater detail, explaining how they combine to handle the variability and productivity that characterizes ordinary continuous speech. 2. Previous recognition efforts Previous speech recognition systems have varied in the richness of their signal processing and linguistic processing capabilitites. In systems using dynamic time warping (e.g. Rabiner & Levinson, 1981), nearly all the work is done at the signal processing level, with almost no further processing. Whole word prototypes derived from training data are stored as reference templates and inputs are matched using a prechosen distortion measure for time-alignment. While these systems along with those based on probabilistic hidden Markov models (e.g. Rabiner et al., 1983; Makhoul et al., 1984) have been quite successful at identifying words from a fixed list occurring in a small set of phonological or morphological contexts, an impractically large number of reference templates would have to be specified, stored, and matched for such systems to operate on the productive contextual variations of unlimited continuous speech. There are other classes of models that utilize slightly more symbolic processing. In the HARPY system (Lowerre, 1976) and in more recent approaches (e.g. Bush et al., 1984), recognition is accomplished by finding a path through a finite-state pronunciation network. Various distance metrics have been proposed to rate the fit of the match. These systems also perform well only on inputs of restricted variability, achieved either by limiting vocabulary size (often to lexicons of fewer than thirty items), requiring speakers to pause between words, or by limiting recognition to the speakers that the system is trained on. Other systems do much more symbolic processing of phonetic features (e.g. Stevens, 1982; Shipman & Zue, 1983; Huttenlocher & Zue, 1984; Church, 1983), and if they were extended to more fully integrate metrical and sub-segmental acoustic constraints, they might be capable in principle of dealing with the combinatorics of contextual variation. But the symbolic components of these systems have been constructed on the assumption that they can be coupled to methods for reliably segmenting the spoken input into allophonic or demi-syllabic units marked with clearly identified phonetic cues, and these methods have been difficult to discover. This is because the acoustic realization of a given utterance may vary not only according to the linguistic context but also according to the dialect and vocal tract of the speaker, the rate and register of speech, and the presence of environmental and speaker-produced noise. 3. Qualitative speech analysis and phonological constraints Our model cuts up the recognition problem in a novel way. We label the sampled signal in a descriptive, language-independent fashion. This shift of course demands added power from other parts of the model, power that comes from phonetic and phonological grammars and processors. Figure 1 shows the flow of information in our model. In this section we describe the discrete analysis that produces a finite label set and enables us to ignore noise, the language-specific constraints used to build alternative interpretations of the initial symbolic labels, and the way in which ambiguities are resolved by taking various phonological constraints into account. In our approach, the speech sequence is not warped to match a stored prototype nor submitted to a best fit procedure using language-specific templates. Instead, the sampled signal is labeled in a qualitative, language-independent way, and then analyzed according to extremely low-level but language-specific constraints, as well as higher-level phonological constraints. Phonetic transductions and phonological constraints together provide stable representations for spoken words even though their associated acoustic labels vary over utterance situations and even though information in the signal may be incomplete or partially obscured by noise. The signal-analysis component in our model exploits the concept of attractive fixed points, drawn from the field of dynamical systems theory in physics. This concept has been applied to discrete processes to produce computing architectures capable of fault-tolerant pattern classification (Huberman & Hogg, 1984). These architectures currently consist of arrays of simple local units that operate on integer data received locally from their neighbors. Input to the machine takes place only along the edges and the computation is systolically advanced from row to row in step with a global clock. Each processor has an internal state, represented by an integer, which can only take a small set of values depending on a given adaptive rule. The unit determines its local output based on its inputs and its internal state. At each time step, every element receives data values from the adjoining units in one direction and computes its output, which is then passed along to its next neighbors. Huberman and Hogg have recently shown that these architectures can be made to compute in a distributed, deterministic, self-repairing fashion. In our application, these machines achieve a discrete recoding of the continuous input into relative, qualitative labels such as high second formant, silence, long burst, etc. In order to form these labels, this mechanism clusters the inputs, but its resemblance to current front ends stops there. The array is initially "trained" by providing a sequence of input pairs together with an indication of whether those items are to be assigned the same label or not. However, if some of the initial clusters are incorrect, the array can be automatically modified during recognition to pull clusters apart or combine them together. The dynamic attractors provide an efficient means of determining which cluster or label a given input corresponds to. Our mechanism also exhibits invariance to small input- or device-induced distortions and rapid adaptation to new training samples (as it processes inputs from new speakers). A further, practical point is that the front end analyzer is easily implemented in VLSI, an advantage we intend to pursue with other groups at PARC. Once a symbolic representation of the signal is produced, it is translated into phonetic categories such as aspiration, frication, closure, etc., possibly by a two-tape finite-state transducer (Kaplan & Kay, 1981; Kaplan, 1984; Koskenniemi, 1983) derived from language-independent specifications. The phonetic categories are then interpreted as phonological structures according to the constraints of a number of interacting, language-particular grammars. There are allophonic, metrical, and lexical (word-level) constraints on how phonetic and phonological units may combine and what words they represent. The specification of deletion and allophonic reduction sites is also contained in this component. Beyond this, lexical rules further constrain possible sound sequences according to the morphemes listed in the dictionary and the way the language allows words to be composed from those morphemes. Although the phonological description is strictly hierarchical in nature, this does not preclude mixed-grain representations. A given level might include a fully specified allophone followed by a vague category like "nasal." Other approaches (e.g. Shipman & Zue, 1983) have only used dictionary constraints to pin down likely interpretations for such underspecified structures, but in our approach, other phonological constraints can also reduce the uncertainty. If we find, for instance, that some kind of nasal precedes a [g] in the middle of a foot, metrical foot-structure constraints tell us that the nasal must be velar, because that is the only possible foot-medial nasal that precedes a [g] (as in the word finger) (Withgott, 1982; Kiparsky, 1981). These phonological constraints are particularly important for fluent speech recognition, since, unlike lexical constraints, they can be applied whether or not word boundaries have been reliably identified, or if the input is a novel word. Our current phonological grammars are expressed in formalisms that can be processed by the active-chart technology that was originally implemented for natural language syntax, although the theory behind its development was not restricted to this domain (Kaplan, 1973). The design of the Kaplan system reflects the idea of a collection of asynchronous communicating processes that handle various aspects of the recognition problem. Abandoning the familiar linear control strategy, the system effectively reduces the computation space by keeping track of a collection of constraints which minimizes dead-ends in the parse. The phonological grammar becomes a set of recursive transition networks where labels on directed arcs correspond to sound constituents of various sizes. These transitions can also be conditioned by non-local dependencies of the sort that can be expressed in the lexical-functional grammatical formalism (Kaplan & Bresnan, 1982). These processing algorithms have been coupled to a sophisticated display-based user interface that allows for rapid writing and easy debugging of phonological specifications. We are now also exploring other experimental grammar formalisms, constraint languages and solution algorithms for these. By virtue of this modular organization of constraints, our model naturally deals with the sources of variability that have proved difficult for other recognition approaches to handle: Z Presence of "noise" One source of variation in the speech signal is the presence of both environmental and speaker-produced noise. Omitting discussion of ambient or channel noise that can be filtered out with standard techniques, we cope with language-like noise as well as extraneous sounds by means of a mapping to a phonetic lexicon and by the particular design of the front end. Some labels prove distinctive in some languages but are ignored as extra-linguistic in others, being treated as noise outside the language-specific phonetic lexicon. Lip smacks and clicks are good examples of acoustic events which have no match in the phonetic lexicon of English, but which must be processed for certain African languages (e.g. Hottentot, Zulu, Xhosa). The language-independent identification of potential linguistic categories in this modular treatment also provides a ready-made foundation for assigning categories for new languages. Z Timing variation Rhythmic, metrical factors, along with rate, create a serious problem of timing variation. It is obviously difficult to recover temporal information in dynamic time warping systems, and it is hard to capture duration generalizations in systems that use finite hidden Markov models. But it is also well-known that duration is helpful in segment identification, both for distinguishing similar segments (e.g. stops) and for determining how many allophones an utterance contains. We can exploit information of this type by defining the transduction from acoustic labels to linguistic categories using constraints which are explicitly sensitive to contextual factors that affect timing, including speech rate and metrical environment. Z Location of boundaries Words spoken in isolation bear only limited resemblance to words in fluent sentences and may be hard to segment; similarly, the location of allophone "boundaries" is rarely the same in two pronunciations of the same word. Parsing with phonetic and phonological grammars permits segments, metrical constituents, and words to emerge in the well-formed parse, and bypasses the search for segment boundaries. Z Language productivity Another reason for the inclusion of phonological grammars is apparent in cross-linguistic studies. We must expect novel words; the inventory of lexical items is not the "bin of parts" we find in robotic vision tasks. In current speech systems, it is impractical or impossible to build a system based on the usual design of simple lexical look-up on a finite word list for languages with rich morphologies (e.g. Finnish) where a single verb may exhibit several thousand variations created by processes of inflection. In contrast, we have at hand technology enabling us to include a morphological parser as a part of the model (Koskenniemi, 1983). Z Phonological variation The parts of the model described so far are designed to handle some aspects of the problem of speaker variation, including some segmental differences and changes in speaker rate, but other aspects involve setting parameters at the front end (e.g. is the speaker an adult male/female or child). The reason we wish to explore speaker differences is to test the adequacy of our radically low level of symbolic representation, the more traditional one that is essentially syllabic, and the higher level consisting of strings of syllables. We observe that although whole syllables can drop out in ordinary, casual speech in some dialects and speech styles, entire feet (which span several syllables) do not, thus giving us a way to recover the intended word. These are the sources of variability in the signal that various approaches try to overcome by means of normalization, searches for invariant acoustic cues, or through successively complex attempts at template matching. Current systems treat variability in the shape of the speech signal as distortion, regardless of the source. What we believe to be the way around the problem of variability, and what is unique in our model, is its reformulation of the recognition task, its incorporation of constraints from phonological constituents from the foot to the sub-segment, its particular use of symbolic representations of physical signals, and the way in which we have compartmentalized the knowledge into language-independent and language-specific modules. 4. Proposal At this moment, we believe we are uniquely positioned to engage in this exploration. We are investigating optimal parsing design with researchers of the Center for the Study of Language and Information at Stanford, as well as general phonological and speech recognition issues with members of the local research community. In addition, we are able to benefit from Kaplan's and Kay's current work on lexical access and encoding. Finally, our computational environment includes access to simulations of the Huberman-Hogg machines, and the LFG system where both preliminary phonetic and high-level phonological components have been successfully implemented. We propose a three-year effort in order to make substantial progress on these issues. While at present we have designs and simulations of components that provide solutions to some of the difficulties associated with variability and productivity, we must systematically investigate how to improve them both from the point of view of the scientific model and from an efficiency perspective. In addition, we want to implement our software in a special-purpose chip to enable far greater processing speed. The support we seek is for the investigation of scientific questions including how to develop better adaptive structures for time-varying inputs, whether a two-tape finite-state automaton will suffice to map the array outputs to the parser, how to develop the metrical constraints already in place and to add additional phonological constraints, and finally how best to exploit the machines we have developed for lexical look-up. NEEDED Withgott to work on theory, phonological and phonetic investigation and grammar design. Hogg to work on abstract cluster analysis architecture Possibly less than full time: Expert Lisp hacker with strong theoretical background to collaborate on theory development and to work on tool development Speech researcher to work on compatibility with research elsewhere, statistics, and lexical investigation as well as help in implementation Secretarial/administrative assistant, to perform back-ups, and be in charge of purchase orders, etc. Travel funds ?1000 per year per person Symbolics 3670 with 474 Mbyte Disk, a 10 Mbyte memory board, floating point accelerator, and Gbus and Ethernet connection. Maintenance contract REFERENCES Bush, M., G. Kopec, and M. Hamilton (Fairchild), Network-based Isolated Digit Recognition Using Vector Quantization. Presentation at ASA, October 1984. Church, K. Phrase-Structure Parsing: A Method for Taking Advantage of Allophonic Constraints, MIT Ph.D. dissertation, 1983 (available from RLE publications, MIT, Cambridge, MA). de Kleer, J. An Assumption-based Truth Maintenance System, Xerox PARC, 1985. Hogg, T. and B. A. Huberman, Understanding Biological Computation: Reliable Learning and Recognition, Proc. Natl. Acad. Sci. USA 81, 6871-6875, 1984. Huberman, B. A. and T. Hogg. Adaption and Self-Repair in Parallel Computing Structures, Phys. Rev. Lett. 52, 1048-1051, 1984. Huttenlocher, D. and V. Zue (MIT). Presentation at ICAASP, June 1984. Kaplan, R. A multi-processing approach to natural language. Proc. of the First National Computer Conference, AFIPS Press, 1973, 435-440. Kaplan, R. Finite-state models of phonological rule systems. Paper presented at the Mathematics of Language Conference, University of Michigan, October 1984. Kaplan, R. & M. Kay. Phonological rules as finite-state transducers. Paper presented at the Winter Meeting of the Linguistic Society of America, New York, 1981. Kaplan, R & C. Kiparsky. The LFG Grammar Writer's Workbench, Xerox PARC, 1985. Kiparsky, P. Metrical Structure Assignment is Cyclic. Linguistic Inquiry 10, 421-441. Koskenniemi, K. Two-level morphology: A general computational model for word-form recognition and production. University of Helsinki Publications, No. 11, 1983. Lowerre, B. The HARPY Speech Recognition System, CMU Ph.D. dissertation, 1976. Makhoul, J., R. Schwartz, Yen-Lu Chow, O. Kimball, S. Roucos, and M. Krasner (BBN), Continuous phonetic speech recognition, Presentation at ASA, October 1984.Rabiner, L. and S. Levinson. Isolated and connected word recognition -- theory and selected applications. IEEE Trans. on communications, COM-29, 5, May 1981. Rabiner, L., S. Levinson and M. Sondhi. On the application of vector quantization and hidden Markov models to speaker-independent, isolated word recognition. BSTJ 62(4) (APR. 1983). Shipman, D. and V. Zue. Properties of large lexicons: implications for advanced word recognition systems, 1983 IEEE International Conference on Acoustics, Speech and Signal Processing, Paris, France. Stevens, K.N. Toward a feature-based model of speech perception. IPO Annu. Prog. Rep. 17, 36-37 (1982). Withgott, M. Segmental Evidence for Phonological Constituents, UT Ph.D. dissertation, 1982, Austin, TX (available from University microfilms, Ann Arbor). (LIST ((PAGE NIL NIL (0 0 612 792) ((FOLIO NIL (PARALOOKS (QUAD CENTERED) CHARLOOKS (SUPERSCRIPT 0 SIZE 12 FAMILY TIMESROMAN OVERLINE OFF STRIKEOUT OFF UNDERLINE OFF SLOPE REGULAR WEIGHT MEDIUM)) (270 36 72 36) NIL) (HEADING NIL (HEADINGTYPE DRAFT) (0 756 612 36) NIL) (TEXT NIL NIL (72 72 468 648) NIL))) (PAGE NIL NIL (0 0 612 792) ((FOLIO NIL (PARALOOKS (QUAD CENTERED) CHARLOOKS (SUPERSCRIPT 0 SIZE 12 FAMILY TIMESROMAN OVERLINE OFF STRIKEOUT OFF UNDERLINE OFF SLOPE REGULAR WEIGHT MEDIUM)) (270 36 72 36) NIL) (HEADING NIL (HEADINGTYPE DRAFT) (0 756 612 36) NIL) (TEXT NIL NIL (72 72 468 648) NIL))) (PAGE NIL NIL (0 0 612 792) ((FOLIO NIL (PARALOOKS (QUAD CENTERED) CHARLOOKS (SUPERSCRIPT 0 SIZE 12 FAMILY TIMESROMAN OVERLINE OFF STRIKEOUT OFF UNDERLINE OFF SLOPE REGULAR WEIGHT MEDIUM)) (270 36 72 36) NIL) (HEADING NIL (HEADINGTYPE DRAFT) (0 756 612 36) NIL) (TEXT NIL NIL (72 72 468 648) NIL))))) PAGEHEADINGDRAFT $< <  TIMESROMAN  TIMESROMAN  TIMESROMAN MATH TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN8œIÜ gY>õ   ,A# ø¨ /X[AÏ™<–Œø(‚lÎ丮ñ¯  ÷ ”§X7{Œe(} ›³C—ƒF‹ £EX£:>·Éhš`ûÄz¹