Pre-proposal Draft Symbolic Analysis for Continuous Speech Recognition Approaches to continuous speech recognition differ according to where they locate the boundary between the acoustic processing that produces an initial discrete characterization of the signal and further symbolic processing of that characterization into morphemes or words. Our strategic shift in how to achieve continuous speech recognition pushes the acoustic/symbolic boundary below the standard level of phonetic features that many other systems depend on, down to a level at which it is possible to reliably determine non-linguistic, qualitative, physical descriptors of an unsegmented acoustic wave. These qualitative labels vary if the speech signal varies, even if a human listener might perceive the same words being spoken. In contrast to other approaches, we compensate for the lack of phonetic invariance in the signal and for the well-known problem of segmentation by shifting more of the computational burden from acoustic processing to symbolic processing of these sub-segmental qualitative labels. In our approach, the interpretation of the spoken input as a sequence of morphemes or words emerges from these sub-segmental labels through the interaction of a series of novel transduction processes that obey explicit constraints encoded in distinct allophonic, lexical, and metrical system modules. We have formulated mathematically precise descriptions of such acoustic, segmental, syllabic, and suprasyllabic constraints for English. We have also developed new computational techniques that now make feasible a modular recognition system based on these constraints: Z Constraint specification: The structure-mapping mechanisms of Lexical-Functional Grammar (Kaplan & Bresnan, 1982) operating on formalisms for explicitly representing these constraints, based on several studies in metrical and segmental phonological theory (Withgott, 1982; Kiparsky, 1981). Z Debugging facilities: A collection of tools, such as the Grammar Writer's Workbench for Lexical Functional Grammar (Kaplan & Kiparksy, 1985), that enable us to easily write and rapidly test new descriptions in these formalisms. Z Constraint application: Compile- and run-time methods for processing such constraints. These include multi-processing active chart parsers (Kaplan, 1973) and finite-state transducer models for allophonic rules (Kaplan & Kay, 1981; Kaplan, 1984). Assumption-based truth-maintenance procedures (de Kleer, 1985) that permit the contraints from the separate modules to be efficiently integrated and enable the interactions between them to be monitored and explicitly controlled. We have also developed new qualitative labeling methods for deriving qualitative labels from acoustic waves, based on dynamic attractors (Huberman & Hogg, 1984). These provide a family of simple fault-tolerant pattern-classification architectures that appear to lend themselves to VLSI implementation. We believe that only a system emphasizing the explicit interactions among such symbolic processing components can behave robustly in the face of the productive variability inherent in speech. We also believe that our existing technologies provide a sound foundation for exploring this new division of labor between acoustic and symbolic components in speech recognition. We propose a three-year research project, leveraging off these technologies, to devise enough of the linguistic and acoustic descriptions and enough of the system components to permit a thorough evaluation of this new approach to the design of robust recognition systems for non-restricted, continuous speech. In the remainder of this document, we briefly discuss previous approaches and then describe our theory and technologies in greater detail, explaining how they combine to handle the variability and productivity that characterizes ordinary continuous speech. 1. Previous recognition efforts Previous speech recognition systems have varied in the richness of their signal processing and linguistic processing capabilitites. In systems using dynamic time warping (e.g. Rabiner & Levinson, 1981), nearly all the work is done at the signal processing level, with almost no further processing. Whole word prototypes derived from training data are stored as reference templates and inputs are matched using a prechosen distortion measure for time-alignment. Systems based on probabilistic hidden Markov models (e.g. Rabiner et al., 1983; Makhoul et al., 1984) similarly rely on non-symbolic training data, coupled to a minimal symbolic component consisting of an extensional finite-state network characterization of all recognizable speech inputs. Each path through this network has an associated probability and words are identified by matching against paths with maximum probability. While dynamic time warping and Markov model-based systems have been quite successful at identifying words from a fixed list occurring in a small set of phonological or morphological contexts, an impractically large number of reference templates would have to be specified, stored, and matched for these systems to operate on the productive contextual variations of unlimited continuous speech. There are other network models that utilize slightly more symbolic processing. In the HARPY system (Lowerre, 1976) and in more recent approaches (e.g. Bush et al., 1984), recognition is accomplished by finding a path through a finite-state pronunciation network whose arcs may model acoustic-phonetic segments. A set of acoustic pattern matchers, which can be based on vector quantized LPC spectra, operate at the level of individual spectral analysis frames or over groups of these. Various distance metrics have been proposed to rate the fit of the match. These systems also perform well only on inputs of restricted variability, achieved either by limiting vocabulary size (often to lexicons of fewer than thirty items), requiring speakers to pause between words, or by limiting recognition to the speakers that the system is trained on. Other systems do much more symbolic processing of phonetic features (e.g. Stevens, 1982; Shipman & Zue, 1983; Huttenlocher & Zue, 1984; Church, 1983), and if they were extended to more fully integrate metrical and sub-segmental acoustic constraints, they might be capable in principle of dealing with the combinatorics of contextual variation. But the symbolic components of these systems have been constructed on the assumption that they can be coupled to methods for reliably segmenting the spoken input into allophonic or demi-syllabic units marked with clearly identified phonetic cues, and these methods have been difficult to discover. This is because the acoustic realization of a given utterance may vary not only according to the linguistic context but also according to the dialect and vocal tract of the speaker, the rate and register of speech, and the presence of environmental and speaker-produced noise. 2. Qualitative speech analysis and phonological constraints Our model cuts up the recognition problem in a novel way. We label the sampled signal in a descriptive, language-independent fashion, a strategic shift that provides a new mechanism to handle variability. This shift of course demands added power from other parts of the model, power that comes from phonetic and phonological grammars and processors. Figure 1 shows the flow of information in our model. In this section we describe the discrete analysis that produces a finite label set and enables us to ignore noise, the language-specific constraints used to build alternative interpretations of the initial symbolic labels, and the way in which ambiguities are resolved by taking various phonological constraints into account. In our approach, the speech sequence is not warped to match a stored prototype nor submitted to a best fit procedure using language-specific templates. Instead, the sampled signal is labeled in a qualitative, language-independent way, and then analyzed according to extremely low-level but language-specific constraints, as well as higher-level phonological constraints. Phonetic transductions and phonological constraints together provide stable representations for spoken words even though their associated acoustic labels vary over utterance situations and even though information in the signal may be incomplete or partially obscured by noise. The signal-processing front end in our model exploits the concept of attractive fixed points, drawn from the field of dynamical systems theory in physics. This concept has been applied to discrete processes to produce computing architectures capable of fault-tolerant pattern classification (Huberman & Hogg, 1984). These architectures currently consist of arrays of simple local units that operate on integer data received locally from their neighbors. Input to the machine takes place only along the edges and the computation is systolically advanced from row to row in step with a global clock. Each processor has an internal state, represented by an integer, which can only take a small set of values depending on a given adaptive rule. The unit determines its local output based on its inputs and its internal state. At each time step, every element receives data values from the adjoining units in one direction and computes its output, which is then passed along to its next neighbors. Huberman and Hogg have recently shown that these architectures can be made to compute in a distributed, deterministic, self-repairing fashion. In our application, these machines achieve a discrete recoding of the continuous input into qualitative labels such as burst, silence, high second formant, etc. In order to form these labels, this mechanism clusters the inputs, but its resemblance to current front ends stops there. The array is initially "trained" by providing a sequence of input pairs together with an indication of whether those items are to be assigned the same label or not. However, if some of the initial clusters are incorrect, the array can be automatically modified during recognition to pull clusters apart or combine them together. Unlike more familiar vector quantization models, the speed of labeling is not adversely affected by enlarging the set of possible labels. This is because it is not necessary to compare the inputs to the clusters or to compute explicit distance metrics. Instead, the dynamic attractors provide an efficient means of determining which cluster or label a given input corresponds to. Our mechanism also exhibits invariance to small input- or device-induced distortions and rapid adaptation to new training samples (as it processes inputs from new speakers). A further, practical point is that the front end is easily implemented in VLSI, an advantage we intend to pursue with other groups at PARC. Once a symbolic representation of the signal is produced, it is translated into phonetic categories such as aspiration, frication, closure, etc. by a two-tape finite-state transducer (Kaplan & Kay, 1981; Kaplan, 1984; Koskenniemi, 1983) derived from language-independent specifications. The phonetic categories are then parsed into phonological structures according to the constraints of a number of interacting, language-particular rule systems, or grammars. There are allophonic, metrical, and lexical (word-level) rules determining how phonetic and phonological units may combine and what words they represent. The English allophonic grammar states that the phoneme /p/ must be aspirated in foot-initial position, where a foot is a metrical unit composed of a sequence of stressed and unstressed syllables. Foot grammars have not been exploited outside the theoretical phonological literature and have never been explicitly implemented, to our knowledge, outside our system. Lexical rules further constrain possible sound sequences according to the morphemes listed in the dictionary and the way the language allows words to be composed from those morphemes. Although phonological structures are hierarchical in nature, mixed categories appear by design on the same level. A given level might include a fully specified allophone followed by a vague category like "nasal." Other approaches (e.g. Shipman & Zue, 1983) have only used dictionary constraints to pin down likely interpretations for such underspecified structures, but in our approach, other phonological constraints can also reduce the uncertainty. If we find, for instance, that some kind of nasal precedes a [g] in the middle of a foot, metrical foot-structure constraints tell us that the nasal must be velar, because that is the only possible foot-medial nasal that precedes a [g] (as in the word finger) (Withgott, 1982; Kiparsky, 1981). These phonological constraints are particularly important for continuous speech recognition, since, unlike lexical constraints, they can be applied whether or not word boundaries have been reliably identified, or if the input is a novel word. Our phonological rules are expressed in formalisms that can be processed by the active-chart technology that was originally implemented for natural language syntax, although the theory behind its development was not restricted to this domain (Kaplan, 1973). The design of the Kaplan system reflects the idea of a collection of asynchronous communicating processes that handle various aspects of the recognition problem. Abandoning the familiar linear control strategy, the system effectively reduces the computation space by keeping track of a collection of constraints which minimizes dead-ends in the parse. The phonological grammar becomes a set of recursive transition networks where labels on directed arcs correspond to sound constituents of various sizes. These transitions can also be conditioned by non-local dependencies of the sort that can be expressed in the lexical-functional grammatical formalism (Kaplan & Bresnan, 1982). These processing algorithms have been coupled to a sophisticated display-based user interface that allows for rapid writing and easy debugging of phonological specifications. By virtue of this modular organization of constraints, our model naturally deals with the sources of variability that have proved difficult for other recognition approaches to handle: Z Presence of "noise" One source of variation in the speech signal is the presence of both environmental and speaker-produced noise. Omitting discussion of ambient or channel noise that can be filtered out with standard techniques, we cope with language-like noise as well as extraneous sounds by means of a mapping to a phonetic lexicon and by the particular design of the front end. Some labels prove distinctive in some languages but are ignored as extra-linguistic in others, being treated as noise outside the language-specific phonetic lexicon. Lip smacks and clicks are good examples of acoustic events which have no match in the phonetic lexicon of English, but which must be processed for certain African languages (e.g. Hottentot, Zulu, Xhosa). The language-independent identification of potential linguistic categories in this modular treatment also provides a ready-made foundation for assigning categories for new languages. Z Timing variation Rhythmic, metrical factors, along with rate, create a serious problem of timing variation. It is obviously difficult to recover temporal information in dynamic time warping systems, and it is hard to capture duration generalizations in systems that use finite hidden Markov models. But it is also well-known that duration is helpful in segment identification, both for distinguishing similar segments (e.g. stops) and for determining how many allophones an utterance contains. We can exploit information of this type by defining the transduction from acoustic labels to linguistic categories as a two-tape Kaplan-Kay or Koskenniemi machine that is explicitly sensitive to contextual factors that affect timing, including speech rate and metrical constraints. Z Location of boundaries Words spoken in isolation bear only limited resemblance to words in fluent sentences and may be hard to segment; similarly, the location of allophone "boundaries" is rarely the same in two pronunciations of the same word. Parsing with phonetic and phonological grammars permits segments, metrical constituents, and words to emerge in the well-formed parse, and bypasses the search for segment boundaries. Z Language productivity Another reason for the inclusion of phonological grammars is apparent in cross-linguistic studies. We must expect novel words; the inventory of lexical items is not the "bin of parts" we find in robotic vision tasks. In current speech systems, it is impractical or impossible to build a system based on the usual design of simple lexical look-up on a finite word list for languages with rich morphologies (e.g. Finnish) where a single verb may exhibit several thousand variations created by processes of inflection. In contrast, we have at hand technology enabling us to include a morphological parser as a part of the model (Koskenniemi, 1983). Z Speaker variation The parts of the model described so far are designed to handle some aspects of the problem of speaker variation, including some segmental differences and changes in speaker rate, but other aspects involve setting parameters at the front end (e.g. is the speaker an adult male/female or child). The reason we wish to explore speaker differences is to test the adequacy of our radically low level of symbolic representation, the more traditional one that is essentially syllabic, and the higher level consisting of strings of syllables. We observe that although whole syllables can drop out in ordinary, casual speech in some dialects and speech styles, entire feet do not, thus giving us a way to recover the intended word. These are the sources of variability in the signal that various approaches try to overcome by means of normalization, searches for invariant acoustic cues, or through successively complex attempts at template matching. Current systems treat variability in the shape of the speech signal as distortion, regardless of the source. What we believe to be the way around the problem of variability, and what is unique in our model, is its reformulation of the recognition task, its incorporation of constraints from phonological constituents from the foot to the sub-segment, its particular use of symbolic representations of physical signals, and the way in which we have compartmentalized the knowledge into language-independent and language-specific modules. 3. Proposal At this moment, we believe we are uniquely positioned to engage in this exploration. We are investigating optimal parsing design with researchers of the Center for the Study of Language and Information at Stanford, as well as general phonological and speech recognition issues with members of the local research community. In addition, we are able to benefit from Kaplan's and Kay's current work on lexical access and encoding. Finally, our computational environment includes access to simulations of the Huberman-Hogg machines, and the LFG system where both preliminary phonetic and high-level phonological components have been successfully implemented. We propose a three-year effort in order to make substantial progress on these issues. While at present we have designs and simulations of components that provide solutions to some of the difficulties associated with variability and productivity, we must systematically investigate how to improve them both from the point of view of the scientific model and from an efficiency perspective. In addition, we want to implement our software in a special-purpose chip to enable far greater processing speed. The support we seek is for the investigation of scientific questions including how to develop better adaptive structures for time-varying inputs, whether a two-tape finite-state automaton will suffice to map the array outputs to the parser, how to develop the metrical constraints already in place and to add additional phonological constraints, and finally how best to exploit the machines we have developed for lexical look-up. REFERENCES Bush, M., G. Kopec, and M. Hamilton (Fairchild), Network-based Isolated Digit Recognition Using Vector Quantization. Presentation at ASA, October 1984. Church, K. Phrase-Structure Parsing: A Method for Taking Advantage of Allophonic Constraints, MIT Ph.D. dissertation, 1983 (available from RLE publications, MIT, Cambridge, MA). de Kleer, J. An Assumption-based Truth Maintenance System, Xerox PARC, 1985. Hogg, T. and B. A. Huberman, Understanding Biological Computation: Reliable Learning and Recognition, Proc. Natl. Acad. Sci. USA 81, 6871-6875, 1984. Huberman, B. A. and T. Hogg. Adaption and Self-Repair in Parallel Computing Structures, Phys. Rev. Lett. 52, 1048-1051, 1984. Huttenlocher, D. and V. Zue (MIT). Presentation at ICAASP, June 1984. Kaplan, R. A multi-processing approach to natural language. Proc. of the First National Computer Conference, AFIPS Press, 1973, 435-440. Kaplan, R. Finite-state models of phonological rule systems. Paper presented at the Mathematics of Language Conference, University of Michigan, October 1984. Kaplan, R. & M. Kay. Phonological rules as finite-state transducers. Paper presented at the Winter Meeting of the Linguistic Society of America, New York, 1981. Kaplan, R & C. Kiparsky. The LFG Grammar Writer's Workbench, Xerox PARC, 1985. Kiparsky, P. Metrical Structure Assignment is Cyclic. Linguistic Inquiry 10, 421-441. Koskenniemi, K. Two-level morphology: A general computational model for word-form recognition and production. University of Helsinki Publications, No. 11, 1983. Lowerre, B. The HARPY Speech Recognition System, CMU Ph.D. dissertation, 1976. Makhoul, J., R. Schwartz, Yen-Lu Chow, O. Kimball, S. Roucos, and M. Krasner (BBN), Continuous phonetic speech recognition, Presentation at ASA, October 1984.Rabiner, L. and S. Levinson. Isolated and connected word recognition -- theory and selected applications. IEEE Trans. on communications, COM-29, 5, May 1981. Rabiner, L., S. Levinson and M. Sondhi. On the application of vector quantization and hidden Markov models to speaker-independent, isolated word recognition. BSTJ 62(4) (APR. 1983). Shipman, D. and V. Zue. Properties of large lexicons: implications for advanced word recognition systems, 1983 IEEE International Conference on Acoustics, Speech and Signal Processing, Paris, France. Stevens, K.N. Toward a feature-based model of speech perception. IPO Annu. Prog. Rep. 17, 36-37 (1982). Withgott, M. Segmental Evidence for Phonological Constituents, UT Ph.D. dissertation, 1982, Austin, TX (available from University microfilms, Ann Arbor). (LIST ((PAGE NIL NIL (0 0 612 792) ((FOLIO NIL (PARALOOKS (QUAD CENTERED) CHARLOOKS (SUPERSCRIPT 0 SIZE 12 FAMILY TIMESROMAN OVERLINE OFF STRIKEOUT OFF UNDERLINE OFF SLOPE REGULAR WEIGHT MEDIUM)) (270 36 72 36) NIL) (HEADING NIL (HEADINGTYPE DRAFT) (0 756 612 36) NIL) (TEXT NIL NIL (72 72 468 648) NIL))) (PAGE NIL NIL (0 0 612 792) ((FOLIO NIL (PARALOOKS (QUAD CENTERED) CHARLOOKS (SUPERSCRIPT 0 SIZE 12 FAMILY TIMESROMAN OVERLINE OFF STRIKEOUT OFF UNDERLINE OFF SLOPE REGULAR WEIGHT MEDIUM)) (270 36 72 36) NIL) (HEADING NIL (HEADINGTYPE DRAFT) (0 756 612 36) NIL) (TEXT NIL NIL (72 72 468 648) NIL))) (PAGE NIL NIL (0 0 612 792) ((FOLIO NIL (PARALOOKS (QUAD CENTERED) CHARLOOKS (SUPERSCRIPT 0 SIZE 12 FAMILY TIMESROMAN OVERLINE OFF STRIKEOUT OFF UNDERLINE OFF SLOPE REGULAR WEIGHT MEDIUM)) (270 36 72 36) NIL) (HEADING NIL (HEADINGTYPE DRAFT) (0 756 612 36) NIL) (TEXT NIL NIL (72 72 468 648) NIL)))))PAGEHEADINGDRAFT$<< TIMESROMAN TIMESROMAN TIMESROMAN MATH TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN3kY>� �A#��/� �XA��<��#�l �_�� C��F��EX�:>��h�Y��z�