An Alternative Model in the Design of Pattern Recognizers: the Signal Parser This document is intended as a short comparison between a general recognition model we are developing and recent speech recognition systems. Our design is influenced by our belief that recognizing ordinary spoken language -- i.e. multi-speaker, continuous speech in arbitrary settings -- requires accessing knowledge in several domains, and thus the specificational theory must reflect a compartmentalization of knowledge we observe in linguistic behavior. A sub-goal of our research is pulling apart low-level, phonetic parsing from high-level, phonological parsing so we can use the same model for recognizing many natural languages. The primary stumbling block to the development of recognizers for even a single natural language is the variability of human speech. The shape of the speech signal depends on (a) dialect and register (speech style); (b) utterance context (e.g. words spoken in isolation bear only limited resemblance to words in fluent sentences); (c) utterance rate; (d) the vocal tract of the speaker; (e) presence of noise. These are usually referred to as problems of speaker adaption, locating word boundaries, and filtering out noise. A difficulty that is not ususally adressed is the productivity of language. This problem is particularly striking in languages with rich morphological components wherein, for example, a given noun may have as many as fifty possible realizations. While the problem is less severe in languages like English, one may still observe active word-formation processes such as compounding, derivation, and inflection. We will first outline six general classes of previous speech recognition systems; point out the sources of their extreme limitations; and then sketch how our model is designed to overcome each of the shortcomings. 1. Overview of previous recognition efforts The continuum of speech recognition systems spans work that is almost entirely non-linguistic to efforts that display little signal processing capability and are primarily linguistic. Representative points on this scale are: Z Engineering and Statistical approaches: DTW: Dynamic time warping of stored speech [1]. HMM: Probabilistic hidden Markov models [2], [3]. Z Mixed phonetic/engineering approaches: NBM: Network-based models [4], [5]. FBM: Feature-based models [6]. Z Linguistic approaches: DEM: Dictionary encoding models [7], [8]. LPM: Linguistic parsing models [9]. DTW and HMM systems only operate on restricted inputs. These are limited either in the number of isolated words recognized or in the specified sequence of words permitted in continuous speech. Based on the evidence of best current results, unrestricted speech recognition with these methods does not seem to be computationally feasible in the near future. Recognition of ordinary utterances in such systems also awaits better signal processing and enhanced models to handle morphological productivity and semi-grammatical phrases. Network models (NBM) based on cluster analysis in practice seem to be more flexible than the purely statistical and engineering approaches, but it is clear that current models would also require massive amounts of training and many enhancements to handle ordinary, unrestricted continuous speech. DTW algorithms are the most straightforward to describe. Whole word prototypes are stored and inputs are matched using a prechosen distortion measure to resolve questions of scaling. When the unknown input is time-aligned with a reference prototype, the ensuing discrimination may be thought of as recognition by least distortion. Finite-state HMM's usually do not involve time-alignment. They are primarily constructed out of training data, although their states are usually initially specified by linguists and may consist of allophones, diphones, or triphones (one, two or three segment clusters). A HMM consists of a state transition probability matrix along with an initial probability vector. A system might have model parameters which depend on the left and right phonetic context for each allophone. Each path through a graph representation of such a system has an associated probability and the result of observing the results can be thought of as recognition by maximum probability. It is perhaps interesting to note that the highest computational cost in a HMM is not the process of finding the maximum likelihood path, rather it is in constructing all possible states and finding all transition probabilities at the onset. For either DTW or HMM's, extraordinary measures would have to be undertaken to recognize unrestricted, continuous speech, e.g. dynamic time warping against minimally hundreds of thousands of word prototypes, or in massive training in Markov models. This is due to all sources of variability in the speech signal, from noise and rate to dialect differences. NBM's are based on finding the minimum-cost path through a finite-state pronunciation network whose arcs may model some kind of acoustic-phonetic segment. A set of acoustic pattern matchers, which can be based on vector quantized LPC spectra, operate at the level of individual spectral analysis frames, or over groups of these. Various distance metrics have been proposed to rate the fit of the match. A dynamic programming search algorithm finds templates to match against in one or more codebooks. Again in these sytems, performance depends on limiting variability. This is usually done by either limiting vocabulary size (often to lexicons of fewer than thirty items) or by requiring speakers to pause between each word. FBM, DEM, and LPM's are better evaluated as theories than as actual systems. While those who work on the systems described above publish recognition results of 90% or better for recognition of under fifteen words in continuous speech, or for a large set of words pronounced in isolation, FBM, DEM, and LPM researchers generally do not report recognition statistics. An exception to this is certain work on feature-based models, for which valuable statistics are published on two- or three-way distinctions (distinguishing p/b pairs, for example.) DEM's exploit the concept of underspecification. One application is lexical encoding with broad phonetic categories. The idea is basically one of delayed binding, and stems from the observation that there is no reason to demand complete identification of a sound (which may be obscured anyway) if partial identification of a string of sounds yields a word candidate. DEM is not a model of speech recognition per se, but a method for reducing the search space in the recognition problem. Another way to cut down the search is LPM. In most recognition systems, the explicit source of constraint is at the level of the word. Many have advocated the use of prosodic generalizations to augment these constraints. The motivation for seeking additional constraints is due to the word-boundary problem, plus the non-finite nature of language (productivity). One example of the use of additional regularities is work on syllable-based constraints in an implementation which parses an already-labelled allophonic string into words [9]. In contrast to recent LPM's, early NBM's explored syntactic constaints [4]. Such constraints over-determine ordinary, casual speech which is characterized by the presence of semi-grammatical sentences (with false starts, unfinished phrases, etc. ). FBM, DEM and LPM's are incomplete as recognition models since they do not provide a mapping from the signal to the word. FBM's do not address the question of accessing higher units of structure, and DEM and LPM's abstract away from the analysis of the actual waveform, and operate either on partial or complete allophones taken from sentences transcribed by human listeners. In summary, the more abstract models await a signal processing interface, and the template/statistical models await methods that permit them to incorporate more knowledge regarding the possible utterances of a language. While there are literally hundreds of critiques of exisiting speech recognition systems, the most telling indictment is the fact that no current system exists which can recognize naturally produced dialogues from any spoken language. Nonetheless, we will employ some aspects of each of the types of systems we described in the model we are developing. What we believe to be the way around the problem of variability, and what is unique in our model, lies in its overall organization, incorporation of certain knowledge sources, and the way in which we have compartmentalized the knowledge into language-independent and language-specific modules. 2. Intelligent pattern recognizers Signal parsing takes variability in human speech to be the norm. The processed waveform is labelled in a qualitative, language-independent way. It is then parsed by an extremely low-level language-specific parser, as well as a higher-level parser. This decomposition of the problem permits a representation that is stable over utterance situations and provides constraints that handle some of the difficulties associated with partially obscured or ``incomplete'' information. The organization of the model reflects the various sources of variability: Z Presence of "noise" The processed waveform is labelled with a fixed set of acoustic labels. Presently one label is given to an arbitrary time slice of the waveform. In the simplest version, configurations of energy that do not match an acoustic model are discarded as extra-linguistic sound. In more interesting versions, they are categorized as unknown noise and stored until the system learns to partition them into further categories. The labels form a symbolic representation of the speech events. It is language-independent, and consists of strings of fairly obvious labels such as "silence/second-formant-dropped/no-match" etc. In many cases, what people hear as noise can be meaningfully classified as an acoustic event. Bjorn Lindblom has often pointed out that if you reverse a recording of a vowel-consonant-vowel word, you will hear [h]-vowel-consonant-vowel. The reason you don't hear the phantom "h" in the normal direction is that you expect a "winding-down," which in this case is a cessation of voicing at the end of the vowel. One way to handle the problem of contextual noise in our model is by a context-sensitive labeller. However, we find that keeping this level as "objective" as possible facilitates moving between language models. Thus our labeller is context-free, and a language-specific parser is responsible for the examination of strings of labels. In signal parsing, the symbolic representation is pushed down to the level of the signal itself. Z Time normalization Language-particular information processors evaluate the discrete output of the labeller. Some outputs may be discarded as non-meaningful in a given language, or only meaningful in a fixed context. This step is extremely powerful. At present we go directly into the parse; e.g. if we receive a sequence of identical labels z, the parser can accept z* (Kleene star). We wish to augment the machine at this level so that it can perform a rudimentary kind of counting, and thereby gain the ability to use duration cues over variable speech rate. It is hard to imagine how to capture simple duration generalizations in systems that use hidden Markov models (e.g. the fact that speakers do not usually change rate in the middle of a short utterance). Since duration is helpful in segment identification, keeping track of the number of labels in a possible segment provides a source of constraint. We wish to explore the labeller-parser interface as a two-tape Kaplan-Kay/Koskenniemi [12] machine. The idea is to reproduce the listener's labelling behavior modulo speech rate so that what counts as a "long" input is relative. Z Location of boundaries The acoustic labels are parsed by a segmental parser into linguistic objects. The initial labelling over each pair of vertex vi and vertex vi' is parsed into higher-level objects. Unlike the acoustic labelling, the parsing is language-specific. (For instance, short silence and aperiodic energy can be parsed by an English segmental parser as an affricate "ch" (as in "choose") but as "t" followed by a "sh" sound in French (e.g. cette chose)). One important feature is that the waveform is not pre-segmented. Allophone boundaries are never seached for; the segment boundary problem is solved by allowing segments to emerge in the well-formed parse. Syllabic grammars also impose language-specific constraints. However, these are not as stringent as one might expect due to the various permissible lenition processes such as vowel deletion in unstressed syllables. On the other hand, additional phonological constraints allow us to hypothesize words, in particular language-specific sequencing constraints on syllable strings (foot grammars). Constraints of this type have not been reported outside the theoretical phonological literature, and have never been implemented, to our knowledge, outside our system. Although the sets of grammars are hierarchical in nature, mixed categories appear by design on the same level. A given level might include, for example, a fully specified allophone followed by a vague descriptions like "nasal." Phonological constraints allow us to hypothesize words from underspecified tree structures. If we find, for instance, that some kind of nasal precedes a [g] in certain positions, foot-structure constraints tell us that that the nasal must be velar, because it is the only possible foot-medial nasal that precedes a [g] (as in the word "finger"). This is the concept of underspecification brought into the grammar itself. We wish to underscore the importance of the prosodic knowledge source, and not just rely on mapping to the word to provide information regarding obscured segments. There are three reasons for relying on phonological constituents as an augmentation to the information available in the dictionary component. The first is the boundary problem, and the second is that phonological categories provide interesting sequencing constraints. The third reason is the fact that the appearance of an acoustic event depends on where a segment falls in a phonological constituent. For example, a [p] will have a burst or not depending on its position in a foot. The formulation of our phonological grammar addresses the problem of obscured sound through parsing techniques and the concept of underspecification. Facts like this highlight the advantage of the incorporation of many levels of constraints to handle the many sources of obscured information in ordinary, continuous speech. It also permits segments, prosodic constituents, and words to emerge in the well-formed parse, and bypasses the search for boundaries. Z Language productivity Another reason for the inclusion of phonological grammars is apparent in cross-linguistic studies. We must expect novel words; the inventory of lexical items is not a "bin of parts." In morphologically rich languages like Finnish we would not directly match our phonological constituents against a finite word list; rather, we would include a morphological parser as a part of the model [12]. Z Speaker variation We employ a radically low level of symbolic representation in this model, a more traditional one that is essentially syllabic, and a higher level consisting of strings of syllables. The incorporation of the latter reflects our belief that listeners attend to the rhythmic structure of speech in finding word boundaries, and reflects our findings that this level provides some interesting constraints in decoding obscured information. One further practical consideration is that although whole syllables can drop out in ordinary, casual speech in some dialects, entire feet do not. The parts of the model described so far will handle some aspects of the problem of speaker variation, including some segmental differences and changes in speaker rate, but other aspects involve setting parameters at the front end (e.g. is the speaker an adult male/female or child?). The signal parser can be spliced onto any number of good signal-processing front ends. However, our experience in analyzing DFT spectrograms and other processed waveforms leads us to conclude that a rethinking in this domain will lead to interesting results as well. At present, human beings are better at reading spectrograms than are machines [10], [11]. Front ends have not been designed to extract the information that is most valuable from the point of view of acoustic parsing. Z We wish to couple the model to a front end that is better tailored to the qualitative labelling task, and which exhibits plasticity, fault-tolerance, and the possibility for learning. Huberman's and Hogg's machines based on a regular array of locally connected elements exhibit interesting and germane properties. For instance, these machines exhibit the property of plasticity, or the ability to respond to changed inputs. A certain amount of plasticity is evident in more traditional systems. However, a key difference is that plasticity can be built into the targets when the concept of dynamic attractors is built into the computation. There is no distinction in principle between the training data and the recognition data. The targets shift over time in response in the actual input. This extra power has exciting possibilities for domains where inputs vary systematically (as is the case when moving from speaker to speaker or dialect to dialect). We not only wish to build a robust recognizer but to understand first how to extract appropriate representations of physical signals and second, how to compute with these representations to extract linguistic meaning using task-particular constraints and on-line learning from previous experience. The result should be a general model for intelligent pattern recognition in any domain. A subtle aspect of the model is that we use concurrent processing with dynamic attractors for the non-task-specific portion of the recognition problem, and a constraint-based system for the task-specific labelling of the output from the arrays. A side-effect of this design is the ability to decouple the components in order to move smoothly between recognition problems, e.g. from recognizing language a to recognizing language b. At this moment, we believe we are uniquely positioned to engage in this exploration. We are investigating optimal parsing design with members of CSLI, and general speech recognition issues with members of SPAR. In addition, it should be possible to benefit from Kaplan's and Kay's current work on lexical access and encoding. Finally, our computational environment includes access to simulations of the Huberman-Hogg machines, and the LFG system, where acoustic and high-level phonological parsers have been successfully implemented. While we have started to publish our results individually on specific aspects of the general problem, we feel that a collaborative investigation at this point would result in useful and influential work on intelligent pattern recognition. REFERENCES [1] Rabiner, L. and S. Levinson. Isolated and connected word recognition -- theory and selected applications. IEEE Trans. on communications, COM-29, 5, May 1981. [2] Rabiner, L., S. Levinson and M. Sondhi. On the application of vector quantization and hidden Markov models to speaker-independent, isolated word recognition. BSTJ 62(4) (APR. 1983). [3] John Makhoul, Richard Schwartz, Yen-Lu Chow, Owen Kimball, Salim Roucos, and Michael Krasner (BBN), Continuous phonetic speech recognition, Presentation at ASA, October 1984. [4] Lowerre, B. The HARPY Speech Recognition System, CMU Ph.D. dissertation, 1976. [5] Bush, M., G. Kopec, and M. Hamilton (Fairchild), Network-based Isolated Digit Recognition Using Vector Quantization. Presentation at ASA, October 1984. [6] Stevens, K.N. Toward a feature-based model of speech perception. IPO Annu. Prog. Rep. 17, 36-37 (1982). [7] Shipman, D. and V. Zue. Properties of large lexicons: implications for advanced word recognition systems, 1983 IEEE International Conference on Acoustics, Speech and Signal Processing, Paris, France. [8] Huttenlocher, D. and V. Zue. ICAASP 84. [9] Church, K. Phrase-Structure Parsing: A Method for Taking Advantage of Allophonic COnstraints, MIT Ph.D. dissertation, 1983 (available from RLE publications, MIT, Cambridge, MA). [10] Beth G. Greene, David B. Pisoni, Thomas D. Carrell. Recogniton of speech spectrograms JASA 76,1, '84 [11] Shockey, L. and Reddy, R. Quantitative analysis of speech perseption: Results from transcription of connected speech from unfamiliar languages," paper presented at the Speech Communication Seminar, Stockholm, Sweden, 1974. [12] Koskenniemi, K. Two-level morphology: A general computational model for word-form recognition and production. University of Helsinki Publications, No. 11, 1983.NILNILL TIMESROMAN TIMESROMAN NILNIL TIMESROMAN NILNIL~ TIMESROMAN TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMANNILNIL+ TIMESROMAN TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMANMATH + TIMESROMANNILNIL TIMESROMAN* TIMESROMAN TIMESROMANNILNIL TIMESROMAN' TIMESROMAN TIMESROMANNILNIL TIMESROMANMATH * TIMESROMANNILNIL TIMESROMAN TIMESROMAN TIMESROMANNILNIL TIMESROMAN TIMESROMAN TIMESROMANNILNIL TIMESROMAN MATH  TIMESROMANNILNIL TIMESROMAN TIMESROMAN TIMESROMANNILNIL TIMESROMAN! TIMESROMAN TIMESROMANNILNIL TIMESROMANNILNILA TIMESROMAN NILNIL TIMESROMAN NILNILG TIMESROMAN NILNIL TIMESROMAN NILNIL TIMESROMAN NILNIL TIMESROMAN NILNIL& TIMESROMAN NILNIL TIMESROMAN NILNIL TIMESROMAN  TIMESROMAN e TIMESROMAN NILNIL TIMESROMAN NILNILx TIMESROMAN NILNIL TIMESROMAN NILNIL TIMESROMAN NILNILe TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMANNILNIL" TIMESROMAN TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMANNILNILK TIMESROMANNILNIL TIMESROMANNILNILMATH  TIMESROMANNILNIL TIMESROMANNILNILD TIMESROMAN TIMESROMAN ) TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMAN TIMESROMAN NILNIL TIMESROMAN NILNILa TIMESROMAN TIMESROMAN NILNIL TIMESROMANNILNIL TIMESROMANNILNILMATH  TIMESROMANNILNIL TIMESROMANNILNILG TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMANNILNIL TIMESROMANNILNILMATH  TIMESROMANNILNIL TIMESROMANNILNIL~ TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN# TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMANNILNIL8 TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMANNILNILMATH  TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMANNILNILMATH  TIMESROMANNILNIL TIMESROMANNILNILf TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMANNILNILMATH TIMESROMAN TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMAN TIMESROMANNILNIL TIMESROMANNILNIL TIMESROMAN TIMESROMAN NILNIL TIMESROMAN NILNIL TIMESROMAN NILNIL TIMESROMAN NILNIL TIMESROMAN NILNIL TIMESROMAN NILNIL TIMESROMAN NILNIL TIMESROMAN NILNIL TIMESROMAN NILNIL TIMESROMAN NILNIL TIMESROMAN NILNILS TIMESROMAN NILNIL TIMESROMAN NILNIL TIMESROMAN NILNIL TIMESROMAN NILNILl TIMESROMAN NILNIL TIMESROMAN NILNIL TIMESROMAN NILNIL TIMESROMAN NILNIL, TIMESROMAN NILNIL TIMESROMAN NILNIL TIMESROMAN NILNIL TIMESROMAN NILNILj TIMESROMAN NILNIL TIMESROMAN NILNIL TIMESROMAN NILNIL TIMESROMAN SX"z