An alternative model in the design of intelligent pattern recognizers There is a class of recalcitrant recognition problems which are best characterized by the kinds of systems they are associated with. Speech recognition is one of a set of problems which presupposes not only low-level stimulus identification, but the kind of high-level, cognitive categorization found in intelligent systems. We recast the recognition problem as one of qualitative labelling and parsing (derived categorization) of the signal. We wish to explicate how can we decode an arbitrary utterance, using language specific constraints and on-line learning from previous experience. Until now, most approaches to the problem of speech recognition have ben based on what might be thought of as a signal processing model of stimulus-response with accompanying distance metrics to determine best response in ambiguous cases. The pattern matching procedure may involve stored, time-warped patterns (usually isolated words), probabilistic matchers based on training data, or various processes of cluster analysis with least distortion algorithms. Projects with this orientation do not consider the questions that are foundational in our alternative approach. Our perspective on the problem entails novel formulations of each component in the recognition process. The issues outlined below will be described in an order that reflects the interactions between the components of the model. The important components are: -- an active chart parser, which parses an initial acoustic labelling over each pair of vertices vi and vi' into higher-level objects if appropriate constraints are satisfied; -- a machine based on a regular array of locally connected elements which perform a qualitative labelling of the continuous signal; -- a machine that maps the output of the array to the input to the parser. This is a unique configuration based on a theoretical approach to intelligent pattern recognition. We analyze a sound without first considering it as a part of a linguistic object by assigning categories based on the physical signal itself. Since we have made the discovery that we can formulate an explicit, language-specific grammar of acoustic events, we are able to use natural language techniques to go from the represented signal to higher linguistic levels. We will discuss the kinds of constraints needed for the parser, the form of the grammar which drives the parser, the front-end, and the mapping from the front-end to the language-specific parser. I. LANGUAGE-SPECIFIC KNOWLEDGE In present speech recognition systems, the most explicit source of constraint is at the level of the word. Many have advocated the use of prosodic generalizations to augment these constraints. The motivation for seeking additional constraints is that (1) word boundaries are hard to find in unrestricted, connected speech, and (2) one must expect novel combinations of sub-pieces of words. One example of the use of additional regularities is Church's thesis which explored syllable-based constraints in an implementation which parsed an already-labelled allophonic string into words [1]. The same motivation underlies our seeking further constraints on syllables and more interestingly, sequencing constraints on syllable strings. Constraints of this type have not been reported outside the theoretical phonological literature, and have never been implemented, to my knowledge, outside our system. We also formulate language-specific grammars at the level of acoustic events. The introduction of formal grammars at this level forces an interesting division of language-particular and language-independent knowledge that is mirrored in the division of labor among the components in the model. We will return to this point below. Earlier approaches explored syntactic constaints [2]. Such constraints over-determine ordinary, casual speech which is characterized by the presence of semi-grammatical sentences (with false starts, unfinished phrases etc. ). An approach using such constraints is useful nonetheless from the point of view of special applications. II. PARSING TECHNOLOGY AND UNDERSPECIFICATION The concept of underspecification is explored in depth in structuralist linguistics. A practical application is Zue and Shipman's lexical encoding with broad phonetic categories[3]. The idea is basically one of delayed binding, and stems from the observation that there is no reason to demand complete identification of a sound (which may be obscured anyway) if partial identification of a string of sounds yields a word candidate. This concept enters into the form of the phonological grammar. Rules are permitted of the type: X --> x [vague-category q] z where x, q, or z may be specified as optional and where alternative expansions are permitted. Such rules are a hybrid of the underspecification concept and the rule formalism in the LFG system. The left-hand side of the rule can be a high-level category, such as a syllable, or as low-level as an allophone (like "[k]"). There is a wealth of language-specific generalizations at this level which apparently has not been captured previously in an explicit, formal grammar. Again, since some information may be obscured at this level, the system is designed so that ambiguities in parses can be resolved at higher levels. One interesting feature is that the waveform is not pre-segmented. Allophone boundaries are never seached for -- the segment boundary "problem" is "solved" by allowing segments to emerge in the well-formed parse. III. FRONT-ENDS THAT ARE MORE CONGENIAL TO EXPLORATION AS EXPERT SYSTEMS, AND WHICH EXHIBIT SELF-REPAIR, LEARNING, AND PLASTICITY Human beings can extract astonishing amounts of information from visual inspection of processed waveforms, especially from speech spectrograms [4]. Signal processing in the area of speech does not yet match this kind of performance, although, for example, formant trackers are steadily improving. Instead of being limited to the kinds of information currently extracted by engineers, we find by replacing traditional signal processing front-ends with concurrent processing arrays of the type developed by Huberman and Hogg we can tailor the basins of attractors to match what seems to be optimal from the point of view of the parser. Even more interesting is the fact that these machines exhibit the property of plasticity, or the ability to respond to changed inputs. A certain amount of plasticity is evident in more traditional paradigms which exhibit good design of matchers based on vector-quantized LPC spectra. Of course since the best match always depends on how many targets there are, and how different they are from one another, inputs which differ radically from the training data may be successfully identified. However, a key difference is that plasticity can be built into the targets when the concept of dynamic attractors is built into the computation. There is no distinction in principle between the training data and the recognition data. The targets shift over time in response in the actual input. This extra power has exciting possibilities for domains where inputs vary systematically (as is the case when moving from speaker to speaker or dialect to dialect). IV. THE MAPPING BETWEEN THE COMPUTING ARRAYS AND THE PARSERS Language-particular information processors evaluate the discrete output of the arrays. Some outputs of the attractors may be discarded as non-meaningful in a given language, or only meaningful in a fixed context. This step is extremely powerful. We have not yet determined the kind of machine which will best map the output of the arrays to the input lables. It is possible to go directly into the current system; e.g. if we receive a sequence of, say, five identical labels z, the parser can accept z* (Kleene star). But if we augment the machine at this level so that it can perform a rudimentary kind of counting, we gain possibly a unique ability, the use of duration cues over variable speech rate. In systems, for example, that use hidden Markov models [5], it is hard to imagine how to capture the fact that speakers don't usually change rate in the middle of a short utterance. If one drags out the "b" in "butane", the "u" is not going to be rapidly articulated. Since duration is helpful in segment identification, this generalization provides another source of constraint. In opposition to an inflexible finite-state hidden Markov model, or even a context-sensitive parser, it might be useful to think of the array-output parser-input mapper as a two-tape Kaplan-Kay/Koskenniemi machine controlled by sets of grammars. The idea is to simply reproduce the listener's labelling behavior modulo speech rate so that what counts as a "long" input is relative. This mapping is the least explored at the moment, but something like it together with the fault-tolerance built into the parsers (due to a specialized use of linguistic knowledge at several levels) and the plasticity inherent in the computing arrays (due to their novel architeture) provide a way of thinking about the problem that takes change to be the norm. Viewing change as distortion in ordinary utterances may lead to extraordinary measures in current systems: such as dynamic time warping [6] of minimally tens of thousands of word prototypes, or in massive training in Markov models, or in abandoning constraints in multiple vector quantization codebooks [7]. Returning to the motivation behind this work, we not only wish to build a robust recognizer but to understand how to extract appropriate representations of physical signals and how to compute with these representations to extract linguistic meaning. A subtle aspect of the model is that we use concurrent processing for the non-language specific portion of the recognition problem, and a constraint-based system for the language-specific labelling of the output from the arrays. A side-effect of this design is the capability, in theory, to decouple the components in order to move smoothly from recognizing language a to recognizing language b. REFERENCES [1] Church, K. Phrase-Structure Parsing: A Method for Taking Advantage of Allophonic COnstraints, MIT Ph.D. dissertation, 1983 (available from RLE publications, MIT, Cambridge, MA). [2] Lowerre, B. The HARPY Speech Recognition System, CMU Ph.D. dissertation, 1976. [3] Shipman, D. and V. Zue. Properties of large lexicons: implications for advanced word recognition systems, 1983 IEEE International Conference on Acoustics, Speech and Signal Processing, Paris, France. [4] Reddy et al. -- [5] Rabiner, L. , S. Levinson and M. Sondhi. On the application of vector quantization and hidden Markov models to speaker-independent, isolated word recognition. BSTJ 62(4) (APR. 1983). [6] Rabiner, L. and S. Levinson. Isolated and connected word recognition -- theory and selected applications. IEEE Trans. on communications, COM-29, 5, May 1981. [7] Kopec, E. and M. Bush. Network-based Isolated Digit Recognition Using Vector Quantization. Submitted to IEEE Trans. on Acoustics, Speech and SIgnal Processing, June 1984.(LIST ((PAGE NIL NIL (0 0 612 792) ((FOLIO NIL (PARALOOKS (QUAD CENTERED) CHARLOOKS (FAMILY NIL OVERLINE OFF STRIKEOUT OFF UNDERLINE OFF SLOPE REGULAR WEIGHT MEDIUM)) (9525 1270 2540 1270) NIL) (TEXT NIL NIL (2540 2540 16510 22860) NIL))) (PAGE NIL NIL (0 0 612 792) ((FOLIO NIL (PARALOOKS (QUAD CENTERED) CHARLOOKS (FAMILY NIL OVERLINE OFF STRIKEOUT OFF UNDERLINE OFF SLOPE REGULAR WEIGHT MEDIUM)) (9525 1270 2540 1270) NIL) (TEXT NIL NIL (2540 2540 16510 22860) NIL))) (PAGE NIL NIL (0 0 612 792) ((FOLIO NIL (PARALOOKS (QUAD CENTERED) CHARLOOKS (FAMILY NIL OVERLINE OFF STRIKEOUT OFF UNDERLINE OFF SLOPE REGULAR WEIGHT MEDIUM)) (9525 1270 2540 1270) NIL) (TEXT NIL NIL (2540 2540 16510 22860) NIL)))))NILNILFMODERNNILNILMODERNNILNILRMODERNNILNILMODERNNILNILMODERNNILNIL@MODERNNILNILMODERNNILNILMODERNNILNILMODERNNILNILMODERNNILNILbMODERNMODERNMODERNMODERNEMODERNNILNIL…MODERNNILNILOMODERNNILNILMODERNNILNIL˜MODERNNILNILMODERNNILNILMODERNNILNILMODERNNILNILMODERNNILNILMODERNNILNIL"MODERNNILNILMODERNNILNILÕMODERNNILNILMODERNNILNILLMODERNNILNILMODERNNILNILMODERNNILNIL/MODERNNILNILMODERNNILNIL²MODERNNILNILMODERNNILNIL`MODERNNILNILMODERNNILNIL#MODERNNILNILMODERNNILNIL_MODERNNILNILMODERNNILNILMODERNNILNILMODERNNILNIL×MODERNNILNILMODERNNILNILMODERNNILNIL„MODERNNILNILMODERNNILNIL}MODERNNILNILMODERNNILNIL¼MODERNNILNILMODERNNILNILMODERNNILNIL=MODERNNILNILMODERNNILNILøMODERNNILNILMODERNNILNILæMODERNMODERNMODERNMODERNÒMODERNNILNILMODERNNILNIL MODERNNILNILMODERNNILNILkMODERNMODERNMODERNMODERNMODERN HELVETICANILNIL HELVETICANILNIL HELVETICANILNIL HELVETICANILNIL HELVETICANILNIL· HELVETICANILNIL HELVETICANILNILS HELVETICANILNIL HELVETICANILNILÍ HELVETICANILNIL HELVETICANILNIL HELVETICANILNIL HELVETICANILNIL½ HELVETICANILNIL HELVETICANILNIL¢ HELVETICANILNIL HELVETICANILNIL± HELVETICA,¤z¸