(part 1 of 3) Prolegomena to a theory of speech recognition notes i. Introduction It has recently been demonstrated that broad phonetic information can be used to partition a large lexicon into relatively small sub-classes [ShipmanZue] [HuttenlocherZue]. It has been further proposed that this technique be used for hypothesizing word candidates in isolated word recognition systems [HuttenlocherZue] [Huttenlocher]. In this paper we observe a serious inherent limitation in this general approach extended to unrestricted speech recognition. The limitation stems from restricting recognition to a predefined set of coarse-grain classes, and the observation basically is that one is not necessarily making maximally efficient use of the fine-grain information in the signal. We present a proposal intended to overcome this limitation. Terms such as fine- and coarse-grain specification, or "underspecification" refer to the amount of information in a representation. Yet this notion is only remotely related to the quantitative notion of information in the well-known work in information theory (Shannon, Wiener etc). It is closer in spirit to the concept of segmental specification in the Prague School of linguistics, or in work on the theory of lexical phonology [Kiparsky, Withgott, Mohanan & Mohanan 1984, Archangeli 1985]. In an underspecified representation, a consonant might be indicated simply by a "C " with the associated distinctive feature [voice], while another might be unspecified for that feature. In the more recent models of phonology, the lexical representation grows richer as lexical derivation procedes. We are able to posit this same sort of process in the recognition process through a judicious exploitation of cooccurence constraints for each given language. First we summarize previous work on the use of broad phonetic information for partitioning a large lexicon. Broad phonetic sequences reflect a certain amount of phonotactic constraint while being insensitive to within-class phonetic variability (e.g. merging s and S), and the coarse-grain descriptive level is amenable to the creation of acoustic discrimination algorithms. Together, this makes them attractive for pruning the space of possible word candidates in recognition. We point out, however, that the simplest algorithms employing broad phonetic sequences cannot cope with frequent phonological processes involving deletions, insertions, and cross-category substitutions. We present results of an investigation involving the use of metrical information to specify in what environments phonological rules are are likely to apply. We suggest that our deletion site rules are important for recognizing continuous speech in general, regardless of the grain-size of the phonetic representation. One proposed method for coping with phonological variablility in isolated word recognition is to rely primarily on the sounds in the stressed syllable, since they are in general both more reliable and more distinct. For continuous speech we argue out that this method is less tractable because there is more variability. On the basis of variability, we also argue against expanding the lexicon to contain alternate pronunciations, with the exception of a principled class of cases. Finally we present our ** model which is capable of bypassing unnecessary broad category labelling, and coping with phonological modification. In our proposed recognition procedure, there is a distinction between offline lexicon development, which can be relatively expensive as long as its extensible, versus online recognition where lexical access is guided by the structure of the lexicon. Our hierarchical lexicon is constructed once, and a constraint-based well-formed sub-string hypothesization procedure serves to greatly limit the number of online probes into that lexicon. Use of metrical information allows us to specify where phonological modifications are likely to occur. It also allows us to move away from the use of words as the unit of lexical access. This is important for continuous speech recognition, because words do not have strong acoustic correlates in the speech signal. I. Broad phonetic classification The surprising amount of lexical discrimination provided by a broad phonetic encoding was demonstrated by a set of studies examining the phonemic distribution of words in the 20,000-word Merriam Webster's Pocket Dictionary [ShipmanZue]. In these studies, the lexicon was partitioned into equivalence classes of words sharing the same broad phonetic sequence. By examining the distribution of words across classes, various broad phonetic representations could be compared. In one of their studies, Shipman and Zue mapped the phonemes of each word into one of the six broad manner of articulation classes: vocalic, stop, nasal, liquid or glide, strong fricative, and weak fricative. For example, the word "speak", with the phoneme string /spik/, was represented by the broad phonetic sequence [STRONG-FRIC][STOP][VOCALIC][STOP] It was found that approximately one third of the words in the 20,000-word lexicon were in equivalence classes of size one --- and hence were uniquely specified. The average number of words in the same equivalence class was approximately two, and the maximum was 223. In other words, in the worst case this broad phonetic representation reduces the number of possible word candidates to about one percent of the 20,000-word lexicon. Shipman and Zue examined several smaller lexicons and found this to be stable for lexicons of about 2,000 or more words; for smaller lexicons the specific choice of words can of course greatly influence the distribution. [FOOTNOTE Speech sounds can be characterized according to both their manner and place of articulation. Manner classes tend to have more reliable acoustic correlates than place differences, which is why Shipman and Zue chose to use manner classes as a broad phonetic representation. Given that manner classes appear more reliable, it is interesting to discover if they also provide more lexical constraint. Huttenlocher [DPH] investigated this question by partitioning the lexicon using the broad place of articulation sequence. Each phoneme was mapped into one of: palatal, labial, velar, dental, glottal, and vocalic. The expected equivalence class was approximately 90 words, and the maximum class size was 336 words. Since this expected class size is substantially larger than that for manner sequences, these results indicate that manner information does provide more constraint than place information.] On the basis of these results, Zue and Shipman proposed a multi-pass recognition model, where initial broad phonetic classification is done to prune the lexical candidate space to a small subset of the lexicon. Then detailed verification is done to differentiate among the remaining word candidates. It was later pointed out that the average equivalence class size is an overly optimistic measure of the extent to which a given representation differentiates between words. A better measure is the expected equivalence class size [Huttenlocher]. Consider the case of partitioning 100 words into two classes of size 50 each, versus two classes of sizes 99 and 1. The former is clearly a better partitioning, even though the average class size is the same in both cases. In addition, the lexical partitioning does not take word frequency into account. Therefore we weighted each word according to its frequency of occurrence in the Brown Corpus of Written English [ref]. For Shipman and Zue's study of the 20,000 word lexicon, the frequency weighted expected equivalence class size is approximately 34 words. One attractive feature of this model is that it delays the need for making any phonetic distinctions, by making only broad phonetic descisions before lexical access. As pointed out in [DPH], this pruning can either be done by explicit lexical access, or by using the broad phonetic sequence to guide more detailed bottom-up analysis before lexical access. We will argue in section x. that salient information in the signal is a better guide to the use of the more detailed bottom-up analysis, but that broad phonetic categories still play an important role in lexical access. II. Suprasegmental Constraints: Syllable Stress Broad phonetic sequences preserve only linear precedence information about phonetic features. Specifically, there is no suprasegmental information. Huttenlocher [] investigated the use of syllable stress as another source of constraint in differentiating words from one another. First he considered augmenting the broad phonetic sequence representation with syllable stress information. According to this scheme, a syllable is classified as being either stressed or unstressed. Using this representation the frequency weighted expected equivalence class size for the 20,000 word lexicon is 28 words, and the maximum class size is 209 words. Thus adding stress information only helps slightly overall. However, stress information turns out to be especially useful for differentiating between high-frequency polysyllabic words. However, with respect to the problem of phonological variation, results from a related study [] demonstrated that stressed syllables are provide overwhelmingly more constraint than unstressed syllables, [by several orders of magnitude. NUMBERS ] [GOT TO DISCUSS THIS Therefore it was proposed that lexical access for isolated word recognition can be performed based only on the information in the stressed syllables. While throwing away the unstressed syllables is extreme, it is a reasonable first pass for isolated word recognition because the unstressed syllables provide so little constraint at the broad phonetic level.] [ The approach of ignoring the unstressed syllables is not realistic for large lexicons or for continuous speech. First, there are relatively long strings of unstressed syllables in continuous speech. Second there is more variability in continuous speech, and the assumption that stressed syllables are relatively invariant is no longer valid.] [Assumed that there is more variability in the unstressed portions. This actually is unclear. Can say that there is more DELETION in unstressed portions-- Let's look at this in ice cream.] Three level stress identification in IW can be done reliably [Aull]. TIMESROMAN1(DEFAULTFONT 1 (GACHA 10) (GACHA 8) (TERMINAL 8)) ¤'N(ôzą