Very Large Scale Integrated Notes i. introduction partitioning on the lexicon into equivalence classes dealing with deletions , which is a problem most recognizers dont handle at all -- those that do use multiple prons dictionary [which is ass backwards. Is it even a closed set?] dealing with uncertainty: broad phonetic technique -emphasize what is reliable, instead of everything probabilistic. Principle of least commitment (DPH:54) contra HEARSAY advantage: don't have to set probabilities potential disadvantage: by restricting recognition to a predefined set of classes, we are not necessarily making maximal use of the information which is there. e.g., a whopping /t/ may be very reliably recognizable, but we just call it a stop. There is a distinction between offline lexicon development, where we can be relatively expensive as long as its extensible (ie not HARPY), versus online recongition where you want to only probe lexicon where its necessary. Our hierarchical lexicon is constructed once, and the WFSS identification serves to greatly constrain the number of online probes into that lexicon. (1) what is use of BPHON suited for? backpointer to sz / dph What are inherent limitations? Clearly we are throwing away a lot of info for one thing. IWR vs. the real thing (NOTE WELL: nobody else does the real thing! Connected vs. continuous speech.) (2) How can we build on this, and impose more structure on the lexicon? This paper builds on the broad phonetic represenation. We do this by imposing metrical constraints on the lexicon, and thereby extending the approach to the problem of lexical access in continuous speech. (3) How can we use this approach for recognition? We propose a control structure for using this lexical representation in a recognition system. I. Broad phonetic classification The power of broad phonetic constraints was demonstrated by a set of studies examining the phonemic distribution of words in the 20,000-word Merriam Webster's Pocket Dictionary [ShipmanZue]. In these studies, the lexicon was partitioned into equivalence classes of words sharing the same broad phonetic sequence. By examining the distribution of words across classes, various broad phonetic representations can be compared. In one of their studies, Shipman and Zue mapped the phonemes of each word into one of the six broad manner of articulation classes: vocalic, stop, nasal, liquid or glide, strong fricative, and weak fricative. For example, the word "speak", with the phoneme string /spik/, was represented by the broad phonetic sequence [STRONG-FRIC][STOP][VOCALIC][STOP] It was found that, at this broad phonetic level, approximately one third of the words in the 20,000-word lexicon were in equivalence classes of size one --- and hence were uniquely specified. The average number of words in the same equivalence class was approximately two, and the maximum was 223. In other words, in the worst case this broad phonetic representation reduces the number of possible word candidates to about one percent of the 20,000-word lexicon. Shipman and Zue examined several smaller lexicons and found this to be stable for lexicons of about 2,000 or more words; for smaller lexicons the specific choice of words can make a large difference in the distribution. It was later pointed out that the average equivalence class size is an overly optimistic measure of the extent to which a given representation differentiates between words. A better measure is the expected equivalence class size which takes into account the size variations in the equivalence classes [DPH]. Consider the case of partitioning 100 words into two classes of size 50 each, versus two classes of sizes 99 and 1. The former is clearly a better partitioning, even though the average class size is the same in both cases. In addition, the lexical partitioning counts each word the same amount regardless of how frequently it occurs in English. This is unrealistic because some word occur much more frequently than others. Therefore we weight each word accoring to its frequency of occurrence in the Brown Corpus of Written English. For Shipman and Zue's study, the frequency weighted expected equivalence class size is approximately 34 words. On the basis of these results, Zue and Shipman proposed a multi-pass recognition model, where initial broad phonetic classification is done to prune the lexical candidate space to a small subset of the lexicon. Then detailed verification is done to differentiate among the remaining word candidates. One attractive feature of this model is that it delays the need for making any phonetic distinctions, by making only broad phonetic descisions before lexical access. 2. The lexical partioning paradigm allows us to compare different broad phonetic representations. Coupled with acoustic considerations about what sound classes are more reliably recognizable, we have a means of determining what are potentially good representations for a recognition system. Speech sounds can be characterized according to both their manner and place of articulation. Manner classes tend to have more reliable acoustic correlates than place differences, which is why Shipman and Zue chose to use manner classes as a broad phonetic representation. Given that manner classes appear more reliable, it is interesting to discover if they also provide more lexical constraint. Huttenlocher [DPH] investigated this question by partitioning the lexicon using the broad place of articulation sequence. Each phoneme was mapped into one of: palatal, labial, velar, dental, glottal, and vocalic. The expected equivalence class was approximately 90 words, and the maximum class size was 336 words. Since this expected class size is substantially larger than that for manner sequences, these results indicate that manner information does provide more constraint than place information. While manner of articulation information is more acoustically salient than place information, speaker-independent automatic recognition of manner classes is still somewhat beyond the state of the art. Perhaps a more acoustically motivated set of classes is: vocalic, noise, voiced and stop-gap. Using these four classes, the expected equivalence class size is xx and the maximum class size is xx. Figure xx - what 4-way class is. Spectrogram labelled with 4 classes. Also schematic diagram of 4 classes. 3. What kind of information is there in broad phonetic sequences? Broad phonetic sequences preserve only linear precedence information about phonetic features. Specifically, there is no suprasegmental information, which is probably quite important in recognition. For example in the next section we summarize Huttenlocher's findings that stress is highly iportant in differentiating isolated words from one another. This is probbaly even more true in the case of continuous speech. The notion of lexical equivalence class presented thus far is tailored to isolated word recognition, because lexical access requires knowing the word boundaries. Thus it is not obviously extensible to continuous speech, because of the difficulty of finding word boundaries. Starting in section III, we propose several ways to extend this model to the recognition of words in continuous speech. This finally leads us to a hierarchical lexical representation, and an iterative-refinement control structure for lexical item recognition. A broad phonetic sequence representation is highly sensitive to the deletion, insertion, or modification of phonemes, because these processes change the broad phonetic sequence. Since phonemic modification tends to occur within rather than across manner of articulation features, the broad phonetic sequence should be relatively unaffectd by modifications. In addition, phoneme insertions in speech are relatively rare. Thus, the major problem for the representation is that of deleted phonemes. In the next section we discuss a proposal by Huttenlocher for handling this problem using stress information. II. Suprasegmentals: Syllable Stress Huttenlocher augmented the brod phonetic manner of articulation sequences with information about the syllable stress pattern of a According to this representation, p Stress information is especially useful for high-frequency polysyllabic words. One way to take advantage of the fact that the information in stressed syllables appears more reliable is to incorporate confusion matrices in unstressed contexts (i.e. collapse weak.strong fric, medial nasals, liquids). Addresses the problem of deletion/insertion/modification of phonemes to the extent that they occur more in unstressed syllables. Deletions: storing all alternate pronunciations in the lexicon is unrealistic; not only is in computationally impractical, not all possible pronunciations are even known. Closed set? from above. Also, scientific vs. taxonomic. Is listing all possible pronunciations interesting? Assumed that there is more variability in the unstressed portions. (This actually is unclear. Can say that there is more DELETION in unstressed portions-- Let's look at this in ice cream). Stress identification in IW is very reliable [Aull]. III. What about continuous speech with BPC with or without stress? Words appear to be a relatively poor choice of unit, because determining where words are in continuous speech is very difficult. Also there is no acoustic correlate of the unit word. Syllables, on the other hand, do have an acoustic correlate -- each vowel nucleus corresponds to a given syllable. Note however, that we are talking about the problem of identifying where potential syllables must be in the signal, as opposed to identifying exactly where the boundaries between successive syllables lie. The latter problem is substantially more difficult than the former, and is potentially an ill-formed problem. Consider the case of ambisyllabic phonemes [Kahn], phonemes which are believed to belong to two syllables, such as the /t/ in /b@tX/. Here the assignment of /t/ to one syllable or the other can't be determined on the basis of acoustic evidence. Furthermore, productive sandhi processes (footnote: systematic coarticulatory effects at boundaries) such as degemination pose a similar problem. For example the /s/ in /DIsYn/ cannot be said to belong to either the word "this" or the word "sign". If we are going to explore the use of syllables as a recognition unit, one of the first questions is how well broad phonetic syllables serve to partition our lexicon. In the 20k-word lexicon, there are only around 250 BP syllables (??times 2 with stress), and the expected equivalence class size is approximately xx words. For larger lexicons, the picture is even worse, because the number of broad phonetic syllables in English is not appreciably more than 250. In the 20k word lexicon there are approximately 6000 syllables, which is about the total number in English (Fujimura&Lovins). Therefore, the 20k word lexicon already has almost all the possible syllables in English, and larger lexicons will result only in larger classes, not more classes. (Footnote: This is in contrast with SZ/DPH studies where the number of classes grows with new (long) words). Since using syllables alone will result in large cohort sizes with big dictionaries, additional strategies are necessary if syllables are to be a viable unit for lexical access. (Footnote, for a limited size lexicon, say <5k words, this approach is still quite viable). [look at KWCs thesis on this] IV. In order to use syllable based recognition in continuous speech we must do more than identify potential syllable nuclei. Potential syllables must be hypothesized from a sequence of speech sounds in order to do lexical access. The naive approach is to hypothesize potential syllables of all possible lengths starting at each phoneme in the sequence. In this approach, there are O(7^n) possible syllables for an utterance of length n phonemes, because a syllable can be of any length from 1 to 7 sounds. Since this is exponential in the length of the utterance, it seems wise to try and exploit any syllable structure constraints which may exist in the language. Examination of the 20k word lexicon reveals that there are strong constraints on what broad phonetic sound sequences can occur at the beginning of a syllable. In the 20k word lexicon, there are only 14 different broad phonetic syllable onsets, out of xx broad phonetic consonant sequences which occur in the Pocket lexicon. Codas can be similarly characterized, although they are less constrained than onsets. (note: deletions seem to happen overwhelmingly in codas, which is a nice parallel to the fact that there seems to be less constraint in codeas). In order to evaluate how much constraint is provided by modelling syllables in this fashion, we parsed the broad phonetic transcriptions of 1000 sentences (N repetions of the Harvard List sentences by 100 speakers) into potential syllables. For this corpus, there are O(x) syllables for sentences of length x phonemes. [Investigate: do deletions preserve broad-syllabicity? I think so, at least in this sample of fairly careful but still natural speech.] One as yet unresolved problem with this approach is the fact that deleted/inserted/cross-class modified (low incidence?) phonemes change the broad phonetic representation of a given syllable. Therefore, in order to do lexical access based on syllables, deletions and insertions will have to be taken into account in some fashion. Note, one way to at least partially handle this problem is do lexical access based only on the stressed syllables of words. Analogously to confusion matrices above. V. Back to problem of equivalence class sizes growing with the size of the lexicon for syllable-based access. Need some other sources of constraint for larger lexicons. One way is to use longer units than syllables, so that there are more than 250 divisions. For example, modelling which sequences of syllables actually occur in the lexicon, and using this as a source of constraint in lexical access. However, there are two problems with this approach. First, it ignores the constraints imposed by prosodics. Second, deleted phonemes will affect the syllable sequence corresponding to a given word. Still need to identify syllables in the input in order to determine what the syllable sequence was. But this approach ignores the fact that (1) Not all syllables are easy to identify because of deleted and modified segments, and (2) not all syllables provide as much constraint. VI. However there is a principled manner in which syllables can be grouped together into supra-syllabic units. The metrical foot (as defined by [Kiparsky]) is a stressed syllable together with the unstressed syllables which follow it. The problem with using metrical feet to partition the lexicon into equivalence classes is that some words begin with an unstressed syllable, which violates the foot model. So we defined the "lexical foot", which is a single optional unstressed syllable followed by a metrical foot: (UNS) STR UNS*. Using this definition of foot we found that the 20k word lexicon is partitioned into equivalence classes such that the expected class size is approximately 100 words. This partitioning has the property that the number of broad phonetic feet grows with the size of the lexicon, analogously to the word-based studies and unlike the syllable-based studies. [Can we justify this with some mumbo jumbo?]. Therefore we can expect the equivalence class sizes not to grow substantially for larger lexicons. At first glance there is a strong parallel between the role of the stressed syllable in a foot and the role of a vowel nucleus in a syllable -- they are the acoustic "islands of reliability". However, in the case of unstressed syllables the vowel nucleus often can be deleted (or devoiced to the extent that it is unrecognizable as a vowel). On the other hand, the presence of a stressed syllable is very reliable, thus the acoustic correlate of a foot is hard (if not impossible) to delete. (Footnote: function words -- closed class lexical items such as determiners, auxiliaries and conjunctions -- are a degenerate case in the foot model because they have no lexically stressed syllable. We handle these specially below.) Fewer possible boundaries to hypothesize than in the syllable case. ***** The Fubar Point: In what region there must be an x versus where the x's in a region specifically are. Granularity <=> Reliable recognition cue idea: maximize constraint & maximize acoustic reliability. ***** VII. A more detailed description of deletion sites In this section, we will sketch a foot template wherein more detailed phonetic information will be associated with certain positions in the template. We then investigate the effects of this kind of mixed representation on the equivalence class constriction process in the lexicon. We will show that this encoding scheme predicts an expected equivalence class size for the 20,000 word lexicon of about 130 words, which is not substantially larger than when a simple broad phonetic encoding scheme for the foot is used. That is, the number of words matching a given sequence does not grow much even though we are both deleting some information from the sequences and multiply encoding words in the lexicon. This is intriguing because it suggests that by underspecifying selected information in the foot, we do not in general conflate equivalence classes. Nevertheless, we will not advocate a process of lexical access based *directly* on the foot since, in continuous speech, feet span word boundaries. In the next section we will show how to employ a process of iterative refinement which can make use of mixed-grain foot-based constraints during the word matching process. Recall that the main problems associated with a recognition scheme dependent on arbitrary suprasyllabic units were that certain syllables appeared highly prone to phonetic modification and prosodic constraints remained unexploited. The move to a foot-based representation provides a simple mechanism for making use of stress information, and permits a specification of those syllables which exhibit a high degree of variability. Sounds in unstressed syllables are often modified to the extent that they change broad manner class membership. A frequently occurring and extreme case of phonetic variability is segment deletion. When we used words as the unit of lexical access, we handled this problem by simply ignoring the sounds in unstressed syllables. This of course is unsatisfactory since the unstressed syllables do provide important information in recognition. To begin a preliminary formulation of deletion/modification environments in the foot, we must distinguish phonemes in unstressed and stressed syllables. We only classify the sounds in unstressed syllables as being conosonantal or vocalic [?? Syllabic consonants ??]. If a foot (in a word) is maximum of five syllables, and contains only one (either primary or secondary) stressed syllable, a reasonable template might be the following: Foot: (UNS0) STR (UNS1) (UNS2) (UNS3) examples: a bout - - - - stress - - - - bright ly - - - pos si bly - im pos si bly - - prac ti cal ly At this level of description, we permit a maximum of one initial unstressed syllable in the template. The unstressed syllable is often, but not always, an affix. We actually leave such a syllable "unassociated" in our initial lexical probe (to be described below) but for the purposes of constraint checking we will eventually want a phrase like 'think about' to contain substrings ^think^ ^about^, not just ^thinka^ ^bout^. We will also permit well-formed sub-strings which exhibit no acoustic correlate of stress. In a phrase containing a word with multiple short prefixes we will merely hypothesize stress based on its actual occurence elsewhere. For example, an acoustic stress-finding algorithm running on the phrase "he's uninformed" may only return stress in association with the syllable [fcrmd]. We then blindly permit a wfss corresponding to a foot which will encompass, for example, the two preceding syllables ^unin^. That is, a mixed top-down/bottom-up approach will hypothesize foot templates over sequences of unstressed syllables. In order to evaluate the use of a variable specificity encoding scheme, we partitioned the 20k word lexicon into equivalence classes by representing each word in terms of each of its possible partial phonetic feet. For example the word "probably" is represented as ST GL V C V C C V as in /prabxbli/, ST GL V C C C V as in /prabbli/, ST GL V C C V as in /prxbli/, ST GL V C V as in /prali/. This representation set is permitted by fleshing out the template as follows: UNS0: C* [V] [C] STR: N-way classification; modulo deletion of final phoneme UNS1: {C$ ! [C*]} [V*] [C*] [recall "raucous" where xs is not deletable and "reply" where r is like a vowel because its syllabic-ified when reduced, final vowel not deletable in "probably"] [Note, we need to examine ice cream to verify our foot rule, whatever it may be -- and then say we did that.] How much confusion given this less specificity of encoding? VIII. How do we use this in lexical access? While feet provide a reasonable means of partitioning a large lexicon into relatively small equivalence class sizes, we have alluded to several problems in directly accessing words in continuous speech solely on their basis. First, due to the optional unstressed syllable at the beginning of a lexical foot, there is ambiguity in the assignment of lexical feet to a broad phonetic class sequence. On the other hand, the assignment of metrical feet is uniquely determined because each stressed syllable begins a new metrical foot. However, if metrical feet are used for lexical access, a given foot does not necessarily correspond to a word. For example "think about" contains the two metrical feet "thinka bout". Second, because of the existence of function words -- which are completely unstressed -- a given foot may actually encompass more than one lexical entry. This causes the same problem as the use of metrical feet, namely that a foot no longer corresponds to a word. The conclusion we draw from this is that matching and lexical encoding are separate problems. What do we mean by this? Feet provide a useful framework for matching a given phoneme sequence to an underlying lexical entry because the foot model allows phoneme deletion sites to be specified. However, feet are not a particularly good unit for accessing words from the lexicon. One way to deal with this problem is to split word hypothesization into two stages: lexical access and matching. One possible solution to this is to do an initial, errorful probe based on the two-way distinction REDUCED and NONREDUCED VOWEL (plus unattached consonants) upon the lexicon, followed by further matching using foot-based rules. In particular, such an approach is attractive for a relatively small lexicon (say less than 5k words) where the initial syllable-based probe will not return extremely large equivalence classes. This approach is really a special case of a more general framework for viewing the problem of lexical access based on partial descriptions. In this framework, the recognition problem is viewed as one of iterative refinement -- where the space of possible candidates is successively reduced by the addition of more detailed information. An additional important aspect of this model is that the candidate space is not necessarily limited to the words in any finite lexicon. This is especially important for the recognition of languages (or speech modes) which employ productive morphology. IX. The iterative refinement model We have been discussing how partial descriptions of sound sequences can be used as sources of constraint in recognition. In order to make use of such constraints we need a control structure for combining information from different sources and at different levels of specifity. We propose the use of a control structure based on an iterative refinement model. We have demonstrated that it is problematic to simply divide the lexicon into equivalence classes along a single dimension, and to then use a single probe to obtain the cohort corresponding to, for example, a given foot or syllable, and we have also suggested that it is impractical to continuously repartion the lexicon into new equivalence classes during the recognition process. The iterative refinement model makes a specific commitment to a particular lexical representation and a given lexical access algorithm, both of which overcome the shortcomings of a single probe into a partitioned lexicon. The lexical access algorithm parallels the structure of the lexicon in an interesting way, particularly in view of the fact that the lexicon is not conceived of as a finite word list, but as the possible well-formed word space in a given language. In order to discuss this conception of the lexicon and of the corresponding process of lexical access, let us first review the concept of information granularity. Terms such as "coarse" and "fine" specification and "underspecification" all refer to the amount of information in a representation. Yet this notion is only remotely related to the quantitative notion of information in the well-known work in information theory (Shannon, Wiener etc). It is closer in spirit to the concept of segmental specification in the Prague School of linguistics, or in work on the theory of lexical phonology (Kiparsky, Withgott, Mohanan, Archangeli). Unlike classical generative phonology, analyses in lexical phonology permit radically underspecified representations, whereby a consonant might be indicated simply by a "C " at one level of analysis. Certain consonants might be specified for the feature [voice], and others might be unspecified for that feature (analogous to archiphonemes in classical structuralism). In this current conception of phonology, as lexical derivation procedes the lexical representation grows richer. This same sort of process can be used in lexical access. The iterative refinement model places certain demands on the signal processing. It must result in a muli-tiered analysis. The organization of the tiers mirrors the competence of the signal processing front end. The top tiers return reliable information. Here, the presence of syllable nuclei, the reduced/nonreduced vowel distinction, and the identification of segment regions into the rough categories of consonantal and vocalic elements corresponds to the first cut in the lexical space. [back pointer to 4-way cut?] ............................syllabic nuclei..... other..... parameter:stress ....consonants...... vowels.......other.. .........................diphthong/non-diphthong..... ....stops...... glides/liquids/nasals.......fricatives....other.. ....stops......aspirated stops/weak fricatives... glides/liquids.....nasals....... strongfricatives....other.. ....rounded/back vowels...... front vowels.......other.. ....high rounded/back vowels...... low rounded/back vowels.... high front vowels..... low front vowels..other.. . . . . frication with labial tail....nasal bar in vowel..etc. Successive tiers provide more detailed phonetic information -- information which may prove important for lexical access but also which is suseptible to transformation under noise or mislabelling on the part of the front end. The use of an iterative refinement control structure preserves the principle of least commitment regardless of the processing mode. This permits the recognizer to be tuned in a natural way. If the task is to distinguish a few, acoustically distinct, lexical items, the full "funnel" will not be traversed; most likely the first cut yielding a syllable count and vowel/consonant distinctions will adequately discriminate the inputs. If the cohort still does not contain a unique candidate, however, the recognizer will drop down to the successive tiers until a match can be made. This funnel design is associated with parameters which are set by the processing *mode*. The operation of lexical access is sensitive to the following parameters: speech inputs restricted to a list of lexical items with no function words (e.g. digits) speech inputs restricted to a small set of known lexical items speech inputs restricted to a large set of known lexical items speech inputs with pauses between lexical items speech inputs which are continuous speech inputs consisting of an open set of lexical items (expects new words) speech inputs restricted to a few, expected, utterances speech inputs which are unrestricted That is, the operation of the word matcher is expectation-driven in a global sense. The default mode reflects the expectation that utterances will not be subject to any special restrictions. The parameter set can be augmented, of course. The processing mode might vary if one encounters a foreign accent, or is in a noisy environment, processes synthetic speech, etc. While it is obvious that listeners are capable of making use of various kinds of knowledge dependent upon the discourse situation, it is not common for recognition models to permit the variable use of heterogeneous constraints in a natural fashion. Preliminary partition hierarchy and funnel access (to be revised) BPHON SEQ --MATCH--> COHORT -> --AddFine-grainInfo--MATCH--> COHORT ----> ^-------------------------------------------------------' It is important to note that the inputs are not reprocessed by the front end at every tier (i.e. the model need not be multiple-pass). The signal processing information regarding fine details is always available but is not always utilized, that is unless the processing mode is unrestricted or there is sufficient ambiguity present in a more restricted mode. Thus one dimension in the organization of the lexicon is amount of phonetic specification. Another dimension is constituent organization. Since the model must be flexible enough to capture morphological productivity, we cannot rely only on the word unit to provide us with linear precedence constraints. Other sources of cooccurrence constraints include the units of syllables, feet, and the set of morphemes already in the language, such as prefixes and suffixes, as well as stems which might be equivalent the the word unit. All these can be drawn upon in the process of well-formed sub-string hypothesization. The recognition algorithm separates matching from WFSS hypothesization -- matching takes place as a subpart of AddInfo (in those cases when AddInfo actually does a lexical probe or other thing requiring a match [which gets back to this issue of direct versus indirect -- matched -- info]). Once the speech inputs are labelled at a given level of specificity, one of a set of simple deterministic finite automata groups these into sub-strings using constituency information that is appropriate for that level of specficity. partition hierarchy BPHON SEQ --DFA--> WFSS LAT --GC--> WFSS LAT --AddInfo--> SS LAT ----> ^-------------------------------------------------------' Multiple well-formed sub-strings provide a way to bypass the search for word boundaries in continuous speech. Boundaries emerge when one or more paths of WFSS's can be found, and when such paths match up with lexical entries. If a language is to be recognized which makes productive use of morphological rules, the paths of WFSS's must be consistent with the morphological grammar rules. Such an operation need not be computationally expensive, as demonstrated by [Kaplan&Kay, Koskeniemmi]. Let us in fact consider the worst-case scenario, where we cannot assume that we have stored all lexical items the speaker will produce, and that the utterance may consist of lists as well as syntactically well-formed sentences. In such a situation, we start by using broad phonetic information coupled with syllable constituency constraints. The DFA will produce a lattice of well-formed sub-strings which will be very deep indeed. However, we will be able to garbage collect some strings we hypothesized which do not span the utterance. Depending on the language, we must then make use of morpheme or higher-level prosodic constraints to form a new WFSS lattice in conjunction with an augmentation of the amount of detail in the phonetic labelling. We will consider this step in the next section specifically for English. An interesting question that arises in this fully unrestricted scenario is when should the computation halt? If lexical items are found in an utterance of a language with productive compounding, such as German, Swedish, Malayalam, etc., should those items be put together to form a word? Should one be content to recognize three words in (Ger) *cooking-hot-vasser* (lit. 'cooking-hot-water' "boiling water") or two in *blackbird*? If the goal is language understanding, the answer is clearly no (a *blackbird* is clearly different from a *black bird*, yet only discourse will distinguish them if the latter is heard with contrastive stress on the adjective). A more modest goal is word recognition plus phonological identification of unknown items. While much of speech recognition research in the past has assumed that syntactic, semantic and pragmatic knowledge sources are needed to recognize utterances, we would rather concentrate on a model capable simply of recognizing a finite set of stems, including their allomorphs, and the set of affixes they can be associated with, plus of course a phonological analysis of any unmatched speech inputs. Such a range of data is broad enough to be scientifically challenging, yet does not depend upon understanding discourse. ??[Scale Space WFSS as a mechanism: parsing,matching,connectionism; ambiguity,combinatorics,error] X. Function Words Promise: we [read Meg] will show how we can deal with uncertainty, deletions, modifications and insertions -- both phonologically motivated and some cases of spurious insertions. we found in read, but relaxed, speech that...(ice cream) - Vowel in primary stress in word will not delete if content word (other stuff here) - word-final consonants delete regularly, even in order to preserve the initial consonant of a following function word. "Coming at you fast" (withgott 85) No point in gathering the cohort for certain patterns: no *tyu - f V: no *Cx - f V: no *tx - f V: no *t/C x - Fric V: no *t/CFric V: (one syllable) no *?ty x- Fric V: no *y x- Fric V: no *?ty x- Fric V: (one syllable) no *y x- Fric V: (one syllable) No point in gathering the cohort for (?)(t)y (x) FRIC Stress V cuts back on combinatorics, even with function words XI. Stress in the signal .. how reliable? what about stress shift? This is not so bad because easily detectable stress (by current algorithms) is only primary, unreduced and reduced anyway, so let us only rely on it at the onset artificial intelligence --> ARti'ficial inTELligence look up all artx all fISl -- art won't be in lexicon as reduced xrtx fISl wont reduce all the way down, so tractable XII. Conclusion 1(DEFAULTFONT 1 (GACHA 10) (GACHA 8) (TERMINAL 8)) TIMESROMAN AlIû‹gz¹