V. A more detailed description of deletion sites In this section, we will sketch a foot template wherein more detailed phonetic information will be associated with certain positions in the template. We then investigate the effects of this kind of mixed representation on the equivalence class constriction process in the lexicon. We will show that this encoding scheme predicts an expected equivalence class size for the 20,000 word lexicon of about 130 words, which is not substantially larger than when a simple broad phonetic encoding scheme for the foot is used. That is, the number of words matching a given sequence does not grow much even though we are both deleting some information from the sequences and multiply encoding words in the lexicon. This is intriguing because it suggests that by underspecifying selected information in the foot, we do not in general conflate equivalence classes. Nevertheless, we will not advocate a process of speech recognition based *directly* on the foot. In this work the foot is viewed as an abstract, emergent unit which does work in the form of helping to decode phonological modification, in serving as a unit upon which to state phonotactic constraints, and in providing an efficient way to partition the English lexicon for the purposes of lexical access. In the next section we will show how to [employ a process of iterative refinement] which can make use of mixed-grain foot-based constraints during the word matching process. Recall that the main problems associated with a recognition scheme dependent on arbitrary suprasyllabic units were that certain syllables appeared highly prone to phonetic modification and prosodic constraints remained unexploited. The move to a foot-based representation provides a simple mechanism for making use of stress information, and permits a specification of those syllables which exhibit a high degree of variability. Sounds in unstressed syllables are often modified to the extent that they change broad manner class membership. A frequently occurring and extreme case of phonetic variability is segment deletion. When we used words as the unit of lexical access, we handled this problem by simply ignoring the sounds in unstressed syllables. This of course is unsatisfactory since the unstressed syllables do provide important information in recognition. To begin a preliminary formulation of deletion/modification environments in the foot, we must distinguish phonemes in unstressed and stressed syllables. [no We only classify the sounds in unstressed syllables as being conosonantal or vocalic [?? Syllabic consonants ??].] If a foot (in a word) is maximum of five syllables, and contains only one (either primary or secondary) stressed syllable, a reasonable template might be the following: Foot: (UNS0) STR (UNS1) (UNS2) (UNS3) examples: a bout - - - - stress - - - - bright ly - - - pos si bly - im pos si bly - - prac ti cal ly At this level of description, we permit a maximum of one initial unstressed syllable in the template. The unstressed syllable is often, but not always, an affix. We actually leave such a syllable "unassociated" in our initial lexical probe (to be described below) but for the purposes of constraint checking we will eventually want a phrase like 'think about' to contain substrings ^think^ ^about^, not just ^thinka^ ^bout^. We will also permit well-formed sub-strings which exhibit no acoustic correlate of stress. In a phrase containing a word with multiple short prefixes we will merely hypothesize stress based on its actual occurence elsewhere. For example, an acoustic stress-finding algorithm running on the phrase "he's uninformed" may only return stress in association with the syllable [fcrmd]. We then blindly permit a wfss corresponding to a foot which will encompass, for example, the two preceding syllables ^unin^. That is, a mixed top-down/bottom-up approach will hypothesize foot templates over sequences of unstressed syllables. [ In order to evaluate the use of a variable specificity encoding scheme, we partitioned the 20k word lexicon into equivalence classes by representing each word in terms of each of its possible partial phonetic feet. For example the word "probably" is represented as ST GL V C V C C V as in /prabxbli/, ST GL V C C C V as in /prabbli/, ST GL V C C V as in /prxbli/, ST GL V C V as in /prali/. This representation set is permitted by fleshing out the template as follows: UNS0: C* [V] [C] STR: N-way classification; modulo deletion of final phoneme UNS1: {C$ ! [C*]} [V*] [C*] [recall "raucous" where xs is not deletable and "reply" where r is like a vowel because its syllabic-ified when reduced, final vowel not deletable in "probably"] [Note, we need to examine ice cream to verify our foot rule, whatever it may be -- and then say we did that.] How much confusion given this less specificity of encoding?] VIII. How do we use this in lexical access? While feet provide a reasonable means of partitioning a large lexicon into relatively small equivalence class sizes, we have alluded to several problems in directly accessing words in continuous speech solely on their basis. First, due to the optional unstressed syllable at the beginning of a lexical foot, there is ambiguity in the assignment of lexical feet to a broad phonetic class sequence. On the other hand, the assignment of metrical feet is uniquely determined because each stressed syllable begins a new metrical foot. However, if metrical feet are used for lexical access, a given foot does not necessarily correspond to a word. For example "think about" contains the two metrical feet "thinka bout". Second, because of the existence of function words -- which are completely unstressed -- a given foot may actually encompass more than one lexical entry. This causes the same problem as the use of metrical feet, namely that a foot no longer corresponds to a word. The conclusion we draw from this is that matching and lexical encoding are separate problems. What do we mean by this? Feet provide a useful framework for matching a given phoneme sequence to an underlying lexical entry because the foot model allows phoneme deletion sites to be specified. However, feet are not a particularly good unit for accessing words from the lexicon. One way to deal with this problem is to split word hypothesization into two stages: lexical access and matching. One possible solution to this is to do an initial, errorful probe based on the two-way distinction REDUCED and NONREDUCED VOWEL (plus unattached consonants) upon the lexicon, followed by further matching using foot-based rules. In particular, such an approach is attractive for a relatively small lexicon (say less than 5k words) where the initial syllable-based probe will not return extremely large equivalence classes. This approach is really a special case of a more general framework for viewing the problem of lexical access based on partial descriptions. In this framework, the recognition problem is viewed as one of iterative refinement -- where the space of possible candidates is successively reduced by the addition of more detailed information. An additional important aspect of this model is that the candidate space is not necessarily limited to the words in any finite lexicon. This is especially important for the recognition of languages (or speech modes) which employ productive morphology. IX. The iterative refinement model We have been discussing how partial descriptions of sound sequences can be used as sources of constraint in recognition. In order to make use of such constraints we need a control structure for combining information from different sources and at different levels of specifity. We propose the use of a control structure based on an iterative refinement model. We have demonstrated that it is problematic to simply divide the lexicon into equivalence classes along a single dimension, and to then use a single probe to obtain the cohort corresponding to, for example, a given foot or syllable, and we have also suggested that it is impractical to continuously repartion the lexicon into new equivalence classes during the recognition process. The iterative refinement model makes a specific commitment to a particular lexical representation and a given lexical access algorithm, both of which overcome the shortcomings of a single probe into a partitioned lexicon. The lexical access algorithm parallels the structure of the lexicon in an interesting way, particularly in view of the fact that the lexicon is not conceived of as a finite word list, but as the possible well-formed word space in a given language. The iterative refinement model places certain demands on the signal processing. It must result in a muli-tiered analysis. The organization of the tiers mirrors the competence of the signal processing front end. The top tiers return reliable information. Here, the presence of syllable nuclei, the reduced/nonreduced vowel distinction, and the identification of segment regions into the rough categories of consonantal and vocalic elements corresponds to the first cut in the lexical space. [back pointer to 4-way cut?] ............................syllabic nuclei..... other..... parameter:stress ....consonants...... vowels.......other.. .........................diphthong/non-diphthong..... ....stops...... glides/liquids/nasals.......fricatives....other.. ....stops......aspirated stops/weak fricatives... glides/liquids.....nasals....... strongfricatives....other.. ....rounded/back vowels...... front vowels.......other.. ....high rounded/back vowels...... low rounded/back vowels.... high front vowels..... low front vowels..other.. . . . . frication with labial tail....nasal bar in vowel..etc. Successive tiers provide more detailed phonetic information -- information which may prove important for lexical access but also which is suseptible to transformation under noise or mislabelling on the part of the front end. The use of an iterative refinement control structure preserves the principle of least commitment regardless of the processing mode. This permits the recognizer to be tuned in a natural way. If the task is to distinguish a few, acoustically distinct, lexical items, the full "funnel" will not be traversed; most likely the first cut yielding a syllable count and vowel/consonant distinctions will adequately discriminate the inputs. If the cohort still does not contain a unique candidate, however, the recognizer will drop down to the successive tiers until a match can be made. This funnel design is associated with parameters which are set by the processing *mode*. The operation of lexical access is sensitive to the following parameters: speech inputs restricted to a list of lexical items with no function words (e.g. digits) speech inputs restricted to a small set of known lexical items speech inputs restricted to a large set of known lexical items speech inputs with pauses between lexical items speech inputs which are continuous speech inputs consisting of an open set of lexical items (expects new words) speech inputs restricted to a few, expected, utterances speech inputs which are unrestricted That is, the operation of the word matcher is expectation-driven in a global sense. The default mode reflects the expectation that utterances will not be subject to any special restrictions. The parameter set can be augmented, of course. The processing mode might vary if one encounters a foreign accent, or is in a noisy environment, processes synthetic speech, etc. While it is obvious that listeners are capable of making use of various kinds of knowledge dependent upon the discourse situation, it is not common for recognition models to permit the variable use of heterogeneous constraints in a natural fashion. Preliminary partition hierarchy and funnel access (to be revised) BPHON SEQ --MATCH--> COHORT -> --AddFine-grainInfo--MATCH--> COHORT ----> ^-------------------------------------------------------' It is important to note that the inputs are not reprocessed by the front end at every tier (i.e. the model need not be multiple-pass). The signal processing information regarding fine details is always available but is not always utilized, that is unless the processing mode is unrestricted or there is sufficient ambiguity present in a more restricted mode. Thus one dimension in the organization of the lexicon is amount of phonetic specification. Another dimension is constituent organization. Since the model must be flexible enough to capture morphological productivity, we cannot rely only on the word unit to provide us with linear precedence constraints. Other sources of cooccurrence constraints include the units of syllables, feet, and the set of morphemes already in the language, such as prefixes and suffixes, as well as stems which might be equivalent the the word unit. All these can be drawn upon in the process of well-formed sub-string hypothesization. The recognition algorithm separates matching from WFSS hypothesization -- matching takes place as a subpart of AddInfo (in those cases when AddInfo actually does a lexical probe or other thing requiring a match [which gets back to this issue of direct versus indirect -- matched -- info]). Once the speech inputs are labelled at a given level of specificity, one of a set of simple deterministic finite automata groups these into sub-strings using constituency information that is appropriate for that level of specficity. partition hierarchy BPHON SEQ --DFA--> WFSS LAT --GC--> WFSS LAT --AddInfo--> SS LAT ----> ^-------------------------------------------------------' Multiple well-formed sub-strings provide a way to bypass the search for word boundaries in continuous speech. Boundaries emerge when one or more paths of WFSS's can be found, and when such paths match up with lexical entries. If a language is to be recognized which makes productive use of morphological rules, the paths of WFSS's must be consistent with the morphological grammar rules. Such an operation need not be computationally expensive, as demonstrated by [Kaplan&Kay, Koskeniemmi]. Let us in fact consider the worst-case scenario, where we cannot assume that we have stored all lexical items the speaker will produce, and that the utterance may consist of lists as well as syntactically well-formed sentences. In such a situation, we start by using broad phonetic information coupled with syllable constituency constraints. The DFA will produce a lattice of well-formed sub-strings which will be very deep indeed. However, we will be able to garbage collect some strings we hypothesized which do not span the utterance. Depending on the language, we must then make use of morpheme or higher-level prosodic constraints to form a new WFSS lattice in conjunction with an augmentation of the amount of detail in the phonetic labelling. We will consider this step in the next section specifically for English. An interesting question that arises in this fully unrestricted scenario is when should the computation halt? If lexical items are found in an utterance of a language with productive compounding, such as German, Swedish, Malayalam, etc., should those items be put together to form a word? Should one be content to recognize three words in (Ger) *cooking-hot-vasser* (lit. 'cooking-hot-water' "boiling water") or two in *blackbird*? If the goal is language understanding, the answer is clearly no (a *blackbird* is clearly different from a *black bird*, yet only discourse will distinguish them if the latter is heard with contrastive stress on the adjective). A more modest goal is word recognition plus phonological identification of unknown items. While much of speech recognition research in the past has assumed that syntactic, semantic and pragmatic knowledge sources are needed to recognize utterances, we would rather concentrate on a model capable simply of recognizing a finite set of stems, including their allomorphs, and the set of affixes they can be associated with, plus of course a phonological analysis of any unmatched speech inputs. Such a range of data is broad enough to be scientifically challenging, yet does not depend upon understanding discourse. ??[Scale Space WFSS as a mechanism: parsing,matching,connectionism; ambiguity,combinatorics,error] X. Continuous Speech We have examined a number of linear precedence constraints stated on syllables and feet at the broad phonetic level. These are useful for discrimination between a fairly small set of lexical items, and they are useful as an entry device to recognition in the continuous case. [notion of getting the computation started here]. We have seen that a foot algorithm stated at the broad phonetic level serves to loosely divide wfss's into regions that correspond to possible words. However, the ensuing representation is highly ambiguous due to [above formulation]. The disambiguation process depends onthe existence of constraints at a finer level of phonetic detail. There are such constraints at the foot level, and they prove especially helpful in separating out unstressed function words from adjacent content words, and in finding word boundaries in function word sequences. Let us take an example using the frequently occurring preposition . It happens that English contains no monomorphemic feet beginning with the unstressed syllable [tx] followed by any stressed syllable beginning with a fricative. This is presumably an accidental gap due to restricting our lexicalsearch space to a half a million words. In such a restricted task, however, accidental gaps involving phonetic sequences which match function words are numerous and beneficial. Suppose were embedded in a phrase such as "...at you for.." We can garbage collect a number of sub-strings, cutting back on the combinatorics of lexical hypothesization. No content word will contain the feet with the following sequences: *tyu - f V: *Cx - f V: *tx - f V: *t/C x - Fric V: *t/CFric V: (one syllable) *?ty x- Fric V: *y x- Fric V: *?ty x- Fric V: (one syllable) *y x- Fric V: (one syllable) (where : = nonreduced vowel and ? = glottalization and C = affricate and - = syl boundary)] [note about the common allophonic variants of at you for]The natural question to ask at this point is what if the front end cannot detect [c] or [ty] reliably? The iterative refinement model still is beneficial. If the front-end labeller produces a *set* of finer-grain labels, we will still make use of sequencing constraints, only we will have to consider a larger set of sub-strings. This seems preferable to a "best-match" algorithm that settles upon a fine-grain label at the expense of a remote, but possibly correct, candidate label. Thus the funnel design reduces the combinatorics of processing continuous speech by a nmber of largely independent operations. Broad phonetic labelling and wfss formation provide a discrete representation [ww] amenable to lattice pruning. The addition of finer-grain labels again increases the computational burden, but with the pay-off that a new set of sequencing constraints may be used for further garbage collection of non-spanning strings. The scheduling function of the addition of more and progressively more finely detailed information is determined by the confusability error of the front end. These advantages stem from a modular design regarding parsing and lexical access, and a hierarchical partitioning of the lexicon itself. Promise: we [read Meg] will show how we can deal with uncertainty, deletions, modifications and insertions -- both phonologically motivated and some cases of spurious insertions. we found in read, but relaxed, speech that...(ice cream) - Vowel in primary stress in word will not delete if content word (other stuff here) - word-final consonants delete regularly, even in order to preserve the initial consonant of a following function word. XI. Stress in the signal .. how reliable? what about stress shift? This is not so bad because easily detectable stress (by current algorithms) is only primary, unreduced and reduced anyway, so let us only rely on it at the onset artificial intelligence --> ARti'ficial inTELligence look up all artx all fISl -- art won't be in lexicon as reduced xrtx fISl wont reduce all the way down, so tractable XII. Conclusion TIMESROMAN1(DEFAULTFONT 1 (GACHA 10) (GACHA 8) (TERMINAL 8)) P�P�z�