LANGUAGE-SPECIFIC PARSING OF ACOUSTIC EVENTS Abstract In this paper is reported a technique for transforming acoustic events derived from a processed speech signal into meaningful units for a given natural language. This decomposition of the problem permits a representation that is stable over utterance situations and provides constraints that handle some of the difficulties associated with partially obscured or ``incomplete'' information. A system will be described which contains a grammar for parsing higher-level (phonological) events as well as an explicit grammar for low-level acoustic events. It will be shown that the same techniques for parsing syntactic strings apply in this domain. The system thus provides a formal representation for physical signals and a way to parse them as part of the larger task of extracting meaning from sound. (WHAT LEVEL LANGUAGE -SPECIFIC INFO COMES IN?) 1. Introduction/Motivation Knowledge-based systems get around the uncertainty of information in various ways, typically through a process of "compartmentalization" [McDermott83] which may involve access to task-specific information, the general domain, or which agents are involved [Lowerre--] . We follow this general paradigm at a microscopic level by developing a representation of physical signals whereby we can use both non-linguistic, situation-independent knowledge plus higher-level language-specific constraints in the information extraction process. The proposal is that there is a set of ordinary grammars that map down to the level of sound. If we have this representation of the spoken utterance, we can compute with it in interesting ways. Current research into spoken language generally splits into two kinds of activities. One is primarily linguistic and involves the formulation of grammars and descriptive systems for living and extinct natural languages. The other has a more direct connection to the analysis of speech signals, whether for the purpose of synthesis or automatic recognition. There are two reasons for incorporating methods from the first area of study into the second domain: 1. Understanding "speech" is a problem that involves various knowledge sources including fairly abstract language-particular information. Formal cross-linguistic studies yield a ready-made modularization of such knowledge. 2. The principles of word formation for a given language can guide recognition of novel strings. Linguistic knowledge allows for the fact that the inventory of items in a specific language is a moving target, as opposed to a "bin of parts." In our implementation we have been most concerned with the first observation. We have limited the sources of knowledge to those below the level of syntax and even below the word, so that we include as a component a set of language-specific grammars of low-level objects including syllables, segments, down to acoustic events. The premise is that native speakers have access to all these information levels, and that some knowledge sources are essential since speech recognition can happen in noise. The latter premise assumes a technique of soft matching, whereby uncertain input can be discarded, in opposition to a forced choice scheme. Using different kinds of knowledge to solve a problem Information can be gleaned from speech using familiar techniques employed in natural language research at higher levels like syntax. The feasibility of "finding word boundaries" through such parsing methods taking allophones as input has already been demonstrated [Church83]. We will take a lower level of knowledge, one similar to Church's, and a higher level consisting of strings of syllables. The incorporation of the latter reflects our belief that listeners attend to the rhythmic structure of speech in finding word boundaries. The lowest level entails a difficult task, which is to develop a symbolic representation of the speech signal that we can compute with. In order to get at the information in a speech signal, however it's filtered or enhanced, we have to go through a qualitative labelling procedure. Although there are several ways to go about this initial step, we will limit the discussion to the labelling of spectrograms. The key point will be that labelling should make reference only to the physical signal itself, not to higher level linguistic categories. A theory of how the labelling in principle should be done will will rest on a theory of perceptual grouping [Witkin & Tennebaum83]. In this work we start at the second step, formulating a language-independent representation of perceptually salient events, without accounting for why the events are salient, e.g. why it is we know how to label a sharp burst. Thus, we are already relying on a certain kind of very complex knowledge. The symbolic representation is the "transcript" of the speech events. It is language-independent, and consists of strings of of fairly obvious labels such as "silence double-burst second-formant-dropped" etc. The acoustic labels are parsed by the segmental parser into linguistic objects. That is, the initial labelling over each pair of vertex vi and vertex vi' is parsed into higher-level objects. Unlike the labelling, the parsing is language specific. While, for instance, short silence and aperiodic energy can be parsed as an affricate "ch" in "choose" in English, it cannot be in French; we would claim that this rule is not included in the French acoustic grammar. Under certain circumstances, a gap followed by turbulence starting at a fairly low frequency from 'cette chose" might be parsed as either a [c] or a [ts] in English, but never as the former in French. Syllabic grammars impose language-specific constraints. However, these are not as stringent as one might expect due to the various permissible lenition processes such as vowel deletion in unstressed syllables. On the other hand, the combination of supra-syllabic grammars and soft matching can counteract effects of reduction processes, as we will see. An Example A speech spectrogram is labelled (without prior segmentation). The labels are from a fixed set, and {defaultly} correspond to a "commonsense" naming of objects. If objects do not match, they are ignored. The labels are looked up and interpreted in an articulation lexicon. This step is not actually necessary, but is interesting in that the mapping from acoustic events to articulatory events is many-to-many, and this property happens to cause some grammar rules to collapse. It also makes it easy to test various front ends, since any input is simply translated in the lexicon. The affricate rule, for example, is a [c] which rewrites as Closure followed by AlvpaFric, phonetic translations of the acoustic event transcript. Analysis of the transcript The process of analyzing the transcript is performed via an active chart parser and linguistic grammars. So long as we can formulate a grammar of acoustic events, we can use standard techniques to go from the represented signal to higher linguistic levels. The parser is derived from Kaplan's General Syntactic Processor [Kaplan--], and the system is built on top of the lisp-based LFG grammar-writer's environment [?]. Bottom-up parsing is the norm, but a flag for partial parses throughout the string also permits examination of especially degraded input. We make use of a chart window and a tree-structure window. The transcript is entered in the top level typscript or "lisp listener" window. All successful parses (complete at the top level) are shown as trees; all successful hypotheses (complete at each level, going bottom-up) appear in the chart lattice. The root node can be selected with a menu, or with code. The sets of grammars are hierarchical by nature; however, mixed categories appear by design on the same level. For instance, in a word like "finger," the transcript may contain a medial nasal formant, the second and third formants of the vowel drawing together, and a brief amount of low frequency energy. This is given labels which are interpreted phonetically as nasalization velarization nasal, which is parsed as [ n ]. On the other hand, there may not be any discernible movement of F2 and F3. In this case, the medial portion of the transcript string is parsed as NAS, a category more general than the allophone [ n ], but which appears in the same position in the parse tree. NAS or [ n ] are in turn parsed as CDA, a syllable "coda." Other examples of categories at this level include syllable "onsets" and "medial onsets." Medial in this case refers to a position in a stress group or foot. This higher category contain many constraints that allow recovery from incomplete or obscured information. Phonological constraints will allow us to hypothesize words once we have labeled a trajectory of acoustic events through time. In the case of our second parse tree for "finger", where only a vague nasal and not a velar nasal is hypothesized, foot-structure constraints will tell us the identity of the nasal, because it is the only possible foot-medial nasal that precedes a [g]. We wish to underscore the importance of this knowledge source, and not just rely on mapping to the word to provide information regarding obscured segments. There are three reasons for relying on phonological constituents instead of merely the word. The first is that words are hard to find in connected speech. Phonological constituents, on the other hand, can be built bottom-up. The second reason is that the appearance of an acoustic event depends on where a segment falls in a phonological constituent. A [p] will have a burst or not depending on its position. The third reason is that phonological categories provide interesting sequencing constraints, as we have just witnessed. Yet another reason has already been mentioned. The lexicon is not quite a "Bin of Parts;" we have to expect novel words. This is especially striking in morphologically rich languages like Finnish. Intuitive Explanation Each grammar rule can be viewed as a small information processor. We advocate an approach that uses separate stages of higher-level constraints, notably those found in the phonology (as opposed to even higher levels , which will let us say that we can, in some sense of the word, percieve (though perhaps not recognize) a sound in the world without first knowing it is a part of a linguistic object. It seems that the way to go about labeling "only what is there" is by viewing the problem by looking at the uniformities in the actual signal, the acoustic events, and assigning categories based on the physical signal itself. The signal is analyzed in a qualitative way, but, perhaps unexpectedly, linguistic constraints do not mediate this process, but a logically subsequent one. Viewed in this light, the research problem then becomes one of determining the classification of events in the physical signal, developing a parser that can map first from these to phonetic categories, and then to phonological categories and then finally to words. Put differently, we can ask: how can we represent actual, physical signals? How do we compute these to extract meaning? Conclusion This approach to spoken natural language shares some of the same advantages, and disadvantages, as recent research paradigms in vision [Marr]. Its flexibility and ease with which different kinds of knowledge are incorporated however speaks in its favor. It is surprising, perhaps, that language-specific grammars can be pushed all the way down to the level of acoustic events, but that being the case, it is obvious that grammars are particularly well-suited for encoding information at this level. REFERENCES [Church83] [Kaplan--] [Lowerre--] [Marr ] [McDermott-] [Witkin & Tennebaum] [Zue & Shipman81]NILNIL, HELVETICA HELVETICANILNIL HELVETICANILNIL HELVETICANILNIL HELVETICANILNIL HELVETICANILNIL HELVETICA HELVETICANILNIL7 HELVETICA HELVETICA@ HELVETICANILNILM HELVETICANILNILM HELVETICANILNILQ HELVETICANILNILM HELVETICANILNILH HELVETICANILNILN HELVETICANILNIL HELVETICANILNIL HELVETICA HELVETICANILNIL HELVETICANILNIL¨ HELVETICANILNILä HELVETICANILNILö HELVETICANILNIL HELVETICANILNIL! HELVETICA HELVETICAU HELVETICANILNIL HELVETICANILNIL6 HELVETICANILNIL HELVETICANILNIL HELVETICANILNIL HELVETICANILNIL HELVETICANILNIL HELVETICANILNILÚ HELVETICANILNIL HELVETICANILNIL^ HELVETICA HELVETICAÿþ HELVETICA HELVETICAÿþ' HELVETICANILNIL HELVETICANILNILx HELVETICA HELVETICA^ HELVETICANILNIL HELVETICANILNILc HELVETICA HELVETICANILNIL HELVETICANILNIL HELVETICA HELVETICANILNIL HELVETICANILNIL‡ HELVETICA HELVETICA HELVETICA HELVETICA: HELVETICANILNIL HELVETICANILNIL HELVETICA HELVETICANILNIL HELVETICANILNIL HELVETICANILNIL HELVETICANILNILÍ HELVETICA HELVETICA@ HELVETICA HELVETICA„ HELVETICANILNIL HELVETICANILNILq HELVETICA HELVETICA° HELVETICA HELVETICAo HELVETICANILNIL HELVETICANILNIL HELVETICA HELVETICAG HELVETICANILNIL HELVETICANILNIL HELVETICANILNIL HELVETICANILNIL HELVETICANILNIL HELVETICANILNILO HELVETICANILNIL| HELVETICANILNIL HELVETICANILNIL HELVETICA HELVETICANILNIL HELVETICANILNILQ HELVETICA° HELVETICA HELVETICA HELVETICANILNIL… HELVETICANILNIL HELVETICANILNIL HELVETICANILNIL HELVETICANILNIL÷ HELVETICANILNIL HELVETICANILNIL HELVETICANILNIL HELVETICA HELVETICANILNIL HELVETICANILNIL HELVETICANILNIL HELVETICANILNIL HELVETICANILNIL HELVETICANILNIL HELVETICANILNIL HELVETICANILNIL HELVETICA.‰»z¸