I. Problems posed by increasing lexicon size and speaker community II. Ways to artificially delimit the problem a. restrict vocabulary b. restrict kinds of words c. use one speaker d. use one dialect, voice type III. How to avoid restricting lexicon size and excessive training to an individual? How to avoid rebuilding the world after training to an individual? IV. Along what dimensions are all possible items maximally distinct? a. function word vs. content word segment statistics b. stress patterns and high-frequency words IV. Phonotactic constraints and statistical properties of the English lexicon a. Shipman and Zue b. Huttenlocher c. Why Waltz-type search algorithms make sense for natural language i. phonotactic constraints plus accidental gaps V. Radically underspecified acoustic phonetic information is possible for lexical lookup -- what information is reliable? a. stressed syllables b. Yet information in unstressed syllables is important: Church VI. Syllables are not an easy unit to access because they are difficult to isolate when analyzing the signal a. Do not provide significant aid when mapping to the word VII. Supra-syllabic units, or 'feet', can be deterministically parsed in isolated words a. should use morphological information b. but if the stress and no morphology is known, an algorithm is the following i. find primary stress foot, secondary stress feet, syllables with heavy rhymes, include "proclitic" syllables into feet c. statistics on feet if every segment is known VIII. Parsing feet in utterance (with no word boundaries) is non-deterministic a. In the worse case, it is only a little worse than n ?? (cap = stress lower-case = unstressed) Let an utterance be aB* So aBaB There are 2 possible parse for this (2 STRESSES) (aB)(aB) (aBa)(B) aBaBaB There are 4 possible parses for this (3 stresses) (aB)(aB)(aB) (aBa)(Ba) (B) (aB) (aBa) (B) (aBa) (B) (aB) aBaBaBaBaBaB..... There are 22 possible parses for this (6 STRESSES) (aB)(aB)(aB)(aB)(aB)(aB) (aBa)(B)(aB)(aB)(aB)(aB) (aBa)(Ba) (B) (aB)(aB)(aB) (aBa)(Ba) (Ba) (B)(aB)(aB) (aBa)(Ba) (Ba) (Ba) (B) (aB) (aBa)(Ba) (Ba) (Ba) (Ba) (B) (aB) (aB) (aB) (aB) (aBa) (B) (aB) (aB) (aB) (aBa) (Ba) (B) (aB) (aB) (aBa) (Ba) (Ba) (B) (aB) (aBa) (Ba) (Ba) (Ba) (B) (aBa) (B) (aBa) (B) (aBa) (B) (aB) (aB) (aBa) (B) (aBa) (B) (aBa) (Ba) (Ba) (B) (aBa) (B) (aBa) (B) (aB) (aB) (aBa) (B) (aBa) (B) (aBa) (Ba) (Ba) (B) (aBa) (B) (aB) (aBa) (Ba) (B) (aBa) (Ba) (B) (aB) (aBa) (B) (aBa) (Ba) (B) (aBa) (Ba) (B) (aB) (aBa) (B) (aBa) (Ba) (B) (aB) (aBa) (B) (aB) (aBa) (B) (aB) (aBa) (B) (aBa) (B) (aB) (aBa) (Ba) (B) (aBa) (B) (aB) b. However, in the best case it can be deterministic and it yields word-boundaries. There is a distinction between "lexical feet" --i.e. feet occuring inside a word in the language, and "post-lexical" feet, which span word edges and which may or may not match lexical feet in terms of segment structure. IX. Statistics a. Feet: maximally distinct (each segment counted) In a 20k lexicon, there are 27,675 feet (total). Of these, 16,473 are distinct. The expected number of instances of feet in each class is 9.6, calculated by S... There are 13,212 unique feet counted in this fashion, which equals 47% of all feet. 72% of the lexicon contains categories with five or fewer members. A sample of the breakdown: 3404 feet in classes of size 2, 12 % 1737 feet in classes of size 3, 6 % 1172 feet in classes of size 4, 4 % 875 feet in classes of size 5, 3 % 684 feet in classes of size 6, 2 % 574 feet in classes of size 7, 2 % . ........ (etc.) 93 feet in classes of size 93, 0 % 119 feet in classes of size 119, 0 % 178 words in classes of size 178, 0 % The largest class contains 178 members. b. When this division of the lexicon into feet takes word-frequency into account (Brown), the singleton class increases by 2%. c. When stress is factored into the equation, ?? X. Statistics using broad phonetic categories a. Using the broad phonetic categories instead of allophonic transcriptions, we find 5,171 types of feet and the expected class size is 95.96058. The categories used were: Stops, Weak Fricatives (f,v, eth, theta), Strong Fricatives (includes affricates as well as fricatives), Vowels, Diphthongs, Glides (including liquids), and Nasals. There were 3,209 singleton feet, which was 11% of the sum total of 27,675. Six percent of the groups fall into extremely large classes containing more than 300 members. When we weight these results according to frequency, we find 5,171 types of feet and the expected class size is 133.3836. Singleton feet compose only 4% of the sum total. Eight percent of the groups contained 18 members. 11% fall into extremely large classes containing more than 300 members. b. When stress is considered using broad phonetic classifications in the unweighted lexicon, the expected class size decreases slightly (87.49222 as opposed to 95.96058). The unique categories increase from 11% to 13%. 26% of the lexicon is seen to have categories containing 5 or fewer members, as opposed to 22% when stress was not calculated. GACHA HIPPO GACHA z