CSL Notebook Topic
To VoiceInterest  Date September 12, 1983 3:57 pm
From Dan Swinehart  Location PARC
Subject "Cedar Voice"  Organization CSL
Release as [Indigo]<CSL—Notebook>SomethingOrOther.
Came from
 [Indigo]<Voice>Documentation>CedarVoice.tioga
Last editedby Dan Swinehart
Abstract This note describes a design for a programmer's interface to the Etherphone system. Its primary goal is to keep things simple, so that people will actually use it. A secondary goal is to make as many features of the Etherphone system available as possible without violating the primary goal. These features include simple placement (and possibly answering) of interactive telephone calls, the generation of various simple kinds of noises (mostly beeps), and a fairly extensive set of capabilities for manipulating recorded voice.
Introduction
The Etherphone system [GlobeCom] combines a conventional telephone system with a number of unconventional (or at least uncommon) features, including editable recorded voice segments, and including flexible workstation control of all of the above. The hardware and the software architecture are interesting, but not here. (The reference gives an overview; the details have never been coherently recorded.)
Audio is inherently more one-dimensional than visio. Although people can in principle hear many independent noises, they have trouble making sense out of more than one of them at a time. Besides that, it's temporal: a noise has to be going by before you can hear it -- and then it's gone unless you play it again. Compared with aspects of our two-dimensional multi-viewer Cedar applications, these attributes place limits on the ways in which voice-related devices can be used. As usual, limits lead to complications. The control of the Etherphone system is quite complicated. This digression is intended as rationale for the decisions revealed in the next paragraph.
From the start, it has been our intent to provide an interface to the voice-related services that would allow programmers to add voice (and other noises) to their applications. It is still our intent. This documentation is a preliminary design for such an interface. This design divides fairly naturally into several (presently five) components. The interface provides access to nearly all the underlying capabilities of some of these components, while severely restricting others. Generally, the restrictions apply to the heavy-duty interactive telephone functions, whose full generality we haven't really figured out how to accomplish yet. (The interface described here supports the placement of simple calls, by name or number, but doesn't get into the hairy conferencing, forwarding, call-holding, and call-filtering domains.) The remaining features are, however, compatible with the versions of the hairy functions that we intend to supply as standard equipment. Fortunately, most of the potential clients I've talked to really want to do the other things anyway: play tunes, record and replay voice, edit these recordings, annotate documents, and generally fling little snippets of noise about. That stuff is pretty well worked out in the interface.
A comment on names: this set of functions has traditionally been dubbed "CedarVoice". It has a lot less to do with Cedar than it does with Voice. In form, it will resemble "GrapevineUser" more than anything, so maybe we should call it "ThrushUser" -- Thrush being the telephone control server that manages the whole system -- or "VoiceUser". Actually, since it's a client interface to the underlying system, "ThrushClient" or "VoiceClient" is a more reasonable sobriquet. The temptation is to stick to the alate appellations, call it "Turkey", and be done with it. This issue is clearly not yet decided, nor have I decided how many separate interfaces to supply; like "GrapevineUser", a separation by component may well make sense. In this document, I'll continue to use "CedarVoice."
CedarVoice Components (hard stuff last)
Interactive Calls
You can place a simple two-party call by name or telephone number, with a number of parameters controlling how hard to try, whether to try for an intercom call, whether to record the call, and so on. This either succeeds or fails. If it fails, you find out something about why, but can do little about it. These features are intended to be used in conjunction with existing Finch facilities for manually controlling telephone directories and telephone calls. Anything your program can do via CedarVoice, your user can undo via Finch. We intend to allow more control later, but not until we learn how to do each thing well in Finch.
The interface also includes some functions to allow a program to detect and answer or reject an incoming call. This is quite primitive, for a repeat of the above reasons.
Noises
A Lark can generate a tone comprising the sum of two sine waves with frequencies in the range from 0 to about 3500 Hz. If one of the frequencies is 0, a pure sine wave results. Thrush can direct the Lark to enqueue a number timed sequences of these tones, perhaps interspersed with timed sequences of silence. This is how Larks ring and make busy signals. The Noises component of CedarVoice makes these abilities available. The client can specify whether to beep if there's a call in progress or not, and if so, whether to let the other party hear the beeps. The interface includes two other forms of noise specification: Laurel mail tune format, and Mockingbird .music file format (precise interpretation TBD.) Maybe it will be hard to get the Mockingbird variety to work.
Noisy Text
We have one (1) speech synthesizer that turns text into respectably intelligible English. We intend to connect it to a dedicated Lark (Etherphone box), so that its output can be sent to any other Lark. This component of CedarVoice has functions that accept text and produce voice, again either overriding or deferring to ongoing interactive telephone calls.
Voice File Directory
This directory isn't really exactly a directory -- it's a Cypress database, thus considerably more wonderful. And it doesn't really catalog voice files, but more complex beasts known as voice ropes (see the last, hard stuff, section.) But the simple ones are like files: single, contiguous stored utterances. Each has a unique identifier (entity name), then some other information: type of entry (simple or complicated), type indicating intended use (Walnut voice message, piece of Tioga file, and so on), creator, dates, access privileges, encryption keys, and the like.
Voice Ropes
The operations that are needed to manipulate recorded voice form a superset of the operations provided by Cedar ROPEs. Operations such as Substr, Concat, Length, and Fetch make sense; creation by recording a stream of voice samples has its analog in IO.PutFR. (Use of Fetch, to obtain the actual values of individual voice samples, is expected to be rare.)
At present, it appears that we can in fact use the existing ROPE implementation to extract segments of utterances, and to compose them with other segments. If this is not possible, we will at least use the immutable ROPE model as a well-understood metaphor for the voice manipulations. In this document, the actual ROPE implementation is assumed.
In any case, we will need to extend the ROPE operations to include the recording and playback of any voice rope. We will also provide operations for dealing with other available attributes of the recorded voice, primarily the one-value-per-packet voice energy information that will allow applications to roughly locate phrase boundaries and to roughly represent utterances in some visible form. Finally, we provide a method for assigning correspondences between the voice information and ohter related data (text, for instance.) {Or do we?}
The underlying implementation, unlike the existing ROPE package, will need to provide for the permanent storage of unflattened voice ropes. Flattening is infeasible because of the vast numbers of bits required to store voice. We expect Cypress to help us here, but the performance is perhaps an issue.
Interactive Calls Component
Types
The interface also includes some functions to allow a program to detect and answer or reject an incoming call. This is quite primitive, for a repeat of the above reasons.
Call Placement Hints
[ Urgencies, busy-test, intercom-recommendations, . . . ]
Reasons for things
[ Rationalization of all the NB's, Reasons, . . . ]
CallBack types
[ Filter Proc: caller, urgency, subject, returns yes, no, pass ]
[ Call progress proc. Info ONLY. Questionable. convID, caller, callee, urgency -- find approp. Thrush types to copy. ]
Procedures
PlaceCall
[ Includes various hints, but either succeeds or fails based on what might appear to be capricious bases determined by the implementation. If fails, reason is provided, also posted in conventional Finchy places if there are any (unspecified.) If fails, there's little reason to try again, soon, unless to up the priority? Parameter to indicate just checking callee busy? ]
ValidateRecipient
[ Just makes sure the recipient parameter is valid, exists, that sort of thing. Not necessarily useful. ]
CallState
[ Simple information about ongoing calls (PlaceCall will succeed/fail, even if recip. is valid, and why, primarily) Supposed to apply to caller only. Use PlaceCall properly to deal with callee? ]
RegisterFilter
[ Registers a procedure that will have the opportunity to accept, reject, or pass on to the next filter an incoming call. Filters are tried in inverse order of registration, which might not be fair. If it's not fair, we'll fix it. Returns some sort of U. ]
WithdrawFilter
[ Given the U, undoes the binding. ]
RegisterCallReporter
[ Procedure that will find out when things happen. No control granted. Questionable? Returns a U. ]
WithdrawCallReporter
[ Given the U, undoes the binding. ]
Noises Component
Types
Style criteria
[ loudness, speed, that sort of thing. ]
Override criteria
[ Whether to override ongoing calls; whether other party hears it. ]
Procedures
ToneSequence
[ Everything necessary to produce reasonably complex series of things out of sine waves. Include amplitude, conditionally repeated frobs, sky's the limit. All forms of ringing cadences, and the like. Include QueueIt. Returns U. ]
RemoveSequence[U]
[ Eliminates just that one, or all after that one. ]
Flush
[ Remove all. Like No-Op, queueIt: FALSE ]
LaurelTune[text, style, callControl, queueIt]
[ Takes a rope, somehow obtained, in Laurel Mail tune format. Rename to credit the original inventors of the format. Include style: loudness, speed, whatever other params needed to round it out. callControl is the stuff to determine whether noises override ongoing call, and whether other parties to ongoing calls hear the noises. ]
MockingbirdTune[text, voice1𡤀, voice2, style, callControl, queueIt]
[ Similar. Up to two notes extracted from up to two voices. Text is rope slurped from .music file? Use FS filename instead? One or both? Assume hard to implement, to be delayed. ]
Noisy Text Component
Types
[ See style, override criteria types ]
Procedures
Speak[rope, style, override, queueIt]
[ Loudness and speed may be reasonable global things to say. Otherwise, all the control's embedded in the Rope. Speech-Plus format. Returns U. Flush stuff from noises section all works. ]