Introduction
The Etherphone system [GlobeCom] combines a conventional telephone system with a number of unconventional (or at least uncommon) features, including editable recorded voice segments, and including flexible workstation control of all of the above. The hardware and the software architecture are interesting, but not here. (The reference gives an overview; the details have never been coherently recorded.)
Audio is inherently more one-dimensional than visio. Although people can in principle hear many independent noises, they have trouble making sense out of more than one of them at a time. Besides that, it's temporal: a noise has to be going by before you can hear it -- and then it's gone unless you play it again. Compared with aspects of our two-dimensional multi-viewer Cedar applications, these attributes place limits on the ways in which voice-related devices can be used. As usual, limits lead to complications. The control of the Etherphone system is quite complicated. This digression is intended as rationale for the decisions revealed in the next paragraph.
From the start, it has been our intent to provide an interface to the voice-related services that would allow programmers to add voice (and other noises) to their applications. It is still our intent. This documentation is a preliminary design for such an interface. This design divides fairly naturally into several (presently five) components. The interface provides access to nearly all the underlying capabilities of some of these components, while severely restricting others. Generally, the restrictions apply to the heavy-duty interactive telephone functions, whose full generality we haven't really figured out how to accomplish yet. (The interface described here supports the placement of simple calls, by name or number, but doesn't get into the hairy conferencing, forwarding, call-holding, and call-filtering domains.) The remaining features are, however, compatible with the versions of the hairy functions that we intend to supply as standard equipment. Fortunately, most of the potential clients I've talked to really want to do the other things anyway: play tunes, record and replay voice, edit these recordings, annotate documents, and generally fling little snippets of noise about. That stuff is pretty well worked out in the interface.
A comment on names: this set of functions has traditionally been dubbed "CedarVoice". It has a lot less to do with Cedar than it does with Voice. In form, it will resemble "GrapevineUser" more than anything, so maybe we should call it "ThrushUser" -- Thrush being the telephone control server that manages the whole system -- or "VoiceUser". Actually, since it's a client interface to the underlying system, "ThrushClient" or "VoiceClient" is a more reasonable sobriquet. The temptation is to stick to the alate appellations, call it "Turkey", and be done with it. This issue is clearly not yet decided, nor have I decided how many separate interfaces to supply; like "GrapevineUser", a separation by component may well make sense. In this document, I'll continue to use "CedarVoice."
CedarVoice Components (hard stuff last)
Interactive Calls
You can place a simple two-party call by name or telephone number, with a number of parameters controlling how hard to try, whether to try for an intercom call, whether to record the call, and so on. This either succeeds or fails. If it fails, you find out something about why, but can do little about it. These features are intended to be used in conjunction with existing Finch facilities for manually controlling telephone directories and telephone calls. Anything your program can do via CedarVoice, your user can undo via Finch. We intend to allow more control later, but not until we learn how to do each thing well in Finch.
The interface also includes some functions to allow a program to detect and answer or reject an incoming call. This is quite primitive, for a repeat of the above reasons.
Noises
A Lark can generate a tone comprising the sum of two sine waves with frequencies in the range from 0 to about 3500 Hz. If one of the frequencies is 0, a pure sine wave results. Thrush can direct the Lark to enqueue a number timed sequences of these tones, perhaps interspersed with timed sequences of silence. This is how Larks ring and make busy signals. The Noises component of CedarVoice makes these abilities available. The client can specify whether to beep if there's a call in progress or not, and if so, whether to let the other party hear the beeps. The interface includes two other forms of noise specification: Laurel mail tune format, and Mockingbird .music file format (precise interpretation TBD.) Maybe it will be hard to get the Mockingbird variety to work.
Noisy Text
We have one (1) speech synthesizer that turns text into respectably intelligible English. We intend to connect it to a dedicated Lark (Etherphone box), so that its output can be sent to any other Lark. This component of CedarVoice has functions that accept text and produce voice, again either overriding or deferring to ongoing interactive telephone calls.
Voice File Directory
This directory isn't really exactly a directory -- it's a Cypress database, thus considerably more wonderful. And it doesn't really catalog voice files, but more complex beasts known as voice ropes (see the last, hard stuff, section.) But the simple ones are like files: single, contiguous stored utterances. Each has a unique identifier (entity name), then some other information: type of entry (simple or complicated), type indicating intended use (Walnut voice message, piece of Tioga file, and so on), creator, dates, access privileges, encryption keys, and the like.
Voice Ropes
The operations that are needed to manipulate recorded voice form a superset of the operations provided by Cedar ROPEs. Operations such as Substr, Concat, Length, and Fetch make sense; creation by recording a stream of voice samples has its analog in IO.PutFR. (Use of Fetch, to obtain the actual values of individual voice samples, is expected to be rare.)
At present, it appears that we can in fact use the existing ROPE implementation to extract segments of utterances, and to compose them with other segments. If this is not possible, we will at least use the immutable ROPE model as a well-understood metaphor for the voice manipulations. In this document, the actual ROPE implementation is assumed.
In any case, we will need to extend the ROPE operations to include the recording and playback of any voice rope. We will also provide operations for dealing with other available attributes of the recorded voice, primarily the one-value-per-packet voice energy information that will allow applications to roughly locate phrase boundaries and to roughly represent utterances in some visible form. Finally, we provide a method for assigning correspondences between the voice information and ohter related data (text, for instance.) {Or do we?}
The underlying implementation, unlike the existing ROPE package, will need to provide for the permanent storage of unflattened voice ropes. Flattening is infeasible because of the vast numbers of bits required to store voice. We expect Cypress to help us here, but the performance is perhaps an issue.