[Indigo]<Voice>Stewart>voiceOperatingPlan.bravo!1

We plan to explore the utilization of "voice" within our office systems, where by "voice" we mean the manipulation (conversion, packaging, transport, storage, etc.) of sections of human utterance. We plan to build a microprocessor-based digital telephone which incorporates not only the necessary A to D and D to A transducers, but also the means for sending and receiving secure voice packets via the Ethernet. This Etherphone will include a handset and keypad much like the present telephone and will be used in the usual (as well as novel) ways for interpersonal voice communication. In addition it will serve as an input/output device for other voice related functions such as voice messages, voice-annotated documents, etc.

We do not plan (at present) to try either to create artificial utterances (synthesize speech) or to construct programs which attempt to understand human speech. Rather we will simply handle segments of human speech as raw "bit-map" to which understanding is attached only by a human listener. Although it is easy to envision valuable applications for synthesis and understanding, we believe that doing an adequate job in these areas to permit practical systems is beyond our present capabilities. Despite this, nothing that we intend to do will preclude their incorporation at a future time.

We can divide speech manipulation into two classes: first, those situations where it forms part of a conversation, and second, those cases where speech is being recorded for later listening (either by some other person or by the speaker). Sometimes the two occur simultaneously, as when a converstaion is recorded, but the distinction remains intact. The distinction is important in that the two cases give rise to significantly different transmission requirements. For conversational speech, delays in transmission above a certain threshold are intolerable since they harm the interactivity. On the other hand, delays in transmitting speech for recording (or in later playback) only affect the amount of buffering required. Our goal is to handle both sorts of service.

We can envision a substantial number of ways in which voice could be incorporated into our systems. These include such things as voice messages, voice-annotation of documents, improved functionality for the telephone including new means for placing and receiving calls, etc. We believe that only by actual experiment can one determine the merit of various applications, and so it is our intention to provide a set of basic, enabling capabilities for dealing with voice which will permit us to explore the application space.

Our goal is to replace all of CSL’s telephones with Etherphones over the next two years. These will let us try out novel forms of telephony and to utilize voice as data in various systems such as Laurel. The Etherphone itself will embody only primitive capabilities; the higher-level (lower duty cycle) control protocols will be provided by some other computer on the Ethernet. Certain higher-level functions will be provided by the user’s own work station - for example Laurel-like facilities (probably, in fact, extensions of Laurel) for constructing, sending, reviewing, and listening to voice messages. However the basic functions required to place and receive calls must continue to operate independent of the state of the work station, so for these a special Etherphone Server will be provided. Another reason for putting the basic call placement/receipt functions into an Etherphone Server (instead of the work station) is that we eventually hope to build a stand-alone (work-station-independent) Etherphone which will have its own keyboard and perhaps a one-line display and, even on its own, will provide a good deal more functionality than the traditional telephone.

A gating item in the design of the Etherphone is the requirement for a small Ethernet controller. We have gotten under way by using Alto I’s (not using the disk and display) as functional prototypes, but a realistic prototype will require a single (or dual) chip Ethernet controller. We have decided not to wait for any of the commercial chips, since their delivery dates are remote and uncertain. Instead we plan to use the Shared Line Controller chip built by the MEC. This will require a separate 1.5 megabit/second Ethernet and a gateway to connect it to our 3 megabit Ethernet, but it will allow us to proceed.

Storage of voice messages, which must be absorbed and played out in real-time, places a new set of demands on a file server. Need for extreme reliability is replaced by real-time demands and since these are so different, we plan to build a special Voice File Server tailored to these new requirements. We envision a system in which, for example, a voice message is played out from the Voice File Server to the appropriate Etherphone, mediated by the Etherphone server under commands issued by the user at his work station.

Continuing compatibility with the existing phone system is a major issue; we must be able to continue to make and receive "outside" calls. To achieve this, we plan to provide each Etherphone with a "back-door" connection to the phone line in the office. In the long run such connection to the phone system should be provided by a special server on the Ethernet which connects to a number of trunk lines. However were we to try to design such a component now, it would inevitably lead us into reconsideration, and perhaps redesign of the entire PARC phone system. The simple back-door approach avoids these complex distractions.