An Experimental Environment
for Voice System Development
Daniel C. Swinehart
Douglas B. Terry
Polle T. Zellweger
Xerox Palo Alto Research Center
Abstract
The EtherphoneTM system has been developed to explore methods for extend existing multi-media office environments with the facilities needed to handle the transmission, storage, and manipulation of voice. Based on a hardware architecture that uses microprocessor-controlled telephones to transmit voice over an Ethernet that also supports a voice file server and a voice synthesis server, this system has been used for applications such as directory-based call placement, call logging, call filtering, and automatic call forwarding. Voice mail, voice annotation of multi-media documents, voice editing using standard text editing techniques, and applications of synthetic voice use the Etherphones for voice transmission. Recent work has focused on the creation of a comprehensive voice system architecture, both to specify programming interfaces for custom uses of voice, and to specify the role of different system components, so that equipment from multiple vendors could be integrated to provide sophisticated voice services.
1. Introduction
Suppose Alexander Graham Bell had waited to invent the telephone until personal workstations and distributed computing networks had been invented. What approach would he take in introducing voice communications into the modern computing environment? It was an attempt to answer this question that led to the creation of a voice communications project within the Computer Science Laboratory at Xerox PARC.
Stated more concretely, the project's aim was to extend our existing multi-media office environment with the facilities needed to handle the transmission, storage, and manipulation of voice. We believed that it should be possible to deal with voice as easily as we can manage text, electronic mail, or images. The desired result was an integrated workstation that could satisfy nearly all of a user's communications and computing needs.
A basic requirement was to provide conventional telephone facilities (so that casual users would not have to read a manual to make a phone call), but our goals went far beyond that. We had observed that most enhanced voice communications facilities had been developed by designers of telephone systems. In contraast, we wished to draw on our experience as developers of personal information systems running on powerful workstations with graphical interfaces. We were convinced that the user would prefer to perform voice management functions using the power and convenience of workstation facilities such as on-screen menus, text editors, and comprehensive informational displays.
These aims led us to explore two related research domains:
"Taming the telephone": Despite an immense investment in research and development over the last 110 years, the user interface and the functionality of the telephone still leaves much to be desired. We contend that the personal workstation, combined with a telephone system whose characteristics we can control, make it possible to better match the behavior of the office telephone with the needs of its users. There are gains to be had in the placement of calls, the handling of incoming calls, and the capabilities available to telephone attendants (that is, switchboard operators, receptionists, and secretaries).
"Taming the tape recorder": We also believe that workstation techniques for creating, editing, and storing text or images can be modified to deal with digitally-recorded voice. Application areas such as electronic mail, document annotation, and dictation are candidates for improvement. Speech synthesis and recognition devices can be added to provide translation between textual and spoken information.
These two sets of activities are clearly related. A carefully designed system can support novel applications of both live and recorded voice.
In this overview we will describe the EtherphoneTM system that we have developed and used to explore ways of integrating voice into a personal information environment. The following sections sketch the present hardware architecture, describe some of the more compelling applications that have been built to exploit it, and briefly explore the software and systems issues that have surfaced.
2. Etherphone System Description
In designing our prototype voice system, we surveyed several possible hardware architectures, including using our existing Centrex service or a commercially available PABX. We concluded that the most effective way to satisfy our needs was to construct our own transmission and switching system [Swinehart et al 1983]. Ethernet local-area networks, which provide the data communications backbone supporting personal distributed computing at Xerox PARC, have proven to be an effective medium for transmitting voice as well as data. Our prototype voice system consists of the following types of components connected by Ethernets, as shown in Figure 1.
Etherphones: telephone/speakerphone instruments that include a microcomputer, encryption hardware, and an Ethernet controller. Etherphones digitize, packetize, and encrypt telephone-quality voice and send it to each other directly over an Ethernet. Etherphone software is written in C. The current environment contains approximately 50 Etherphones, which are used daily by members of the Computer Science Laboratory as their only telephone service. A connection to the standard direct-dial telephone line provides access to telephones outside the Etherphone system. Additional information on the Etherphone hardware and the Voice Transmission Protocol can be found in a previous report [Swinehart et al 1983].
Voice Control Server: a program that provides control functions similar to a conventional PABX and manages the interactions between all the other components. It runs on a dedicated server that also maintains databases for public telephone directories, Etherphone-workstation assignments, and other shared information. The Voice Control Server is programmed in the Cedar programming environment [Swinehart et al 1986].
Voice File Server: a service that can hold conversations with Etherphones in order to record or play back user utterances. In addition to managing stored voice in a special-purpose file system, the Voice File Server provides operations for combining, rearranging, and replacing parts of existing voice recordings to create new voice objects. For security reasons, voice remains encrypted when stored on the file server.
Text-to-speech Server: a service that receives text strings and returns the equivalent spoken text to the user's Etherphone. Each text-to-speech server is constructed by connecting a commercial text-to-speech synthesizer to a dedicated Etherphone and is available on a first-come-first-served basis. Two speech synthesizers, purchased from different manufacturers, have been installed.
Workstations: high-performance personal computers with large bitmapped displays and mouse pointing devices. Workstations are the key to providing enhanced user interfaces and control over the voice capabilities. We rely on the extensibility of the local programming environment—be it Cedar, Interlisp, Unix, or whatever—to facilitate the integration of voice into workstation-based applications. Workstation program libraries implement the client programmer interface to the voice system.
In addition, the architecture allows for the inclusion of other specialized sources or sinks of voice, such as speech recognition equipment or music synthesizers.
All of the communication required for control in the voice system is accomplished via a remote procedure call (RPC) protocol [Birrell and Nelson 1984]. For instance, conversations are established between two or more parties (Etherphones, servers, and so on) by performing remote procedure calls to the Voice Control Server. During the course of a conversation, RPC calls emanating from the Voice Control Server inform participants about various activities concerning the conversation. Active parties in a conversation use the Voice Transmission Protocol to actually exchange voice. Multiple implementations of the RPC mechanisms permit the integration of workstation programs and voice applications programmed in different environments.
3. Examples of Applications to Date
Most of our user-level applications to date have been created in the Cedar environment, although limited functions have been provided for Interlisp and for standalone Etherphones. This section describes the voice applications that are available in Cedar, including telephone management, text-to-speech synthesis, and voice annotation and editing. Figure 2 shows a typical Cedar screen using voice, text, and graphics to support programming and document preparation activities.
In order to make voice a first-class citizen of the Cedar environment, Etherphone functions are typically available in several ways: through an Etherphone control panel, through commands that can be issued in a typescript, and through procedures that can be invoked from client programs. This integration of voice capabilities will be discussed more fully in the next section.
3.1. Telephone management
The telephone management functions provide improved capabilities for placing calls and improved capabilities for receiving calls. Figure 3 shows an Etherphone control window, called Finch, and a personal telephone directory window.
Users can place calls by specifying a name, a number, or other attributes of the called party. A system directory database for local Xerox employees (about 1000 entries) is stored on the Voice Control Server. Etherphone users can also create personal directories, which are consulted before the system directory to locate the desired party. A soundex search mechanism [Knuth 1973] compensates for some kinds of spelling errors.
A variety of convenient workstation dialing methods are provided. A user can fill in fields in the Finch tool, select names or numbers from anywhere on the screen, or use either of two directory tools that present browsable lists of names and associated telephone numbers as speed-dialing buttons. Calls can also be placed by name or number from the telephone keypad.
Calls are announced audibly, visually, and functionally. Each Etherphone user selects a personalized ring tune, such as a few bars of "Mary Had a Little Lamb". This tune is played at a destination Etherphone to announce calls to that user. The caller hears the same tune as a ringback confirmation. During ringing, the telephone icon jangles with a superimposed indication of the caller, as shown in the middle portion of Figure 4. An active conversation is represented as a conversation between two people with a superimposed indication of the other party (also shown in Figure 4). The system automatically fills in the Finch tool's Calling Party or Called Party field to allow easy redialing. It also creates a new entry in a conversation log. A user can consult the conversation log to discover who called while he was out of the office.
Our methods of following a user around an office building rely upon the personalized ring tunes, which allow Etherphone users to identify calls to them wherever they may be: in their own offices, within earshot, or at other Etherphones. If an Etherphone user logs in at a workstation, his calls are automatically forwarded to the adjacent Etherphone. An additional feature, called visiting, allows him to register his presence with a second workstation or Etherphone, such as during a meeting. Registering with the destination location allows users to travel more freely than forwarding calls from the home location does. Each visit request cancels any earlier requests; visiting the home location cancels visiting. The common problem of forgetting to cancel forwarding is further eased by ringing both Etherphones during visiting.
3.2. Text-to-speech synthesis
A user or program can generate speech as easily as printing a message on the display by using one of the Text-to-speech Servers. A user can select text in a display window and click the Finch tool's SpeakText menu button. A program can call a procedure with the desired text as a parameter. These features are implemented by creating a "conversation" between the user's Etherphone and a Text-to-speech Server. The system sets up a connection to the Text-to-speech Server, sends the text (via RPC), returns the digitized audio signal (via the Voice Transmission Protocol), and closes the connection when the text has been spoken. A similar mechanism is used for voice recording and playback.
Our primary uses for text-to-speech so far have been in programming environment and office automation applications. Programming environment tasks have included spoken progress indicators, prompts, and error messages. Office automation applications have included proofreading (especially comparing versions of a document when one version has no electronic form, such as proofing journal galleys) and audio reminder messages generated by calendar programs.
3.3. Voice annotation and editing
This section describes the addition of a voice annotation mechanism to Tioga, the standard text-editing program in Cedar. Tioga is essentially a high-quality galley editor, supporting the creation of text documents using a variety of type faces, type styles, and paragraph formats. Tioga includes the ability to incorporate illustrations and scanned images into its documents. Tioga is the underlying editor for all textual applications in Cedar, including the electronic mail system, the system command interpreter, and other tools that require the user to enter and manipulate text. Wherever Tioga is used, all of its formatting and multi-media facilities are available. Thus, by adding voice annotation to Tioga, we have made it available to a variety of Cedar applications.
Any text character within a Tioga document can be annotated with an audio recording of arbitrary length. The user interface of the voice annotation system is designed to be lightweight and easy to use, since spontaneity in adding vocal annotations is essential. Voice within a document is shown as a distinctive shape superimposed around a character, so that the document's visual layout is unaffected. Furthermore, adding voice to a document does not alter its contents as observed by other programs (such as compilers).
To add an annotation, the user simply selects the desired character within a text window and buttons AddVoice in that window's menu. Recording begins immediately, using either a hands-free microphone or the telephone handset, and continues until the user buttons STOP. A voice annotation becomes part of the document, although the voice data physically resides on the Voice File Server. Copies of the document may be stored on shared file servers or sent directly to other users as electronic mail. To listen to voice, a user selects a region containing one or more voice icons and buttons PlayVoice.
Simple voice editing is available: users can select a voice annotation and open a window showing its basic sound-and-silence profile. Sounds from the same or other voice windows can be cut and pasted together using the same editing operations supported by the Tioga editor. A lightweight `dictation facility' that uses a record/stop/backup model can be used to record and incorporate new sounds conveniently. Editing is done largely at the phrase level (never at the phoneme level), representing the granularity at which editing can be done with best results and least effort for an office situation. The dictation facility can also be used when placing voice annotations directly into documents.
Sound-and-silence profiles alone do not provide adequate contextual information for users to identify desired editing locations. Several contextual aids are provided. A playback cue moves along the voice profile during playback, indicating exactly the position of the voice being heard (see Figure 5). While playback is in progress, a user can perform edits immediately or mark locations for future edits. Simple temporary markers can be used to keep track of important boundaries during editing operations, while permanent textual markers can be used to mark significant locations within extended transcriptions. Finally, the system provides a visual indication of the voice-editing history in an editing window. Newly-added voice appears in a bright yellow color, while less-recently-added phrases become gradually darker as new editing operations occur.
More information about the voice annotation and editing system can be found in [Ades and Swinehart 1986].
4. Progress toward a Voice System Architecture
The original goals of the Etherphone project were to produce experimental prototypes that could "tame the telephone" or "tame the tape recorder" in novel and useful ways. As the project developed, however, a more fundamental goal emerged: to create and experimentally validate a comprehensive architecture for voice applications. The best way to explain the value of a voice architecture is to list some of the properties it should have:
Completeness. It must be able to specify the role of telephone transmission and switching, workstations, voice file servers, and other network services in supporting all the kinds of capabilities we have identified, such as telephone services and recorded voice services.
Programmability. It must permit workstation programmers to modify existing voice-related applications and to create new ones. Simple applications should be easy to write, requiring little or no detailed understanding of how the system is implemented. More elaborate applications might require a more thorough knowledge of the underlying facilities. The architecture must be designed to minimize the effect of faulty programming on the reliability and performance of the overall system. (Users of experimental software might experience program failure or reduced performance, but other users should not.)
Openness. It should define the role of each major component, so that different kinds of components could be used to provide the same functions. In this way, multiple vendors of telephone and office equipment could cooperate to provide advanced voice functions in conjunction with workstation-based applications. For example, a conventional PABX (business telephone system) could be used in place of the Etherphones to provide voice switching. Ideally, such an architecture would evolve into a standard for voice component interconnection.
The development of the Etherphone system has included an ongoing effort to define such an architecture, and to implement the system in compliance with it. Following the general methodology employed by such modern communications architectures as the ISO reference model [Herman et al 1986] or the Xerox Network Systems protocols [Xerox 1981], the voice architecture is expressed as a set of layers, each calling on the capabilities of the layer below it through well-defined interfaces or protocols. We have identified five distinct layers. From highest to lowest, these are the Applications layer, the Service layer, the Conversation layer, the Transmission layer, and the Physical layer.
The best way we have found to explain this organization is from the inside out, beginning with the heart of the architecture, the Conversation layer. It provides a uniform approach to the establishment and management of voice connections, or conversations, among the various services. It also provides a standard method for distributing conversation state transitions and other progress reports to the various participants in each conversation. All communications between services are mediated by Conversation layer facilities. In the Etherphone system, the functions of the Conversation layer are implemented entirely within the Voice Control Server. However, the architecture does not mandate centralized control. For example, Etherphones built with larger memories and more powerful processors could support a distributed implementation, each managing the conversations that it or its associated workstation initiated.
The Service layer defines the various voice-related services—such as telephone functions, voice recording and storage, voice playback, speech synthesis, and speech recognition—that form the basis for the voice applications. Each of the services must follow the uniform Conversation layer protocols in creating voice connections with other services. However, each can register with the Conversation layer additional service-specific interfaces (protocol specifications). Connections may be formed between similar services (as in a call from one telephone to another), or among different services (such as a connection from a telephone to the recording service, mediated by a workstation program). It is not expected that ordinary programmers will produce new services; the services provide both the basic user facilities and interfaces to the building blocks for higher-level applications. In the Etherphone system, some services are implemented on the server machine that contains the Voice Control Server, others on separate server machines, still others on individual workstations.
The Applications layer represents client applications that use the voice capabilities of the architecture. To establish voice connections, a client uses simplified facilities provided by a service that resides on the workstation along with the application. Client applications also negotiate with the Conversation layer to gain access to specialized interfaces provided by other services. The previous section illustrated many of the present voice applications using the Etherphone environment.
Logically below the Conversation layer is the Transmission layer. This layer represents the actual methods for representing digital voice, for transmitting and switching voice, and for communicating control information among the components of the system. In the Etherphone system, voice is transmitted digitally, in discrete packets, using a standard 64 kilobits/second voice representation and our own voice transmission protocol. Other transmission and switching methods could be substituted without affecting the nature of the programs operating in layers above the Conversation layer. Possibilities include synchronous digital transmission, or even analog transmission. The only requirement is that these components provide interfaces that allow the implementation of Conversation layer protocols. As we have mentioned, the control protocol selected for all control communications in the system was a locally-produced remote procedure call protocol. Other remote procedure protocols or message-based protocols would work equally well.
Finally, the Physical layer represents the actual choice of communications media, for the transmission of both voice information and control (not necessarily the same media!) Besides the research Ethernet that we use (operating at 1.5 megabits/second), voice transmission on standard Ethernets, synchronous or token-oriented ring networks, digital PABX switches, or analog telephone switches could be used.
Looking at the architectural layers, it becomes easy to see how our efforts differ from work being done elsewhere.1 Most of the current efforts to "integrate voice and data", such as those systems built around the Integrated Systems Digital Network (ISDN) definitions [Decina 1982], deal only with the Transmission and Physical layers. Other systems that include voice, such as the Diamond research effort at BBN [Thomas et al 1985], commercial voice mail services, etc., support some specialized applications exhibiting very scanty Service and Conversation layers. They mostly build directly on capabilities corresponding to our Transmission layer. By contrast, we have concentrated our efforts on Conversation and Service layer specifications, and on the architecture in general.
To date, only one instance each of the Physical, Transmission, and Conversation layers has been implemented. We have used the resulting facilities extensively to produce the various Services and Applications described in the preceding sections. We have produced a relatively complete workstation service for Cedar workstatations, and a preliminary implementation for Interlisp.
We are not yet fully satisfied with the architecture, particularly the interface between the Conversation and Service layers. This interface has proven to be somewhat clumsy to use, while at the same time restricting the number of capabilities that can be readily produced. Recent progress is encouraging, however.
5. Current and Future Directions
Our efforts to date have been to build basic facilities for voice communication, based on the general architecture outlined in the previous section, then to produce a few interesting applications demonstrating the unique characteristics of the architecture and the flexibility of the Etherphone system for experimenting with novel voice applications. Voice project members have built most of the applications, although a few programmers have included telephone management or voice synthesis activities in their applications using interfaces provided by the Services layer.
We have begun to explore a number of new directions and enhancements to the current capabilities.
We have a skeletal implementation of call filtering that provides options based on the subject, urgency, or caller's identity to decrease the intrusiveness of the telephone for the callee. Our plan to integrate telephone conversation logs into the electronic mail system should have a side benefit of making the additional filtering information natural for the caller to supply.
We are considering novel kinds of interactive voice connections, such as all-day "background" telephone calls, use of the telephone system to broadcast internal talks or meetings (as a sort of giant conference telephone call), and conference calls that allow side-conversations to take place.
We plan to use conferencing capabilities (already supported by the hardware) to incorporate text-to-speech or recorded voice into telephone calls. Among possible uses for text-to-speech are reading electronic mail over the telephone to a remote lab member as in PhoneSlave [Schmandt and Arons 1984] or MICE [Herman et al 1986] (but without dedicating a synthesizer solely to this task) and playing program-generated messages to callers, such as prompts or reports of the user's location (possibly by consulting the user's calendar program, such as "Dr. Smith is at the Distributed Systems seminar now, please call back after 5 o'clock").
We are also exploring a novel scripting mechanism for creating viewing paths through an electronic document or set of documents [Zellweger 1986]. Built on the capabilities of the voice architecture, scripted multi-media documents can contain any combination of text, pictures, audio, and action. Scripts can be used in a wide variety of ways, such as for formal demonstrations and presentations, for informal communications, and for organizing collections of information.
Finally, we would like to extend the system to other media, such as still and real-time video, other workstations, and other architectures.
We have discovered that managing real-time and stored voice in a distributed environment presents some interesting problems in the areas of distributed systems [Terry 1986], user interface design, and voice transmission and processing technologies. We intend to continue to investigate these problems.
6. References
[Ades and Swinehart 1986] Ades, S. and Swinehart, D. Voice annotation and editing in a workstation environment, Proc. of AVIOS Voice Applications '86 conference, September 1986.
[Birrell and Nelson 1984] Birrell, A. and Nelson, B. Implementing remote procedure call. ACM TOCS 2, 1, February 1984.
[Decina 1982] Decina, Maurizio. Progress towards user access arrangements in Integrated Services Digital Networks, IEEE Trans. on Communications 30, September 1982, 2117-2130.
[Herman et al 1986] Herman, G., Ordun, M., Riley, C., and Woodbury, L. The Modular Integrated Communications Environment (MICE): a system for prototyping and evaluating communications services. To appear.
[ISO 1981] International Organization for Standardization. ISO open systems interconnection—Basic reference model. ISO/TC 97/SC, 16, 719, August 1981.
[Knuth 1973] Knuth, D. The Art of Computer Programming, Volume 3, Addison-Wesley, page 392.
[Schmandt and Arons 1984] Schmandt, C. and Arons, B. Phone Slave: A Graphical Telecommunications Interface. Proc. Society for Information Display 1984 International Symposium, June 1984.
[Swinehart et al 1983] Swinehart, D., Stewart, L., and Ornstein, S. Adding voice to an office computer network. Proc. of GlobeCom 83, IEEE Communications Society Conference, November 1983.
[Swinehart et al 1986] Swinehart, D., Zellweger, P., Beach, R., and Hagmann, R. A Structural View of the Cedar Programming Environment. ACM Trans. Programming Languages and Systems 8, 4, October 1986.
[Terry 1986] Terry, D. Distributed System Support for Voice in Cedar, Proc. of Second European SIGOPS Workshop on Distributed Systems, August 1986.
[Thomas et al 1985] Thomas, R., Forsdick, H., Crowley, T., Schaaf, R., Tomlinson, R., Travers, V., Robertson, G. Diamond: A Multimedia Message System Built Upon a Distributed Architecture. IEEE Computer, December 1985, 65-77.
[Xerox 1981] Xerox Corporation. An Internetwork Architecture. XSIS 028112, Xerox Corporation, Stamford, Conn., December 1981.
[Zellweger 1986] Zellweger, P. Scripted Documents. To appear.