From "Adding Voice" Introduction 83-8 At PARC, we and our colleagues have approached the problem of office productivity from a different perspective. Our research into personal information systems has provided considerable experience with the design, production, and use of personal computers and programs that augment or supplant traditional paper-based office activities. At the heart of this work is the personal computer, such as the Xerox 8010 Professional Workstation. The personal workstation gives its user a coherent and integrated user interface supporting the creation, modification, storage, and printing of documents, memos, messages, and personal databases. High-speed Ethernet communications link each workstation to those of other users and to other computers that provide a variety of services, such as file storage, electronic mail distribution, and high-quality printing. 83-8 Ironically, the typical office in our laboratory includes a multi-function workstation and a stand-alone telephone instrumenttwo isolated systems. Telephone equipment manufacturers have approached the merger of these technologies by treating office applications as extensions to their basic telephone service [Strudwick, Jud, Edwards]. With the Etherphone project, we have the opportunity to gain additional insights into the potential for voice by instead treating voice communications as an extension to our existing office systems. Objectives: Improved Control 83-8 The twelve pushbuttons and hookswitch on a telephone, combined with a set of call progress tones from the central office, provide a reasonable user interface for placing and answering telephone calls. However, newer telephone systems have made available myriad additional features without substantially improving the user interface. Multiple-line telephones and telephones with additional pushbuttons for common functions improve matters, but most attempts to provide more telephone capabilities become too complicated to control long before their designers run out of ideas. Many features are never used because they are too hard to remember and too hard to use. 83-8 Similar problems with the composition and editing of text and graphical documents led to the development of our workstation-based systems [Smith]. The contextual power of the high-resolution display screen, "mouse" pointing device, and menu-based software have increased the ability of the casual user to master large numbers of operations and quite complex situations. We believe that the management of interactive voice will submit to the same techniques, provided the workstations have adequate programmed access to the telephone-switching functions. Objectives: Voice as Data 83-8 In summary, we feel that our computing and communications environment can be brought to bear on voice in two ways. One way is to improve our control over otherwise ordinary voice communications; the other is to incorporate the voice medium into our integrated environment on an equal footing with text, graphics, and data. While the facilities we have built or proposed are not entirely new, we believe that the full integration of voice into our computer-based office systems will make existing functions more effective and new applications possible. System Architecture: basic architectural decisions 83-8 We surveyed several possible architectures for the system. Subscriber line access to a voice storage system and use of our existing Centrex service through DTMF signalling was rejected due to the low bandwidth and the inflexible control over switching. Use of a commercial PABX was attractive, but the necessary switching capabilities were not commercially available. The use of our own computers to control the switching operations of a standard PABX was rejected primarily due to the nontechnical difficulties of arranging the necessary cooperative effort. 83-8 Although reluctant to undertake the necessary development work, we concluded that the most effective way to satisfy our needs was to construct our own transmission and switching system. We elected to use the Ethernet to transmit voice as well as control information. We chose Ethernet because of its pervasiveness in our environment and in order to demonstrate the effectiveness of Ethernet for voice. Ethernet transmission: new functions 83-8 The broadcast nature of the Ethernet admits some new communication possibilities. Conference calls among several sites can be achieved without the need for special conference-bridge hardware: multicast techniques permit voice packets from a given site to be received and combined by each of the others [Forgie]. In the extreme, voice from a meeting or conference can be transmitted once and received by any number of listeners, without consuming any more bandwidth than a single conversation consumes. System Components: Telephone control server (RPC) 83-8 For system control, we were able to take advantage of the recent development of remote procedure call protocols [Birrell83, Nelson]. Remote Procedure Call (RPC) is a means for transforming the message-passing semantics of packet-switching networks into the procedure-call format familiar to programmers. RPC largely relieves applications programmers from any worries about packet formats, addressing, reliable communication, and security. This makes it possible to defer decisions about the partitioning of a distributed system until very late in the design. Our task was reduced to producing an RPC implementation for the Etherphone control processor, then specifying the procedural interfaces between different system components. March 1985 Planning Document Introduction Mar85 The only forms of office automation most offices have yet seen are the electric typewriter, the copier, and the telephone. Voice communication is and will remain among the most powerful tools that people use to get their work done. If Xerox is to remain a force in offices it will have to understand the role of voice and the relationship of voice to its other products and services. In CSL, apart from the intriguing systems problems that we have encountered while building voice applications, we have two major reasons for wanting to study voice communications: Taming and recording Mar85 Despite the immense investment in research and development over the last 110 years, the functionality of the telephone (now extended by the answering machine and its more sophisticated voice-mail siblings) still leaves quite a bit to be desired. The telephone is easy to learn but hard to use, in the sense that an enormous proportion of attempted calls are unsuccessful either because they fail to reach the intended party, or because they succeed at inopportune times. The attempts to integrate the telephone with more powerful user interfaces and with voice mail capabilities are making some progress in taming the telephone, and in exploiting the use of recorded voice in conjunction with visual documents, but all the ones we've seen lack much of the insight we have into effective ways to integrate voice with the workstation capabilities that we have begun to master. Architecture Mar85 Even so, the need to interact with the voice products and services of other vendors is even more compelling than we have found in our other areas of interest (those addressed by Interpress, Interscript, XNS, international mail standards, etc.) For example, it is unlikely that Xerox would be able to capture a significant percentage of the business telephone switch market, even if we knew how to or wanted to. Having identified the functions we want our systems to provide, we need to understand how to achieve them in conjunction with competitor's business telephone systems and with the existing telephone network. This should include both means for accommodating our requirements to the existing systems, and careful advice to the industry for improving the interface to their capabilities. This ought to take the form of a new voice communications architecture (in the spirit of XNS and/or Interscript). Architecture Mar85 The architecture proposal needs further discussion. Architectures as diverse as Interpress and XNS have two important attributes in common: Mar85 They describe, in exacting detail, the range of capabilities that their clients have available, the way those capabilities are structured, and exactly how they are used. XNS describes the interfaces that clients use to communicate with other applications; Interpress defines the file formats that clients need to obey in order to get documents printed. In both cases, it is possible to define precisely, as subsets of the total architectures, any restrictions that a particular implementation might place on its clients. Mar85 They serve as detailed specifications for the implementors of the services that are needed to support their clients. The best example comes from XNS: each layer in the architecture has to be implemented by somebody, whose clients will then implement the higher levels. This is very important for voice; we want to be able to specify precisely the features that a PBX should provide, for example. Mar85 In the voice area, one can identify some simple protocols at about the level of RS-232 and maybe SDLC, but higher-level architectures tend to be implicit in the implementations of telephone switching systems, PBXs and the like. All the existing proposals for integrating data and voice, or integrating local area networks and PBXs, inside and outside of Xerox, are expressed at this primitive level, so they are not as interesting as they sound at first. To use a specific example, they would give little guidance for how to build an integrated Etherphone-like system using a Northern Telecom SL-1 switch and an EMS voice mail machine (the one Xerox sells now). Given an InterVoice architecture, it should be much easier to decide which pieces of it Xerox wants to buy and which pieces it wants to make. With any luck, it could be presented as a de facto standard in the spirit of XNS. We are attempting to cast the design for extensions to the Etherphone prototype in terms of such an architecture. 1985 Year-end report Executive Summary EY85 The CSL Voice project is an effort to define and validate experimentally a prototype architecture, for which we have coined the term "Intervoice," that can incorporate live and recorded voice into office systems. The methodology is to produce working, extensible prototypes, primarily in Cedar. Introduction EY85 Voice is a vital form of interpersonal communication in the office. As a non-interactive medium (in messages and as annotations to text documents) it has yet to be fully assessed, but it appears to have considerable value. (The August 5, 1985 issue of the "Seybold Report on Office Systems" says: "Although voice applications have been relatively slow to take off in the United States, we suspect that lack of integration is the villain. Once voice functions are truly intertwined with text and data applications in an organization's information systems, they will become an irresistible and compelling adjunct to any office.") The appearance of integrated voice capabilities in Xerox workstation products is overdue. EY85 The CSL Voice project is an effort to define and validate experimentally a prototype architecture, for which we have coined the term "Intervoice," that can incorporate live and recorded voice into office systems. Intervoice must be able to specify the role of telephone transmission and switching, workstations, voice file servers, and other network services in supporting voice communications both real-time telephone calls and recorded voice as it is stored, manipulated, and experienced in documents. It must be an open architecture, permitting programmers to create new voice-related applications and modify existing ones without having to understand the detailed workings of the voice system and without endangering its normal operation. Ideally, this architecture could evolve into a standard defining the role of each major component, so that multiple vendors could cooperate to provide advanced voice functions in conjunction with our workstations. EY85 The methodology is to produce working, extensible prototypes, primarily in the Cedar programming environment; to demonstrate the present state of the architecture; and to serve as an experimental base for improving it. The present experimental environment includes: EY85 Etherphones: telephone/speakerphone instruments that communicate digitized voice and control information over the Ethernet. Each includes a microcomputer and encryption hardware. Approximately forty Etherphones are in daily use within CSL. EY85 Telephone service: a control program that provides PBX functions and manages the communicatons of all the other components. It runs on a dedicated server. This service and workstation program libraries implement the client programmer interface to the Intervoice architecture. EY85 Voice file service: a service that can connect to Etherphones in lieu of or in addition to other conversants. Supports digital recording, playback, and flexible editing of user utterances. EY85 Synthesized voice service: a recently-added facility described in more detail below. EY85 Workstation programs: provide enhanced user interfaces and control over the voice capabilities. The Cedar version is called Finch, and has been in use for some time. Doug's Component List from 19-Sep-86 Introduction DBT The goals of the Cedar Voice project include "taming the telephone" and being able to treat voice as data. Our approach has been to build an experimental prototype voice system to investigate ways of integrating voice into a distributed workstation environment. System Components DBT The prototype voice system consists of a collection of workstations, Etherphones, and servers connected by Ethernets. Etherphones DBT Etherphones are a user's audio interface to the system. They can digitize, packetize, and encrypt voice and send it directly over an Ethernet. Workstations DBT Workstations provide user-friendly interfaces to the voice system as well as program access to stored voice and telephone features. Voice Control Server DBT A Voice Control Server manages conversations between sources and sinks of voice and controls Etherphone functions. It also maintains databases for white-page directories, Etherphone-workstation assignments, etc. Voice File Server DBT A Voice File Server manages recorded voice. Protocols and Operations DBT A special voice transmission protocol is used for sending voice over an Ethernet. All control messages and operations performed on specialized servers, such as the Voice File Server or Text-to-speech Server, are transmitted using a remote procedure call protocol. Voice Transmission DBT Voice is packetized into 20 msec samples and sent as 50 packets per second (except that silent packets are not actually transmitted). We conservatively expect to be able to support approximately 225 telephones on a voice-only 10 Mbit Ethernet. Conversation Establishment DBT Conversations are established between two or more parties by performing remote procedure calls to the Voice Control Server. Each party autonomously advances its state in the conversation. Active parties (that are voice terminals) use the Voice Transmission Protocol to actually exchange voice. Service Reports DBT During the course of a conversation, participants may asynchronously receive reports about various activities concerning the conversation, such as the availability of a new encryption key, the fact that the Voice File Server has started recording, etc. Voice Recording and Playback DBT Recording and playback of stored voice is accomplished by establishing a conversation between an Etherphone and the Voice File Server. Voice Editing DBT Editing operations on recorded voice are similar to those for manipulating text strings. Editing produces new voice objects that are stored in a database and reference voice files. Text-to-speech Services DBT A text-to-speech server receives text strings and returns the spoken text to an Etherphone using the Voice Transmission Protocol. Applications Call Placement and Telephone Directories DBT Calls can be placed from a workstation using a telephone directory viewer or dialed by name from the telephone keypad. Voice Mail and Annotated Documents DBT Voice is sent in electronic mail messages by reference. This is a specific case of the more general techniques for annotating documents with voice. Voice Editing Interface/Dictation Machine DBT Voice viewers on a workstation display a visual representation of stored voice. Users can edit voice by rearranging parts of the existing voice or recording new voice passages to be inserted in selected places. Animated Scripts DBT The voice system facilities have also been used to give a running audio narration for documents. Summary and Status DBT We have discovered that managing voice in a distributed environment presents some interesting problems, which are pertinent to other media such as images or video. DBT The voice system with approximately 50 Etherphones is in everyday use by members of CSL. References DBT We should reference the "Adding Voice..." paper, Dan and Stephen's paper on editing, perhap's Polle's new CHI paper, perhaps Doug's distributed systems workshop paper, etc. Voice Annotation and Editing ... (Ades&Swinehart) Ades Ades There has been a trend in office systems for some time toward a single computer-based system that can satisfy all of a user's working needs. With a trend also away from large time-shared systems and toward personal computers there has evolved the concept of the workstation, a powerful personal computer that performs a wide range of functions. These may include text composition, document preparation and typesetting, support for program development, spreadsheets and accounting facilities, and general-purpose local computing as well as access to remote computers. Today's workstation is usually equipped with a high-resolution display, a keyboard, a mouse or other pointing device, a network connection, and possibly a high-capacity local disk. The workstation environment of the near future will also include scanning equipment and document architectures that can describe scanned images [4], thereby taking over the role of the facsimile machine. Ades The office telephone and the workstation together satisfy nearly all of a user's communications needs. It is therefore useful to consider combining them into one system. This affords two major possibilities: � Ades Modern office telephone systems offer all kinds of facilities, such as call forwarding, dialing by name or abbreviation, and three-way conferencing. These are invariably cumbersome to use, hard to learn, and hard to remember, since the user interface is limited to a twelve digit keypad, some progress tones, and perhaps a few special-purpose buttons and lights. We believe that the user would prefer to perform these functions using the power and convenience of such workstation facilities as on-screen menus, text editors, and comprehensive informational displays. The user should be able to master more of the capabilities of the telephone system and to accomplish telephone-related tasks more quickly and easily. � Ades We are beginning to see document architectures integrating text, graphics, scanned images, and other visual media [3, 4, 18]. Integration of voice (or rather of sound in general) into documents promises a wide range of new applications, from voice mail to documents containing simple vocal annotations and even to documents whose self-contained `scripts' can generate automatic audio-visual presentations. �Extracts.tioga Swinehart, September 30, 1986 8:26:54 am PDT DBT The facilities of the voice system have been used to integrate voice into our environment in various ways. ��default�Icode�,�head��Ibody��bs�o��N��N��N��N��2N��N��$N��1N��Iblock��O��O��O��O��O��O��O��O��O��O��O��T�YO��$�O��O��u�y�O��O��O��O��+�/��O��O��O��O��O�� O��O��O��j�n�(O��v�z��"O��)O��O��`�d��O��O��X�\�� O��1Iblock1��I bulletitem��Q��M��MzQ7