Inter-Office Memorandum To Voice Project Date July 21, 1983 3:12 pm From Stewart Location PARC Subject GlobeComOutline.tioga Organization CSL XEROX Adding Voice to an Office Computer Network Abstract: The architecture and initial implementation of an Ethernet based telephone system is described. Introduction Voice, in meetings and in conversation, has always played a major role in the conduct of business in the office. Since the development of the telephone, the role of voice in the office has reached major proportions. Yet in spite of this role, extraordinarily little effort has been made to improve matters. With the exception of direct dialing, the telephone has not changed in any fundamental way since 1876. We feel that advances in computer and communication technology (office automation) over the past decade can be brought to bear on voice. This paper describes the architecture and initial implementation of Etherphone, a project within the Computer Science Laboratory of the Xerox Palo Alto Research Center. (DCS) Other people trying to solve this problem too. Failing? If truth be known, concentrating more on saving money than improving things? This should say what sections follow. Goals and Motivation We feel that computers and communications can be brought to bear on voice in two ways. One thrust is aimed at improving our control over otherwise ordinary voice communications. A second attack is aimed at the integration of voice as a medium for the office fully equal with text, graphics, and data. The reader will observe that we have more to say about the telephone than about systems incorporating voice in cooperation with text and graphics. Everyone knows what is wrong with the telephone, but in truth, we do not yet know what functionality people will find useful and elegant. We do not necessarily advocate all of the features we can envision. The provision of plain-old-telephone-service is a complex enterprise; we may be able to do a better or less expensive job of it than has been done, but that alone was not sufficient reason for us to undertake a large scale project. We can establish a set of capabilities which are clearly needed for work in this area, but we expect that the design of particular functionality (integrated systems) will be a continuing challenge long after a sufficient number of capabilities have been built. Most of our proposed improvements in ``telephone'' service have to do with getting phone participants more smoothly into direct voice communication or with circumventing or supplementing the need for such direct communication through voice messages. We argue that the telephone provides reasonable communication once you are talking to the other person. It is in the process of negotiating the establishment of a conversation between the participants that the phone system is most inadequate and annoying. For principals, a personal secretary mediates both incoming and outgoing calls and thereby takes over most of the burden. We hope that we can provide some of this same relief for others. We also believe that voice in messages and documents will find application far beyond simple telephony. We must be careful, because the conventions of telephone use are deeply established. Although people are often annoyed by the present arrangements, they also tend to be quite conservative and to resent changes in the system. Our proposals could cause substantial shifts of burden between callee and caller -- as will become evident below. Better placement and receipt of calls -- Caller One does not know and cannot find the desired number. The number is misdialled, the line is busy or doesn't answer. The call may be answered by an inflexible answering machine. These problems occur in subtle combination! The call is not answered, but it was the wrong number anyway. Most of the difficulties of finding the right number can be solved by placing calls by name rather than number and by use of appropriate data bases (e.g. white pages, yellow pages). These matters have little to do with transmission, but we can smooth the interface of data base systems and telephone switching. We can also provide ``call placement'' services at a level well above simple number translation and automatic dialling. We could construct a calendar system incorporating a schedule of calls, and note taking and time accounting facilities. A lawyer might be interested in such a system. Better placement and receipt of calls -- Callee People are often busy with tasks more important than handling a particular call. One often wishes to know who is calling, what the call is about, and how long will it take. We live in a social environment in which answering the telephone has unduly high priority. Answering machines seem too impersonal and the receptionist too remote. In essence, we would like to provide everyone with the same services provided to those who have personal secretaries. For calls arriving from inside the system, one can know who is calling. We can develop ways of letting the caller indicate the relative importance of calls and the general subject area. We can provide a call filter, instructions to one's telephone of the flavor: ``No calls for half an hour, except if it's Bob or if Fred calls about the budget.'' It is clear that new burdens will fall upon the caller, if only to judge the importance of calls. One is used to barriers when one makes a call to or receives one from a person with a secretary, but otherwise, as a caller, one is used to getting through and may resent buffering by a non-human mechanism. Although we can try to soften the resentment by making the mechanisms as polite as possible, there seems no way to get around the fact that we are redressing an age-old imbalance -- the caller has traditionally had the ability to interrupt the callee. Better Human-assisted call management. What happens when one is out? The situation is not too bad for a person calling into a good manual attendant system. However, in more ``advanced'' systems -- which are usually bought for reasons of cost-reduction rather than improved convenience -- the attendant doesn't know who is being called, the phone rings for a long time before anyone answers, and one is always put on hold. Messages don't always get through. If they do, they are usually late. We can implement attendant features as extensions to the standard telephone/workstation functions. We will identify the original callee, by name as well as number. We can use a well designed user interface to indicate the status of multiple calls. Potential callees will be able to leave general or individual messages for potential callers: ``If Fred calls, tell him I'm at home.'' Because attendant facilities can be implemented with a standard telephone/workstation setup, the attendant's duties can be easily transferred (including calls in progress!) or shared. Messages for callees will be voice messages instead of handwritten notes. We should add that we have not neccessarily invented many new features. Modern PABX systems can handle these kinds of functions, but often the facilities are only available at the main console and are not well integrated with the office environment. Voice Messages. Answering machines, which today are relatively inexpensive and quite powerful, ought to solve many of the problems that we have identified. They can solve the fundamental problem of communication when both parties are not available at the same time. But in practice, especially in an internal office situation, anyone who has encountered an answering machine instead of the intended callee knows how unsuitable a conversant such a device is; it is one of the more intimidating of modern inventions. Its non-interactive nature essentially demands a well-constructed, composed message rather than an informal communication, and few of us are able to produce such a composition in real time. Digital recording, editing, filing, dissemination, and playback of segments of voice permit us both to construct more reasonable versions of existing facilities and to construct entirely new facilities. For users within the system the caller, not the callee, will make the decision to leave a recorded message. For outside callers, we envision providing facilities that will allow human attendants to handle incoming calls better and more efficiently than do current systems. Voice as Data Fully integrated systems, mixing text, voice, and imaginal material on an equal footing, will present both unforseen opportunities and unforseen problems. One possibility is ``document-level telephony,'' wherein two or more participants use their workstations and their telephones to interactively collaborate on the form and content of a document. Both participants see the same display and discuss their work over a conference call. This area is not strictly a voice application, since no special voice capabilities are required. It is just one of the many possibilities for systems combining the capabilities of computers, large displays, and digital and voice communications. Notable achievements. Production use of Ethernet for interactive voice [cite O'Leary, Batnet, etc.] (DCS) We are in strong if not unique position to explore the integration aspects because of our already rich personal information diddlydah environment. In the next section the environment is briefly described. PARC Environment Rather than the traditional picture of a central computing installation surrounded by communications to users, the PARC environment has as its heart the communication internet, with computing power placed near the user. The environment consists primaritly of a collection of personal workstations interconnected by a high bandwidth local area network. In addition, resources such as file servers, mail servers, and high quality laser printers are connected to the network and shared among users. Communications outside the local area are provided primarily by leased lines of up to 56 kbps. (Installation of the first 1.5 Mbps link is due in November 1983). Several types of personal workstations are in use, including the Xerox 1100, 1108, and 1132 (Dorado) processors. These same processors are used as the controllers for most servers. Personal workstations in our environment have a high resolution bitmap display (1000 by 800 pixels), a keyboard, and a mouse. Within our laboratory most users work within the Cedar programming environment, which is a multiwindow frob. Check other recent papers for the schema Cite Pup paper, Alto paper, Dorado paper... Figure out how to cite grapevine paper as example of distributed system. Brief discussion of Etherphone architecture Given our goals of improved telephone functionality and integrated voice, we established a list of required capabilities for our system. Fundamentally, we must have rapid and versatile control over the voice transmission and switching system and we must have a high performance voice filing system. Together with the facilities of our existing environment for high quality user interfaces, distributed systems, and shared data, these capabilities are all that are required for the development of integrated voice systems. We next surveyed three possible architectures for the system. In-band control of our existing Centrex service, together with a dial-in voice storage system was rejected due to the low-bandwidth and inflexible control over switching. Use of our own computers to control a commercial PABX was rejected primarily due to the non-technical difficulties of arranging the necessary cooperative effort. We concluded that the most effective way to begin adding voice to our systems was to construct our own transmission and switching system using packet voice transmitted over the Ethernet. The Etherphone system transmits voice and control information over the Ethernet rather than over conventional phone wires. For terminal equipment, we have designed the Etherphone as a stand alone Ethernet peripheral without much local intelligence. Its job is digitization and transmission of voice over the Ethernet. The Etherphone Server is the system controller. Its responsibilities include monitoring the state of the system, keeping track of the state of each Etherphone, and setting up all connections. Software in existing users' workstations provides an enhanced user interface. The Voice File Server is a general purpose computer with high capacity disks. It performs more or less standard file server functions, but is specialized for the real-time needs of telephony. A gateway function connects the Etherphone system to the public switched telephone network. Existing standard file servers and data base services are used for storage of white and yellow page information and for storage of users' call filters and other information. Ethernet Voice The fundamental transmission requirements for interactive voice are voice quality and transmission delay. In digital voice communications, voice quality is mostly a matter of allocating sufficient bandwidth. We have chosen to use 64 Kbps mu-255 companded PCM coding. In a packet voice system, delay is composed of two components, packetization delay and transmission delay. Packetization delay arises because the first sample of a packet cannot be transmitted until the last sample has been digitized. The delay introduced is equal to the product of the sampling rate (8 KHz in our case) and the packet size. Transmission delay is in turn composed of network access delays and software delays. For an Ethernet that is not overloaded, the network access delay is very small. A CSMA/CD network such as Ethernet exhibits complex relationships among packet size, offered load, and access delay [Metcalfe, Gonsalves, O'Leary, etc.]. Generally, there are very few collisions and delays are very small for offered loads below some threshold. Above the threshold delays climb rapidly, approaching infinity as the network is saturated. For large packets suitable for data transmission, the threshold is well above 90% utilization. For packets of suitable size for interactive voice, thresholds appear to range from 50 - 80% utilization. One would choose to engineer an Ethernet voice system to operate below the knee of this delay curve. As an example, consider the use of voice packets containing 160 eight bit voice samples (20 milliseconds) on the 10 Mbps standard Ethernet. Consider further that each packet includes 30 bytes of overhead - consistant with our present protocols - and that control traffic is negligible compared to voice traffic. Each (one-way) voice connection would then consume 76 Kbps. Assuming a worst case of 50% utilization for low access delay, the bandwidth availble is 5 Mbps which would support 65 voice streams or 33 full-duplex "trunks." Using a TASI advantage of 1.6 for this size trunk group [ATT], a 10 Mbps Ethernet would support about 52 conversations. If at most 20% of telephones are in use at once, such a network could support in excess of 250 subscribers. While we have not yet built a system of this size, it seems clear that Ethernet voice transmission is feasible for sizeable installations even without consideration of multiple Ethernet cables linked by packet gateways or by conventional circuit switches. Voice protocols For interactive voice, we have chosen to transmit fifty packets per second, each containing 160 voice samples and 30 bytes of addressing and control overhead. The system delay budget allows for 40 milliseconds end-to-end. This delay consists of a packetization delay of 20 milliseconds, hardware latencies of 5 milliseconds for encryption, Ethernet transmission, and decryption, software delays of 5 milliseconds, and an anti-jitter delay of 10 milliseconds. Anti-jitter delay is buffering introduced at the receiving station to allow for variations in the arrival times of future packets. One might say we operate with about one-half packet of buffering. This delay budget does not allow time for retransmissions in the event of lost packets. We have found that the native packet loss rate of well-designed Ethernet components is less than one packet in two million. Recording and playback of stored voice has slightly different characteristics. Transmission delay is not particularly important, provided that the startup delay from request to playback is reasonably short. We have chosen to use a protocol with about 100 milliseconds of buffering and a retransmission capability. One of the most intriguing problems of voice protocol design is the matter of clock synchronization. Each Etherphone has its own crystal controlled clock. A pair of Etherphones may have clocks differing by as much as one part in 10,000. In the steady state, this frequency error causes the quantity of buffered voice at the receiving Etherphone to slowly increase or decrease. Since we use silence detection (TASI) to reduce transmission bandwidth, we were able re-establish the correct buffer depth during a silence interval. Communications with the Voice File Server avoid this problem by use of a software delay-locked loop - the file server is driven by the Etherphone clock. Because the Ethernet is inherently a broadcast medium, we have elected to provide security by encrypting all voice and control communications traffic using the Data Encryption Standard (DES). The availability of single chip DES devices operating at 900 Kbps makes this possible. Key distribution is accomplished using a trusted authentication server [Needham & Schroeder]. The broadcast nature of the transmission medium also admits some new communication possibilities. Conference calls among several Etherphones are easily achieved by multicast. Voice packets from a given Etherphone are received by each of the others. The effect is that of a distributed conference bridge. Another possibility is broadcast. A meeting or conference can be transmitted once and received by any number of interested listeners at the cost of a single conversation. Control protocols For transmission, we were forced to develop new protocols because of the special requiremnets of voice. For system control, we were able to take advantage of the recent development of remote procedure call protocols [Nelson]. Remote Procedure Call is a means for transforming the message passing semantics of packet nets into a procedure call form familiar to programmers. RPC largely relieves applications programmers from any worries about packet formats, addressing, reliable communication, and security. Our task was reduced to specifying the procedural interfaces between different system components. RPC even makes it possible to defer decisions about the partitioning of a distributed system until very late in the design. More detailed architectural description Given the existing PARC environment we found it necessary to design only the Etherphone itself and the trunk server connection to the public switched network. Each of the other system components was built by writing software for existing general purpose computers connected to the network. Indeed, one of our goals was to take advantage of the users' personal workstations to provide a user interface for the telephone system. We have chosen not to implement the trunk server at this time. Rather, we have retained users' existing subscriber lines and added hardware to each Etherphone permitting it to attach to the existing subscriber line as well as the Ethernet. Inside calls are handled by Ethernet transmission while outside calls (usually) traverse the subscriber line. One way to view this decision is as a distributed trunk server! As well as saving substantial hardware development time, this decision permitted direct use of the subscriber line as a fallback position. We have too much experience in large software systems to let our phone service depend on it early in development! In the next several sections we describe the telephone hardware and software (Lark), and the software for the Etherphone Server (Thrush), the Voice File Server (Bluejay), and the preliminary workstation user interface program (Finch). Lark The Etherphone is a general purpose computer with interfaces for the Ethernet, DES encryption, RS-232, and a microprocessor based voice peripheral. The hardware is divided into an analog board and a digital board packaged with a power supply in a convection cooled cabinet about 12" x 13" x 5". This package is designed to sit under a users desk, while telephone set, nmicrophone, and speaker occupy positions of convenience. The analog board is centered around an 8 by 8 analog crossbar switch interconnecting various voice sources and sinks: telephone set, telephone line, CODECs, DTMF decoder, microphone and speaker, and line level inputs and outputs for external voice devices. The telephone set and telephone line interfaces are connected by relays so that a power failure or system crash will restore standard telephone service. The digital board contains the main and slave CPUs, memory systems, timing logic, and digital interfaces for the Ethernet, RS-232, DES encryption, digital voice, and control of the analog hardware. A watchdog timer is present for improved reliability. The main CPU is an 8 MHz Intel 8088 with 8K of EPROM and 56K of RAM. The slave processor, which is dedicated to voice signal processing, has access to 8K of private EPROM, 2K RAM, and shared access to 48K of main memory. DES encryption is accomplished by a 900 Kbps single chip device with DMA access to memory. The Ethernet controller is an internal Xerox LSI part using the protocols of the 3 Mbps "Experimental" Ethernet, but operating at 1.5 Mbps. The RS232 interface is intended for connection of a local display and keyboard for those situations in which no workstation is located nearby. The slave processor executes a small carefully coded assembly language program. Its functions include silence detection, conference bridge, echo suppression, gain control, CODEC I/O and low level buffer management in shared memory. It might be considered firmware because of its close cooperation with the hardware. The main processor is programmed primarily in the C language, with a small amount of assember in low level parts of device drivers. The software includes a round-robin scheduled multitask operating system, the network communications package, the voice protocol package, and the remote procedure call package. In addition there is the "applications" program which implements a remote procedure call interface permitting high level command and status communications with the Etherphone Server. The overall performance of the Etherphone is sufficient to support either participation in a four party conference call or two simultaneous Ethernet conversations. The latter case arises if the telephone set is in use for an Ethernet call and it is then desired to forward the outside line across the network to an attendent. Thrush Traditional telephone switching. Intelligence for stand-alone Etherphones. Mediation of control among the Etherphone keypad and one or more workstation applications program. Database access for information associating network addresses of Etherphones with office locations, nearby workstations, outside phone number, and individuals. Database access for user options (ring level etc.). Thrush includes some maintenance facilities not directly connected with system operation. For example, Thrush contains software which detects the power-up or failure of an Etherphone and which permits automatic downloading and remote debugging of Etherphone software. Through use of another database, Thrush can support the simultaneous operation of different model Etherphones or Etherphones with different software and capabilities. Bluejay Bluejay, the Voice File Server program, runs on a Xerox Dorado (1132) equipped with a 300 Mb disk. Voice storage is done at 8000 bytes per second, giving a storage capacity of over 8 hours. As we gain experience with stored voice in the office environment, we may need to expand capacity by the addition of additional disks. The disk is organized so that storage is allocated in one second quantities. This permits disk activity on behalf of a single record or playback operation to be limited to one disk transfer per second. Nevertheless, user software can specify the order and duration of voice segment access at a grain of one millisecond. This facility makes it possible to experiment with voice editting.In addition to disk managment, the very high network communication loads presented by multiple voice protocol connections presented a special challenge. The present implementation is capable of handling about eight simultaneous transfers. All voice is stored in encrypted form, with the associated keys stored in a data base along with information granting appropriate access to each voice segment. Since each voice segment is stored only once, regardless of the number of users granted access, the file server data base also keeps track of the number of outstanding references to each segment. In this way, voice storage is reclaimed automatically. Finch The Finch user interface program runs in the Cedar programming environment and provides a rudimentary manual user interface for controlling the telephone and the voice message system. Cedar is available on several different types of workstations available in our laboratory so we do not expect availability to be a problem. The functions that Finch provides are designed to provide improved placement and receipt of calls, management of a personal telephone directory, and management of the voice message system. The voice message system portions of Finch are patterned after our text mail system Walnut. Facilities are provided for recording and reviewing messages, for giving a voice message a text subject field, and for directing delivery of the message to one or to several recipients. Eventually we expect to fully integrate the voice and text mail systems. Finch also provides a table of contents of the user's messages. Individual messages can be played, deleted, or saved for a later time. The telephone management portions of Finch allow for the placement of calls by name rather than number, for the placement of calls by pointing at the callee's name in the directory, for annunciation of received calls by the caller's name, and for logging of all telephone activity. Examples of usage Rudimentary voice messages, call placement, ringing volume. Status and Plans In June 1983 we an eight user system into service using engineering prototype Etherphones. At this writing (August 1983) we are beginning the deployment of an additional 40 Etherphones, with which we will offer service to the entire Computer Science Laboratory. With the backbone system in place, we are just beginning to explore the potential applications of integrated voice in the office environment. Summary/Zinger closing paragraph Ê– "Cedar" style˜Icenter•Mark centerHeaderšÐbl˜ImemoHeadšÏsœžÏt œžœŸ˜-Lšžœ žœ˜Lšžœž œ˜/logošœ˜headšœ*˜*IbodyšÐrti˜i—˜ Oš ¢˜¢š ¨˜¨O˜Š—Oš %˜%—˜Ošœ1 œ€˜ÉOšœ°˜°O˜œO˜Ò˜/O˜šOšœšÏiœƒ˜¶O˜ž—˜/O˜ÅOšœË¡ œ…˜ÛO˜¬—˜&OšœW¡œé˜ÄO˜þOšœÕ¡œ˜Ošœ:¡œ¼˜ù—˜OšœO¡œÜ˜°OšœÛ˜Û—N˜ O˜§˜cO˜Ò——šœ˜Ošœœ  œ 5œ’ ¡œ¡˜²Ošœ( u˜—˜+O˜‰O˜ÇOšœÂ¡œÈ¡ œ*¡œ·˜™—˜O˜†O˜“O˜ýN˜O˜èO˜ºO˜«O˜öO˜ßN˜OšœÜ˜Ü—˜'Oš Å˜ÅO˜ê˜Oš “œ–˜©O˜šO˜ÓO˜½O˜íO˜Æ—˜Oš ‚˜‚O˜²—˜Oš º˜ºOš ›˜›—˜Oš Zœ\ Ê˜€Oš è˜èOš ™˜™——˜O˜;—˜O˜“—N˜ ——…—i¶lL