Heading:
Voice Proposal: Architecture
Page Numbers: Yes X: 527 Y: 10.5"
Inter-Office Memorandum
ToFileDateSeptember 13, 1981
FromL. StewartLocationPalo Alto
SubjectVoice Proposal: ArchitectureOrganizationCSL
XEROX
Filed on: <Audio>Doc>VPArchitecture.bravo
SYSTEM CAPABILITIES
While the voice project must have some means for shipping voice around, there are additional capabilities required of any system providing the kinds of functionality described in our visions. Without too much predisposition towards a particular architecture, we have managed to group the basic capabilities into a number of areas. Some of the areas described below imply matching collections of hardware, but some describe functional requirements which are quite abstract.
Availability
Availability is less a capability than a requirement. A telephone system should always work. In the large, this means that the system must continue to function even after some of its components break. Essential components, of course, must be designed with high availability in mind. Availability in the small is more subtle: one’s phone must still work even if one’s workstation is in the debugger.
Terminal Equipment
In the same sense that our workstations usually have a display and a keyboard, the voice system requires machinery for voice input and output and enough of a "digital" user interface to control it. This might take the form of an ordinary desk telephone set with its earpiece receiver and mouthpiece transmitter, plus a 12-button keypad and hookswitch. It might take the form of a fancier telephone, perhaps with a full typewriter-like keyboard and a one-line display. It might simply be a speaker and microphone attached to a regular workstation display and keyboard. (In any case, it will be possible to connect a variety of transducers to the voice terminal, such as a speakerphone or headset.)
Transmission and switching
We require some means of getting voice from place to place. Some possibilities are the existing telephone switching and transmission system, a private switching system such as a PABX (Private Automatic Branch Exchange) using traditional wiring, and the Ethernet. Good quality telephony places lower limits on the performance of any method we might choose. The most crucial performance requirements are related to high bandwidth and low delay. Most telephone industry digital voice uses 64,000 bits per second, although some new PABXs usr more in order to obtain higher quality. While data compression schemes exist which might reduce this rate to as low as 8000 bits per second, they are quite expensive. Delays are generated by speed-of-light problems or by buffering in packet-switched systems. Acceptable delays depend on such factors as background noise and echo level. The transmission system we use must supply enough bandwidth for an adequate number of users without excessive delays.
Control
We must have rapid and versatile control over the transmission and switching system. We must be able to manage connections among people, between people and servers, and between machines. Because we must communicate with the outside world, we cannot entirely rule out generating tones to control the traditional phone system, but the quality of the voice system we construct will depend a great deal on the speed and accuracy with which system control is handled. Some of our visions depend on the flexibility of control arrangements; for example, forwarding requires the ability to change the association between numbers and terminals.
Voice Filing
In order to handle voice messages, annotated documents, answering machine facilities, and the like, we must have a voice filing system with the necessary real-time capability. Although voice messages could be implemented with analog tape recorders attached to terminals, voice segments which might be accessed by many people seem to require digital storage using high performance disks.
Voice composition and editing
Once we begin constructing documents incorporating voice, we will need systems for composing and editing voice. Such systems might range from simple "record" and "pause" buttons through some graphical representation of a voice passage. The voice filing machinery must be sufficiently versatile to handle complex restructuring of a passage (e.g., by something like a piece table).
Data-bases
Although we do not expect to construct data base systems ourselves, we expect to make heavy use of such systems in the implementation of white pages, yellow pages, and perhaps for organizing voice messages.
Encryption
People expect their telephone calls to be private. There are legal sanctions against wiretapping, but wiretapping on a broadcast medium where a wiretap might be just a program would be very difficult to detect. High speed single chip DES encryption chips are now inexpensive and key distribution using a trusted server is straightforward.
ARCHITECTURAL PRINCIPLES
The basic premise of the Etherphone approach is that actual transmission of the voice data, as well as all control information, is done in digital form over the Ethernet. Connections to the outside telephone are done by servers with trunks to the phone company. This scenario has the advantage of complete control over the telephone transmission system. We benefit by the natural multiplexing of the ether and by direct access to voice-as-data. Control of the system is distributed; negotiation for a call takes place between the source and destination Etherphones and other parties.
We plan to guide our implementation by four architectural principles:
Use the Ethernet for both voice and control information. This is the basic premise. We plan both to control the telephone/voice system by digital communications through the internet and to transmit voice on the Ethernet both for conversations and for storage.
High availability and reliability. We plan to construct a system with as close to the availability and reliability of the existing telephone system as we can.
Keep the Etherphones simple. We plan to treat the Etherphones themselves as simple terminals without much intelligence. The complexities of software and system control will reside in a more powerful server.
Use workstation when available for wonderful user interface. We plan to take considerable advantage of the workstations in most of our offices. Only they can provide the large displays and versatile user input capabilities we will need to provide advanced yet friendly functions. In locations without workstations we will provide telephones with somewhat fancier user interfaces than the usual twelve buttons.
SYSTEM COMPONENTS
We turn to a more detailed discussion of the Etherphone architecture, shown in Figure EPSystem.sil.
Etherphone -- We view the Etherphone primarily as an Ethernet peripheral. It’s job is A/D and D/A conversion of voice and transmitting and receiving voice over the Ethernet. These activities are carried out under the close control of the Etherphone Server. The digital part of the user interface, the buttons and lights, will be controlled by the server, but the Etherphone must transmit and receive event notifications.
Etherphone Server -- The Etherphone Server is the system controller. It is in charge of monitoring the state of the system, keeping track of the state of each Etherphone and what conversations are in progress, and of controlling the system. It is responsible for setting up all connections. In order to achieve high reliability we plan eventually to use redundant Etherphone Servers.
Voice File Server -- The voice file server is a general purpose computer with high capacity disks. It performs more or less standard file server functions, but is specialized for the real-time needs of telephony. The voice file server must reliably handle several simultaneous file stores and retrieves at the telephone data rate of 64 Kilobits per second.
POTS gateway -- "Plain Old Telephone Service" gateway refers to a server machine that provides access from the Etherphone world to the public switched telephone network. Calls arriving from the outside arrive at the gateway and are routed under control of the Etherphone Server to the appropriate Etherphone. Calls originating on the Ethernet but bound elsewhere use the gateway as a path to the outside world.
The next two system components are slightly different in character; we need them to provide a complete system, but in some sense we do not have to do all the work ourselves. These are not new components.
Workstation -- We will depend in large measure on workstations for good user interfaces.
Database -- We plan to use existing and planned standard file servers and data base services for storage of white pages and yellow pages information and for storage of users’ call filters and other information. We may use data base services for storage of the voice file server directory, although not for the voice files themselves.
COMPATIBILITY
We have no immediate plans to build the POTS server. Instead, we plan to retain the present system of individual phone lines. Rather than connecting them to standard telephone sets, we will connect tham as a "back-door" to the individual Etherphones. Calls for a particular station might arrive either over the Ethernet or over the back-door standard phone line. If a user dials an outside number, the Etherphone Server will direct the Etherphone to use the back-door line rather than the Ethernet.
We are taking this approach (which may be considered a distributed POTS gateway) largely for compatibility with the existing PARC phone system. If we removed all the existing lines and instead aquired direct-inward -dialling trunks for connection to a POTS server, we could no long be part of our Centrex phone system. Those without Etherphones would have to call us as "outside calls".
In addition, this organization offers additional protection against system failures. We will provide a deadman timer to automatically reconnect the outside line in the event of Etherphone or system failure.
COMPONENT OVERVIEW
We now turn to overall descriptions of the various system components. More detailed descriptions will be found in the various design documents, as they appear.
Etherphones
We plan to produce three or four versions of the Etherphone.
Etherphone 0
The Etherphone 0 exists now. It consists of an Alto I together with an Auburn audio board and a Danray telephone set. The program is written in BCPL and incorporates a first Ethernet voice transmission protocol together with a simple connection mechanism. One can "dial" the destination station’s Ethernet address and the program will ring the destination phone or return a busy signal. The program includes silence detection and a number of facilities for performance evaluation.
Etherphone I
The Etherphone I will use essentially the same hardware as the Etherphone 0, with the addition of a "back-door" interface to the office phone line. The Etherphone I program will also be BCPL, but should be simpler (more refined) than the existing program. We plan to let the Etherphone Server control the collection of Etherphone I’s.
Etherphone II A and II B
The Etherphone II series will be the first real Etherphones. Etherphone II will be a microcomputer system with its power supply in a shoebox on the floor and its telephone set on the desk. The "B" model would include a keyboard and a small display for telephone applications without a nearby workstation. The Etherphone II will be built using off the shelf LSI components. It will be programmed in assembly language or in a higher level language such as C or Pascal. The program should be a near transliteration of the Etherphone I program.
Etherphone III
After we gain sufficient operational experience with the Etherphone II, and as available LSI parts improve, we may build newer, smaller Etherphones. We hope the Etherphone III will include a Dragon and be programmed in Mesa.
Etherphone Server
The Etherphone Server will be responsible for management and control of the entire voice system. The individual Etherphones will act as peripherals of the server; a users’ actions of pushing buttons on an Etherphone will be transmitted to the Etherphone Server for interpretation. However, after the server directs two Etherphones to establish a connection, the actual two-way transmission of voice will proceed without further intervention by the server. We are taking this centralized approach because of the relatively low powered Etherphones and because we see a medium term need for some centralized management of the system. In the longer term, ther server functions should be assignable to one or two Etherphones in small systems.
Since the Etherphone Server will be responsible for interpreting users’ actions, it is the logical place for the software controlling many system functions such as forwarding, call filtering, and control of the Voice File Server.
We will use a Dolphin running Pilot/Cedar for the Etherphone Server. We have tried to avoid time critical tasks for the Etherphone Server, so it should be possible to take advantage of some of Cedar’s facilities. The Etherphone Server program (Thrush) design is further described in a later section.
Voice File Server
The Voice File Server will be controlled by the Etherphone Server. Storage of voice in real time is a sufficiently specialized activity that we feel no existing file server can fill our needs. The voice file server will have to speak the same protocol as do the Etherphones, and it will have to play and record several simultaneous voice streams at 64 Kilobits per second each. No special voice hardware is needed, because the voice will have already been digitized on its way to the Ethernet. Large capacity disks, however, will be important.
We will use a Dolphin running Pilot for the VFS. Use of Cedar facilities might compromise the real-time requirements, but we intend to take advantage of the improving programming environment.
REMAINING ISSUES
Ethernet Transmission
While we do not fully understand the details of Ethernet behavior in a very high load regime, we are very confident that the network behaves very well (low delay) up to sufficient load to build a usable Etherphone system. We intend to incorporate load management into the system to insure that the Ether does not become overloaded. We think that a 10 MBit Ethernet would handle around 400 telephones, that a 3 MBit Ethernet would handle around 150 telephones, and that a 1.5 MBit Ethernet would handle about 80 telephones.
More measurements and experiments are probably required to pin these matters down precisely. There is interest in SDD (Bob Printis & Yogen Dalal), in GSL (Bernard Huberman), in the Science Center (Shoch) and in CSL (Stewart, Baskett, Swinehart, Nowicki, Gonsalves) in various experimental and theoretical efforts.
In the appendix there is a document "Voice Transmission Protocol Issues" by Nowicki and Stewart which includes some discussion of these matters.
Telephone Systems
All telephones are never in use at the same time. One very useful figure is that only about 22% of telephones are busy during busiest hour of the day. (This number via Lynch, we should locate supporting data. Can we find out how many calls are inside vs. outside?)
Long delay telephone circuits sometimes have trouble with echos. Echos can be caused, for example, by acoustic echo from the far end of a call or by an imperfectly balanced hybrid at a junction between 2 and 4 wire circuits. The Etherphone system, since it has "high delay" connections by industry standards, may require echo handling circuits on some outside calls. Single chip echo cancellers are slowly becoming available. We have a collection of literature and we may have to invest in experiment and hardware at some point.
Compression
While compression does not seem particularly attractive for transmission, it may be attractive for storage of voice. We have no plans to use compression. 64,000 bits/second is 8,000 bytes per second, or roughly 4 IFS pages per second. A single T-300 disk drive can accomodate 9.5 hours of speech. We think that will hold us for a while. If voice messages really take off, however, 9.5 hours is still only 11 minutes for each of 50 CSL members.
Conference Calls
We don’t really know the right way to handle conference calls. If we use multicasting, then when more than one person is speaking the involved Etherphones must mix the audio or decide who should be allowed to talk. Another approach is that used by most PBXs. We could build a few "conference bridges" and mix the audio and send it back out on the net.
Traffic management
In the appendix there is a memo "Voice Vs. Data" by Stewart which generally raises the spectre of class-of-service in datagram networks. If we are ever to use internet voice (imagine a collection of 10 MBit Ethernets hooked together by 1.5 MBit point to point links), then the gateways must allocate guaranteed bandwidth to voice users. Similar problems apply to combined voice and data on a single net or to use of an Ethernet as a transit network.
System reliability
To a degree, we have not signed up to build something as reliable as the exisitng phone system. We are using commercial power and each Etherphone retains an outside phone line for use when the system or Etherphone breaks. We are centralizing control in the Etherphone Server. What if it breaks? (In fact we will probably have to allow multiple Etherphone servers anyway, in order to allow internet Etherphone calls between areas served by different Etherphone servers.
Alternative Architectures
There is some material in the appendices about alternatives to the single line Etherphone. We think that the advanced functions could be built on a variety of voice transmission systems, but we think that the single line Etherphone is the most appropriate for us.
INITIAL PLANS AND SCHEDULE
By this Fall we plan to have the Alto I Etherphone I prototypes (2 to 5 of them) able to talk to each other and to standard phone lines with the aid of a first Etherphone Server. We plan to have essentially frozen the Ethernet voice transmission protocol and to have collected information about (and perhaps measured) Ethernet performance for voice traffic. We are already collecting information we need to start the Etherphone II design.
By this Winter we will have more of the basic Etherphone Server operational as well as a basic Voice File Server. We expect some preliminary applications work to start in parallel with these activities.
By next Summer we expect to have a number of Etherphone II prototypes and some solid applications.
Shortly thereafter we intend to build enough Etherphone IIs to supply all of CSL.
As was mentioned, we have no particular plans at this time to build the POTS gateway.
PERSONNEL, PLANS, FACILITIES, AND BUDGET
Dan Swinehart, Larry Stewart, and Severo Ornstein are the primary participants. They are spending roughly 70% of their time on the Voice project. In addition John Ousterhout and Susan Owicki are working on the project - John, one day a week, and Susan, one day a week for now and more after mid-October.
Specific roles are as follows:

Dan:
Define protocols; work on Etherphone Server program
Larry:
Work on Etherphone hardware and software
Severo:
Manage things and work on Etherphone hardware with Larry
John:
Work on Voice File Server
Susan:
Work on Etherphone Server design with Dan and File
Server design with John
We are presently designing and will soon begin implementing both hardware and software.
Dan and Susan will develop the Etherphone Server code on Dan’s Dolphin. If we can’t find another machine by the time we’re ready to start incorporating the Etherphone Server with the Etherphones, we’ll use Dan’s Dolphin as the Server - and get him an Alto II.
John will develop the Voice File Server code on a Dorado. (He has special sign-up dispensation for his one day a week here). Larry Stewart has a Dolphin with the understanding that if, by the time we are ready to start incorporating the Voice File Server with the other machines, we can’t find another one for the server, we will use his machine - and replace it with an Alto II.
We have ordered a number of parts for the Etherphone II and are beginning to try out some small circuits to become familiar with how the parts work. We hope to have a working breadboard by about the end of the year. Most of our hardware work will consist of plugging together small numbers of reasonably high level standard parts on prototype boards. A lab bench or table will do for this. Also we’ll need some power supplies and scopes and things but we don’t forsee need for substantial lab space. On the other hand, during development, we will want to assemble all of the elements of the distributed system in one room, to make debugging feasible. When fullblown this could include the following items of equipment:
1 or 2 Alto based Etherphone I’s
1 or 2 prototype Etherphone II’s
1 Dolphin Voice File Server
1 T-80 for Voice File Server
1 Dolphin Etherphone Server
1 control ("Midas")* Alto for Etherphone II’s
1 3MB to 1.5 MB Ethernet Alto Gateway (could be elsewhere)
Scopes, tables, etc.
* Not really Midas, but serves some of the same basic functions
Not a trivial pile of stuff. We plan to use the Nursery. It’s about the right size and near the lab. (Our use fits the mold of the Nursery’s original purpose). Over the next few weeks it will be cleared of other occupants and we will start to move in - at first with the Alto I Etherphone 0’s (presently in Dan’s office and in the Nursery) together with our hardware breadboarding. Gradually the other stuff will assemble there. We hope to have a rudimentary combined system working next spring. By next summer we expect to have the Etherphone II design solidified enough to order enough units to outfit everyone in CSL with an Etherphone. We have allocated $50K in the 1982 CSL budget for this purpose (not including server machines and disk drives). By the end of 1982 we hope to have a working backbone system in place in CSL that will handle telephone calls in standard ways, permit simple storage and retrieval of voice messages, and generally put us in a position to start experimenting with more esoteric uses of the voice handling facilities - e.g. integration with Laurel (or derivative).