[Indigo]<Voice>Stewart>ap2.bravo!1

While the voice project must have some means for shipping voice around, there are additional capabilities required of any system providing the kinds of functionality described in our visions. Without too much predisposition towards a particular architecture, we have managed to group the basic capabilities into a number of areas. Some of the areas described below imply matching collections of hardware, but some describe functional requirements which are quite abstract.

Perhaps availability is less a capability than a requirement. We are engaged in constructing a telephone system. It should always work. In the large, this means that the system must continue to function even after some of its components break. Essential components, of course, must be designed with high availability in mind. Availability in the small is more subtle. One’s phone must still work even if one’s workstation is in the debugger.

In the same sense that our workstations usually have a display and a keyboard, the voice system requires machinery for voice input and output and enough of a "digital" user interface to control it. This might take the form of an ordinary desk telephone set with its earpiece receiver and mouthpiece transmitter, plus a 12-button keypad and hookswitch. It might take the form of a fancier telephone, perhaps with a full typewriter-like keyboard and a one-line display. It might simply be a speaker and microphone attached to a regular workstation display and keyboard. (In any case, it will be possible to connect a variety of transducers to the voice terminal, such as a speakerphone or headset.)

We require some means of getting voice from place to place. Some possibilities are the exisiting telephone switching and transmission system, a private switching system such as a PABX (Private Automatic Branch Exchange) using traditional wiring, and the Ethernet. Good quality telephony places lower limits on the performance of any method we might choose. The most crucial performance requirements are related to high bandwidth and low delay. A telephone industry compatible voice stream uses 64,000 bits per second. While data compression schemes exist which might reduce this rate to as low as 8000 bits per second, they are quite expensive. Delays are generated by speed-of-light problems or by buffering in packet switched systems. Acceptable delays depend on such factors as background noise and echo level. The transmission system we use must supply enough bandwidth for an adequate number of users without excessive delays.

We refer to control over the transmission and switching system. We must be able to manage connections between people, between people and servers, and between machines. Because we must communicate with the outside world, we cannot entirely rule out generating tones to control the traditional phone system, but the quality of the voice system we construct will depend a great deal on the speed and accuracy with which system control is handled. Some of our visions depend on the flexibility of control arrangements; for example, forwarding requires the ability to change the association between numbers and terminals.

In order to handle voice messages, annotated documents, answering machine facilities, and the like, we must have a voice filing system with the necessary real-time capability. Although voice messages could be implemented with analog tape recorders attached to terminals, voice segments which might be accessed by many people seem to require digital storage using high performance disks.

Once we begin constructing documents incorporating voice, we will need systems for composing and editting voice. Such systems might range from simple "record" and "pause" buttons through some graphical representation of a voice passage. The voice filing machinery must be sufficiently versatile to handle complex restructuring of a passage (e.g. by something like a piece table).

Although we do not expect to construct data base systems ourselves, we expect to make heavy use of such systems in the implementation of white pages, yellow pages, and perhaps for organizing voice messages.

While speech synthesis and recognition capabilities would be extremely valuable, we do not propose to develop them. We feel that synthesis and recognition cababilities are not required for basic systems work in voice and we feel we will be able to incorporate them into our systems as they become available.

The basic premise of the Etherphone approach is that actual transmission of the voice data, as well as all control information, is done in digital form over the Ethernet. Connections to the outside telephone world would be done by servers with trunks to the phone company. This scenario has the advantage of complete control over the telephone transmission system. We benefit by the natural multiplexing of the ether and by direct access to voice-as-data. Control of the system is distributed; negotiation for a call might take place directly between the source and destination Etherphones.

Use the Ethernet for both voice and control information. This is the basic premise. We plan both to control the telephone/voice system by digital communications through the internet and to transmit voice on the Ethernet both for conversations and for storage.

Keep the Etherphones simple. We plan to treat the Etherphones themselves as simple terminals without much intelligence. The complexities of software and system control will reside in a more powerful server.

Use workstation when available for wonderful user interface. We plan to take considerable advantage of the workstations in most of our offices. Only they can provide the large displays and versatile user input capabilities we will need to provide advanced yet user friendly functions. In places without workstations we will provide telephones with somewhat fancier user interfaces.

Etherphone -- We view the Etherphone primarily as an Ethernet peripheral. It’s job is A/D and D/A conversion of voice and transmitting and receiving voice over the Ethernet. These activities are carried out under the close control of the Etherphone Server. The digital part of the user interface, the buttons and lights, will be controlled by the server.

Etherphone Server -- The Etherphone Server is the system controller. The Etherphone server is in charge both of monitoring the state of the system, it keeps track of the state of each Etherphone and who is talking to whom, and of controlling the system, it is responsible for setting up all connections. In order to achieve high reliability we may eventually use redundant Etherphone Servers.

Voice File Server -- The voice file server is a general purpose computer with high capacity disks. It performs more or less standard file server functions, but is specialized for the real-time needs of telephony. The voice file server must reliably handle several simultaneous file stores and retrieves at the telephone data rate of 64 Kilobits per second.

POTS gateway -- "Plain Old Telephone Service" gateway refers to a server machine that provides access from the Etherphone world to the public switched telephone network. Calls arriving from the outside arrive at the gateway and are routed under control of the Etherphone Server to the appropriate Etherphone. Calls originating on the Ethernet but bound elsewhere use the gateway as a path to the outside world.

The next two system components are slightly different in character; we need them to provide a complete system, but in some sense we do not have to do all the work ourselves. These are not new components.

Database -- We plan to use existing and planned standard file servers and data base services for storage of white pages and yellow pages information and for storage of users’ call filters and other information. We may use data base services for storage of the voice file server directory, although not for the voice files themselves.

We have no immediate plans to actually build the POTS server. Instead, we plan to retain the present system of individual phone lines but rather than connecting them to standard telephone sets, we will connect tham as a "back-door" to the individual Etherphones. Calls for a particular station might arrive either over the Ethernet or over the back-door standard phone line. If a user dials an outside number, the Etherphone Server will direct the Etherphone to use the back-door line rather than the Ethernet.

We are taking this approach (which may be considered a distributed POTS gateway) largely for compatibility with the exisitng Parc phone system. If we removed all the existing lines and instead aquired direct-inward -dialling trunks for connection to a POTS server, we could no long be part of the Xerox Palo Alto phone system, those without Etherphones would have to call us as "outside calls". In addition, this organization offers additional protection against system failures. We will provide a deadman timer to automatically reconnect the outside line in the event of Etherphone or system failure.

The proposed organization, in which there is one outside line per station, can also be considered a "key telephone system", although functionally, our organization is more like a PABX with many trunks.

The Etherphone 0 exists now. It consists of an Alto I together with an Auburn audio board and a Danray telephone set. The program is written in bcpl and incorporates a first Ethernet voice transmission protocol together with a simple connection mechanism. One can "dial" the destination station’s Ethernet address and the program will ring the destination phone or return a busy signal. The program includes silence detection and a number of facilities for performance evaluation.

The Etherphone I will use essentially the same hardware as the Etherphone 0, with the addition of a "back-door" interface to the office phone line. The Etherphone I program will also be bcpl, but should be simpler (although more refined) than the existing program. We plan to let the Etherphone Server control the collection of Etherphone I’s.

The Etherphone II series will be the first real Etherphones. We are thinking in terms of a microcomputer system with power supply in a shoebox on the floor plus a telephone set on the desk. The "B" model would additionally include a keyboard and a small display for telephone applications without a nearby workstation. The Etherphone II will be built using off the shelf LSI components and programmed in assembler or a higher level language such as C or Pascal. The program should be a straight transliteration of the Etherphone I program. We plan to use the Xerox SLC Ethernet chip because it is the only one available. Since the SLC may not run faster than 1.5 Mb we may have to string a 1.5 Mb Ethernet and use a gateway.

After we gain sufficient operational experience with the Etherphone II and as available LSI parts improve we plan newer, smaller Etherphones. Perhaps the Etherphone III could include a Dragon and be programmed in Mesa.

The Etherphone Server will be responsible for management and control of the entire voice system. The individual Etherphones will act as peripherals of the server, a users’ actions of pushing buttons on an Etherphone will be transmitted to the Etherphone Server for interpretation. However, after the server directs two Etherphones to establish a connection, the actual two-way transmission of voice will proceed without further intervention of the server.

Since the Etherphone Server will be responsible for interpreting users’ actions, it is the logical place for the software controlling many system functions such as forwarding, call filtering, and control of the Voice File Server.

We envision use of a Dolphin running Pilot/Cedar for the Etherphone Server. We have tried to avoid time critical tasks for the Etherphone Server, so it should be possible to take advantage of some of Cedar’s facilities.

The Voice File Server will be a peripheral of the Etherphone Server, although a somewhat more intelligent one than the collection of Etherphones. Storage of voice in real-time is a sufficiently specialized activity that we feel no existing file server can fill our needs. The voice file server will have to speak the same protocol as do the Etherphones, and it will have to play and record several simultaneous streams at 64 Kilobits per second each. No special voice hardware is needed, because the voice will have already been digitized on its way to the Ethernet. Large capacity disks, however, will be important.

By this Fall we plan to have the Alto I Etherphone I prototypes (2 to 5 of them) able to talk to each other and to standard phone lines with the aid of a first Etherphone Server. We plan to have essentially pfrozen the Ethernet voice transmission protocol and to have collected information about (and perhaps measured) Ethernet performance for voice traffic. We are already collecting information we need to start the Etherphone II design.

By this Winter we will have more of the basic Etherphone Server operational as well as a basic Voice File Server. We expect some preliminary applications work to start in parallel with these activities.

Precis: While we do not fully understand the details of Ethernet behavior in a very high load regime, we are very confident that the network behaves very well (low delay) up to sufficient load to build a usable Etherphone system. We intend to incorporate load management into the system to insure that the Ether does not become overloaded.

The access delay of an Ethernet depends on the number of stations waiting to transmit at the end of some other transmission, be it a successful packet or a collision. The delay (including the average time remaining of a transmission in progress when a new station becomes ready) is complicated, but it is very small if ther is only one station desiring to transmit. In this case, the delay is only that of waiting for a previous transmission to finish. Obviously, if an Ethernet were perfectly scheduled, this condition would persist up to 100% load. Since an Ethernet is not scheduled, statistical fluctuations and retransmission delays result in several stations waiting to transmit on occasion even though the average is less than one. Measurments by T. Gonsalves of Stanford, while not exactly applicable to Etherphone, seem to show that the delays remain very low until the load reaches 60% to 80% of capacity. The exact position of the "knee" in the curve depends on the packet size in use.

Interactive voice (telephony) requires low delay, under 100 milliseconds and preferably well under. In a datagram system, this requirement may be translated into packet size. Shorter packets mean lower end to end delays since less time is spent w in accumulating a packet. In an Ethernet, shorter packets are less efficient thatn longer ones for two reasons: shorter packets have relatively more overhead and less data, and shorter packets have a lower "knee" in the delay curve.

As was mentioned, interactive voice requires low delay, but not necessarily perfect reliability. The ear will not notice occasional garbles. Telephone quality voice is readily achieved by sampling the voice waveform 8000 times per second and encoding the sample in 8 bits, for a data rate of 64,000 bits per second. We are planning a system design of roughly 50 packets per second (160 voice bytes per packet). This figure gives 30 - 40 milliseconds end to end delay. The delay is made up of 20- milliseconds packet assembly time, plus a minimum transmission delay, plus some "anti-jitter" buffering delay at the receiver to cover up momentary longer delays.

The figure of 64000 bits per second is a compromise in many ways. Higher rates would give higher voice quality at the expense of extra Ethernet bandwidth per conversation and higher voice file storage costs. Lower rates could be achieved by speech compression, but would not greatly improve the capacity of an Ethernet. The number of packets per second needed for an interactive conversation must be kept high to minimize delays. Required Ethernet bandwidth would be reduced by compression, but the efficiency would drop as well, due to the shorter packets.

Silence detection provides an opportunity to improve the voice capacity of an Ethernet. Usually only one of the two parties of a phone call is speaking. The peak capacity must be sufficnet to allow both parties to speak at once, but sometimes neither party will be speaking. The average is around 50% utilization of the full duplex channel.