[Indigo]<Voice>LCSAudio>epprotocol.bravo!1

Telephone quality voice can be achieved with rates of 8000 bits per second or so, but at this writing, the required techniques are computationally expensive. Intermediate bit rates are a possibility, but 64,000 bits per second represents the present telephone industry standard. For this reason, we restrict our attention to 64 Kbps telephone industry compatible speech. Such digital voice signals consist of sampling the voice 8000 times per second and representing each sample as an 8 bit encoding of the amplitude of the voice. The standard encoding is m-255 law, a form of segmented logarithmic companding [ref for mu-law].

Voice communication from human to human (telephony) is a real time communications problem. The perceived delay must be fairly small and constant. Tolerable delays are generally below 100 milliseconds. [Notes on the Network].

Voice filing, transmission of voice between a human and a storage device is a half-duplex kind of function. As such, it can tolerate higher delays provided that the initial delay, when a connection is set up, is not too long.

Echos are responsible for much of the perceived annoyance caused by delay. There is a tradeoff betwen allowable delay and the loudness of echo. Generally speaking, the more return-loss (the quieter the echo), the longer delays can be tolerated. There are many sources of echo. Two important classes of echos are acoustic echo, which occurs when acoustic energy from the receiver (speaker) enters the transmitter (microphone), and hybrid echo (?), which is an electrical effect caused by reflections from hybrid circuits or impedance discontinuities in 2-wire voice paths. [Notes on the Network]

While a conversation between people is usually full duplex (both people can talk at once), usually only one participant at a time is speaking. In addition, when a person is speaking, there are often gaps between words or sentences. On the other hand, both partici[ants occasionally speak at once. Over conversations in general, something like 47% of the full-duplex channel is used.

The laws of large numbers apply to these statistics. Useful data points derive from the telephone industry use of Time Assigned Speech Interpolation (TASI), in which a certain number of trunk circuits (e.g. transoceanic cable circuits) are overcommited. If 24 full duplex trunks are available, usually 36 conversations can be supported, for a ratio of 1.5. If 150 circuits are available, 300 conversations can usually be supported, for a ratio of 2.0. [Notes on the Network, BSTJ] These statistical effects are usually referred to as the TASI advantage.

To a certain extent, the human ear is tolerant of distortion in speech. For digital speech this means that, to a certain extent, the ear is tolerant of errors in the digital representation of speech.

Consistant service is perhaps the wrong title for this concept. Once set up, a voice connection should maintain an adequate quality. In the presence of network overloading it would be better to reject (block) connection attempts altogether than to offer poor quality. (A corollary to this is that it is certainly better to block new calls than to degrade old ones.)

The above considerations have several consequences for the design of a datagram based voice transmission protocol. The two most critical requirements are that the voice protocol and the end devices have sufficient power to support the voice data rate in a steady state and that the end-to-end delay be sufficiently small. It is not sufficient for the system just to support the average data rate. The system must support the average data rate with sufficiently low variance to maintain a constant low delay. (Variance in the data rate can be compensated by increased delay. Voice data is buffered at the receiving end so that the buffer runs dry with very low probability.)

To the extent that the voice protocol is essentially a 4-wire transmission path, it need not be directly concerned with echo. However, acoustic echo may exist at one or both ends, and a voice protocol connection might be connected in tandem with a 2-wire transmisssion path, leading to hybrid echo.

Voice is digitized at 8000 samples per second. The samples are encoded in 8 bits using the industry standard m-255 companding. In general, each Alto transmits 50 packets per second, each with 160 voice samples plus a sequence number indicating the sample number of the first sample of the packet. Since the samples are generated at a fixed 8000 Hz rate, this sequence number is equivalent to a timestamp.

The Etherphone 0 audio microcode computes the sum of the upper 8 bits of the absolute value of the 12 bit linear encoded samples produced by the Auburn A/D converter. If this value, summed over a given 160 sample block, falls below a certain threshold, the input is deemed to be silent. After a certain number of consecutive silent blocks app ear, the originating machine stops transmitting packets. By this means, typically half the required communications bandwidth is saved.

During a silence interval, the receiving station plays silence to the listener. After a silence interval, packets again begin to arrive at the receiving station. In order to account for jitter in the arrival of future packets, the very first packet is delayed by 10 milliseconds before it is played.

In the steady state, there should be 30+ milliseconds between the time a particular sample is digitized at the originating station and the time at which it appears at the D/A converter of the receiving station. Two thirds of this delay, 20 milliseconds (or 160 sample times) is absorbed by the very process of packetization. The first sample of a packet cannot be sent to the receiver until the last sample of the packet has been digitized. Twenty milliseconds per packet results in a 50 packets per second transmission rate. The remaining delay has a certain minimum value corresponding to the minimum transmission delay between the two stations, but is actually made longer in order to smooth over jitter or variations in transmission delay. (Note that the component of jitter associated with access to the Ethernet is typically much smaller than the jitter associated with the scheduling of processes in the sending and receiving stations.)

The implementation of the jitter reduction delay is as follows. An assumption is made that the first packet to arrive does so with a typical transmission delay. A 10 millisecond silence is placed on the D/A queue in front of the first packet. Assuming that the clocks of the sending and receiving station are running at the same frequency, the delay between A/D at the originating station and D/A at the receiving station becomes fixed at 20 msec for packetization plus the transmission delay of the first packet plus 10 msec smoothing delay inserted at the receiving station. This process is repeated at the end of each silence interval.

In fact, the resynchronization is done whenever the sequence number of an arriving packet does not match the expected sequence number, thus a lost packet is deemed to have been a silence interval.

The Etherphone 0 protocol is deficient in several areas. The protocol is not instrumented. We are engaged in an experimental enterprise; before we can design the ultimate voice protocol for our communications environment, we must assess the performance of that environment. The protocol makes no provision for system wide management of bandwidth. To achieve what is called consistant behavior, be must permit only the number of active voice connections compatible with the available transmission bandwidth.

The Etherphone I system also uses Alto I/Auburn as the Etherphone. The system includes an Etherphone server and uses an enhanced transmission protocol. The Etherphone Server keeps track of the state of the entire Etherphone system and is responsible for setting up calls.

We would like to instrument the Etherphone system in order to keep track of lost packets. Since the voice transmission protocol is not reliable in the sense that it does not retransmit lost packets, we are trying to engineer the system in such a way that it does not lose too many. Each packet will have not only a sequence number, but also a second sequence number indicating the number of packets which have been transmitted -- allowing the receiving station to accurately count lost packets.

As a general principle, we are trying to avoid keeping distributed state that must be keep up to date. For example, two Etherphones must be able to tell whether a connection between them should be closed. In addition (another principle) a voice connection should not be disturbed by other traffic. It is better to prevent a call from starting than to allow it to be disturbed once set up. Thus: even during silence intervals, Etherphones will transmit packets, although they will be shorter than packets carrying voice data and will be transmitted at a lower rate.