[Indigo]<Voice>Doc>EpProtocol.Bravo!2

The major criticism of the Ethernet local area network is that it is not suitable for real-time applications. Transmission of the human voice in the form of telephony is one application with severe real-time constraints. This memo describes some characteristics of voice transmission and Ethernet. We have designed and implemented a real-time voice transmission protocol based on Pup/OIS packets.

Telephone quality voice can be achieved with transmission rates down to 8000 bits per second, but the required compression techniques are currently very computationally expensive. Intermediate bit rates are a possibility, but 64,000 bits per second is the present telephone industry standard. For this reason, we restrict our attention to 64 Kbps telephone industry compatible speech. Such digital voice signals consist of sampling the voice 8000 times per second and representing each sample as an 8 bit encoding of the amplitude of the voice. The standard encoding is called m-255 law, a form of segmented logarithmic companding [AT&T 80] of pulse code modulation.

Voice communication from human to human (telephony) is a real time communications problem. The perceived delay must be fairly small and constant. Tolerable delays are generally below 100 milliseconds. [AT&T 80].

Voice filing, transmission of voice between a human and a storage device is a half-duplex kind of function. As such, it can tolerate higher delays provided that the initial delay, when a connection is set up, is not too long.

Echos are responsible for much of the perceived annoyance caused by delay in current long-distance telephone calls. There is a tradeoff betwen allowable delay and the loudness of echo. Generally speaking, the more return-loss (the quieter the echo), the longer delays can be tolerated. There are many sources of echo. Two important classes of echos are acoustic echo, which occurs when acoustic energy from the receiver (speaker) enters the transmitter (microphone), and hybrid echo, which is an electrical effect caused by reflections from hybrid circuits or impedance discontinuities in 2-wire voice paths [AT&T 80, Section 7.2]. The major concern is for "Talker Echo," which is generated on the receiver side(the person listening), but perceived on the transmitter side (the person talking).

While a conversation between people is usually full duplex (both people can talk at once), usually only one participant at a time is speaking. In addition, when a person is speaking, there are often gaps between words or sentences. On the other hand, both participants do occasionally speak at once. Over conversations in general, about 47% of the full-duplex channel capacity is used.

The laws of large numbers apply to these statistics. Useful data points come from the telephone industry use of Time Assigned Speech Interpolation (TASI), in which a certain number of trunk circuits (such as. transoceanic cables) are overcommited. If 24 full duplex trunks are available, usually 36 conversations can be supported, for a ratio of 1.5. If 150 circuits are available, 300 conversations can usually be supported, for a ratio of 2.0. [AT&T 80, Section 7.11, BSTJ] These statistical effects are usually referred to as the TASI advantage.

It is also fairly well known that only about 20% of the phones will be in use at the busiest hour. Most of the time, almost all of the phones will be unused, but of course the system must be designed for worst case behavior.

To a certain extent, the human ear is tolerant of brief distortions in speech. For digital speech this means that small, transient errors in the digital representation of speech can be ignored. Dropouts of up to several milliseconds will be perceived as "pops" and "clicks," and will be tolerated as long as they are kept within a reasonable level.

Once set up, a voice connection should maintain an adequate quality. In the presence of network overloading it would be better to reject connection attempts altogether than to offer poor quality. It is certainly better to block new calls than to degrade old ones.

The above considerations have several consequences for the design of a datagram based voice transmission protocol. The two most critical requirements are that the voice protocol and the end devices have sufficient power to support the voice data rate in a steady state and that the end-to-end delay be sufficiently small.

It is not sufficient for the system just to support the average data rate. The system must support the average data rate with sufficiently low variance to maintain a constant low delay. Some variance in the data rate can be compensated by increased initial delay. Voice data is buffered at the receiving end so that the buffer runs dry with very low probability.

The voice protocol must transmit enough packets per second to achieve a small delay. However, the number of packets per second must be low enough to be handled by software in the etherphone and the audio file server. Experiments have shown this limit to be about 100 packets per second. Inexpensive microprocessors will probably handle fewer packets per second. For this reason speach compression techniques offer little help. They reduce the bandwidth, but the packets per second is usually the limiting factor.

A simple sequence of arithmetic can tell us that the three million bits per second of bandwidth available on the experimental Ethernet currently available within Parc can support about 40 simultaneous transmissions at 64 Kbps plus overhead. This means about 150-200 telephones could be supported on a single network. On a 10 Mbps Ethernet, about 400 telephones could be supported, and about 100 phones on a 1.5 Mbps network. The number of telephones does not scale linearly with bandwidth because of the greater potential for collisions at higher speeds.

To the extent that the types and number of errors in the system are tolerable to the end users, the protocol does not need acknowlegements or retransmissions. Experiments have determined packet loss rates to be about 0.1% through all layers of software. This rate will be worse for heavier network loads, but could be better with improved Ethernet interface hardware.

In order to benefit from the TASI advantage, the voice protocol must detect periods of silence and utilize reduced bandwidth while silence is present. This factor of two reduction in average bandwidth guarantees effecient utilization of the channel capacity [Shoch and Hupp].

To the extent that each connection in the voice protocol is essentially two indipendent (4-wire) transmission paths, it need not be directly concerned with echo. However, acoustic echo may exist at one or both ends, and a voice protocol connection might be connected in tandem with a 2-wire transmisssion path, leading to hybrid echo. The silence threshold must be high enough to avoid detecting this echo as speech.

Voice is digitized at 8000 samples per second. The samples are encoded in 8 bits using the industry standard m-255 companding. In general, each Alto transmits 50 packets per second, each with 160 voice samples plus a sequence number indicating the sample number of the first sample of the packet. Since the samples are generated at a fixed 8000 Hz rate, this sequence number is equivalent to a timestamp.

The Etherphone 0 audio microcode computes the sum of the upper 8 bits of the absolute value of the 12 bit linear encoded samples produced by the Auburn A/D converter. If this value, summed over a given 160 sample block, falls below a certain threshold, the input is deemed to be silent. After a certain number of consecutive silent blocks app ear, the originating machine stops transmitting packets. By this means, typically half the required communications bandwidth is saved. Care must be taken, however, to set the number of consecutive packets before shutdown high enough to avoid the anoying effect of shutting down between very short pauses, such as those that occurr between words.

During a silence interval, the receiving station plays silence to the listener. After a silence interval, packets again begin to arrive at the receiving station. In order to account for jitter in the arrival of future packets, the very first packet is delayed by 10 milliseconds before it is played.

In the steady state, there should be about 30 milliseconds between the time a particular sample is digitized at the originating station and the time at which it appears at the D/A converter of the receiving station. Two thirds of this delay, 20 milliseconds (or 160 sample times) is absorbed by the very process of packetization. The first sample of a packet cannot be sent to the receiver until the last sample of the packet has been digitized. The remaining delay has a certain minimum value corresponding to the minimum transmission delay between the two stations, but is actually made longer in order to smooth over jitter or variations in transmission delay. Note that the component of jitter associated with access to the Ethernet [we should find out exactly what it is; from Golslaves’ studies it seems like it is zero up to 80% load] is typically much smaller than the jitter associated with the scheduling of processes in the sending and receiving stations.

The implementation of the jitter reduction delay is as follows. An assumption is made that the first packet to arrive does so with a typical transmission delay. A 10 millisecond silence is placed on the D/A queue in front of the first packet. Assuming that the clocks of the sending and receiving station are running at roughly the same frequency, the delay between A/D at the originating station and D/A at the receiving station becomes fixed at 20 msec for packetization plus the transmission delay of the first packet plus 10 msec smoothing delay inserted at the receiving station. This process is repeated at the end of each silence interval.

In fact, the resynchronization is done whenever the sequence number of an arriving packet does not match the expected sequence number, thus a lost packet is treated exactly like a silence interval.

The Etherphone I system also uses Alto I/Auburn as the Etherphone. The system includes an Etherphone server and uses an enhanced transmission protocol. The Etherphone Server keeps track of the state of the entire Etherphone system and is responsible for setting up calls.

We would like to instrument the Etherphone system in order to keep track of lost packets. Since the voice transmission protocol is not reliable in the sense that it does not retransmit lost packets, we are trying to engineer the system in such a way that it does not lose too many. Each packet will have not only a sequence number, but also a second sequence number indicating the number of packets which have been transmitted -- allowing the receiving station to accurately count lost packets.

As a general principle, we are trying to avoid keeping distributed state that must be kept up to date. For example, two Etherphones must be able to tell whether a connection between them should be closed. In addition (another principle) a voice connection should not be disturbed by other traffic. It is better to prevent a call from starting than to allow it to be disturbed once set up. Thus: even during silence intervals, Etherphones transmit packets, although they will be shorter than packets carrying voice data and will be transmitted at a lower rate (two per second).