Ethernet Voice Protocols
To File  Date February 8, 1984
From L. Stewart  Location PARC
Subject Ethernet Voice Protocols  Organization CSL
Release as VoiceProtocol.tioga
Came from
 /Indigo/Voice/Documentation/VoiceProtocol.tioga
Last editedby L. Stewart, February 19, 1984 1:23 pm
Abstract This paper describes the voice transport protocols used by the PARC-CSL Voice Project.
The CSL Voice Project
For the past 2 1/2 years, work has been underway in the Computer Science Laboratory of the Xerox Palo Alto Research Center on the integration of voice into our office systems.
Goals and motivation
This section is new.
System Architecture
Give overview. The relevant sections are the Etherphone, the Voice File Server, and the non-existant trunk server.
The Etherphone
This section is new.
Network Capacity Analysis
This section stolen from Globecom paper.
One major advantage of Ethernet voice transmission is its efficient sharing of a high-bandwidth channel among users. Combined with a method that we call silence detection, this sharing permits concentration effects similar to those of Time Assigned Speech Interpolation (TASI) [ATT]. The silence-detection method involves transmitting only those packets containing significant signal energy.
To estimate the resulting capacity, consider an implementation whose voice packets contain 160 eight-bit samples (representing 20 milliseconds of voice), collected at 125 microsecond intervals. Using the 10 Mbps standard Ethernet, each packet includes 64 bytes of overhead, so each (one-way) voice connection consumes 90 Kbps. Assuming a worst case of 50% utilization for low-access delay, the bandwidth available in a network dedicated to voice transmission is 5 Mbps, yielding 28 full-duplex "trunks" (55 one-way voice streams.) If we estimate a TASI advantage of 1.6 for this size trunk group [ATT], a dedicated 10 Mbps Ethernet can support about 45 conversations. If at most 20% of telephones are in use at once, such a network could support in excess of 225 subscribers. This argument requires that control traffic and other non-voice traffic is negligible compared to voice traffic.
Thus it seems clear that a simple single-cable Ethernet voice-transmission system can support a sizeable installation. The number of subscribers could be further increased by linking a number of Ethernet cables with packet gateways or conventional circuit switches. Installation of a multiple-cable system would require attention to usage patterns and other traffic-engineering considerations.
Security
Security can be a problem with Ethernet or any other broadcast medium, since no physical protection against eavesdropping is possible. We impose security on our transmissions by encrypting all voice and control traffic using the Data Encryption Standard [DES]. The availability of single-chip DES devices operating at 900 Kbps makes this possible.
Control protocol security
Security is built into the Remote Procedure Call package used for the control protocols. The facilities provided by RPC include authentication, using a method similar to the trusted-authentication-server approach advocated by Needham and Schroeder [Needham], and secure calls, using DES encryption. The authentication phase of call setup (which need happen only occasionally) is responsible for kjey distribution.
Voice protocol security
Security of the voice protocols is handled by DES encryption of most interesting fields of the packet headers as well as of the voice data itself. Since our hardware does not support cipher block chaining (CBC), we use the electronic-codebook (ECB) scheme of independently enciphering each eight sample block. This is not as secure as CBC, but it seems good enough for our system.
The vulnerability of ECB for voice stems from the repeating pattern associated with silence. Even if the electronics generate some idle channel noise, there would only be a few patterns for silence. One might imagine that an adversary could partially decode voice through knowlege of the silent segments.
Initial key problem
In a distributed system incorporating many simple devices (such as our Etherphones), there is a problem of getting the system off the ground. We have chosen to program an initial collection of keys into the EPROM of each Etherphone. (Each EPROM is different.) These keys are secure as long as the Etherphone hardware is physically secure. In addition, we have incorporated a mechanism for entry of keys from outside the system. A user may enter a key through the telephone touchpad. The associated touchpad transitions are not transmitted over the network, but are retained locally as the key.
Traffic analysis
The Etherphone system, at present, provides no security against traffic analysis. Both control and voice packets necessarily contain cleartext network addresses, so by network monitoring, it is possible to determine who is talking to whom, and when. Through more general use of the conference call (multicast) facilities described below, a great deal more security against traffic analysis could be provided, because there would not be a fixed relationship between network address and telephone.
Ethernet Voice Protocols
The voice protocols for the Etherphone are based on the preceding analysis. We transmit 8000 8-bit samples per second, using the industry-standard mu-255 encoding [Henning]. The Etherphone actually supports two related protocols, one for interactive voice (telephone calls), the other to play and record stored voice.
The interactive-voice protocol was designed to meet a delay budget of 40 milliseconds end-to-end. This delay consists of a packetization delay of 20 milliseconds, hardware latency of 5 milliseconds for encryption and Ethernet transmission, software delays of up to 5 milliseconds, and what we call an anti-jitter delay of 10 milliseconds. To meet this budget, the protocol specifies the transmission of fifty packets per second, each containing 160 voice samples and 36 bytes of addressing and control overhead.
Anti-jitter delay is introduced by buffering at the receiving station to allow for variations in the arrival times of packets. One might say the protocol operates with about one-half packet of buffering. This delay budget does not allow time for retransmissions in the event of lost packets. This has proven acceptable, because the native packet loss rate of well-designed Ethernet components has been shown to be less than one packet in two million [Hupp].
Recording and playback of stored voice has slightly different characteristics. Transmission delay is not particularly important, provided that the startup delay from request to playback is reasonably short. The stored-voice protocol calls for 100 milliseconds of buffering. The additional buffering is desirable because the shared voice file server may not be able to schedule the transmission times of packets as accurately as a dedicated voice terminal. The additional buffering also permits retransmission, although we have not found the need to implement it.
Voice Packet Types
There are several different packet formats in use, but all of them lie within the Pup framework. Pup [xxx], which stands for PARC Universal Packet, is both an internetwork packet and a family of internet protocols. The basic Pup looks like this:
PUP -- PARC Universal Packet
Pup: TYPE = MACHINE DEPENDENT RECORD [
length: CARDINAL, -- in bytes, including checksum word
transportControl: Byte,
type: Byte,
id: RECORD [a, b: CARDINAL], -- contents depends on packet type
destination: RECORD [net, host: Byte, socketA, socketB: CARDINAL],
source: RECORD [net, host: Byte, socketA, socketB: CARDINAL],
contents: -- 0 to 532 bytes --,
checkSum: CARDINAL -- all ones means not-checksummed
];
At the datagram level, all Pups look alike
On the 3 Mbps experimental Ethernet, and on the 1.5 Mbps voice network, Pups are encapsulated in Ethernet Packets:
Experimental Ethernet Packet
EthernetPacket: TYPE = MACHINE DEPENDENT RECORD [
preamble: CARDINAL, -- 15 0's and a 1
destination: Byte, -- 0 is the broadcast address
source: Byte,
type: CARDINAL, -- octal 1000 is a Pup
contents: Pup,
crc: CARDINAL
];
General
The id field of the Pup is relevant to voice packets. This 32 bit field is broken into two 16 bit fields by the voice protocols. The second word is always used as a monotonically increasing packet serial number. This serial number is used primarily for statistics collection.
The first id word varies in application among the different voice packets, but generally serves as a timestamp.
Stored-voice protocol packet
This packet type is used to carry voice information from the voice file server to the lark.
The first id word, id.a, indicates the time-to-play. It is a timestamp, in milliseconds according to the Lark clock, of when the packet is to be played out.
The second id word, id.b, is a sequence number of packets sent from the voice file server to the Lark on the current connection.
The probeRequest bit tells the Lark to reply to this packet with a packet of type ProbeReply.
BluejayVoiceType = 250
id.a: timeToPlay
id.b: bluejay sequence number
StoredVoicePacketContents: TYPE = MACHINE DEPENDENT RECORD [
blank1: [0..77B],
ignore: BOOL,
probeRequest: BOOL,
blank2: [0..17B],
encryptionKeyIndex: [0..17B],
energy: CARDINAL,
silenceMS: CARDINAL,
blank3: CARDINAL,
data: ARRAY [0..0) OF Byte
];
LarkVoiceType = 254, LarkFirstVoiceType = 252
id.a: Lark time corresponding to first byte of silence.
id.b: Monotonic increasing packet sequence number (ignored).
LarkPacketObject: TYPE = MACHINE DEPENDENT RECORD [
blank1: [0..377B],
blank2: [0..17B],
encryptionKeyIndex: [0..17B],
energy: CARDINAL,
silenceMS: CARDINAL, -- preceding data
blank3: CARDINAL,
data: ARRAY [0..0) OF Byte
];
ProbeReplyType = 251
id.a: Lark clock
id.b: Lark sequence number
ProbeReplyContents: TYPE = MACHINE DEPENDENT RECORD [
replyID: CARDINAL,
maxPacketSize: CARDINAL,
maxBuffers: CARDINAL,
blank1: CARDINAL
];
RetransmitRequestType = 253
id.a: Timestamp of first data not received.
id.b: Monotonic increasing packet sequence number.
RetransmitRequestContents: TYPE = MACHINE DEPENDENT RECORD [
];
Multicast
The Ethernet coaxial cable is a broadcast medium, every packet transmitted can be heard at every station although normally stations would not listen to packets addressed to others. This broadcast capability provides a means for naturally supporting conference calls and other value-added broadcast voice services.
Conference Calls
In a broadcast environment, each terminal participationg in a conference call can hear and be heard by each of the other participants. Rather than using a traditional conference bridge, conference calls can be implemented through multicast. In this scheme, all participants transmit and listed using the same network address, so that packets transmitted by a given terminal are received by each of the terminals participating in the call (including the transmitting terminal). Each terminal then combines the voice streams from each of the other participants, in effect forming a distributed conference bridge.
This design for conference calls suffers the disadvantage that each terminal receives and must reject in software its own packets. If the network interface hardware can support more than one address simultaneously, this restriction can be removed. Imagine a three party conference call involving parties A, B, and C. Imagine further that the Ethernet interface hardware can support address recognition for three addresses. Party A would listen to addresses 100 (A's native address), and to addresses 201 and 202. Party A would transmit to address 200. Party B would listen to addresses 101 (B's native address), and to addresses 200 and 202. Party B would transmit to address 201. Party B would transmit to address 200. Party C would listen to addresses 102 (C's native address), and to addresses 200 and 201. Party C would transmit to address 202. With this configuration, each party would continue to receive control traffic (addressed to the various native addresses), and the voice streams from the other participants, but not his own voice stream.
The multicast forwarder
The present Etherphone system operates using the "experimental" 3 Mbps Ethernet, but operating at 1.5 MBps half speed. The network interface controller is a Xerox developed VLSI device. The experimental Ethernet design made no real provision for the use of multicase, therefore the available network interface controller is cabable of listening to only one network address at a time (although the address may be reprogrammed from time to time) and to broadcast packets, which are received by all stations. These limitations provide some interesting problems for the application of multicast conference calls.
One obvious solution might be to use the broadcast address for conference calls. Unfortunately, broadcast voice packets in high volume would swamp many network stations. Two key observations permitted a solution: control traffic is light and the Ethernet controller address is programmable.
Ordinary point to point calls are made transmitting voice packets using the poarticipating terminals native addresses. Conference calls are made using a common multicast address, so that all participants both transmit and receive on a common address. This has two implications. First, each terminal must reject its own voice packets. Second, because the hardware cannot listen both to the native address and to the multicast address at the same time, all control traffic must be redirectd to the mulicast address for the duratiuon of the conference call.
We wished to avoid the large modifications to the internal structure of the remote procedure call package which would have been necessary to change the network address of a caller or callee in mid call. Instead, taking advantage of the light control traffic load, we chose to implement multicast forwarder.
The Dorado Ethernet controller does address recognition in microcode. The destination address of each packet on the network is inspected. We modified the Ethernet microcode to use a bit table of all possible network addresses (256 of them). When a packet arrives, the microcode uses the bit table, indexed by the packet destination address, to decide whether or not to accept the packet.
Suppose that terminal A, with native address 100, is actually operating using address 200, due to its participation in a conference call. Control traffic for terminal A is transmitted using address 100. The multicast forwarder however, has been set up to receive packets addressed to host 100 and to retransmit them to address 200.
This all works because packets are filtered by software as well as hardware address recognition. Since all participants in the conference call are using address 200, terminals other than A would receive A's control traffic. Such extraneous packets are rejected by the other terminals because their internal high level addresses read A, not B or C. In addition, the multicast forwarder must handle only control traffic, which demands much lower bandwidth than the voice traffic.
Clock drift
Because each Etherphone has its own crystal-controlled clock, synchronization of two Etherphones participating in a conversation presents an intriguing problem. A pair of Etherphones may have clocks differing by as much as one part in 10,000. In the steady state, this frequency error causes the quantity of buffered voice at the receiving Etherphone to slowly increase or decrease. Since silence detection is used to reduce transmission bandwidth, the correct buffer depth is reestablished during each silent interval. Communications with the voice file server avoid this problem by use of software feedback -- the file server is driven by the Etherphone clock.
A longer term solution also uses software feedback to control the frequency of clocks throughout the network to exactly the same frequency. Imagine that each terminal has an adjustable clock, and that some station on the network with an accurate clock, (such as the gateway to the synchronous telephone system) transmits time-of-day packets on a periodic basis.
Software phase-locked loop and packet scheduling
Explain here how Interval Timer and the probes work to adjust TX time.
Voice File Server
The voice file server uses the stored-voice protocol to record and play back digitized voice segments. It runs on a Xerox Dorado equipped with a 300 Mbyte disk. At 8000 bytes per second, the resulting voice-storage capacity is over 8 hours. Capacity can be expanded as usage warrants. The disk is organized so that storage is allocated in one second units. This permits disk activity on behalf of a single record or playback operation to be limited to one contiguous disk transfer per second. Nevertheless, user software can specify the order and duration of voice-segment access at a grain of one millisecond. This facility makes it possible to experiment with voice editing. The very high network-communication loads presented by multiple voice-protocol connections present a special challenge. The present implementation is capable of handling about eight simultaneous transfers.
All voice is stored in encrypted form. The associated keys are stored by the Etherphone control server in a directory along with information granting appropriate access to each voice segment. Since each voice segment is stored only once, regardless of the number of users granted access, the file server directory also keeps track of the number of outstanding references to each segment. Voice storage is reclaimed automatically when no references remain.
Internet Considerations
Here are some relevant entries from the Voice Vs Data paper
Proposals
Class-ofService. Stray from the ideologically pure notion of a stateless datagram network and build a system that understands some semantics of the kinds of traffic using it. We have already departed from purity by recognizing "interactive traffic" and promoting small packets to the head of queues. Legitimizing these activities will require a class-of-service field in our internet packets.
Employ traffic engineering. In our present datagram-only internet, we have escaped with only rudimentary traffic engineering because we had only one class of users. With the addition of voice traffic and with larger internets in general, we will have to keep loose track of "blocking probability", line utilization, and user populations and add capacity as appropriate.
Mechanisms
Load Control. At least for real-time applications, users should be turned away once the load on a network or link has reached capacity. The same information used on a minute by minute basis to handle loading can be used in the longer term to guide traffic engineering.
Hints. Although routers, gateways, and other load control points must keep track of who is using how much bandwidth for what, they can do so in a nearly stateless fashion by using hints. We want the advantages of centralized control without the reliability problems. The same bandwidth and delay requirements that cause real-time or voice packets to pass fairly often permit the "state" information in routers to time-out rapidly. Bad information will not persist long enough to disturb the internet.
Examples
Here are examples of the application of these proposals
Managing the bandwidth of a point to point line
Consider the case of two 10 Mbit Ethernets connected by a point-to-point 1.5 Mbit link. There is plenty of bandwidth around, but it is not infinite. A pair of routers connected by a 1.5 MBit line would have a parameter indicating that up to 1 MBit of line capacity may be used for voice (or other real-time traffic), with the remainder reserved for data. When there is less than 1 Mbit of real-time traffic flowing, the idle capacity can be used for data datagrams: (and the data queue empties faster), but when there is real-time traffic around, it gets reserved capacity. The routers keep an eye on packets coming in. Suppose the router sees a real-time, how-much=64 Kbit packet for a new source-destination pair. The router takes this as a hint that a new "stream" is being set up and makes a table entry "reserving" capacity for the connection. By using the how-much field together with the packet length, the router can predict when the next packet of the connection is expected. The table entry can be deleted (timed-out) if the next packet doesn't show up. (Thus there is no "stream setup" protocol, it is all done with hints.) When it happens that the n-th+1 apparent stream shows up, the router drops the packet and sends an error reply "no capacity now".
Managing the bandwidth of an Ethernet
Consider the use of an Ethernet for telephones. So long as the total offered load is below the "knee" in the delay curve, the Ethernet works very well. Much above the knee, its performance may not be adequate for voice. The exact position of the knee is dependent on the distribution of packet sizes and on the average number of stations contending for the channel but it is in the 50% to 80% area for voice packets.
If too many people attempt to make calls at the same time, the Ethernet delays would grow rapidly, disrupting service for all. One solution is to register calls with a server -- callers would not get dial-tone if the Ethernet could not handle their call. Another solution is to monitor the general levels of Ethernet traffic and to split the network into two parts (adding capacity) well before the loading reaches dangerous levels. (This is just a localized version of the counterproposal described above. Its successful application might depend on separate Ethernets for voice and for data.)
More complex is the problem of using an Ethernet as a transit network in an internet. While a telephone server might register calls and perform load control on a local basis, who could take the responsibility for internet traffic? One approach might have the routers (perhaps using special hardware), watch every packet on the Ethernet and keep track, by hints, of the traffic levels. Transit connections could be blocked before entering a congested region.
head Node, repeat as needed, nest for subheads if appropriate
body node, repeat as needed
head Node, repeat as needed, nest for subheads if appropriate
body node, repeat as needed