Heading:
Audio Proposal
Page Numbers: Yes X: 527 Y: 10.5"
Inter-Office Memorandum
To Thacker, McCreight, Baskett, Date July 9, 1981
Lampson, Rovner, Crowther
From S. M. Ornstein, L. Stewart,
D. Swinehart Location Palo Alto
Subject Audio Proposal Organization CSL
XEROX
Filed on: <Audio>lydoc>AudioProposal.memo
ABSTRACT
This memo presents what we think is an appropriate Audio game for CSL, discusses a vision of new functionality that might come into our lives as a result of the endeavor, describes the enabling architecture as we see it, and then discusses a plan for setting out.
INTRODUCTION
Why audio? We see two domains. First, in the real-time world, our information management skills can give us improved control over voice communications (the telephone). Second, when we can manipulate voice as we now manipulate data and integrate voice with our other endeavors (messages, documents, . . .), we can add an additional dimension to all those activities.
We have concluded that the most useful way to add audio into our systems is to start by providing a new and better, Ethernet-based, form of telephone service. It seems likely that we can make genuine improvements in our own lives; and, in the process, we will have to build those fundamental tools for handling voice from which other extensions into the world of audio will grow naturally.
Our eventual goal is to provide every member of CSL with a 20-40 chip stand-alone Ethernet telephone, or Etherphone. This device would include a telephone handset, keypad or keyboard, and a single line display. An Etherphone would transmit voice and control information over the Ethernet rather than over conventional phone wires: to another Etherphone, to an audio file server, or to a telephone gateway to the national telephone network. The collection of Etherphones can operate as telephones by themselves, with a few extra features such as redialing the last number. With the aid of an Etherphone server, the system could provide name to number translation and call management functions. With the audio file server, answering machine functions, voice annotation of documents, and voice messages become possible. The telephone system gateway provides outside call access.
Because we cannot build such a stand-alone device at least until single chip Ethernet controllers are available, our immediate goal is to build a 2-5 Etherphone key telephone system (KTS) using Alto I's and the existing Auburn audio board. We can avoid an immediate need for a telephone system gateway by retaining the current office phone lines -- rather than having a few trunks to the central office, we would have one line per Etherphone. ``Internal'' calls would travel via Ethernet, while ``outside'' calls, for the moment, would be placed over the existing phone system. (Think of this organization as a distributed gateway.)
We will use this initial system for development of the Ethernet voice transport protocols and call management protocols and for the development of value added functions. Our intention is that this first system be a full-time operational system (for its small community) rather than a ``demo'' system. It is for this reason that we plan to use separate computers (not our workstations) as Etherphones. A very rough estimate for this effort might be 12 to 18 months for a few etherphones, a first audio file server, and a small heap of value added functions.
Capabilities/functions
To provide a some common ground for the discussion of design choices, here is an ill-ordered list of mixed capabilities and functions, presented without regard to desirability.
We can identify three classes of audio capabilities: voice as data, call management, and system control. The distinctions are sometimes blurred, but here are some of the capabilities and higher level functions that fall into these categories.
Voice as Data -- Storage and retrieval of voice bits and automatic interpretation of those bits.
Voice storage and retrieval -- answering machine, voice Laurel, document annotation
Compression -- Take less space for voice storage
Speech Synthesis -- have the machine read you your electronic mail over the phone while you are out of town
Speech Recognition -- voice control of our existing information systems, voice control of the telephone
Before proceeding further, we would like to exclude compression and speech synthesis and recognition from this discussion. We claim the compression is not an immediate issue. Even at 64,000 bits per second, a single T-300 will hold around eight and a half hours of speech -- let's ignore the problem for now. Speech synthesis and recognition admittedly hold exciting prospects, but we cannot afford to work on them and we can build very exciting systems without them.
Call Management -- Information management of the part of telephony that is already digital.
Data base for phone numbers -- white/yellow pages, call by name, not by number
Redial last number, speed calling
Computer setup of calls -- a daily calendar of calls to be placed, and when, time accounting for lawyers etc.
System Control -- Telling the phone system what you want and getting back status information.
Conventional forwarding -- ring one here, twice there, then switch downstairs.
Come from forwarding -- Logging in to a terminal sets up forwarding to where you are.
Conference calling -- ways of ascertaining the availability of all the parties and setting up a conference call. Obviously great overlap with call management.
Intercom/paging -- The speaker in your phone works even when it is on the hook.
Filtering/Negotiation -- ``Hold all calls except from Taylor.'', ``But it's important.''
Call Waiting, Hold, Camp-on busy
Background calls -- ways of using your phone without neccessarily making it busy for others. One might envision keeping an intercom channel open between two offices all morning while engaged in some cooperative activity. It would be important to still permit incomming calls to get through.
Attendant functions -- answer by name (``Smith's office'')
Manager-Secretary interaction -- ``Your call to Smith is ready.''
Call identification -- calling number available from system
Locators -- a personal transponder, the phone closest to you would ring.
VISIONARY OVERVIEW
The next section provides the reader with a preview of some of the improved services a user might see in this new world.
New User Services
Most of the improvements in "telephone" service have to do with getting phone participants more smoothly into direct voice communication - or circumventing (supplementing) the need for such direct communication through voice messages. We argue that the telephone provides reasonable communication once you are talking to the other person. It is in the process of negotiating the establishment of a conversation between the participants that the phone system is most inadequate and annoying. For the caller this means dialing, waiting for the busy signal, redialing, waiting for the callee to answer, or a secretary, waiting for the real callee to be found and come to the phone, or waiting while the airline reservation system plays music at you . . . etc. For the callee, it means listening to the ringing, answering, establishing who is calling, what they want, whether it's important enough to bother with now, etc. For "Big Guns" a personal secretary mediates both incoming and outgoing calls and thereby takes over most of the burden. We hope that we can provide some of this same relief for smaller guns.
We must be careful, because the conventions of phone use are deeply established. Although people are often annoyed by the present arrangements, they also tend to be quite conservative and to resent changes in the system. Our proposals could cause substantial shifts of burden between callee and caller -- as will become evident below.
First let's distinguish between a conversation and a message. Although the distinction seems obvious, in talking about audio people often use the terms improperly (and sometimes even interchangeably). A conversation is a two way matter - it is used if you have a question or some matter that needs to be interactively discussed. A message is something you have to impart (maybe short, maybe long) that doesn't require immediate, interactive response. That's why we call Laurel a "message" system. Obviously you can turn a sequence of messages into a sort of converstaion but it's awkward - and becomes increasingly useless as the degree of interaction rises. Let us first turn to conversations.
Benefits to the Callee
In the current phone system, the caller has the upper hand. If you are talking to someone in your office and the phone rings, you don't know who it is, how urgent it is -- nothing. Naturally you play it safe and interrupt your conversation to answer -- frequently to your regret. If you are lucky the person on the other end will:
1. quickly identify him/herself
2. tell you what it's about, and
3. if it's not a short question, ask if you're free to talk now.
Many people aren't that polite -- you have to listen until you figure out it can wait, and then interrupt to say "I'll call you back". If you are lucky enough to have a private secretary, you can provide her(him) with a filter for incoming calls which will typically let through only certain ones (emergencies, girlfriends, boyfriends, etc.).
In our visionary world, the phone will similarly filter incoming calls, letting through only those you have specified as admissable. Your filter can be as complex as you are willing to specify. ("John and Mary under any circumstances, Peter if he calls about the budget, others only if they claim it's urgent, but under no circumstances George or Nelson"). If a call arrives from within the system, it will arrive complete with some amount of digital information: who is calling, urgent or not, perhaps even information about the subject of the call ("budget"), your phone can easily determine how to handle the call. For calls arriving from outside the system either blanket restrictions could apply or we could offer advanced attendant facilities. (When a call for you is switched to the attendant, your filtering information is presented so that the attendant will know what action to take: "Please hold on" or "May I take a message.") Or if you choose, incoming calls could be directed to something like today's answering machine and a message taken.
From this scenario, it is clear that new burdens fall upon the caller - to identify himself, specify the topic and the importance of his call, etc. One is used to such buffering when one makes a call to or receives one from a Big Gun, but otherwise, as a caller, one used to getting through and will resent buffering by a non-human mechanism. Although one can try to soften this by making the mechanism as polite as possible, there seems no way to get around the fact that we are fixing an age-old problem wherein the caller has had the ability to interrupt the callee (except for very strong willed individuals or those with private secretaries - rarely do I observe people letting their phones ring while they continue a conversation). This "upper hand" has, in the past, been justified because the only way to filter incoming calls was for a human to answer and find out whether it was important. People don't mind if you answer but then tell them you're busy and ask them to call back; I postulate that this is in part because they have an appeal -- they can claim urgency. I believe that giving them that option will go a long way toward relieving the unhappiness that is to some extent inevitable (if we are to clean up the clutter of incoming phone calls).
Benefits to the Caller
For outgoing calls, you can call the receiver by name. You can either make the call while staying on the line - that is you will deal directly with what happens from the far end - or you can turn the call over to the phone to make for you.
We could construct a daily schedule/calander system, with lists of whom to call and when: ``make my next call please''.
The phone should ideally distinguish between (1) the remote phone's ringing, (2) a busy signal, (3) a real person answering, and (4) the remote phone answering. How much of this sort of thing we can do remains to be seen. In your telephone profile you can specify your "persistence" constant - i.e. how long to go on ringing, whether and how frequently to try again on a busy call, etc. Of course we can have a ``try again'' button that redials the last call.
In the case that the destination is busy or tied up. We can provide the caller with the option of leaving a message on an audio file server. This should be better than the current situation with answering machines. Placing the record button under control of the caller should result in better voice messages since the caller wopuld have time to collect his thoughts.
A Potential Benefit for Both - a Locator
Suppose everyone wore a tiny device which broadcast his/her identity very locally, say 15 feet - to the nearest telephone. Such a mechanism would allow the system to recognize when you were near your phone. Thus it could desist from trying to set up a call for you after you left your office. Furthermore, as recipient of your calls, your phone, using its Locator, could change its answering strategy when you left your office. It could begin to answer calls with "He's out; please leave a message". Alternatively, with a more sophisticated system the phone nearest you could ring with a ``signature tune'' wherever you went - assuming your profile indicates that's what you want.
Voice Messages
Voice messages travel from a phone to the Audio File Server to await later retrieval. If the recipient has a workstation, then news about the message (title, sender, length, etc. but not the contents) will go into his regular Laurel mailbox -- and will show up there in the listing of his messages. When he "displays" a voice message, it plays out (from the Audio File Server) on his phone -- instead of on his screen. If the recipient has no workstation but only a phone, then the Etherphone's more limited features will provide a primitive capability for auditing voice mail. The display will let you scan the voice message list (titles, senders) sequentially, listening only to those you choose (just like today's answering machines plus some extras).
Background: Three alternative approaches
Now that we have described at least some aspects of what a new phone system could provide, we turn to the how.
One can separate three sub-areas of work: telephony, filing, and system integration. Telephony has to do with the transmission of voice data, with the elementary control functions of placing calls, and with terminal equipment (hardware). Filing has to do with the ability to store and retrieve voice messages from a file server on the Ethernet and with basic editting capabilities. Integration has to do with advanced control facilities such as use of a data base to store telephone numbers, and with the manipulation of voice data in cooperation with our other activities: voice Laurel messages, voice annotation of documents, etc. Integration depends upon the basic utilities provided by Telephony and Filing.
It is important to note that filing and some kinds of integration can proceed with audio I/O equipment that is not combined with the telephone. We could concentrate on voice editting and annotation, and ignore telephones, but we think that there is a lot to be gained by the integration of telephony and filing.
We have identified three alternative system approaches to audio. These differ (in various ways) in where the control information is passed and in where the voice data itself is passed. We believe that, at the top level, very nearly the same collection of functions could be provided through any of these approaches, but that the choice greatly affects the performance and elegance of those functions. We should not forget that eventually a difference in degree becomes a difference in kind: a system that takes 1/2 second to place a call is very different from a system that takes 10 seconds to place a call.
1. Existing telephone systems
The current CSL phone system (Centrex) system provides a single direct dial number for each office and has a few value-added features such as call-forwarding and an attendant console. (One can forward calls to another number and one's phone transfers to the attendent automatically after three rings.) The distinguishing features of this system are that control information is ``in-band'', consisting of beeps and clicks on the voice channel, and that the voice channels pass through conventional phone wires and through a switch downtown. (With the exception of switch location, these remarks also apply to existing (dumb) PABXs.)
In this system, both the control signals and the voice data pass through the standard phone system (although our computers would look up numbers over the Ethernet and perhaps check with the destination workstation whether or not the destination phone was in use.)
In order to use this existing phone system to provide the capabilities described above we would build an updated version of the old ``Ross Box'' (after Bill Ross). This device would connect to a workstation and permit the machine to pick up and dial one's office phone. In addition, we would need some number of audio I/O interfaces on server machines. The audio board device, in small numbers, would provide a way of getting voice on and off the Ethernet, where our programs can work with it, file it, annotate with it, and so on. The A/D and D/A conversions are provided by a server, rather than by one's own machine. Essentially the voice as data functions would be delegated to servers, the call management functions would be done by our workstations placing calls by dialing our existing phones, and the system control functions would be limited to those already available by dialling one's phone.
This system provides the capability for a number of really impressive systems: voice messages, voice annotation of documents, semi-automatic call placement, and so on. There are also some crippling disadvantages: our control of the operation of the voice transmission is somewhat uncertain, and an important resource, one's telephone, is tied up for extended periods. The second problem is really a consequence of the first. Because our existing system uses inband signalling, the only controls available over the telephone system are obtained by electrically picking up the office telephone and electronically generating beeps. This can be made to work, but is a slow and somewhat uncertain process. The progress of an attempted call is determined by the return of various noises over the voice path: ringing, dial-tone, busy, reorder, etc. It is quite difficult for a machine to sort out these noises; they cannot be ignored because calls do not always get through, a line may be busy. In addition, the placing of a call requires several seconds: one or two to dial the call, perhaps one for the system to connect a local call, and as many as six seconds for ringing to be detected at the destination. This means that for applications such as annotation of a document, one's office phone is effectively tied up. Callers will get a busy signal! We might obtain "call-waiting" or similar functions from the phone company, but again, it would be hard for our machines to recognize the associated "beep" and almost as hard to handle the situation in a reasonable manner.
Of course we could buy a second telephone line for everyone but the speed and uncertainty of call placement would still be with us.
2. Control of a PABX
Under this scenario we would replace our present telephone system with a commercial PABX and use a computer to control it. D/A and A/D conversion functions for manipulation of voice-as-data would be done by a server, and call placement would be be done either by manually dialing one's telephone or by having one's workstation instruct the PABX (possibly through another server) to place a call.
In this system, the voice data still travels through more or less conventional wires and switches, but the control information is entirely digital, on the Ethernet, and under our control.
In this system, calls inside the PABX would be very fast and we would have easy access to the state of the system. If we want to know if a number is busy, just (digitally) ask the PABX. It would be possible to connect to a server for just a few seconds to record a voice annotation, while still remaining open for incoming calls. We could instruct the switch to do just about anything, such as forwarding calls after leaving one's office, rather than before. At base, these functions are available because the phone system would be entirely under our control. We would still require an adequate number of servers to meet our D/A and A/D needs.
The key disadvantages of this system are that we do not have such a controllable PBX (getting one would cost quite a bit), and before installing such a system, we would have to negotiate control of the switch with the vendor.
3. Etherphone
The basic premise of the Etherphone approach is that actual transmission of the voice data, as well as all control information, is done in digital form over the Ethernet. There are many variations, but the eventual system might provide each CSL member with a 20-40 chip microcomputer based telephone interfaced to the Ethernet. Connections to the outside telephone world would be done by servers with trunks to the phone company. This scenario has the advantage of complete control over the telephone transmission system. We benefit by the natural multiplexing of the ether and by direct access to voice-as-data. Control of the system is distributed; negotiation for a call might take place directly between the source and destination Etherphones.
The disadvantages of this system are the uncertainties of Ethernet voice (not too serious), and the major fact that a 20-40 chip Etherphone cannot be built at least until a single chip Ethernet controller is available. (But we can build prototype Etherphones now, using whole computers for the job.)
Discussion
The key telephone system proposal is a first step on the route to a full Etherphone system. We feel that the first option -- control of the existing phone system -- is unacceptable because it does not offer sufficient reliable functionality and performance. The PABX route -- control of a commercial telephone switch -- is impracticable for us because we do not have one. The third alternative, Etherphone, is difficult to pursue now because it is expensive to build an Etherphone today (although that will change).
Our KTS proposal is really a combination of the first and last scenarios. By building a few expensive (Alto I) Etherphones now, we can give a few people all the benefits of the Etherphone and develop all the required protocols and work on applications while at the same time working towards our true goal of the 20-40 chip Etherphone for everyone. In addition, the KTS idea, wherein all the clients retain their original phone lines, avoids the problem that not everyone has an Etherphone. No-ones view of the phone system need change; the same 4-digit number still works, but for those with Etherphones, many value-added functions become available.
In addition, Ethernet telephony may well be cost-competitive in a few years with a conventional PABX.
What about alternative audio hardware?
One obvious way to avoid the expensive separate Etherphone is to place audio hardware in our workstations. This approach is probably fine for annotation of documents, but our workstations are not designed for 100% availability (You can't get calls while you are in the debugger.) and they are not designed for real-time performance (Your call to your friend breaks up because the collector starts running.) Basically, if we want to use the system while a special program is running then workstation audio hardware is fine, but we can't build a telephone system that way -- it has to work all the time.
One way to avoid using up Alto I's is to construct stand alone Etherphones out of commercial 16 bit microcomputers. At the present time, both the processor and Ethernet would be full boards, the audio hardware would be a few extra chips, and a fairly bulky power supply and cabinet would be needed. On top of that, we would probably not have a very good program development environment. Our early efforts would be greatly diverted by hardware and software development struggles.