The only forms of office automation most offices have yet seen are the electric typewriter, the copier, and the telephone. Voice communication is and will remain among the most powerful tools that people use to get their work done. If Xerox is to remain a force in offices it will have to understand the role of voice and the relationship of voice to its other products and services. In CSL, apart from the intriguing systems problems that we have encountered while building voice applications, we have two major reasons for wanting to study voice communications:
Despite the immense investment in research and development over the last 110 years, the functionality of the telephone (now extended by the answering machine and its more sophisticated voice-mail siblings) still leaves quite a bit to be desired. The telephone is easy to learn but hard to use, in the sense that an enormous proportion of attempted calls are unsuccessful — either because they fail to reach the intended party, or because they succeed at inopportune times. The attempts to integrate the telephone with more powerful user interfaces and with voice mail capabilities are making some progress in taming the telephone, and in exploiting the use of recorded voice in conjunction with visual documents, but all the ones we've seen lack much of the insight we have into effective ways to integrate voice with the workstation capabilities that we have begun to master.
Even so, the need to interact with the voice products and services of other vendors is even more compelling than we have found in our other areas of interest (those addressed by Interpress, Interscript, XNS, international mail standards, etc.) For example, it is unlikely that Xerox would be able to capture a significant percentage of the business telephone switch market, even if we knew how to or wanted to. Having identified the functions we want our systems to provide, we need to understand how to achieve them in conjunction with competitor's business telephone systems and with the existing telephone network. This should include both means for accommodating our requirements to the existing systems, and careful advice to the industry for improving the interface to their capabilities. This ought to take the form of a new voice communications architecture (in the spirit of XNS and/or Interscript).
I. What Are We Doing?
A. Who?
The voice communications group at present consists of Dan Swinehart, Polle Zellweger, and Doug Terry (in proportions not yet fully determined), with help from a consultant, Luis Felipe Cabrera. Polle and Doug have made only limited commitments (6 months to a year) to the project.
B. Why?
After many attempts to include voice in larger office information systems projects, and after a number of one-person underground hacks, we decided in 1981 that to fully exploit the ideas we had for voice we needed to focus on it. A three-person group, with occasional help from consultants and students, have produced the present prototype voice environment based on Etherphones.
C. Customer
The Etherphone system serves CSL staff members, and to some extent we are also their customers, since they are providing valuable insights into what works and what doesn't. As with all our research, we expect this work to influence the world at large through our publications. We have been able to share our approaches with the advanced development group working on workstation-based voice products in OSD, and with planning and marketing people from XSG and the corporate staff. Because it is based on special-purpose hardware and cabling, we have deferred any plans to extend the Etherphone to other parts of PARC or Xerox.
D. Output
We have published an overview of the Etherphone system, in the proceedings of a conference and as a Blue and White report. But primary output to date is the existing Etherphone prototype: microprocessor-based telephones that use the Ethernet for both digitized voice transmission and control; a server providing switching management, a telephone user interface, and a voice file service; and some simple voice-message and telephone-management tools in Cedar (Finch). At this time we are beginning a new phase of system design.
2. What Do We Want to Do?
A. Who?
A group of three (preferably four) people is adequate to make continued progress on voice communications within CSL. The voice work calls on existing and planned facilities of Cedar, making it a lot easier than starting from scratch. Conversely, once the planned capabilities are in place, the Etherphone system will become a resource that other programmers will be able to use; by so doing they will further extend the work. There are, however, some aspects of the project, such as switchboard operator or receptionist functions, that will not occur through serendipity, but will require more specific effort.
We do not expect the present makeup of the voice project to last too long; its members have other interests that they want to pursue. If we decide we want to progress beyond what we can accomplish this year we will need to recruit one or two people with strong interests in voice communications (or other forms of person-to-person communications).
B. Why?
The "why" was spelled out in the introduction. We have new ideas that we're anxious to explore in the area of user functionality, and Xerox needs to understand better (and preferably to help specify) how its voice products will fit into the larger arena.
C. Customer
We hope to extend the customer base for Etherphones to other research areas within PARC. The Portland outpost of SCL is one possibility for extensive interactions; their interests in fostering close communications among geographically-separate groups makes them a natural prospect for collaboration. We also intend to explore with ISL some opportunities for joint work in voice synthesis (using commercial text-to-voice products effectively in applications such as reading text-mail by telephone.) The CoLab project would be another intriguing context in which to explore flexible uses of voice.
There is a remote possibility that some or all of the Etherphone approach could be developed into a competitive product. (That was not our original intent, but the performance of Ethernet voice transmission has actually been better than we expected.) In such a case, we would be involved in the transfer of the existing technology into a development organization.
Our most important recipients will be the customers, inside and outside of Xerox, of the voice architecture described in the next section. It should be of considerable help to SDD as they extend their plans beyond the initial basic telephone-management functions.
D. Output
With the current Etherphone system, we have done the "means" part; now it's time to get to the "ends." We have begun designing extensions to the Etherphone that will provide:
A more interesting set of telephone-management tools for the workstation.
Voice ropes and some sample applications of them. Voice ropes are fragments of recorded voice that can be segmented and combined in a manner analogous to the treatment of text in Cedar. Additional operations are required for recording, playing, and identifying phrase or sentence boundaries, and for dealing with the immutable and voluminous nature of recorded voice. The applications we will pursue include Tioga-based visually-cued editors for voice, the annotation of Tioga files with voice commentary, and the use of annotated Tioga documents as a means for providing integrated voice mail.
Implementation of Etherphone-related workstation programs in other environments. We're willing to work with programmers who would like to implement Finch-like programs in XDE, Interlisp, or Smalltalk. The most likely first example will be in Interlisp, since Henry Thompson's RPC implementation is nearing completion.
We have no plans to explore the theory of speech recognition, or to incorporate any existing speech-recognition capabilities into our prototypes, but we do intend to produce a design that would accommodate a successful speech-recognition implementation, including provisions for indicating and assisting in the correction of incomplete or imperfect translations.
Intervoice. An architecture for office voice communications.
The architecture proposal needs further discussion. Architectures as diverse as Interpress and XNS have two important attributes in common:
They describe, in exacting detail, the range of capabilities that their clients have available, the way those capabilities are structured, and exactly how they are used. XNS describes the interfaces that clients use to communicate with other applications; Interpress defines the file formats that clients need to obey in order to get documents printed. In both cases, it is possible to define precisely, as subsets of the total architectures, any restrictions that a particular implementation might place on its clients.
They serve as detailed specifications for the implementors of the services that are needed to support their clients. The best example comes from XNS: each layer in the architecture has to be implemented by somebody, whose clients will then implement the higher levels. This is very important for voice; we want to be able to specify precisely the features that a PBX should provide, for example.
In the voice area, one can identify some simple protocols at about the level of RS-232 and maybe SDLC, but higher-level architectures tend to be implicit in the implementations of telephone switching systems, PBXs and the like. All the existing proposals for integrating data and voice, or integrating local area networks and PBXs, inside and outside of Xerox, are expressed at this primitive level, so they are not as interesting as they sound at first. To use a specific example, they would give little guidance for how to build an integrated Etherphone-like system using a Northern Telecom SL-1 switch and an EMS voice mail machine (the one Xerox sells now). Given an InterVoice architecture, it should be much easier to decide which pieces of it Xerox wants to buy and which pieces it wants to make. With any luck, it could be presented as a de facto standard in the spirit of XNS. We are attempting to cast the design for extensions to the Etherphone prototype in terms of such an architecture.
We have identified functions, and flexible methods for providing them, that dominate anything we have seen in the market or in research projects we know of. We also know of no attempt, successful or otherwise, to produce a comprehensive architectural specification for the use of voice in personal information systems; "InterVoice" has a very good chance of serving as an archetype for this kind of thing.