IEEEPaper.tioga
Swinehart, September 30, 1986 8:49:24 am PDT
An Experimental Environment for Voice System Development (or something)
by Moe, Larry, and Curly
Notes:
Underlines used for original sketch material (wrote first, from whole cloth)
"Other" font used for extracts. (Lots) more is included than is intended to be kept.
Vanilla font used for actual intended text (repeats the other stuff where necessary). There isn't any yet.
Sources for extracts
ey85: 1985 Yearend report, before emasculation
Mar85: March 1985 customers-and-outputs document.
DBT: Doug's list of components.
83-8: CSL 83-8: Adding Voice to an ...
Ades: Ades&Swinehart Tioga Voice paper.
Don't think we should include the terms "Intervoice", "Voice ropes" (for different reasons).
Should never say anything twice.
Intro
ey85 Voice is a vital form of interpersonal communication in the office. As a non-interactive medium (in messages and as annotations to text documents) it has yet to be fully assessed, but it appears to have considerable value. (The August 5, 1985 issue of the "Seybold Report on Office Systems" says: "Although voice applications have been relatively slow to take off in the United States, we suspect that lack of integration is the villain. Once voice functions are truly intertwined with text and data applications in an organization's information systems, they will become an irresistible and compelling adjunct to any office.")
Ades Ades There has been a trend in office systems for some time toward a single computer-based system that can satisfy all of a user's working needs. With a trend also away from large time-shared systems and toward personal computers there has evolved the concept of the workstation, a powerful personal computer that performs a wide range of functions. These may include text composition, document preparation and typesetting, support for program development, spreadsheets and accounting facilities, and general-purpose local computing as well as access to remote computers. Today's workstation is usually equipped with a high-resolution display, a keyboard, a mouse or other pointing device, a network connection, and possibly a high-capacity local disk. The workstation environment of the near future will also include scanning equipment and document architectures that can describe scanned images [4], thereby taking over the role of the facsimile machine.
We've built this voice stuff as an integral part of a multi-media programming&other environment — goal to see what effect of mature office architecture could do for voice -- approach: build an experimental prototype voice system to investigate ways of integrating voice into a distributed workstation environment.
83-8 At PARC, we and our colleagues have approached the problem of office productivity from a different perspective. Our research into personal information systems has provided considerable experience with the design, production, and use of personal computers and programs that augment or supplant traditional paper-based office activities. At the heart of this work is the personal computer, such as the Xerox 8010 Professional Workstation. The personal workstation gives its user a coherent and integrated user interface supporting the creation, modification, storage, and printing of documents, memos, messages, and personal databases. High-speed Ethernet communications link each workstation to those of other users and to other computers that provide a variety of services, such as file storage, electronic mail distribution, and high-quality printing.
83-8 Ironically, the typical office in our laboratory includes a multi-function workstation and a stand-alone telephone instrumenttwo isolated systems. Telephone equipment manufacturers have approached the merger of these technologies by treating office applications as extensions to their basic telephone service [Strudwick, Jud, Edwards]. With the Etherphone project, we have the opportunity to gain additional insights into the potential for voice by instead treating voice communications as an extension to our existing office systems.
DBT Our approach has been to build an experimental prototype voice system to investigate ways of integrating voice into a distributed workstation environment.
ey85 The CSL Voice project is an effort to define and validate experimentally a prototype architecture, for which we have coined the term "Intervoice," that can incorporate live and recorded voice into office systems. Intervoice must be able to specify the role of telephone transmission and switching, workstations, voice file servers, and other network services in supporting voice communications — both real-time telephone calls and recorded voice as it is stored, manipulated, and experienced in documents. It must be an open architecture, permitting programmers to create new voice-related applications and modify existing ones without having to understand the detailed workings of the voice system and without endangering its normal operation. Ideally, this architecture could evolve into a standard defining the role of each major component, so that multiple vendors could cooperate to provide advanced voice functions in conjunction with our workstations.
Mar85 The only forms of office automation most offices have yet seen are the electric typewriter, the copier, and the telephone. Voice communication is and will remain among the most powerful tools that people use to get their work done. If Xerox is to remain a force in offices it will have to understand the role of voice and the relationship of voice to its other products and services. In CSL, apart from the intriguing systems problems that we have encountered while building voice applications, we have two major reasons for wanting to study voice communications:
Ades The office telephone and the workstation together satisfy nearly all of a user's communications needs. It is therefore useful to consider combining them into one system. This affords two major possibilities:
Taming the telephone (what would Alex do if he started today?)
83-8 The twelve pushbuttons and hookswitch on a telephone, combined with a set of call progress tones from the central office, provide a reasonable user interface for placing and answering telephone calls. However, newer telephone systems have made available myriad additional features without substantially improving the user interface. Multiple-line telephones and telephones with additional pushbuttons for common functions improve matters, but most attempts to provide more telephone capabilities become too complicated to control long before their designers run out of ideas. Many features are never used because they are too hard to remember and too hard to use.
83-8 Similar problems with the composition and editing of text and graphical documents led to the development of our workstation-based systems [Smith]. The contextual power of the high-resolution display screen, "mouse" pointing device, and menu-based software have increased the ability of the casual user to master large numbers of operations and quite complex situations. We believe that the management of interactive voice will submit to the same techniques, provided the workstations have adequate programmed access to the telephone-switching functions.
· Ades Modern office telephone systems offer all kinds of facilities, such as call forwarding, dialing by name or abbreviation, and three-way conferencing. These are invariably cumbersome to use, hard to learn, and hard to remember, since the user interface is limited to a twelve digit keypad, some progress tones, and perhaps a few special-purpose buttons and lights. We believe that the user would prefer to perform these functions using the power and convenience of such workstation facilities as on-screen menus, text editors, and comprehensive informational displays. The user should be able to master more of the capabilities of the telephone system and to accomplish telephone-related tasks more quickly and easily.
Applications of recorded voice usw
DBT The goals of the Cedar Voice project include "taming the telephone" and being able to treat voice as data.
· Ades We are beginning to see document architectures integrating text, graphics, scanned images, and other visual media [3, 4, 18]. Integration of voice (or rather of sound in general) into documents promises a wide range of new applications, from voice mail to documents containing simple vocal annotations and even to documents whose self-contained `scripts' can generate automatic audio-visual presentations.
Mar85 Despite the immense investment in research and development over the last 110 years, the functionality of the telephone (now extended by the answering machine and its more sophisticated voice-mail siblings) still leaves quite a bit to be desired. The telephone is easy to learn but hard to use, in the sense that an enormous proportion of attempted calls are unsuccessful — either because they fail to reach the intended party, or because they succeed at inopportune times. The attempts to integrate the telephone with more powerful user interfaces and with voice mail capabilities are making some progress in taming the telephone, and in exploiting the use of recorded voice in conjunction with visual documents, but all the ones we've seen lack much of the insight we have into effective ways to integrate voice with the workstation capabilities that we have begun to master.
83-8 In summary, we feel that our computing and communications environment can be brought to bear on voice in two ways. One way is to improve our control over otherwise ordinary voice communications; the other is to incorporate the voice medium into our integrated environment on an equal footing with text, graphics, and data. While the facilities we have built or proposed are not entirely new, we believe that the full integration of voice into our computer-based office systems will make existing functions more effective and new applications possible.
other motherhood from before; they interact.
In this note will be describing the particular(?) systems & hardware/software system we built. But want to stress that we noticed need for voice architecture as well. Cedar architecture involves separate, hierarchical, open architectures for communications, filing, putting together OS (lip service to Cedar structure), user interface, and so on. In comparison, most voice-based architectures are pretty crude or custom-tailored and not extendable.
Third goal, as yet only partly-realized: develop similar open architecture for voice than can
Support extended applications
Allow applications from multiple languages and environments
Simple things are simple, elaborate ones possible
Mar85 Even so, the need to interact with the voice products and services of other vendors is even more compelling than we have found in our other areas of interest (those addressed by Interpress, Interscript, XNS, international mail standards, etc.) For example, it is unlikely that Xerox would be able to capture a significant percentage of the business telephone switch market, even if we knew how to or wanted to. Having identified the functions we want our systems to provide, we need to understand how to achieve them in conjunction with competitor's business telephone systems and with the existing telephone network. This should include both means for accommodating our requirements to the existing systems, and careful advice to the industry for improving the interface to their capabilities. This ought to take the form of a new voice communications architecture (in the spirit of XNS and/or Interscript).
Mar85 The architecture proposal needs further discussion. Architectures as diverse as Interpress and XNS have two important attributes in common:
Mar85 They describe, in exacting detail, the range of capabilities that their clients have available, the way those capabilities are structured, and exactly how they are used. XNS describes the interfaces that clients use to communicate with other applications; Interpress defines the file formats that clients need to obey in order to get documents printed. In both cases, it is possible to define precisely, as subsets of the total architectures, any restrictions that a particular implementation might place on its clients.
Mar85 They serve as detailed specifications for the implementors of the services that are needed to support their clients. The best example comes from XNS: each layer in the architecture has to be implemented by somebody, whose clients will then implement the higher levels. This is very important for voice; we want to be able to specify precisely the features that a PBX should provide, for example.
Mar85 In the voice area, one can identify some simple protocols at about the level of RS-232 and maybe SDLC, but higher-level architectures tend to be implicit in the implementations of telephone switching systems, PBXs and the like. All the existing proposals for integrating data and voice, or integrating local area networks and PBXs, inside and outside of Xerox, are expressed at this primitive level, so they are not as interesting as they sound at first. To use a specific example, they would give little guidance for how to build an integrated Etherphone-like system using a Northern Telecom SL-1 switch and an EMS voice mail machine (the one Xerox sells now). Given an InterVoice architecture, it should be much easier to decide which pieces of it Xerox wants to buy and which pieces it wants to make. With any luck, it could be presented as a de facto standard in the spirit of XNS. We are attempting to cast the design for extensions to the Etherphone prototype in terms of such an architecture.
Etherphone Project Description
System developed over past few years in CSL — space won't permit, but see CSL 83-8 and refer to Fig 1. Need updated Figure 1 (architecture) including synthesizer(s). Doug?
Experimental Environment — Hardware architecture
DBT The prototype voice system consists of a collection of workstations, Etherphones, and servers connected by Ethernets.
Tried lots of things — for adequate control & flex, decided to build own. Only critical piece of new hardware: device called Etherphone — voice transmission on Ethernet — concept: control through Ethernet, too; Etherphone is network peripheral; bring transmission&voice switching to network. [Gloss over trunk completely, except one sentence later on.]
83-8 We surveyed several possible architectures for the system. Subscriber line access to a voice storage system and use of our existing Centrex service through DTMF signalling was rejected due to the low bandwidth and the inflexible control over switching. Use of a commercial PABX was attractive, but the necessary switching capabilities were not commercially available. The use of our own computers to control the switching operations of a standard PABX was rejected primarily due to the nontechnical difficulties of arranging the necessary cooperative effort.
83-8 Although reluctant to undertake the necessary development work, we concluded that the most effective way to satisfy our needs was to construct our own transmission and switching system. We elected to use the Ethernet to transmit voice as well as control information. We chose Ethernet because of its pervasiveness in our environment and in order to demonstrate the effectiveness of Ethernet for voice.
83-8 The broadcast nature of the Ethernet admits some new communication possibilities. Conference calls among several sites can be achieved without the need for special conference-bridge hardware: multicast techniques permit voice packets from a given site to be received and combined by each of the others [Forgie]. In the extreme, voice from a meeting or conference can be transmitted once and received by any number of listeners, without consuming any more bandwidth than a single conversation consumes.
EY85 Etherphones: telephone/speakerphone instruments that communicate digitized voice and control information over the Ethernet. Each includes a microcomputer and encryption hardware. Approximately forty Etherphones are in daily use within CSL.
DBT Etherphones are a user's audio interface to the system. They can digitize, packetize, and encrypt voice and send it directly over an Ethernet.
DBT A special voice transmission protocol is used for sending voice over an Ethernet. All control messages and operations performed on specialized servers, such as the Voice File Server or Text-to-speech Server, are transmitted using a remote procedure call protocol.
Remainder of core system in software —
Control server
. Provides interpretation of telephone actions, controls them.
. Supports connections for other sources of voice traffic, such as voice recording server.
. As we'll see, manages the interactions between WS and (own telephone and other services).
. Various specialized databases.
. Primary enforcer & provider of voice architecture.
EY85 Telephone service: a control program that provides PBX functions and manages the communicatons of all the other components. It runs on a dedicated server. This service and workstation program libraries implement the client programmer interface to the Intervoice architecture.
DBT A Voice Control Server manages conversations between sources and sinks of voice and controls Etherphone functions. It also maintains databases for white-page directories, Etherphone-workstation assignments, etc.
DBT Conversations are established between two or more parties by performing remote procedure calls to the Voice Control Server. Each party autonomously advances its state in the conversation. Active parties (that are voice terminals) use the Voice Transmission Protocol to actually exchange voice.
DBT During the course of a conversation, participants may asynchronously receive reports about various activities concerning the conversation, such as the availability of a new encryption key, the fact that the Voice File Server has started recording, etc.
Voice file service, including recording & playback, management of utterances, management of details of editing.
EY85 Voice file service: a service that can connect to Etherphones in lieu of or in addition to other conversants. Supports digital recording, playback, and flexible editing of user utterances.
DBT A Voice File Server manages recorded voice.
DBT Recording and playback of stored voice is accomplished by establishing a conversation between an Etherphone and the Voice File Server.
DBT Editing operations on recorded voice are similar to those for manipulating text strings. Editing produces new voice objects that are stored in a database and reference voice files.
Synthesizer — Etherphone with synthesizer hardware (two manufacturers) and specialized code in server to meter text... too specific.
EY85 Synthesized voice service: a recently-added facility described in more detail below.
DBT A text-to-speech server receives text strings and returns the spoken text to an Etherphone using the Voice Transmission Protocol.
Other possibilities — recognition equipment, music synthesizer supplied by ordinary computers on net, and so on.
All of above not very interesting without incorporating the office workstation. At PARC, high-performance personal computers and TS machines (Xerox machines, Vaxen, Suns, PC's — all on the network). Environments such as Interlisp, Smalltalk, Unix, Cedar. Have done the server development in Cedar, most of workstation development in Cedar (but there's an Interlisp existence proof). Many ways to look at Cedar (see Cedar papers); for this purpose, multiple document-management activities through use of structured, multi-media editor (Tioga); programming facilities remain fully available for further development directly in same integrated environment. Typical approach is to build applications in terms of others —> Electronic mail uses full document editor for text . . . See Fig. 2 for typical Cedar Screen.
EY85 Workstation programs: provide enhanced user interfaces and control over the voice capabilities. The Cedar version is called Finch, and has been in use for some time.
DBT Etherphones are a user's audio interface to the system. They can digitize, packetize, and encrypt voice and send it directly over an Ethernet.
DBT Editing operations on recorded voice are similar to those for manipulating text strings. Editing produces new voice objects that are stored in a database and reference voice files.
System exists in Internetwork Environment depicted in Fig 1 — other workstations and services worldwide directly communicate. Applications of this and similar environments fairly mature by now, as you might find in [ some of ours, some of sombody else's].
83-8 For system control, we were able to take advantage of the recent development of remote procedure call protocols [Birrell83, Nelson]. Remote Procedure Call (RPC) is a means for transforming the message-passing semantics of packet-switching networks into the procedure-call format familiar to programmers. RPC largely relieves applications programmers from any worries about packet formats, addressing, reliable communication, and security. This makes it possible to defer decisions about the partitioning of a distributed system until very late in the design. Our task was reduced to producing an RPC implementation for the Etherphone control processor, then specifying the procedural interfaces between different system components.
Examples of Applications to Date
DBT The facilities of the voice system have been used to integrate voice into our environment in various ways.
Informational displays -- who is calling (by tune and icon) -> Fig 3.
Simple commands — ?
DBT Calls can be placed from a workstation using a telephone directory viewer or dialed by name from the telephone keypad.
Phoning from DB or browsing ) — see Fig. 4 for DBT Browser.
Voice Annotation and Editing —
Set up connection to file server instead of other phone —> can record arb-length dictation, connect to document. Fig 5.
DBT Voice is sent in electronic mail messages by reference. This is a specific case of the more general techniques for annotating documents with voice.
Also in Fig 5, picture of segment of voice that can be edited to edit the voice — annotations marking, color cues combine w/graceful features to provide assistance in editing and locating things later [TV paper citations].
DBT Voice viewers on a workstation display a visual representation of stored voice. Users can edit voice by rearranging parts of the existing voice or recording new voice passages to be inserted in selected places.
Built up from voice record/edit capabilities & Tioga, + ability to manage recorded-voice values. Can be used wherever Tioga can, so is avail. for construction constructing voice messages.
Further example of applications building on each other — narrated documents — cite Polle paper, show in Fig 6 (imagine scrolling action).
DBT The voice system facilities have also been used to give a running audio narration for documents.
State (Summary)
In daily use by 50 people as sole phone (connections to other lines & outside trunks provided in undescribed ways) — have developed applications above. Still working to define architecture that would make these applications easier to build and more robust (problem with competing uses for connections). Want to explore a number of areas, including telephone filtering, attenance console stuff, ... Need to experiment with applications of same architecture to different workstation environments, different hardware architectures (could even combine them.) Also other media, like still and real-time video.
DBT We have discovered that managing voice in a distributed environment presents some interesting problems, which are pertinent to other media such as images or video.
DBT The voice system with approximately 50 Etherphones is in everyday use by members of CSL.
Acknowledgments, References, 250 pages of appendices (only kidding)
DBT We should reference the "Adding Voice..." paper, Dan and Stephen's paper on editing, perhap's Polle's new CHI paper, perhaps Doug's distributed systems workshop paper, etc.