<> <> <> An Experimental Environment for Voice System Development (or something) by Moe, Larry, and Curly Notes: Underlines used for original sketch material (wrote first, from whole cloth) "Other" font used for extracts. (Lots) more is included than is intended to be kept. Vanilla font used for actual intended text (repeats the other stuff where necessary). There isn't any yet. Sources for extracts ey85: 1985 Yearend report, before emasculation Mar85: March 1985 customers-and-outputs document. DBT: Doug's list of components. 83-8: CSL 83-8: Adding Voice to an ... Ades: Ades&Swinehart Tioga Voice paper. PTZ: Zellweger Scripted Documents manuscript of ~15 Sept 85. Don't think we should include the terms "Intervoice", "Voice ropes" (for different reasons). Should never say anything twice. Many headings should probably be eliminated; I've found them useful for organizational purposes. All due effort should be made to eliminate "facilities", "capabilities", "functions", "applications", and the like from my prose. This attempt will not be 100% successful, but it's worth a try. Introduction Suppose Alexander Graham Bell had waited to invent the telephone until personal workstations and distributed computing networks had been invented. What approach would he take in introducing voice communications into the modern computing environment? It was an attempt to answer this question that led to the creation of a voice communications project within the Computer Science Laboratory at PARC. Stated more concretely, the idea was to extend existing multi-media office environments with the facilities needed to handle the transmission, storage, and manipulation of voice. We believed it should be possible to deal with voice as easily as we can manage text, electronic mail, or images, in an integrated fashion. The result would be a combined workstation that together could satisfy nearly all of a user's communications and computing needs. Whatever we did would need to provide conventional telephone facilities (so that casual users would not have to read a manual to make a phone call). But beyond that, we wished to draw on our experience with applications of two-dimensional workstations to address the functions of voice communications from the perspective of developers of personal information systems, rather than the perspective of designers of advanced telephone facilities. We believe that the user would prefer to perform these functions using the power and convenience of such workstation facilities as on-screen menus, text editors, and comprehensive informational displays. In this note we will describe the EtherphoneTM system that we have developed and used to explore ways to integrate voice into a personal information environment. In separate sections, we will sketch the present hardware architecture, describe some of the more compelling applications that have been built to exploit it, and briefly explore some of the software and systems issues that have surfaced. Before proceeding with the exposition, we need to give the reader a better picture of the topics we have chosen to focus on in this work. Areas of Exploration "Taming the telephone": Despite an immense investment in research and development over the last 110 years, the user interface and the functionality of the telephone still leaves something to be desired. We have experimented with comprehensive on-line telephone directories, improved methods for locating people as they move about within an office building, and convenient methods for leaving recorded messages, as means for improving the placement of telephone calls. Looking at the problems of the called party, we have used distinctive "ring tunes", call filtering options based on the subject, urgency, or caller's identity for deciding which calls to accept or reject, call logging and recorded messages  all designed to reduce the intrusive nature of the telephone. We are also exploring novel kinds of interactive voice connections, such as all-day "background" telephone calls, use of the telephone system to broadcast internal talks or meetings (as a sort of giant conference telephone call), or conference calls that allow side-conversations to take place. "Taming the tape recorder": Telephone applications must deal with the real-time demands of "live" conversations. Another class of applications, more closely resembling traditional workstation functions, involves the manipulation and use of recorded or synthesized voice. Our investigations of recorded voice range from the conventional dictation machine or answering machine to voice mail to more flexible facilities for adding voice annotations to documents. The voice produced by a commercial text-to-speech synthesizer can be used in analogous ways. We have not yet incorporated speech recognition equipment, but clearly there is also a place for it in this spectrum of applications. Recorded or synthesized music and other sources of audio information also have a role to play. Our aim is to incorporate all of these methods for generating or accepting audio information into a coherent part of a multi-media workstation environment. On the surface, telephony issues and applications of recorded or synthetic voice can be considered independently, but in fact they are clearly interrelated. On the one hand, electronic mail with voice annotations can be effectively used when a call is not answered. On the other hand, the most natural device for voice input/output in an integrated environment is the telephone (or speakerphone). Not only does this eliminate confusion by permitting telephone calls and other voice applications to be coordinated, but it also admits the possibility that some voice applications can be remotely located and reached by telephone  especially effective if connections can be established without incurring dialing and ringing delays. An Architecture for Voice Applications The original goals of the Etherphone project were to produce experimental prototypes to test novel approaches to "taming the telephone" and "taming the tape recorder", drawing on our experience with text, graphics, and other visual media. Along the way, however, a more fundamental requirement has emerged: the need to create and validate experimentally a comprehensive architecture for voice applications. Why? What is it like with and without? The architecture must be able to specify the role of telephone transmission and switching, workstations, voice file servers, and other network services in supporting voice communications  both real-time telephone calls and recorded voice as it is stored, manipulated, and experienced in documents. It must be an open architecture, permitting programmers to create new voice-related applications and modify existing ones without having to understand the detailed workings of the voice system and without endangering its normal operation. Ideally, this architecture could evolve into a standard defining the role of each major component, so that multiple vendors of telephone and office equipement could cooperate to provide advanced voice functions in conjunction with workstation-based applications. Simple things are simple, elaborate ones possible. The best way to explain this requirement is by analogy with existing architectures too long, a bit too repetitive: The Xerox Network Systems protocols [XNS, ref] represent a comprehensive architecture for packet-switched data communications. XNS specifies a common transport-level packet format that all connected computers must agree on for the transmission and routing of information from one system to another (transport layer). Below this level are various device-dependent link-level specifications for transmitting these XNS packets over various media, such as Ethernets, telephone lines, or satellite channels. Above the transport level are additional levels of protocol that support reliable end-to-end connections (? layer), important services such as support for remote procedure calls, bulk data transport and terminal-to-host communications (? layer), and finally various specific applications (applications layer). Doug can fill in the lingo? The TCP/IP protocol family used in the DARPA Research network and the MILNET ... networks ([] Doug?) have a similar organization, as do many other communications protocols, all of which follow the general structure defined by the ISO Reference Model for Open Communications. [ISO] Similarly, modern portable operating systems are structured in layers. Although these systems vary widely in their details and even in their overall philosophy [Unix, Cedar], they share many structural attributes. At the lowest level, machine-dependent programs implement an abstract machine that allows the remainder of the system to be largely machine-independent, supporting applications that can run on a variety of hardware configurations. The levels above the abstract machine level provide memory and device management facilities, services such as character streams and window packages, and finally very specific applications support. Finally, document architectures are beginning to emerge that can specify standard for representing complex multi-media documents, permitting the interchange of documents among workstations and hardcopy printers produced by multiple vendors. [Interpress, Postscript, ODA]. As diverse as these architectures are, they share several elements in common. They describe, in exacting detail, the range of capabilities that their clients have available, the way those capabilities are structured, and exactly how they are used. For example, XNS describes the interfaces that clients use to communicate with other applications; Interpress defines the file formats that clients need to obey in order to get documents printed. In both cases, it is possible to define precisely, as subsets of the total architectures, any restrictions that a particular implementation might place on its clients. They also serve as detailed specifications for the implementors of the services that are needed to support their clients. This is very important for voice; for example, it should be possible to specify precisely the features that a PBX should provide. In comparison, most architectures supporting digital voice capabilities that occupy levels higher than the analogue of the ISO transport layer  to the extent that they exist at all  are embedded in and custom-tailored to particular applications that they support. One can identify some simple protocols at about the level of RS-232 and maybe SDLC, but higher-level architectures tend to be implicit in the implementations of telephone switching systems, PBXs and the like. Most existing proposals for integrating data and voice, or integrating local area networks and PBXs, are expressed at this primitive level1. They would, for example, give little guidance on how to build an integrated Etherphone-like system using a Northern Telecom SL-1 switch and a commercial voice mail machine. 1Exception: MICE []. We are attempting to derive such a voice architecture based on our experience with the Etherphone prototype. It is this architectural work that distinguishes our work from systems with similar capabilities. a Before we describe any further aspects of our architectural design as it stands today, we will describe the general structure and user facilities of the system that we've built to test these ideas. Etherphone Project Description Experimental prototype In designing our prototype voice system, we surveyed several possible architectures for the system, including using our existing Centrex service or a commercially available PABX. We concluded that the most effective way to satisfy our needs was to construct our own transmission and switching system [CSL-83-8]. Ethernet local-area networks, which provide the data communications backbone supporting personal distributed computing at Xerox PARC, have proven to be an effective medium for transmitting voice as well as data. Our prototype voice system (see figure 1) consists of the following types of components connected by Ethernets. Etherphones: telephone/speakerphone instruments that include a microcomputer, encryption hardware, and an Ethernet controller. Etherphones digitize, packetize, and encrypt voice and send it directly over an Ethernet. Our current environment contains approximately 50 Etherphones. Additional information on the Etherphone hardware and the Voice Transmission Protocol can be found in a previous report [CSL-83-8]. Something about programming environment. Voice Control Server: a control program that provides functions similar to a conventional PBX and manages the interactions between all the other components. It runs on a dedicated server that also maintains databases for public telephone directories, Etherphone-workstation assignments, and other shared information. Voice File Server: a service that can hold conversations with Etherphones in order to record or playback user utterances. In addition to managing stored voice, the Voice File Server provides operations for combining, rearranging, and replacing parts of existing voice recordings to create new voice objects. For security reasons, voice remains encrypted when stored on the file server. Text-to-speech Server: a service that receives text strings and returns the equivalent spoken text to the user's Etherphone. We currently have two speech synthesizers purchased from different manufacturers. Workstations: high-performance personal computers. Workstations are the key to providing enhanced user interfaces and control over the voice capabilities. We rely on the extensibility of the local programming environment  be it Cedar, Interlisp, Smalltalk, or whatever  to facilitate the integration of voice into workstation-based applications. Workstation program libraries implement the client programmer interface to the voice system. Workstation applications to date primarily in Cedar system mentioned above, although there's an Interlisp existence proof. Figure 2 is Cedar screen in use, including voice, text, graphical, and program development activities. In addition, the architecture allows for the inclusion of other specialized sources or sinks of voice, such as speech recognition equipment or music synthesizers. Most server programming done in Cedar. All of the communication required for control in the voice system is accomplished via a remote procedure call (RPC) protocol [Birrel and Nelson]. For instance, conversations are established between two or more parties (Etherphones, servers, and so on) by performing remote procedure calls to the Voice Control Server. During the course of a conversation, RPC calls emanating from the Voice Control Server inform participants about various activities concerning the conversation. Active parties in a conversation use the Voice Transmission Protocol to actually exchange voice. Multiple implementations of RPC permit workstation programs and voice applications programmed in different environments to be integrated. Examples of Applications to Date Voice applications The Etherphone system provides a variety of user-level services, including telephone management, text-to-speech synthesis, and voice annotation and editing. We first describe the capabilities available on a workstation adjacent to an Etherphone. These functions are typically available through an Etherphone control panel, through commands that can be issued in a typescript, and through procedures that can be incorporated into client programs. {backwards, really. Intervoice concept/pkgs then tools = procs first, other user interfaces built on top} This section describes the user-level services that we have implemented to date. Recall that our primary goal is to develop a comprehensive and robust voice architecture that permits the construction of such user services. Although we currently provide an interesting assortment of user facilities, some of the more complex items, such as conference specifications, are not yet completed. Background explanations of Cedar Viewers Tioga common text editor used for editable viewers multi-media capabilities (based on Tioga's extensibility) Open system Integration Hardware configuration custom microprocessor-based Lark, modified telephone instrument, speaker, microphone, optional adjacent workstation -- from previous section Telephone management The voice system supports a growing collection of telephone management functions, including call placement, call logging, ... a comprehensive set of functions to manage simple telephone calls (as opposed to conference, background, etc., which are supported by the architecture and underlying hardware, but currently no user interface...weak, too bad). We first describe the functionality provided by the system, then we describe the underlying system architecture and some sample call scenarios, that is, the sequence of steps required to complete a few different types of calls. Call placement. From a workstation adjacent to an Etherphone, a user can place a telephone call in several ways. She can fill in a name or number in the Called Party: field of the Finch tool and click its button; she can select a name or number anywhere on the screen (possibly in an electronic message) and click the Phone button in the tool header; she can type Phone followed by a name or number in any Command Tool viewer; or she can use one of two directory systems that present a browsable list of names and associated telephone numbers as speed-dialing buttons. In addition, calls can be placed by name or number from the telephone keypad. To call by name, we use a simple encoding that translates each letter into the single digit printed on that key (Q, Z, and a few special characters are also given key assignments). Keypad dialing gives an error indicator when the result of this encoding is not unique, such as for AHenderson and CHenderson. Such collisions occur rarely for our relatively small database, but a more complex scheme would be needed for a system with thousands of subscribers. [We plan to construct a list of choices, either on the display or audially, using the text-to-speech server, but even this would be unwieldy in a large system.] Directory assistance. The system includes a central whitepages directory database for all Xerox employees in the Palo Alto area (about 1000 entries). Individual Etherphone users can easily create personal directories from simple text files. Workstation (?) call-placement routines consult first the personal directories then the system directory to translate names to numbers. The first directory system transforms a text file into an unlimited set of speed-dialing buttons. The usual Tioga functions of searching and level-clipping apply to these directories. The second system provides a query-browsing style interface to a collection of directory databases. The results of a query are again a set of speed-dialing buttons. The user can formulate complex queries based on pattern-matching of names, numbers, and other database information. In addition, a soundex search mechanism [Knuth] compensates for some kinds of spelling errors. Locating callees. Within the Etherphone system, the primary callee identification is by name. The system searches for the person with that name as follows: if the person is logged in at a workstation, the adjacent telephone will ring with that person's ring tune. Otherwise, the default telephone listed in the system database will ring. If the person has registered with another workstation or Etherphone that they are visiting that location (by issuing the Visit command), then that Etherphone will ring in addition to any other Etherphone. Although the Etherphone system interfaces to the normal telephone system, functionality for outside calls is fairly limited. Calls to outside locations can be specified in a whitepages directory, but calls from the outside are not identified more specifically than "outside line". Calls to people outside the Etherphone system do not have locating capabilities. The Etherphone hardware is constructed so that software or hardware failures connect the telephone instrument directly to the outside telephone line. On one hand, this has made it easier to get experimental users, because at least normal telephone service is guaranteed. On the other hand, it has also made users less likely to report Etherphone failures, because they can still get their work done. The system tries to locate the named callee and ring the nearest Etherphone. Call announcement. Calls are announced audially, visually, and functionally. Audially, each Etherphone user is given a personalized ring tune, such as a few bars of "We're Off To See The Wizard", that is played at a destination Etherphone to announce calls to that user. The caller hears the same tune as a ringback confirmation. Visually, the telephone icon jangles with a superimposed indication of the caller, as shown in Figure X. An active conversation is represented as a conversation between two people with a superimposed indication of the other party, as shown in Figure Y. This icon feedback gives status information in a minimal screen area. Functionally, the Finch tool's Calling Party: or Called Party: field is automatically filled in to allow easy redialing, and a new conversation log entry is created. The conversation log can be consulted to discover who has called during an absence from the office. In principle, the choice of ring tune could depend on the caller, the subject of the call, the time of day, and so on, but we have found that a single tune allows people to distinguish their calls at a distance, almost subconsciously (much as we subconciously filter noises for our name). Ring tunes have been the single most popular feature of the Etherphone system. We could also announce calls using our text-to-speech server, such as "Call for Swinehart from Zellweger", but this contributes more to office noise pollution if done loudly enough to catch people away from their offices. It remains a possibility of last resort, however, after all other attempts to locate the person have failed. Specializing Etherphone behavior. Ring tunes and ringing behaviors for each Etherphone (such as ring my secretary between 3 and 5 pm, or answer calls about the Cedar compiler with a particular recorded message) are specified in the centralized switching database. The user can modify these behaviors by writing new database entries. Another important consideration has been to allow all of the callee's agents in an Etherphone call (that is, the callee's Etherphone, its adjacent workstation, and the switching server) to cooperate in deciding how a call should be answered. The switching server is consulted first to decide what Etherphone(s) and what workstation(s) to inform about a call. This is where the central database comes into play. Then the workstations are consulted to allow them to evaluate any complex filtering functions. Finally, the Etherphones themselves are allowed to perform their default behavior (which can still be somewhat specialized: answer automatically or ring the phone -- but that's specified in the database too!) Distributed intelligence about Etherphone behavior Call placement and receipt Ether calls vs. outside calls Call logging White pages browser, personal lists, public lists Identifying and locating callers and callees (visiting and poaching) part of Controlling telephone behavior? filtering Text-to-speech synthesis Note: much of this section would apply equally well to recorded voice; it's about uses of voice sources in the absence of telephony. In addition to the voice file server, which supports voice recording and playback, the Etherphone system includes two commercial speech synthesizers, a DECtalk and a Prose 2000. Each can convert arbitrary ASCII text to reasonably intelligible audio output, with control of speaking speed, pitch, and other voice characteristics. Words that do not follow usual English pronunciation rules can be specified as a sequence of phonemes. A common commercial use of such a synthesizer is to provide telephone access to a database, such as stock quotations or bank balances. Often much of the text is a canned script that is typically hand-tuned for maximum intelligibility. In our system, each synthesizer is connected to a dedicated Etherphone, forming a text-to-speech server. Each server is available to any Etherphone-equipped workstation, on a first-come-first-served, one-user-at-a-time basis. A user or program can generate speech as easily as printing a message on the display. To generate speech, a user can select text in a display window. A program can call a procedure with the desired text as a parameter. The system takes care of setting up the connection to the text-to-speech server, sending the text (via remote procedure call), returning the digitized audio signal (via the voice transmission protocol), and closing the connection when the text has been spoken. Our primary uses for text-to-speech so far have been in programming environment and office automation applications. The ability to select text in any screen window has been used directly for proofreading tasks. This has been particularly valuable for comparing versions of a document when one version has no electronic form, such as proofing journal galleys. Calendar and reminder programs have been augmented to allow audio reminders. Some users have added spoken progress indicators to their long computations, allowing them to "keep an ear" on the computation while they perform other tasks. Similarly, audio prompts and error messages allow users to focus their attention elsewhere without losing track of a program that requires intervention. Although present-day synthesizers are less intelligible for arbitrary text than for the hand-tuned scripts that are used in commercial dial-up applications, the controllability of the generated speech suggests interesting future research in "audio style" for documents, in which speed, pitch, and other voice characteristics could be applied automatically to communicate italicization, boldface, or quotations. Uses so far have been general applications of text-to-speech in the electronic office, including: proofreading (especially comparing versions of a document when one version has no electronic form, such as proofing journal galleys), audio reminder messages, program progress indicators, and error messages. Because we treat connections to voice sources explicitly, and our ability to include more than two parties in a conversation is not yet available above the Etherphone hardware level, we have not yet been able to experiment with uses of voice sources in telephone calls. Among planned uses for text-to-speech are: (1) providing audio confirmation of the person or number dialled as a call begins, (2) reading electronic mail over the telephone to a remote lab member (without dedicating a synthesizer solely to this task), and (3) playing program-generated messages to callers, such as prompts or reports of the user's location (possibly by consulting the user's calendar program, such as "Dr. Smith is at the Distributed Systems seminar now, please call back after 5 o'clock"). server. speak selected text, speak under program control, audio reminder, debugging, progress reports, proofreading Voice annotation and editing The Etherphone system supports voice annotation of documents. This capability is built on top of Tioga, the Cedar environment's screen-based editor. The Tioga editor is a what-you-see-is-what-you-get galley editor that is used for both programming and document preparation. Documents can have rich formatting and typography, and can include pictures and voice. Tioga documents are tree-structured: for example, paragraphs can be nested under a section heading. Documents can be displayed at any level of detail, from the single root node to the full tree. Tioga is also extensible. Individual characters or nodes can have arbitrary properties associated with them. One use of node properties is to specify bitmaps and specialized screen-painting procedures for embedded pictures. Ades & Swinehart Tioga is the standard text-editing program in Cedar. Tioga is essentially a high-quality galley editor, supporting the creation of text documents using a variety of type faces, type styles, and paragraph formats. Tioga is unusual among text editors in that its documents are tree-structured rather than being plain running text. This makes possible such operations as displaying only the section headings or key paragraphs of a document, which means that scanning a Tioga document for a particular section can be done quickly and effortlessly. Finally, Tioga includes the ability to incorporate illustrations and scanned images into its documents. Tioga can create both black-and-white and full-color documents. A&S Cedar has been designed so that other applications can employ the capabilities of the Tioga editor. These include the electronic mail system, the system command interpreter, and any tools that require the entry and manipulation of text by the user. This gives considerable unity to the editing interface, since for all the different types of application in which Tioga is used, identical keystrokes will perform identical functions. Wherever Tioga is used, all of its formatting and multi-media facilities are available. Thus, by adding voice annotation to Tioga, we have made it available to a variety of Cedar applications. A&S The user interface of the voice annotation system is designed to be lightweight and easy to use, since spontaneity in adding vocal annotations is essential. Voice within a document is shown as a distinctive shape superimposed around a character, so that the document's visual layout and its contents as observed by other programs (e.g., compilers) are unaffected. Users point at text selections and use menus to add and listen to voice. A&S Simple voice editing is available: users can select a voice annotation and open a window showing its basic sound-and-silence profile. Sounds from the same or other voice windows can be cut and pasted together using the same editing operations supported by the Tioga editor. A lightweight `dictation facility' that uses a record/stop/backup model can be used to record and incorporate new sounds conveniently. Editing is done largely at the phrase level (never at the phoneme level), representing the granularity at which editing can be done with best results and least effort. The visual voice representation itself can be annotated: simple temporary markers are used to keep track of important boundaries during editing operations, while permanent textual markers are used to find significant locations within extended transcriptions. As a further contextual aid, the system provides a visual indication of the age of the voice in an editing window. Newly-added voice appears in a bright yellow color, while less-recently-added phrases become gradually darker as new editing operations occur. The dictation facility can also be used when placing voice annotations straight into documents. A&S Basic annotation Figure 1 shows a text document window, or viewer, from a Cedar workstation screen. Its caption defines the various regions of the viewer, indicating how one selects objects of interest and how one performs various operations on those objects. We will use the terminology defined in Figure 1 throughout the discussion. A&S Any single text character within a Tioga document can be annotated with an audio recording of arbitrary length. To add an annotation, the user simply selects the desired character within a text viewer and buttons AddVoice in that viewer's menu. Recording begins immediately, using either a hands-free microphone or the telephone handset, and continues until the user buttons STOP. As recording begins, a distinctive iconic indication of the presence of a voice annotation is displayed as a sort of decoration of the selected character. Currently, this voice icon is an outline in the shape of a comic-strip dialog `balloon' surrounding the entire character. The second line of text in the first paragraph shown in Figure 1 contains a voice icon. A&S Adding a voice icon does not alter the layout of a document in any way. Thus, voice annotations can be used to comment on the content, format, or appearance of formatted text. Moreover, programs such as compilers can read the text, ignoring voice icons just as they ignore font information. Voice annotations may be used, for example, to explain portions of a program text without affecting the ability to compile it. Like font information, voice icons are copied along with the text they annotate when editing operations move or copy that text to other locations, either within the same document or from one document to another. A&S A voice annotation becomes a permanent part of a Tioga document. Copies of the document may be stored on shared file servers or sent directly to other users as electronic mail. To listen to voice, a user selects a region containing one or more voice icons and buttons PlayVoice. Since playback takes some time, the user may queue up additional PlayVoice requests during playback. These will then be played in sequence. The STOP button can be used to halt the whole process at any time. A&S One aspect of the Etherphone system's architecture is particularly relevant to voice editing systems. Digitized voice is stored not by individual workstations, but by a voice file server on the Ethernet, designed specifically for recording and playing voice. The only time the data is directly accessed is to play it back, by sending voice packets from the server back to an Etherphone. A&S In this paper, we have described the user interface for a voice annotation and editing system. The key points of our design are: · Voice is treated as an additional medium to be incorporated into a multi-media document management system. The voice facilities have been added by extending the semantics of an existing user interface to encompass voice where appropriate, then by adding new techniques to deal with the idiosyncrasies of the audio medium. · There are some cases where simply converting the semantics of a text editing interface to voice would yield poor results. In such cases, we have produced a deliberately different interface. For example, we restrict voice editing to the manipulation of quantities no smaller than a spoken phrase, using a very simple capillary representation of the phrase structure. We have concluded that more elaborate energy profile representations stress too fine a level of detail, and may provide more distraction than contextual information. · This prototype voice editor only required two months to implement. This was possible because the components of the Cedar programming environment were designed to be extensible. The editor was able to use directly a number of user interface facilities already available in the environment. The Etherphone system supplied the underlying capabilities for telephone control as well as for recording, playback, and low-level voice-editing operations. Extensions were linked into Tioga to add voice icons and the specialized voice recording, playback, and dictation commands. A&S We have just begun to test this voice editor within the Cedar community. We will discover which aspects of our design find favor with users and which need improvement. There are many ways in which this work could be extended, some of which have been outlined above. We believe that future work should continue our efforts to balance the need for a user interface that is easy to understand and easy to use against the desire for an extensible and general structure that enables fluent and efficient manipulation of a variety of media. Added to Tioga documents editing functions at phrase level - simulate dictation machine Tioga documents can contain pictures and formatting also => multi-media documents (picture bits are in file, voice is not) Tioga's ArtworkInterpress impl: Tioga allows registering of paintproc for certain nodes. Artwork prop (value=Interpress) determines what proc to call, Interpress prop (value=picture bits, impl for filename proposed), Bounds prop (value=picture boundary). Text contents of ArtworkInterpress nodes is comment telling user to enable ArtworkInterpress. ArtworkInterpress paintproc ignores text contents. Use of props means that picture bits are at end of file. Tioga docs are sent as electronic mail => voice mail Voice ropes voice interests, garbage collection, Narrated documents An additional mechanism that draws on the capabilities of the voice system is .... PTZ We have developed a mechanism that we call a script, which provides a way to layer additional structure on top of an electronic document or set of documents. A script is a directed path through a document or set of documents that need not follow their linear order. Each entry in a script consists of a document location, which is a contiguous sequence of characters, together with an associated action and an associated timing. A sample action might consist of playing back a previously-recorded voice annotation, sending some text to a text-to-speech synthesizer, opening a new window, or running a program, which might animate a picture or retrieve items from a database. A single document can have multiple scripts traversing it for different purposes, and a single script can traverse multiple documents to organize collections of information. PTZ A script can be played back as a whole, in which case the cursor moves to the first location (l1) in a document and performs its associated action (a1). The document scrolls to display that location if the location does not currently appear on the screen, and the location is highlighted to call attention to it. After the associated time (t1), the cursor moves to the location specified in the next script entry (l2), performs its action, and so on. The same location in a document can appear at multiple points in the script, with the same or different associated actions and timing. PTZ Another way to play a script is more user-directed. In this case, the timing information is ignored, and the script reader proceeds from one entry to the next at his or her own pace. PTZ Arbitrary actions at a scripted location allow scripted documents to perform a wide variety of tasks: demonstrations, tutorials, etc. Parameterized actions allow a script to be personalized ("Hi ") or to more accurately reflect the current state of affairs ("There are entries in this category"). For speech, this capability requires a text-to-speech synthesizer. PTZ Scripted multi-media documents can contain any combination of text, pictures, audio, and action. Scripts need not follow the normal linear order of their associated document(s). In addition, the script writer can construct multiple viewing paths through the document(s) for different readers and purposes. This novel mechanism allows writers to communicate additional information to readers. Scripts can be used in a wide variety of ways, including: to construct formal demonstrations and presentations, to construct informal communications, and to organize collections of information. Work in progress.... An additional tool allows a script writer to order voice annotations into a sequence, creating documents that provide a narration. Features and Drawbacks something about the difficulty of modifying the system? want database hooks to allow user to specialize behavior Other conferencing, ... Future Work conferencing, ... <> <> Informational displays -- who is calling (by tune and icon) -> Fig 3. Simple commands  ? DBT Calls can be placed from a workstation using a telephone directory viewer or dialed by name from the telephone keypad. Phoning from DB or browsing )  see Fig. 4 for DBT Browser. Voice Annotation and Editing  Set up connection to file server instead of other phone > can record arb-length dictation, connect to document. Fig 5. DBT Voice is sent in electronic mail messages by reference. This is a specific case of the more general techniques for annotating documents with voice. Also in Fig 5, picture of segment of voice that can be edited to edit the voice  annotations marking, color cues combine w/graceful features to provide assistance in editing and locating things later [TV paper citations]. DBT Voice viewers on a workstation display a visual representation of stored voice. Users can edit voice by rearranging parts of the existing voice or recording new voice passages to be inserted in selected places. Draft versions of this paper were annotated with verbal suggestions for alteratons or improvements. Built up from voice record/edit capabilities & Tioga, + ability to manage recorded-voice values. Can be used wherever Tioga can, so is avail. for construction constructing voice messages. Further example of applications building on each other  narrated documents  cite Polle paper, show in Fig 6 (imagine scrolling action). DBT The voice system facilities have also been used to give a running audio narration for documents. Progress in the design of a voice architecture We have approached the problem of designing a voice architecture in a conventional manner. We identified a set of capabilities that we would like our system to have. We then designed a system to provide those capabilities, being as careful as we could to structure the system for modularity, flexibility, and all those good things. We have been using the resulting system as a model  positive and negative  for the facilities, interfaces, and protocols that should comprise a general architecture for voice systems. At the highest level, Application layer. client applications that use voice Service layer. telephony, recording, speech synthesis, speech understanding, etc. Conversation layer. conversation establishment and management Transmission layer. voice representation and transmission protocol(s); control protocols. Physical layer. communication media Things to get in (DCS): Where conventional (esp. conventional mu-law digital) switches would fit in. How another voice representation would fit in (say ADPCM, LPC encoding on the disk, gateways between them, all that.) How other transmission protocols would fit, the need for gateways among different ones. The right design for tandem switches to the real analog and digital world. (At this architectural level, it's pretty easy to show all this.) Holes in our design. Notes:  The association between these layers and the components of our prototype system is as follows: Application - WalnutVoice, TiogaVoice, FinchTool Service - (ThSmarts), Lark, Bluejay, VoiceRopes, Text-to-Speech, Finch Conversation - (ThParty), Thrush Transmission - Etherphone voice transmission protocol, standard telephone company protocol, RPC Physical - Ethernet, normal telephone lines  Gateways at the Transmission layer can connect different Physical layer components and even convert between different Transmission layer protocols. Etherphones can play this role, to an extent, in our current prototype. For instance, a conversation can go over the Ethernet and then be forwarded over a phone line via the Etherphone's backdoor.  At the Transmission layer, there is a distinction between the representation for voice and the protocol for transmitting it. We use the same voice representation as the phone company (64 Kbits, PCM, ...) but a different protocol (packet switching instead of circuit switching). If we had Cambridge rings or IBM token rings, we would still do packet switching, but would likely use different size packets than the Etherphone protocol. I'm not sure that we want to worry about different voice formats; doing so, requires the use of translation gateways. A PBX provides the bottom two layers, and could possible provide the Conversation layer as well.  Our prototype Conversation layer is provided by a centralized server. I don't think that we've thought much about how to come up with a decentralized implementation of this layer, though one is necessary if we want the system to scale well.  It seems that we have a fairly rich Service layer. Our prototype Service layer is distributed among servers and workstation programs. However, we should think about the interactions between workstations and multiple servers providing identical (or similar) services.  We have a few good examples of Application layer programs. Looking at the architectural layers, it becomes easier to see how our efforts differ from work being done elsewhere. Most of the current efforts to "integrate voice and data" simply deal with the Transmission (and Physical) layer. Other systems that include voice, such as Diamond, commercial voice mail services, etc., have some specialized applications but very scanty Service and Conversation layers; they mostly build directly on a Transmission layer. By contrast, we have concentrated our efforts on Conversation and Service layer specifications, and on the architecture in general. And in doing so, we can make an important contribution. ... Doug DCS I'd include RPC in transmission, or something, I guess. In some ways, voice transmission is sort of parallel to the Conversation layer. We need a place to put database access, Agent, Multicast server, LarkControl, and all that. The original Party/Smarts design doesn't depend heavily on a centralized server; it would work reasonably well fully-decentralized (one per phone); the conversation would be managed by the EtherphoneII that initiated it, even if that user eventually dropped out. A hard thing is to design the Conversation layer for a few cooperating centralized servers. Similar problems come up when one contemplates replications of Bluejay. Once your VoiceRopes are done, only simple interface stubs will remain of any services on the workstation, right? The trend is definitely towards drifting services to a server as they become stable and well-understood. Again, a service that is replicated represents a challenge that has traditionally been outside the scope of this project, but could be drawn in if we have any good ideas -- "interests" might be able to help, but only if we learn how to replicate those! State (Summary) In daily use by 50 people as sole phone (connections to other lines & outside trunks provided in undescribed ways)  have developed applications above. Still working to define architecture that would make these applications easier to build and more robust (problem with competing uses for connections). Want to explore a number of areas, including telephone filtering, attenance console stuff, ... Need to experiment with applications of same architecture to different workstation environments, different hardware architectures (could even combine them.) Also other media, like still and real-time video. If get arch. right, and enuf of it built, can open up to programmers to contribute additional applications. Have done some of that.. DBT We have discovered that managing voice in a distributed environment presents some interesting problems, which are pertinent to other media such as images or video. DBT The voice system with approximately 50 Etherphones is in everyday use by members of CSL. References DBT We should reference the "Adding Voice..." paper, Stephen and Dan's paper on editing, perhap's Polle's new CHI paper, perhaps Doug's distributed systems workshop paper, etc. DCS We should toss in a skeletal bibliography, culling entirely from our previous works, and do better later on. MICE OSI. XNS, Interpress, .... UNIX reference, Cedar paper.