Voice applications
The Etherphone system provides a variety of user-level services, including telephone management, text-to-speech synthesis, and voice annotation and editing. We first describe the capabilities available on a workstation adjacent to an Etherphone. These functions are typically available through an Etherphone control panel, through commands that can be issued in a typescript, and through procedures that can be incorporated into client programs. {backwards, really. Intervoice concept/pkgs then tools = procs first, other user interfaces built on top}
This section describes the user-level services that we have implemented to date. Recall that our primary goal is to develop a comprehensive and robust voice architecture that permits the construction of such user services. Although we currently provide an interesting assortment of user facilities, some of the more complex items, such as conference specifications, are not yet completed.
Background explanations of Cedar
Viewers
Tioga
common text editor used for editable viewers
multi-media capabilities (based on Tioga's extensibility)
Open system
Integration
Hardware configuration
custom microprocessor-based Lark, modified telephone instrument, speaker, microphone, optional adjacent workstation -- from previous section
Telephone management
The voice system supports a growing collection of telephone management functions, including call placement, call logging, ... a comprehensive set of functions to manage simple telephone calls (as opposed to conference, background, etc., which are supported by the architecture and underlying hardware, but currently no user interface...weak, too bad).
We first describe the functionality provided by the system, then we describe the underlying system architecture and some sample call scenarios, that is, the sequence of steps required to complete a few different types of calls.
Call placement. From a workstation adjacent to an Etherphone, a user can place a telephone call in several ways. She can fill in a name or number in the Called Party: field of the Finch tool and click its button; she can select a name or number anywhere on the screen (possibly in an electronic message) and click the Phone button in the tool header; she can type Phone followed by a name or number in any Command Tool viewer; or she can use one of two directory systems that present a browsable list of names and associated telephone numbers as speed-dialing buttons.
In addition, calls can be placed by name or number from the telephone keypad. To call by name, we use a simple encoding that translates each letter into the single digit printed on that key (Q, Z, and a few special characters are also given key assignments). Keypad dialing gives an error indicator when the result of this encoding is not unique, such as for AHenderson and CHenderson. Such collisions occur rarely for our relatively small database, but a more complex scheme would be needed for a system with thousands of subscribers. [We plan to construct a list of choices, either on the display or audially, using the text-to-speech server, but even this would be unwieldy in a large system.]
Directory assistance. The system includes a central whitepages directory database for all Xerox employees in the Palo Alto area (about 1000 entries). Individual Etherphone users can easily create personal directories from simple text files. Workstation (?) call-placement routines consult first the personal directories then the system directory to translate names to numbers.
The first directory system transforms a text file into an unlimited set of speed-dialing buttons. The usual Tioga functions of searching and level-clipping apply to these directories. The second system provides a query-browsing style interface to a collection of directory databases. The results of a query are again a set of speed-dialing buttons. The user can formulate complex queries based on pattern-matching of names, numbers, and other database information. In addition, a soundex search mechanism [Knuth] compensates for some kinds of spelling errors.
Locating callees. Within the Etherphone system, the primary callee identification is by name. The system searches for the person with that name as follows: if the person is logged in at a workstation, the adjacent telephone will ring with that person's ring tune. Otherwise, the default telephone listed in the system database will ring. If the person has registered with another workstation or Etherphone that they are visiting that location (by issuing the Visit command), then that Etherphone will ring in addition to any other Etherphone.
Although the Etherphone system interfaces to the normal telephone system, functionality for outside calls is fairly limited. Calls to outside locations can be specified in a whitepages directory, but calls from the outside are not identified more specifically than "outside line". Calls to people outside the Etherphone system do not have locating capabilities.
The Etherphone hardware is constructed so that software or hardware failures connect the telephone instrument directly to the outside telephone line. On one hand, this has made it easier to get experimental users, because at least normal telephone service is guaranteed. On the other hand, it has also made users less likely to report Etherphone failures, because they can still get their work done.
The system tries to locate the named callee and ring the nearest Etherphone.
Call announcement. Calls are announced audially, visually, and functionally. Audially, each Etherphone user is given a personalized ring tune, such as a few bars of "We're Off To See The Wizard", that is played at a destination Etherphone to announce calls to that user. The caller hears the same tune as a ringback confirmation. Visually, the telephone icon jangles with a superimposed indication of the caller, as shown in Figure X. An active conversation is represented as a conversation between two people with a superimposed indication of the other party, as shown in Figure Y. This icon feedback gives status information in a minimal screen area. Functionally, the Finch tool's Calling Party: or Called Party: field is automatically filled in to allow easy redialing, and a new conversation log entry is created. The conversation log can be consulted to discover who has called during an absence from the office.
In principle, the choice of ring tune could depend on the caller, the subject of the call, the time of day, and so on, but we have found that a single tune allows people to distinguish their calls at a distance, almost subconsciously (much as we subconciously filter noises for our name). Ring tunes have been the single most popular feature of the Etherphone system. We could also announce calls using our text-to-speech server, such as "Call for Swinehart from Zellweger", but this contributes more to office noise pollution if done loudly enough to catch people away from their offices. It remains a possibility of last resort, however, after all other attempts to locate the person have failed.
Specializing Etherphone behavior. Ring tunes and ringing behaviors for each Etherphone (such as ring my secretary between 3 and 5 pm, or answer calls about the Cedar compiler with a particular recorded message) are specified in the centralized switching database. The user can modify these behaviors by writing new database entries. Another important consideration has been to allow all of the callee's agents in an Etherphone call (that is, the callee's Etherphone, its adjacent workstation, and the switching server) to cooperate in deciding how a call should be answered. The switching server is consulted first to decide what Etherphone(s) and what workstation(s) to inform about a call. This is where the central database comes into play. Then the workstations are consulted to allow them to evaluate any complex filtering functions. Finally, the Etherphones themselves are allowed to perform their default behavior (which can still be somewhat specialized: answer automatically or ring the phone -- but that's specified in the database too!)
Distributed intelligence about Etherphone behavior
Call placement and receipt
Ether calls vs. outside calls
Call logging
White pages
browser, personal lists, public lists
Identifying and locating callers and callees (visiting and poaching)
part of Controlling telephone behavior?
filtering
Text-to-speech synthesis
Note: much of this section would apply equally well to recorded voice; it's about uses of voice sources in the absence of telephony.
In addition to the voice file server, which supports voice recording and playback, the Etherphone system includes two commercial speech synthesizers, a DECtalk and a Prose 2000. Each can convert arbitrary ASCII text to reasonably intelligible audio output, with control of speaking speed, pitch, and other voice characteristics. Words that do not follow usual English pronunciation rules can be specified as a sequence of phonemes. A common commercial use of such a synthesizer is to provide telephone access to a database, such as stock quotations or bank balances. Often much of the text is a canned script that is typically hand-tuned for maximum intelligibility.
In our system, each synthesizer is connected to a dedicated Etherphone, forming a text-to-speech server. Each server is available to any Etherphone-equipped workstation, on a first-come-first-served, one-user-at-a-time basis. A user or program can generate speech as easily as printing a message on the display. To generate speech, a user can select text in a display window. A program can call a procedure with the desired text as a parameter. The system takes care of setting up the connection to the text-to-speech server, sending the text (via remote procedure call), returning the digitized audio signal (via the voice transmission protocol), and closing the connection when the text has been spoken.
Our primary uses for text-to-speech so far have been in programming environment and office automation applications. The ability to select text in any screen window has been used directly for proofreading tasks. This has been particularly valuable for comparing versions of a document when one version has no electronic form, such as proofing journal galleys. Calendar and reminder programs have been augmented to allow audio reminders. Some users have added spoken progress indicators to their long computations, allowing them to "keep an ear" on the computation while they perform other tasks. Similarly, audio prompts and error messages allow users to focus their attention elsewhere without losing track of a program that requires intervention. Although present-day synthesizers are less intelligible for arbitrary text than for the hand-tuned scripts that are used in commercial dial-up applications, the controllability of the generated speech suggests interesting future research in "audio style" for documents, in which speed, pitch, and other voice characteristics could be applied automatically to communicate italicization, boldface, or quotations.
Uses so far have been general applications of text-to-speech in the electronic office, including: proofreading (especially comparing versions of a document when one version has no electronic form, such as proofing journal galleys), audio reminder messages, program progress indicators, and error messages.
Because we treat connections to voice sources explicitly, and our ability to include more than two parties in a conversation is not yet available above the Etherphone hardware level, we have not yet been able to experiment with uses of voice sources in telephone calls. Among planned uses for text-to-speech are: (1) providing audio confirmation of the person or number dialled as a call begins, (2) reading electronic mail over the telephone to a remote lab member (without dedicating a synthesizer solely to this task), and (3) playing program-generated messages to callers, such as prompts or reports of the user's location (possibly by consulting the user's calendar program, such as "Dr. Smith is at the Distributed Systems seminar now, please call back after 5 o'clock").
server. speak selected text, speak under program control, audio reminder, debugging, progress reports, proofreading
Voice annotation and editing
The Etherphone system supports voice annotation of documents. This capability is built on top of Tioga, the Cedar environment's screen-based editor. The Tioga editor is a what-you-see-is-what-you-get galley editor that is used for both programming and document preparation. Documents can have rich formatting and typography, and can include pictures and voice. Tioga documents are tree-structured: for example, paragraphs can be nested under a section heading. Documents can be displayed at any level of detail, from the single root node to the full tree.
Tioga is also extensible. Individual characters or nodes can have arbitrary properties associated with them. One use of node properties is to specify bitmaps and specialized screen-painting procedures for embedded pictures.
Ades & Swinehart Tioga is the standard text-editing program in Cedar. Tioga is essentially a high-quality galley editor, supporting the creation of text documents using a variety of type faces, type styles, and paragraph formats. Tioga is unusual among text editors in that its documents are tree-structured rather than being plain running text. This makes possible such operations as displaying only the section headings or key paragraphs of a document, which means that scanning a Tioga document for a particular section can be done quickly and effortlessly. Finally, Tioga includes the ability to incorporate illustrations and scanned images into its documents. Tioga can create both black-and-white and full-color documents.
A&S Cedar has been designed so that other applications can employ the capabilities of the Tioga editor. These include the electronic mail system, the system command interpreter, and any tools that require the entry and manipulation of text by the user. This gives considerable unity to the editing interface, since for all the different types of application in which Tioga is used, identical keystrokes will perform identical functions. Wherever Tioga is used, all of its formatting and multi-media facilities are available. Thus, by adding voice annotation to Tioga, we have made it available to a variety of Cedar applications.
A&S The user interface of the voice annotation system is designed to be lightweight and easy to use, since spontaneity in adding vocal annotations is essential. Voice within a document is shown as a distinctive shape superimposed around a character, so that the document's visual layout and its contents as observed by other programs (e.g., compilers) are unaffected. Users point at text selections and use menus to add and listen to voice.
A&S Simple voice editing is available: users can select a voice annotation and open a window showing its basic sound-and-silence profile. Sounds from the same or other voice windows can be cut and pasted together using the same editing operations supported by the Tioga editor. A lightweight `dictation facility' that uses a record/stop/backup model can be used to record and incorporate new sounds conveniently. Editing is done largely at the phrase level (never at the phoneme level), representing the granularity at which editing can be done with best results and least effort. The visual voice representation itself can be annotated: simple temporary markers are used to keep track of important boundaries during editing operations, while permanent textual markers are used to find significant locations within extended transcriptions. As a further contextual aid, the system provides a visual indication of the age of the voice in an editing window. Newly-added voice appears in a bright yellow color, while less-recently-added phrases become gradually darker as new editing operations occur. The dictation facility can also be used when placing voice annotations straight into documents.
A&S Basic annotation
Figure 1 shows a text document window, or viewer, from a Cedar workstation screen. Its caption defines the various regions of the viewer, indicating how one selects objects of interest and how one performs various operations on those objects. We will use the terminology defined in Figure 1 throughout the discussion.
A&S Any single text character within a Tioga document can be annotated with an audio recording of arbitrary length. To add an annotation, the user simply selects the desired character within a text viewer and buttons AddVoice in that viewer's menu. Recording begins immediately, using either a hands-free microphone or the telephone handset, and continues until the user buttons STOP. As recording begins, a distinctive iconic indication of the presence of a voice annotation is displayed as a sort of decoration of the selected character. Currently, this voice icon is an outline in the shape of a comic-strip dialog `balloon' surrounding the entire character. The second line of text in the first paragraph shown in Figure 1 contains a voice icon.
A&S Adding a voice icon does not alter the layout of a document in any way. Thus, voice annotations can be used to comment on the content, format, or appearance of formatted text. Moreover, programs such as compilers can read the text, ignoring voice icons just as they ignore font information. Voice annotations may be used, for example, to explain portions of a program text without affecting the ability to compile it. Like font information, voice icons are copied along with the text they annotate when editing operations move or copy that text to other locations, either within the same document or from one document to another.
A&S A voice annotation becomes a permanent part of a Tioga document. Copies of the document may be stored on shared file servers or sent directly to other users as electronic mail. To listen to voice, a user selects a region containing one or more voice icons and buttons PlayVoice. Since playback takes some time, the user may queue up additional PlayVoice requests during playback. These will then be played in sequence. The STOP button can be used to halt the whole process at any time.
A&S One aspect of the Etherphone system's architecture is particularly relevant to voice editing systems. Digitized voice is stored not by individual workstations, but by a voice file server on the Ethernet, designed specifically for recording and playing voice. The only time the data is directly accessed is to play it back, by sending voice packets from the server back to an Etherphone.
A&S In this paper, we have described the user interface for a voice annotation and editing system. The key points of our design are:
· Voice is treated as an additional medium to be incorporated into a multi-media document management system. The voice facilities have been added by extending the semantics of an existing user interface to encompass voice where appropriate, then by adding new techniques to deal with the idiosyncrasies of the audio medium.
· There are some cases where simply converting the semantics of a text editing interface to voice would yield poor results. In such cases, we have produced a deliberately different interface. For example, we restrict voice editing to the manipulation of quantities no smaller than a spoken phrase, using a very simple capillary representation of the phrase structure. We have concluded that more elaborate energy profile representations stress too fine a level of detail, and may provide more distraction than contextual information.
· This prototype voice editor only required two months to implement. This was possible because the components of the Cedar programming environment were designed to be extensible. The editor was able to use directly a number of user interface facilities already available in the environment. The Etherphone system supplied the underlying capabilities for telephone control as well as for recording, playback, and low-level voice-editing operations. Extensions were linked into Tioga to add voice icons and the specialized voice recording, playback, and dictation commands.
A&S We have just begun to test this voice editor within the Cedar community. We will discover which aspects of our design find favor with users and which need improvement. There are many ways in which this work could be extended, some of which have been outlined above. We believe that future work should continue our efforts to balance the need for a user interface that is easy to understand and easy to use against the desire for an extensible and general structure that enables fluent and efficient manipulation of a variety of media.
Added to Tioga documents
editing functions at phrase level - simulate dictation machine
Tioga documents can contain pictures and formatting also
=> multi-media documents (picture bits are in file, voice is not)
Tioga's ArtworkInterpress impl: Tioga allows registering of paintproc for certain nodes. Artwork prop (value=Interpress) determines what proc to call, Interpress prop (value=picture bits, impl for filename proposed), Bounds prop (value=picture boundary). Text contents of ArtworkInterpress nodes is comment telling user to enable ArtworkInterpress. ArtworkInterpress paintproc ignores text contents. Use of props means that picture bits are at end of file.
Tioga docs are sent as electronic mail
=> voice mail
Voice ropes
voice interests, garbage collection,
Narrated documents
An additional mechanism that draws on the capabilities of the voice system is ....
PTZ We have developed a mechanism that we call a script, which provides a way to layer additional structure on top of an electronic document or set of documents. A script is a directed path through a document or set of documents that need not follow their linear order. Each entry in a script consists of a document location, which is a contiguous sequence of characters, together with an associated action and an associated timing. A sample action might consist of playing back a previously-recorded voice annotation, sending some text to a text-to-speech synthesizer, opening a new window, or running a program, which might animate a picture or retrieve items from a database. A single document can have multiple scripts traversing it for different purposes, and a single script can traverse multiple documents to organize collections of information.
PTZ A script can be played back as a whole, in which case the cursor moves to the first location (l1) in a document and performs its associated action (a1). The document scrolls to display that location if the location does not currently appear on the screen, and the location is highlighted to call attention to it. After the associated time (t1), the cursor moves to the location specified in the next script entry (l2), performs its action, and so on. The same location in a document can appear at multiple points in the script, with the same or different associated actions and timing.
PTZ Another way to play a script is more user-directed. In this case, the timing information is ignored, and the script reader proceeds from one entry to the next at his or her own pace.
PTZ Arbitrary actions at a scripted location allow scripted documents to perform a wide variety of tasks: demonstrations, tutorials, etc. Parameterized actions allow a script to be personalized ("Hi <username>") or to more accurately reflect the current state of affairs ("There are <curnum> entries in this category"). For speech, this capability requires a text-to-speech synthesizer.
PTZ Scripted multi-media documents can contain any combination of text, pictures, audio, and action. Scripts need not follow the normal linear order of their associated document(s). In addition, the script writer can construct multiple viewing paths through the document(s) for different readers and purposes. This novel mechanism allows writers to communicate additional information to readers. Scripts can be used in a wide variety of ways, including: to construct formal demonstrations and presentations, to construct informal communications, and to organize collections of information.
Work in progress....
An additional tool allows a script writer to order voice annotations into a sequence, creating documents that provide a narration.
Features and Drawbacks
something about the difficulty of modifying the system?
want database hooks to allow user to specialize behavior
Future Work
conferencing, ...