<<IEEEApplications.tioga>>
    <<Polle Zellweger (PTZ) November 6, 1986 7:19:20 pm PST>>
Voice applications
The Etherphone system provides a variety of user-level services, including telephone management, text-to-speech synthesis, and 
voice annotation and editing.  We first describe the capabilities available on a workstation adjacent to an Etherphone.  These 
functions are typically available through an Etherphone control panel, through commands that can be issued in a typescript, and 
through procedures that can be incorporated into client programs.  {backwards, really.  Intervoice concept/pkgs then tools = procs 
first, other user interfaces built on top}
This section describes the user-level services that we have implemented to date.  Recall that our primary goal is to develop a 
comprehensive and robust voice architecture that permits the construction of such user services.  Although we currently provide an 
interesting assortment of user facilities, some of the more complex items, such as conference specifications, are not yet completed.
Background explanations of Cedar
Viewers
Tioga
common text editor used for editable viewers
multi-media capabilities (based on Tioga's extensibility)
Open system
Integration
Hardware configuration
custom microprocessor-based Lark, modified telephone instrument, speaker, microphone, optional adjacent workstation -- from 
previous section
Telephone management
The voice system supports a growing collection of telephone management functions, including call placement, call logging, ... a 
comprehensive set of functions to manage simple telephone calls (as opposed to conference, background, etc., which are supported by 
the architecture and underlying hardware, but currently no user interface...weak, too bad).
We first describe the functionality provided by the system, then we describe the underlying system architecture and some sample 
call scenarios, that is, the sequence of steps required to complete a few different types of calls.
Call placement.  From a workstation adjacent to an Etherphone, a user can place a telephone call in several ways.  She can fill in 
a name or number in the Called Party: field of the Finch tool and click its button; she can select a name or number anywhere on the 
screen (possibly in an electronic message) and click the Phone button in the tool header; she can type Phone followed by a name or 
number in any Command Tool viewer; or she can use one of two directory systems that present a browsable list of names and 
associated telephone numbers as speed-dialing buttons.
In addition, calls can be placed by name or number from the telephone keypad.  To call by name, we use a simple encoding that 
translates each letter into the single digit printed on that key (Q, Z, and a few special characters are also given key 
assignments).  Keypad dialing gives an error indicator when the result of this encoding is not unique, such as for AHenderson and 
CHenderson.  Such collisions occur rarely for our relatively small database, but a more complex scheme would be needed for a system 
with thousands of subscribers.  [We plan to construct a list of choices, either on the display or audially, using the 
text-to-speech server, but even this would be unwieldy in a large system.]
Directory assistance.  The system includes a central whitepages directory database for all Xerox employees in the Palo Alto area 
(about 1000 entries).  Individual Etherphone users can easily create personal directories from simple text files.  Workstation (?) 
call-placement routines consult first the personal directories then the system directory to translate names to numbers.
The first directory system transforms a text file into an unlimited set of speed-dialing buttons.  The usual Tioga functions of 
searching and level-clipping apply to these directories.  The second system provides a query-browsing style interface to a 
collection of directory databases.  The results of a query are again a set of speed-dialing buttons.  The user can formulate 
complex queries based on pattern-matching of names, numbers, and other database information.  In addition, a soundex search 
mechanism [Knuth] compensates for some kinds of spelling errors.
Locating callees.  Within the Etherphone system, the primary callee identification is by name.  The system searches for the person 
with that name as follows: if the person is logged in at a workstation, the adjacent telephone will ring with that person's ring 
tune.  Otherwise, the default telephone listed in the system database will ring.  If the person has registered with another 
workstation or Etherphone that they are visiting that location (by issuing the Visit command), then that Etherphone will ring in 
addition to any other Etherphone.
Although the Etherphone system interfaces to the normal telephone system, functionality for outside calls is fairly limited.  Calls 
to outside locations can be specified in a whitepages directory, but calls from the outside are not identified more specifically 
than "outside line".  Calls to people outside the Etherphone system do not have locating capabilities.
The Etherphone hardware is constructed so that software or hardware failures connect the telephone instrument directly to the 
outside telephone line.  On one hand, this has made it easier to get experimental users, because at least normal telephone service 
is guaranteed.  On the other hand, it has also made users less likely to report Etherphone failures, because they can still get 
their work done.
The system tries to locate the named callee and ring the nearest Etherphone.
Call announcement.  Calls are announced audially, visually, and functionally.  Audially, each Etherphone user is given a 
personalized ring tune, such as a few bars of "We're Off To See The Wizard", that is played at a destination Etherphone to announce 
calls to that user.  The caller hears the same tune as a ringback confirmation.  Visually, the telephone icon jangles with a 
superimposed indication of the caller, as shown in Figure X.  An active conversation is represented as a conversation between two 
people with a superimposed indication of the other party, as shown in Figure Y.  This icon feedback gives status information in a 
minimal screen area.  Functionally, the Finch tool's Calling Party: or Called Party: field is automatically filled in to allow easy 
redialing, and a new conversation log entry is created.  The conversation log can be consulted to discover who has called during an 
absence from the office.
In principle, the choice of ring tune could depend on the caller, the subject of the call, the time of day, and so on, but we have 
found that a single tune allows people to distinguish their calls at a distance, almost subconsciously (much as we subconciously 
filter noises for our name).  Ring tunes have been the single most popular feature of the Etherphone system.  We could also 
announce calls using our text-to-speech server, such as "Call for Swinehart from Zellweger", but this contributes more to office 
noise pollution if done loudly enough to catch people away from their offices.  It remains a possibility of last resort, however, 
after all other attempts to locate the person have failed.
Specializing Etherphone behavior.  Ring tunes and ringing behaviors for each Etherphone (such as ring my secretary between 3 and 5 
pm, or answer calls about the Cedar compiler with a particular recorded message) are specified in the centralized switching 
database.  The user can modify these behaviors by writing new database entries.  Another important consideration has been to allow 
all of the callee's agents in an Etherphone call (that is, the callee's Etherphone, its adjacent workstation, and the switching 
server) to cooperate in deciding how a call should be answered.  The switching server is consulted first to decide what 
Etherphone(s) and what workstation(s) to inform about a call.  This is where the central database comes into play.  Then the 
workstations are consulted to allow them to evaluate any complex filtering functions.  Finally, the Etherphones themselves are 
allowed to perform their default behavior (which can still be somewhat specialized: answer automatically or ring the phone -- but 
that's specified in the database too!)
Distributed intelligence about Etherphone behavior
Call placement and receipt
Ether calls vs. outside calls
Call logging
White pages
browser, personal lists, public lists
Identifying and locating callers and callees (visiting and poaching)
part of Controlling telephone behavior?
filtering
Text-to-speech synthesis
Note: much of this section would apply equally well to recorded voice; it's about uses of voice sources in the absence of telephony.
In addition to the voice file server, which supports voice recording and playback, the Etherphone system includes two commercial 
speech synthesizers, a DECtalk and a Prose 2000.  Each can convert arbitrary ASCII text to reasonably intelligible audio output, 
with control of speaking speed, pitch, and other voice characteristics.  Words that do not follow usual English pronunciation rules 
can be specified as a sequence of phonemes.  A common commercial use of such a synthesizer is to provide telephone access to a 
database, such as stock quotations or bank balances.  Often much of the text is a canned script that is typically hand-tuned for 
maximum intelligibility.  
In our system, each synthesizer is connected to a dedicated Etherphone, forming a text-to-speech server.  Each server is available 
to any Etherphone-equipped workstation, on a first-come-first-served, one-user-at-a-time basis.  A user or program can generate 
speech as easily as printing a message on the display.  To generate speech, a user can select text in a display window.  A program 
can call a procedure with the desired text as a parameter.  The system takes care of setting up the connection to the 
text-to-speech server, sending the text (via remote procedure call), returning the digitized audio signal (via the voice 
transmission protocol), and closing the connection when the text has been spoken.
Our primary uses for text-to-speech so far have been in programming environment and office automation applications.  The ability to 
select text in any screen window has been used directly for proofreading tasks.  This has been particularly valuable for comparing 
versions of a document when one version has no electronic form, such as proofing journal galleys.  Calendar and reminder programs 
have been augmented to allow audio reminders.  Some users have added spoken progress indicators to their long computations, 
allowing them to "keep an ear" on the computation while they perform other tasks.  Similarly, audio prompts and error messages 
allow users to focus their attention elsewhere without losing track of a program that requires intervention.  Although present-day 
synthesizers are less intelligible for arbitrary text than for the hand-tuned scripts that are used in commercial dial-up 
applications, the controllability of the generated speech suggests interesting future research in "audio style" for documents, in 
which speed, pitch, and other voice characteristics could be applied automatically to communicate italicization, boldface, or 
quotations.
Uses so far have been general applications of text-to-speech in the electronic office, including: proofreading (especially 
comparing versions of a document when one version has no electronic form, such as proofing journal galleys), audio reminder 
messages, program progress indicators, and error messages.  
Because we treat connections to voice sources explicitly, and our ability to include more than two parties in a conversation is not 
yet available above the Etherphone hardware level, we have not yet been able to experiment with uses of voice sources in telephone 
calls.  Among planned uses for text-to-speech are: (1) providing audio confirmation of the person or number dialled as a call 
begins, (2) reading electronic mail over the telephone to a remote lab member (without dedicating a synthesizer solely to this 
task), and (3) playing program-generated messages to callers, such as prompts or reports of the user's location (possibly by 
consulting the user's calendar program, such as "Dr. Smith is at the Distributed Systems seminar now, please call back after 5 
o'clock").
server.  speak selected text, speak under program control, audio reminder, debugging, progress reports, proofreading
Voice annotation and editing
The Etherphone system supports voice annotation of documents.  This capability is built on top of Tioga, the Cedar environment's 
screen-based editor.  The Tioga editor is a what-you-see-is-what-you-get galley editor that is used for both programming and 
document preparation.  Documents can have rich formatting and typography, and can include pictures and voice.  Tioga documents are 
tree-structured: for example, paragraphs can be nested under a section heading.  Documents can be displayed at any level of detail, 
from the single root node to the full tree.
Tioga is also extensible.  Individual characters or nodes can have arbitrary properties associated with them.  One use of node 
properties is to specify bitmaps and specialized screen-painting procedures for embedded pictures.
Ades & Swinehart Tioga is the standard text-editing program in Cedar.  Tioga is essentially a high-quality galley editor, 
supporting the creation of text documents using a variety of type faces, type styles, and paragraph formats.  Tioga is unusual 
among text editors in that its documents are tree-structured rather than being plain running text.  This makes possible such 
operations as displaying only the section headings or key paragraphs of a document, which means that scanning a Tioga document for 
a particular section can be done quickly and effortlessly.  Finally, Tioga includes the ability to incorporate illustrations and 
scanned images into its documents.  Tioga can create both black-and-white and full-color documents.
A&S Cedar has been designed so that other applications can employ the capabilities of the Tioga editor.  These include the 
electronic mail system, the system command interpreter, and any tools that require the entry and manipulation of text by the user.  
This gives considerable unity to the editing interface, since for all the different types of application in which Tioga is used, 
identical keystrokes will perform identical functions.  Wherever Tioga is used, all of its formatting and multi-media facilities 
are available.  Thus, by adding voice annotation to Tioga, we have made it available to a variety of Cedar applications.
A&S The user interface of the voice annotation system is designed to be lightweight and easy to use, since spontaneity in adding 
vocal annotations is essential.  Voice within a document is shown as a distinctive shape superimposed around a character, so that 
the document's visual layout and its contents as observed by other programs (e.g., compilers) are unaffected.  Users point at text 
selections and use menus to add and listen to voice.
A&S Simple voice editing is available: users can select a voice annotation and open a window showing its basic sound-and-silence 
profile.  Sounds from the same or other voice windows can be cut and pasted together using the same editing operations supported by 
the Tioga editor.  A lightweight `dictation facility' that uses a record/stop/backup model can be used to record and incorporate 
new sounds conveniently.  Editing is done largely at the phrase level (never at the phoneme level), representing the granularity at 
which editing can be done with best results and least effort.  The visual voice representation itself can be annotated: simple 
temporary markers are used to keep track of important boundaries during editing operations, while permanent textual markers are 
used to find significant locations within extended transcriptions.  As a further contextual aid, the system provides a visual 
indication of the age of the voice in an editing window.  Newly-added voice appears in a bright yellow color, while 
less-recently-added phrases become gradually darker as new editing operations occur.  The dictation facility can also be used when 
placing voice annotations straight into documents.

A&S  Basic annotation
Figure 1 shows a text document window, or viewer, from a Cedar workstation screen.  Its caption defines the various regions of the 
viewer, indicating how one selects objects of interest and how one performs various operations on those objects.  We will use the 
terminology defined in Figure 1 throughout the discussion.
A&S Any single text character within a Tioga document can be annotated with an audio recording of arbitrary length.  To add an 
annotation, the user simply selects the desired character within a text viewer and buttons AddVoice in that viewer's menu.  
Recording begins immediately, using either a hands-free microphone or the telephone handset, and continues until the user buttons 
STOP.  As recording begins, a distinctive iconic indication of the presence of a voice annotation is displayed as a sort of 
decoration of the selected character.  Currently, this voice icon is an outline in the shape of a comic-strip dialog `balloon' 
surrounding the entire character.  The second line of text in the first paragraph shown in Figure 1 contains a voice icon.
A&S Adding a voice icon does not alter the layout of a document in any way.  Thus, voice annotations can be used to comment on the 
content, format, or appearance of formatted text.  Moreover, programs such as compilers can read the text, ignoring voice icons 
just as they ignore font information.  Voice annotations may be used, for example, to explain portions of a program text without 
affecting the ability to compile it.  Like font information, voice icons are copied along with the text they annotate when editing 
operations move or copy that text to other locations, either within the same document or from one document to another.
A&S A voice annotation becomes a permanent part of a Tioga document.  Copies of the document may be stored on shared file servers 
or sent directly to other users as electronic mail.  To listen to voice, a user selects a region containing one or more voice icons 
and buttons PlayVoice.  Since playback takes some time, the user may queue up additional PlayVoice requests during playback.  These 
will then be played in sequence.  The STOP button can be used to halt the whole process at any time.
A&S One aspect of the Etherphone system's architecture is particularly relevant to voice editing systems.  Digitized voice is 
stored not by individual workstations, but by a voice file server on the Ethernet, designed specifically for recording and playing 
voice.  The only time the data is directly accessed is to play it back, by sending voice packets from the server back to an 
Etherphone.
A&S In this paper, we have described the user interface for a voice annotation and editing system.  The key points of our design 
are:
�    Voice is treated as an additional medium to be incorporated into a multi-media document management system.  The voice 
facilities have been added by extending the semantics of an existing user interface to encompass voice where appropriate, then by 
adding new techniques to deal with the idiosyncrasies of the audio medium.
�    There are some cases where simply converting the semantics of a text editing interface to voice would yield poor results.  In 
such cases, we have produced a deliberately different interface.  For example, we restrict voice editing to the manipulation of 
quantities no smaller than a spoken phrase, using a very simple capillary representation of the phrase structure.  We have 
concluded that more elaborate energy profile representations stress too fine a level of detail, and may provide more distraction 
than contextual information.
�    This prototype voice editor only required two months to implement.  This was possible because the components of the Cedar 
programming environment were designed to be extensible.  The editor was able to use directly a number of user interface facilities 
already available in the environment.  The Etherphone system supplied the underlying capabilities for telephone control as well as 
for recording, playback, and low-level voice-editing operations.  Extensions were linked into Tioga to add voice icons and the 
specialized voice recording, playback, and dictation commands.
A&S We have just begun to test this voice editor within the Cedar community.  We will discover which aspects of our design find 
favor with users and which need improvement.  There are many ways in which this work could be extended, some of which have been 
outlined above.  We believe that future work should continue our efforts to balance the need for a user interface that is easy to 
understand and easy to use against the desire for an extensible and general structure that enables fluent and efficient 
manipulation of a variety of media.
Added to Tioga documents
editing functions at phrase level - simulate dictation machine
Tioga documents can contain pictures and formatting also
=> multi-media documents (picture bits are in file, voice is not)
Tioga's ArtworkInterpress impl:  Tioga allows registering of paintproc for certain nodes.  Artwork prop (value=Interpress) 
determines what proc to call, Interpress prop (value=picture bits, impl for filename proposed), Bounds prop (value=picture 
boundary).  Text contents of ArtworkInterpress nodes is comment telling user to enable ArtworkInterpress.  ArtworkInterpress 
paintproc ignores text contents.  Use of props means that picture bits are at end of file.
Tioga docs are sent as electronic mail
=> voice mail
Voice ropes
voice interests, garbage collection, 
Narrated documents
An additional mechanism that draws on the capabilities of the voice system is ....
PTZ We have developed a mechanism that we call a script, which provides a way to layer additional structure on top of an electronic 
document or set of documents.  A script is a directed path through a document or set of documents that need not follow their linear 
order.  Each entry in a script consists of a document location, which is a contiguous sequence of characters, together with an 
associated action and an associated timing.  A sample action might consist of playing back a previously-recorded voice annotation, 
sending some text to a text-to-speech synthesizer, opening a new window, or running a program, which might animate a picture or 
retrieve items from a database.  A single document can have multiple scripts traversing it for different purposes, and a single 
script can traverse multiple documents to organize collections of information.
PTZ A script can be played back as a whole, in which case the cursor moves to the first location (l1) in a document and performs 
its associated action (a1).  The document scrolls to display that location if the location does not currently appear on the screen, 
and the location is highlighted to call attention to it.  After the associated time (t1), the cursor moves to the location 
specified in the next script entry (l2), performs its action, and so on.  The same location in a document can appear at multiple 
points in the script, with the same or different associated actions and timing.
PTZ Another way to play a script is more user-directed.  In this case, the timing information is ignored, and the script reader 
proceeds from one entry to the next at his or her own pace.
PTZ Arbitrary actions at a scripted location allow scripted documents to perform a wide variety of tasks: demonstrations, 
tutorials, etc.  Parameterized actions allow a script to be personalized ("Hi <username>") or to more accurately reflect the 
current state of affairs ("There are <curnum> entries in this category").  For speech, this capability requires a text-to-speech 
synthesizer.
PTZ Scripted multi-media documents can contain any combination of text, pictures, audio, and action.  Scripts need not follow the 
normal linear order of their associated document(s).  In addition, the script writer can construct multiple viewing paths through 
the document(s) for different readers and purposes.  This novel mechanism allows writers to communicate additional information to 
readers.  Scripts can be used in a wide variety of ways, including: to construct formal demonstrations and presentations, to 
construct informal communications, and to organize collections of information.
Work in progress....
An additional tool allows a script writer to order voice annotations into a sequence, creating documents that provide a narration.
Features and Drawbacks
something about the difficulty of modifying the system?
want database hooks to allow user to specialize behavior
Other
conferencing, ...
Future Work
conferencing, ...