<<IEEEPaper.tioga>>
    <<Swinehart, November 5, 1986 9:57:54 am PST>>
    <<Polle Zellweger (PTZ) November 10, 1986 2:16:20 pm PST>>
An Experimental Environment for Voice System Development (or something)
by Moe, Larry, and Curly
Notes:
Underlines used for original sketch material (wrote first, from whole cloth)
"Other" font used for extracts.  (Lots) more is included than is intended to be kept.
Vanilla font used for actual intended text (repeats the other stuff where necessary).  There isn't any yet.
Sources for extracts
ey85: 1985 Yearend report, before emasculation
Mar85: March 1985 customers-and-outputs document.
DBT: Doug's list of components.
83-8: CSL 83-8: Adding Voice to an ...
Ades: Ades&Swinehart Tioga Voice paper.
PTZ: Zellweger Scripted Documents manuscript of ~15 Sept 85.
Don't think we should include the terms "Intervoice", "Voice ropes" (for different reasons).
Should never say anything twice.
Many headings should probably be eliminated; I've found them useful for organizational purposes.
All due effort should be made to eliminate "facilities", "capabilities", "functions", "applications", and the like from my prose.  
This attempt will not be 100% successful, but it's worth a try.
Introduction
Suppose Alexander Graham Bell had waited to invent the telephone until personal workstations and distributed computing networks had 
been invented.  What approach would he take in introducing voice communications into the modern computing environment?  It was an 
attempt to answer this question that led to the creation of a voice communications project within the Computer Science Laboratory 
at PARC.
Stated more concretely, the idea was to extend existing multi-media office environments with the facilities needed to handle the 
transmission, storage, and manipulation of voice.  We believed it should be possible to deal with voice as easily as we can manage 
text, electronic mail, or images, in an integrated fashion.  The result would be a combined workstation that together could satisfy 
nearly all of a user's communications and computing needs.
Whatever we did would need to provide conventional telephone facilities (so that casual users would not have to read a manual to 
make a phone call).  But beyond that, we wished to draw on our experience with applications of two-dimensional workstations to 
address the functions of voice communications from the perspective of developers of personal information systems, rather than the 
perspective of designers of advanced telephone facilities.  We believe that the user would prefer to perform these functions using 
the power and convenience of such workstation facilities as on-screen menus, text editors, and comprehensive informational displays.
In this note we will describe the EtherphoneTM system that we have developed and used to explore ways to integrate voice into a 
personal information environment.  In separate sections, we will sketch the present hardware architecture, describe some of the 
more compelling applications that have been built to exploit it, and briefly explore some of the software and systems issues that 
have surfaced.  Before proceeding with the exposition, we need to give the reader a better picture of the topics we have chosen to 
focus on in this work.
Areas of Exploration
"Taming the telephone": Despite an immense investment in research and development over the last 110 years, the user interface and 
the functionality of the telephone still leaves something to be desired.  We have experimented with comprehensive on-line telephone 
directories, improved methods for locating people as they move about within an office building, and convenient methods for leaving 
recorded messages, as means for improving the placement of telephone calls.  Looking at the problems of the called party, we have 
used distinctive "ring tunes", call filtering options based on the subject, urgency, or caller's identity for deciding which calls 
to accept or reject, call logging and recorded messages  all designed to reduce the intrusive nature of the telephone.  We are 
also exploring novel kinds of interactive voice connections, such as all-day "background" telephone calls, use of the telephone 
system to broadcast internal talks or meetings (as a sort of giant conference telephone call), or conference calls that allow 
side-conversations to take place.
"Taming the tape recorder": Telephone applications must deal with the real-time demands of "live" conversations.  Another class of 
applications, more closely resembling traditional workstation functions, involves the manipulation and use of recorded or 
synthesized voice.  Our investigations of recorded voice range from the conventional dictation machine or answering machine to 
voice mail to more flexible facilities for adding voice annotations to documents.  The voice produced by a commercial 
text-to-speech synthesizer can be used in analogous ways.  We have not yet incorporated speech recognition equipment, but clearly 
there is also a place for it in this spectrum of applications.  Recorded or synthesized music and other sources of audio 
information also have a role to play.  Our aim is to incorporate all of these methods for generating or accepting audio information 
into a coherent part of a multi-media workstation environment.
On the surface, telephony issues and applications of recorded or synthetic voice can be considered independently, but in fact they 
are clearly interrelated.  On the one hand, electronic mail with voice annotations can be effectively used when a call is not 
answered.  On the other hand, the most natural device for voice input/output in an integrated environment is the telephone (or 
speakerphone).  Not only does this eliminate confusion by permitting telephone calls and other voice applications to be 
coordinated, but it also admits the possibility that some voice applications can be remotely located and reached by telephone  
especially effective if connections can be established without incurring dialing and ringing delays.
An Architecture for Voice Applications
The original goals of the Etherphone project were to produce experimental prototypes to test novel approaches to "taming the 
telephone" and "taming the tape recorder", drawing on our experience with text, graphics, and other visual media.  Along the way, 
however, a more fundamental requirement has emerged: the need to create and validate experimentally a comprehensive architecture 
for voice applications.  Why? What is it like with and without? The architecture must be able to specify the role of telephone 
transmission and switching, workstations, voice file servers, and other network services in supporting voice communications  both 
real-time telephone calls and recorded voice as it is stored, manipulated, and experienced in documents.  It must be an open 
architecture, permitting programmers to create new voice-related applications and modify existing ones without having to understand 
the detailed workings of the voice system and without endangering its normal operation.  Ideally, this architecture could evolve 
into a standard defining the role of each major component, so that multiple vendors of telephone and office equipement could 
cooperate to provide advanced voice functions in conjunction with workstation-based applications.  Simple things are simple, 
elaborate ones possible.
The best way to explain this requirement is by analogy with existing architectures too long, a bit too repetitive:
The Xerox Network Systems protocols [XNS, ref] represent a comprehensive architecture for packet-switched data communications.  XNS 
specifies a common transport-level packet format that all connected computers must agree on for the transmission and routing of 
information from one system to another (transport layer).  Below this level are various device-dependent link-level specifications 
for transmitting these XNS packets over various media, such as Ethernets, telephone lines, or satellite channels.  Above the 
transport level are additional levels of protocol that support reliable end-to-end connections (? layer), important services such 
as support for remote procedure calls, bulk data transport and terminal-to-host communications (? layer), and finally various 
specific applications (applications layer).  Doug can fill in the lingo?  The TCP/IP protocol family used in the DARPA Research 
network and the MILNET ... networks ([] Doug?) have a similar organization, as do many other communications protocols, all of which 
follow the general structure defined by the ISO Reference Model for Open Communications. [ISO]
Similarly, modern portable operating systems are structured in layers.  Although these systems vary widely in their details and 
even in their overall philosophy [Unix, Cedar], they share many structural attributes.  At the lowest level, machine-dependent 
programs implement an abstract machine that allows the remainder of the system to be largely machine-independent, supporting 
applications that can run on a variety of hardware configurations.  The levels above the abstract machine level provide memory and 
device management facilities, services such as character streams and window packages, and finally very specific applications 
support.
Finally, document architectures are beginning to emerge that can specify standard for representing complex multi-media documents, 
permitting the interchange of documents among workstations and hardcopy printers produced by multiple vendors. [Interpress, 
Postscript, ODA].
As diverse as these architectures are, they share several elements in common.  They describe, in exacting detail, the range of 
capabilities that their clients have available, the way those capabilities are structured, and exactly how they are used.  For 
example, XNS describes the interfaces that clients use to communicate with other applications; Interpress defines the file formats 
that clients need to obey in order to get documents printed.  In both cases, it is possible to define precisely, as subsets of the 
total architectures, any restrictions that a particular implementation might place on its clients.
They also serve as detailed specifications for the implementors of the services that are needed to support their clients.  This is 
very important for voice; for example, it should be possible to specify precisely the features that a PBX should provide.
In comparison, most architectures supporting digital voice capabilities that occupy levels higher than the analogue of the ISO 
transport layer  to the extent that they exist at all  are embedded in and custom-tailored to particular applications that they 
support.  One can identify some simple protocols at about the level of RS-232 and maybe SDLC, but higher-level architectures tend 
to be implicit in the implementations of telephone switching systems, PBXs and the like.  Most existing proposals for integrating 
data and voice, or integrating local area networks and PBXs, are expressed at this primitive level1.  They would, for example, give 
little guidance on how to build an integrated Etherphone-like system using a Northern Telecom SL-1 switch and a commercial voice 
mail machine.
1Exception: MICE [].
We are attempting to derive such a voice architecture based on our experience with the Etherphone prototype.  It is this 
architectural work that distinguishes our work from systems with similar capabilities.   a  Before we describe any further aspects 
of our architectural design as it stands today, we will describe the general structure and user facilities of the system that we've 
built to test these ideas.
Etherphone Project Description
Experimental prototype
In designing our prototype voice system, we surveyed several possible architectures for the system, including using our existing 
Centrex service or a commercially available PABX.  We concluded that the most effective way to satisfy our needs was to construct 
our own transmission and switching system [CSL-83-8].   Ethernet local-area networks, which provide the data communications 
backbone supporting personal distributed computing at Xerox PARC, have proven to be an effective medium for transmitting voice as 
well as data.  Our prototype voice system (see figure 1) consists of the following types of components connected by Ethernets.
Etherphones: telephone/speakerphone instruments that include a microcomputer, encryption hardware, and an Ethernet controller.  
Etherphones digitize, packetize, and encrypt voice and send it directly over an Ethernet.  Our current environment contains 
approximately 50 Etherphones.  Additional information on the Etherphone hardware and the Voice Transmission Protocol can be found 
in a previous report [CSL-83-8].  Something about programming environment.
Voice Control Server: a control program that provides functions similar to a conventional PBX and manages the interactions between 
all the other components.  It runs on a dedicated server that also maintains databases for public telephone directories, 
Etherphone-workstation assignments, and other shared information.  
Voice File Server: a service that can hold conversations with Etherphones in order to record or playback user utterances.  In 
addition to managing stored voice, the Voice File Server provides operations for combining, rearranging, and replacing parts of 
existing voice recordings to create new voice objects.  For security reasons, voice remains encrypted when stored on the file 
server.
Text-to-speech Server: a service that receives text strings and returns the equivalent spoken text to the user's Etherphone.  We 
currently have two speech synthesizers purchased from different manufacturers.
Workstations: high-performance personal computers.  Workstations are the key to providing enhanced user interfaces and control over 
the voice capabilities.  We rely on the extensibility of the local programming environment  be it Cedar, Interlisp, Smalltalk, or 
whatever  to facilitate the integration of voice into workstation-based applications.  Workstation program libraries implement the 
client programmer interface to the voice system.  Workstation applications to date primarily in Cedar system mentioned above, 
although there's an Interlisp existence proof.  Figure 2 is Cedar screen in use, including voice, text, graphical, and program 
development activities.
In addition, the architecture allows for the inclusion of other specialized sources or sinks of voice, such as speech recognition 
equipment or music synthesizers.
Most server programming done in Cedar.  All of the communication required for control in the voice system is accomplished via a 
remote procedure call (RPC) protocol [Birrel and Nelson].  For instance, conversations are established between two or more parties 
(Etherphones, servers, and so on) by performing remote procedure calls to the Voice Control Server.  During the course of a 
conversation, RPC calls emanating from the Voice Control Server inform participants about various activities concerning the 
conversation.  Active parties in a conversation use the Voice Transmission Protocol to actually exchange voice. Multiple 
implementations of RPC permit workstation programs and voice applications programmed in different environments to be integrated.
Examples of Applications to Date
Voice applications
The Etherphone system provides a variety of user-level services, including telephone management, text-to-speech synthesis, and 
voice annotation and editing.  We first describe the capabilities available on a workstation adjacent to an Etherphone.  These 
functions are typically available through an Etherphone control panel, through commands that can be issued in a typescript, and 
through procedures that can be incorporated into client programs.  {backwards, really.  Intervoice concept/pkgs then tools = procs 
first, other user interfaces built on top}
This section describes the user-level services that we have implemented to date.  Recall that our primary goal is to develop a 
comprehensive and robust voice architecture that permits the construction of such user services.  Although we currently provide an 
interesting assortment of user facilities, some of the more complex items, such as conference specifications, are not yet completed.
Background explanations of Cedar
Viewers
Tioga
common text editor used for editable viewers
multi-media capabilities (based on Tioga's extensibility)
Open system
Integration
Hardware configuration
custom microprocessor-based Lark, modified telephone instrument, speaker, microphone, optional adjacent workstation -- from 
previous section
Telephone management
The voice system supports a growing collection of telephone management functions, including call placement, call logging, ... a 
comprehensive set of functions to manage simple telephone calls (as opposed to conference, background, etc., which are supported by 
the architecture and underlying hardware, but currently no user interface...weak, too bad).
We first describe the functionality provided by the system, then we describe the underlying system architecture and some sample 
call scenarios, that is, the sequence of steps required to complete a few different types of calls.
Call placement.  From a workstation adjacent to an Etherphone, a user can place a telephone call in several ways.  She can fill in 
a name or number in the Called Party: field of the Finch tool and click its button; she can select a name or number anywhere on the 
screen (possibly in an electronic message) and click the Phone button in the tool header; she can type Phone followed by a name or 
number in any Command Tool viewer; or she can use one of two directory systems that present a browsable list of names and 
associated telephone numbers as speed-dialing buttons.
In addition, calls can be placed by name or number from the telephone keypad.  To call by name, we use a simple encoding that 
translates each letter into the single digit printed on that key (Q, Z, and a few special characters are also given key 
assignments).  Keypad dialing gives an error indicator when the result of this encoding is not unique, such as for AHenderson and 
CHenderson.  Such collisions occur rarely for our relatively small database, but a more complex scheme would be needed for a system 
with thousands of subscribers.  [We plan to construct a list of choices, either on the display or audially, using the 
text-to-speech server, but even this would be unwieldy in a large system.]
Directory assistance.  The system includes a central whitepages directory database for all Xerox employees in the Palo Alto area 
(about 1000 entries).  Individual Etherphone users can easily create personal directories from simple text files.  Workstation (?) 
call-placement routines consult first the personal directories then the system directory to translate names to numbers.
The first directory system transforms a text file into an unlimited set of speed-dialing buttons.  The usual Tioga functions of 
searching and level-clipping apply to these directories.  The second system provides a query-browsing style interface to a 
collection of directory databases.  The results of a query are again a set of speed-dialing buttons.  The user can formulate 
complex queries based on pattern-matching of names, numbers, and other database information.  In addition, a soundex search 
mechanism [Knuth] compensates for some kinds of spelling errors.
Locating callees.  Within the Etherphone system, the primary callee identification is by name.  The system searches for the person 
with that name as follows: if the person is logged in at a workstation, the adjacent telephone will ring with that person's ring 
tune.  Otherwise, the default telephone listed in the system database will ring.  If the person has registered with another 
workstation or Etherphone that they are visiting that location (by issuing the Visit command), then that Etherphone will ring in 
addition to any other Etherphone.
Although the Etherphone system interfaces to the normal telephone system, functionality for outside calls is fairly limited.  Calls 
to outside locations can be specified in a whitepages directory, but calls from the outside are not identified more specifically 
than "outside line".  Calls to people outside the Etherphone system do not have locating capabilities.
The Etherphone hardware is constructed so that software or hardware failures connect the telephone instrument directly to the 
outside telephone line.  On one hand, this has made it easier to get experimental users, because at least normal telephone service 
is guaranteed.  On the other hand, it has also made users less likely to report Etherphone failures, because they can still get 
their work done.
The system tries to locate the named callee and ring the nearest Etherphone.
Call announcement.  Calls are announced audially, visually, and functionally.  Audially, each Etherphone user is given a 
personalized ring tune, such as a few bars of "We're Off To See The Wizard", that is played at a destination Etherphone to announce 
calls to that user.  The caller hears the same tune as a ringback confirmation.  Visually, the telephone icon jangles with a 
superimposed indication of the caller, as shown in Figure X.  An active conversation is represented as a conversation between two 
people with a superimposed indication of the other party, as shown in Figure Y.  This icon feedback gives status information in a 
minimal screen area.  Functionally, the Finch tool's Calling Party: or Called Party: field is automatically filled in to allow easy 
redialing, and a new conversation log entry is created.  The conversation log can be consulted to discover who has called during an 
absence from the office.
In principle, the choice of ring tune could depend on the caller, the subject of the call, the time of day, and so on, but we have 
found that a single tune allows people to distinguish their calls at a distance, almost subconsciously (much as we subconciously 
filter noises for our name).  Ring tunes have been the single most popular feature of the Etherphone system.  We could also 
announce calls using our text-to-speech server, such as "Call for Swinehart from Zellweger", but this contributes more to office 
noise pollution if done loudly enough to catch people away from their offices.  It remains a possibility of last resort, however, 
after all other attempts to locate the person have failed.
Specializing Etherphone behavior.  Ring tunes and ringing behaviors for each Etherphone (such as ring my secretary between 3 and 5 
pm, or answer calls about the Cedar compiler with a particular recorded message) are specified in the centralized switching 
database.  The user can modify these behaviors by writing new database entries.  Another important consideration has been to allow 
all of the callee's agents in an Etherphone call (that is, the callee's Etherphone, its adjacent workstation, and the switching 
server) to cooperate in deciding how a call should be answered.  The switching server is consulted first to decide what 
Etherphone(s) and what workstation(s) to inform about a call.  This is where the central database comes into play.  Then the 
workstations are consulted to allow them to evaluate any complex filtering functions.  Finally, the Etherphones themselves are 
allowed to perform their default behavior (which can still be somewhat specialized: answer automatically or ring the phone -- but 
that's specified in the database too!)
Distributed intelligence about Etherphone behavior
Call placement and receipt
Ether calls vs. outside calls
Call logging
White pages
browser, personal lists, public lists
Identifying and locating callers and callees (visiting and poaching)
part of Controlling telephone behavior?
filtering
Text-to-speech synthesis
Note: much of this section would apply equally well to recorded voice; it's about uses of voice sources in the absence of telephony.
In addition to the voice file server, which supports voice recording and playback, the Etherphone system includes two commercial 
speech synthesizers, a DECtalk and a Prose 2000.  Each can convert arbitrary ASCII text to reasonably intelligible audio output, 
with control of speaking speed, pitch, and other voice characteristics.  Words that do not follow usual English pronunciation rules 
can be specified as a sequence of phonemes.  A common commercial use of such a synthesizer is to provide telephone access to a 
database, such as stock quotations or bank balances.  Often much of the text is a canned script that is typically hand-tuned for 
maximum intelligibility.  
In our system, each synthesizer is connected to a dedicated Etherphone, forming a text-to-speech server.  Each server is available 
to any Etherphone-equipped workstation, on a first-come-first-served, one-user-at-a-time basis.  A user or program can generate 
speech as easily as printing a message on the display.  To generate speech, a user can select text in a display window.  A program 
can call a procedure with the desired text as a parameter.  The system takes care of setting up the connection to the 
text-to-speech server, sending the text (via remote procedure call), returning the digitized audio signal (via the voice 
transmission protocol), and closing the connection when the text has been spoken.
Our primary uses for text-to-speech so far have been in programming environment and office automation applications.  The ability to 
select text in any screen window has been used directly for proofreading tasks.  This has been particularly valuable for comparing 
versions of a document when one version has no electronic form, such as proofing journal galleys.  Calendar and reminder programs 
have been augmented to allow audio reminders.  Some users have added spoken progress indicators to their long computations, 
allowing them to "keep an ear" on the computation while they perform other tasks.  Similarly, audio prompts and error messages 
allow users to focus their attention elsewhere without losing track of a program that requires intervention.  Although present-day 
synthesizers are less intelligible for arbitrary text than for the hand-tuned scripts that are used in commercial dial-up 
applications, the controllability of the generated speech suggests interesting future research in "audio style" for documents, in 
which speed, pitch, and other voice characteristics could be applied automatically to communicate italicization, boldface, or 
quotations.
Uses so far have been general applications of text-to-speech in the electronic office, including: proofreading (especially 
comparing versions of a document when one version has no electronic form, such as proofing journal galleys), audio reminder 
messages, program progress indicators, and error messages.  
Because we treat connections to voice sources explicitly, and our ability to include more than two parties in a conversation is not 
yet available above the Etherphone hardware level, we have not yet been able to experiment with uses of voice sources in telephone 
calls.  Among planned uses for text-to-speech are: (1) providing audio confirmation of the person or number dialled as a call 
begins, (2) reading electronic mail over the telephone to a remote lab member (without dedicating a synthesizer solely to this 
task), and (3) playing program-generated messages to callers, such as prompts or reports of the user's location (possibly by 
consulting the user's calendar program, such as "Dr. Smith is at the Distributed Systems seminar now, please call back after 5 
o'clock").
server.  speak selected text, speak under program control, audio reminder, debugging, progress reports, proofreading
Voice annotation and editing
The Etherphone system supports voice annotation of documents.  This capability is built on top of Tioga, the Cedar environment's 
screen-based editor.  The Tioga editor is a what-you-see-is-what-you-get galley editor that is used for both programming and 
document preparation.  Documents can have rich formatting and typography, and can include pictures and voice.  Tioga documents are 
tree-structured: for example, paragraphs can be nested under a section heading.  Documents can be displayed at any level of detail, 
from the single root node to the full tree.
Tioga is also extensible.  Individual characters or nodes can have arbitrary properties associated with them.  One use of node 
properties is to specify bitmaps and specialized screen-painting procedures for embedded pictures.
Ades & Swinehart Tioga is the standard text-editing program in Cedar.  Tioga is essentially a high-quality galley editor, 
supporting the creation of text documents using a variety of type faces, type styles, and paragraph formats.  Tioga is unusual 
among text editors in that its documents are tree-structured rather than being plain running text.  This makes possible such 
operations as displaying only the section headings or key paragraphs of a document, which means that scanning a Tioga document for 
a particular section can be done quickly and effortlessly.  Finally, Tioga includes the ability to incorporate illustrations and 
scanned images into its documents.  Tioga can create both black-and-white and full-color documents.
A&S Cedar has been designed so that other applications can employ the capabilities of the Tioga editor.  These include the 
electronic mail system, the system command interpreter, and any tools that require the entry and manipulation of text by the user.  
This gives considerable unity to the editing interface, since for all the different types of application in which Tioga is used, 
identical keystrokes will perform identical functions.  Wherever Tioga is used, all of its formatting and multi-media facilities 
are available.  Thus, by adding voice annotation to Tioga, we have made it available to a variety of Cedar applications.
A&S The user interface of the voice annotation system is designed to be lightweight and easy to use, since spontaneity in adding 
vocal annotations is essential.  Voice within a document is shown as a distinctive shape superimposed around a character, so that 
the document's visual layout and its contents as observed by other programs (e.g., compilers) are unaffected.  Users point at text 
selections and use menus to add and listen to voice.
A&S Simple voice editing is available: users can select a voice annotation and open a window showing its basic sound-and-silence 
profile.  Sounds from the same or other voice windows can be cut and pasted together using the same editing operations supported by 
the Tioga editor.  A lightweight `dictation facility' that uses a record/stop/backup model can be used to record and incorporate 
new sounds conveniently.  Editing is done largely at the phrase level (never at the phoneme level), representing the granularity at 
which editing can be done with best results and least effort.  The visual voice representation itself can be annotated: simple 
temporary markers are used to keep track of important boundaries during editing operations, while permanent textual markers are 
used to find significant locations within extended transcriptions.  As a further contextual aid, the system provides a visual 
indication of the age of the voice in an editing window.  Newly-added voice appears in a bright yellow color, while 
less-recently-added phrases become gradually darker as new editing operations occur.  The dictation facility can also be used when 
placing voice annotations straight into documents.

A&S  Basic annotation
Figure 1 shows a text document window, or viewer, from a Cedar workstation screen.  Its caption defines the various regions of the 
viewer, indicating how one selects objects of interest and how one performs various operations on those objects.  We will use the 
terminology defined in Figure 1 throughout the discussion.
A&S Any single text character within a Tioga document can be annotated with an audio recording of arbitrary length.  To add an 
annotation, the user simply selects the desired character within a text viewer and buttons AddVoice in that viewer's menu.  
Recording begins immediately, using either a hands-free microphone or the telephone handset, and continues until the user buttons 
STOP.  As recording begins, a distinctive iconic indication of the presence of a voice annotation is displayed as a sort of 
decoration of the selected character.  Currently, this voice icon is an outline in the shape of a comic-strip dialog `balloon' 
surrounding the entire character.  The second line of text in the first paragraph shown in Figure 1 contains a voice icon.
A&S Adding a voice icon does not alter the layout of a document in any way.  Thus, voice annotations can be used to comment on the 
content, format, or appearance of formatted text.  Moreover, programs such as compilers can read the text, ignoring voice icons 
just as they ignore font information.  Voice annotations may be used, for example, to explain portions of a program text without 
affecting the ability to compile it.  Like font information, voice icons are copied along with the text they annotate when editing 
operations move or copy that text to other locations, either within the same document or from one document to another.
A&S A voice annotation becomes a permanent part of a Tioga document.  Copies of the document may be stored on shared file servers 
or sent directly to other users as electronic mail.  To listen to voice, a user selects a region containing one or more voice icons 
and buttons PlayVoice.  Since playback takes some time, the user may queue up additional PlayVoice requests during playback.  These 
will then be played in sequence.  The STOP button can be used to halt the whole process at any time.
A&S One aspect of the Etherphone system's architecture is particularly relevant to voice editing systems.  Digitized voice is 
stored not by individual workstations, but by a voice file server on the Ethernet, designed specifically for recording and playing 
voice.  The only time the data is directly accessed is to play it back, by sending voice packets from the server back to an 
Etherphone.
A&S In this paper, we have described the user interface for a voice annotation and editing system.  The key points of our design 
are:
�    Voice is treated as an additional medium to be incorporated into a multi-media document management system.  The voice 
facilities have been added by extending the semantics of an existing user interface to encompass voice where appropriate, then by 
adding new techniques to deal with the idiosyncrasies of the audio medium.
�    There are some cases where simply converting the semantics of a text editing interface to voice would yield poor results.  In 
such cases, we have produced a deliberately different interface.  For example, we restrict voice editing to the manipulation of 
quantities no smaller than a spoken phrase, using a very simple capillary representation of the phrase structure.  We have 
concluded that more elaborate energy profile representations stress too fine a level of detail, and may provide more distraction 
than contextual information.
�    This prototype voice editor only required two months to implement.  This was possible because the components of the Cedar 
programming environment were designed to be extensible.  The editor was able to use directly a number of user interface facilities 
already available in the environment.  The Etherphone system supplied the underlying capabilities for telephone control as well as 
for recording, playback, and low-level voice-editing operations.  Extensions were linked into Tioga to add voice icons and the 
specialized voice recording, playback, and dictation commands.
A&S We have just begun to test this voice editor within the Cedar community.  We will discover which aspects of our design find 
favor with users and which need improvement.  There are many ways in which this work could be extended, some of which have been 
outlined above.  We believe that future work should continue our efforts to balance the need for a user interface that is easy to 
understand and easy to use against the desire for an extensible and general structure that enables fluent and efficient 
manipulation of a variety of media.
Added to Tioga documents
editing functions at phrase level - simulate dictation machine
Tioga documents can contain pictures and formatting also
=> multi-media documents (picture bits are in file, voice is not)
Tioga's ArtworkInterpress impl:  Tioga allows registering of paintproc for certain nodes.  Artwork prop (value=Interpress) 
determines what proc to call, Interpress prop (value=picture bits, impl for filename proposed), Bounds prop (value=picture 
boundary).  Text contents of ArtworkInterpress nodes is comment telling user to enable ArtworkInterpress.  ArtworkInterpress 
paintproc ignores text contents.  Use of props means that picture bits are at end of file.
Tioga docs are sent as electronic mail
=> voice mail
Voice ropes
voice interests, garbage collection, 
Narrated documents
An additional mechanism that draws on the capabilities of the voice system is ....
PTZ We have developed a mechanism that we call a script, which provides a way to layer additional structure on top of an electronic 
document or set of documents.  A script is a directed path through a document or set of documents that need not follow their linear 
order.  Each entry in a script consists of a document location, which is a contiguous sequence of characters, together with an 
associated action and an associated timing.  A sample action might consist of playing back a previously-recorded voice annotation, 
sending some text to a text-to-speech synthesizer, opening a new window, or running a program, which might animate a picture or 
retrieve items from a database.  A single document can have multiple scripts traversing it for different purposes, and a single 
script can traverse multiple documents to organize collections of information.
PTZ A script can be played back as a whole, in which case the cursor moves to the first location (l1) in a document and performs 
its associated action (a1).  The document scrolls to display that location if the location does not currently appear on the screen, 
and the location is highlighted to call attention to it.  After the associated time (t1), the cursor moves to the location 
specified in the next script entry (l2), performs its action, and so on.  The same location in a document can appear at multiple 
points in the script, with the same or different associated actions and timing.
PTZ Another way to play a script is more user-directed.  In this case, the timing information is ignored, and the script reader 
proceeds from one entry to the next at his or her own pace.
PTZ Arbitrary actions at a scripted location allow scripted documents to perform a wide variety of tasks: demonstrations, 
tutorials, etc.  Parameterized actions allow a script to be personalized ("Hi <username>") or to more accurately reflect the 
current state of affairs ("There are <curnum> entries in this category").  For speech, this capability requires a text-to-speech 
synthesizer.
PTZ Scripted multi-media documents can contain any combination of text, pictures, audio, and action.  Scripts need not follow the 
normal linear order of their associated document(s).  In addition, the script writer can construct multiple viewing paths through 
the document(s) for different readers and purposes.  This novel mechanism allows writers to communicate additional information to 
readers.  Scripts can be used in a wide variety of ways, including: to construct formal demonstrations and presentations, to 
construct informal communications, and to organize collections of information.
Work in progress....
An additional tool allows a script writer to order voice annotations into a sequence, creating documents that provide a narration.
Features and Drawbacks
something about the difficulty of modifying the system?
want database hooks to allow user to specialize behavior
Other
conferencing, ...
Future Work
conferencing, ...
<<Old Outline Stuff>>
<<DBT The facilities of the voice system have been used to integrate voice into our environment in various ways.>>
Informational displays -- who is calling (by tune and icon) -> Fig 3.
Simple commands  ?
DBT Calls can be placed from a workstation using a telephone directory viewer or dialed by name from the telephone keypad.
Phoning from DB or browsing )  see Fig. 4 for DBT Browser.
Voice Annotation and Editing 
Set up connection to file server instead of other phone > can record arb-length dictation, connect to document.  Fig 5.
DBT Voice is sent in electronic mail messages by reference.  This is a specific case of the more general techniques for annotating 
documents with voice.
Also in Fig 5, picture of segment of voice that can be edited to edit the voice  annotations marking, color cues combine 
w/graceful features to provide assistance in editing and locating things later [TV paper citations].
DBT Voice viewers on a workstation display a visual representation of stored voice.  Users can edit voice by rearranging parts of 
the existing voice or recording new voice passages to be inserted in selected places.
Draft versions of this paper were annotated with verbal suggestions for alteratons or improvements.
Built up from voice record/edit capabilities & Tioga, + ability to manage recorded-voice values.  Can be used wherever Tioga can, 
so is avail. for construction constructing voice messages.
Further example of applications building on each other  narrated documents  cite Polle paper, show in Fig 6 (imagine scrolling 
action).
DBT The voice system facilities have also been used to give a running audio narration for documents.
Progress in the design of a voice architecture
We have approached the problem of designing a voice architecture in a conventional manner.  We identified a set of capabilities 
that we would like our system to have.  We then designed a system to provide those capabilities, being as careful as we could to 
structure the system for modularity, flexibility, and all those good things.  We have been using the resulting system as a model  
positive and negative  for the facilities, interfaces, and protocols that should comprise a general architecture for voice 
systems. 
At the highest level, 
Application layer.  client applications that use voice
Service layer.  telephony, recording, speech synthesis, speech understanding, etc.
Conversation layer.  conversation establishment and management
Transmission layer.  voice representation and transmission protocol(s); control protocols.
Physical layer.  communication media

Things to get in (DCS):
Where conventional (esp. conventional mu-law digital) switches would fit in.
How another voice representation would fit in (say ADPCM, LPC encoding on the disk, gateways between them, all that.)
How other transmission protocols would fit, the need for gateways among different ones.
The right design for tandem switches to the real analog and digital world.
(At this architectural level, it's pretty easy to show all this.)
Holes in our design.
Notes:
 The association between these layers and the components of our prototype system is as follows:
Application - WalnutVoice, TiogaVoice, FinchTool
Service - (ThSmarts), Lark, Bluejay, VoiceRopes, Text-to-Speech, Finch
Conversation - (ThParty), Thrush
Transmission - Etherphone voice transmission protocol, standard telephone company protocol, RPC
Physical - Ethernet, normal telephone lines
 Gateways at the Transmission layer can connect different Physical layer components and even convert between different 
Transmission layer protocols.  Etherphones can play this role, to an extent, in our current prototype.  For instance, a 
conversation can go over the Ethernet and then be forwarded over a phone line via the Etherphone's backdoor.
 At the Transmission layer, there is a distinction between the representation for voice and the protocol for transmitting it.  We 
use the same voice representation as the phone company (64 Kbits, PCM, ...) but a different protocol (packet switching instead of 
circuit switching).  If we had Cambridge rings or IBM token rings, we would still do packet switching, but would likely use 
different size packets than the Etherphone protocol.  I'm not sure that we want to worry about different voice formats; doing so, 
requires the use of translation gateways.
A PBX provides the bottom two layers, and could possible provide the Conversation layer as well.
 Our prototype Conversation layer is provided by a centralized server.  I don't think that we've thought much about how to come up 
with a decentralized implementation of this layer, though one is necessary if we want the system to scale well.
 It seems that we have a fairly rich Service layer.  Our prototype Service layer is distributed among servers and workstation 
programs.  However, we should think about the interactions between workstations and multiple servers providing identical (or 
similar) services. 
 We have a few good examples of Application layer programs.

Looking at the architectural layers, it becomes easier to see how our efforts differ from work being done elsewhere.  Most of the 
current efforts to "integrate voice and data" simply deal with the Transmission (and Physical) layer.  Other systems that include 
voice, such as Diamond, commercial voice mail services, etc., have some specialized applications but very scanty Service and 
Conversation layers; they mostly build directly on a Transmission layer.  By contrast, we have concentrated our efforts on 
Conversation and Service layer specifications, and on the architecture in general.  And in doing so, we can make an important 
contribution.

... Doug

DCS I'd include RPC in transmission, or something, I guess.  In some ways, voice transmission is sort of parallel to the 
Conversation layer.

We need a place to put database access, Agent, Multicast server, LarkControl, and all that.

The original Party/Smarts design doesn't depend heavily on a centralized server; it would work reasonably well fully-decentralized 
(one per phone); the conversation would be managed by the EtherphoneII that initiated it, even if that user eventually dropped out. 
    A hard thing is to design the Conversation layer for a few cooperating centralized servers.      Similar problems come up when 
one contemplates replications of Bluejay.

Once your VoiceRopes are done, only simple interface stubs will remain of any services on the workstation, right?  The trend is 
definitely towards drifting services to a server as they become stable and well-understood.  Again, a service that is replicated 
represents a challenge that has traditionally been outside the scope of this project, but could be drawn in if we have any good 
ideas -- "interests" might be able to help, but only if we learn how to replicate those!
State (Summary)
In daily use by 50 people as sole phone (connections to other lines & outside trunks provided in undescribed ways)  have developed 
applications above.  Still working to define architecture that would make these applications easier to build and more robust 
(problem with competing uses for connections).   Want to explore a number of areas, including telephone filtering, attenance 
console stuff, ... Need to experiment with applications of same architecture to different workstation environments, different 
hardware architectures (could even combine them.)  Also other media, like still and real-time video.
If get arch. right, and enuf of it built, can open up to programmers to contribute additional applications.    Have done some of 
that..
DBT We have discovered that managing voice in a distributed environment presents some interesting problems, which are pertinent to 
other media such as images or video.  
DBT The voice system with approximately 50 Etherphones is in everyday use by members of CSL.
References
DBT We should reference the "Adding Voice..." paper, Stephen and Dan's paper on editing, perhap's Polle's new CHI paper, perhaps 
Doug's distributed systems workshop paper, etc.
DCS We should toss in a skeletal bibliography, culling entirely from our previous works, and do better later on.
MICE
OSI.
XNS, Interpress, ....
UNIX reference, Cedar paper.