Voice Annotation and Editing in a Workstation Environment
Stephen Ades and Daniel C. Swinehart
XEROX Palo Alto Research Center

The Integrated Office
there is a trend in the modern office away from large time-sharing
systems towards personal computers
increasingly one computer plays a part in all the tasks of the working
day - this is
The Workstation
supporting text manipulation, spreadsheets, programming, electronic mail
soon will scan and transmit documents
with the phone, it provides all an office's communications needs
Integrating the phone and the workstation would enable
 control of telephone facilities from the workstation
 incorporation of voice into document architectures and other computing
applications
These are the aims of the Etherphone project at Xerox PARC
[this talk addresses the latter area]
Existing Voice Editors
 emphasise the need for a lightweight and spontaneous interface
 e.g. the Centerpoint system
but this system is strictly limited in its capabilities
some emphasise flexible, general and extensible design
 e.g. the Diamond system
but this system is cumbersome for simple operations
The balance between speed and generality is hard to set
We aim to produce an interface
which allows effortless manipulation with a powerful command structure
in which commands have a clear and unambiguous meaning
An Integrated Editor
we want to support a uniform command interface across different media
where analogous operations exist
but
 the idea of uniformity should not relentlessly be applied where
inappropriate
for example
creating subwindows within text is appropriate to the insertion of
graphics but not voice
text is commonly edited at the character level, but voice is better
handled at the phrase level
Freshly Created Voice
the speaker knows whether what has just been said sounded correct
a sentence can easily be recreated if incorrect
an editor should facilitate the "record, rewind, replay, (rub out), resume
recording" action typical of a dictation machine
The Tioga Editor
is the basis of the environment of our voice editor
we have used many ideas from it
Tioga is very versatile
documents are structured
Tioga can handle fonts, layout styles, illustrations
documents can be in colour or monochrome

Tioga is nevertheless very effortless to use
the command interface makes common operations very fast
pairs of selections are the basis of many operations
accelerators allow common command sequences to be combined
Simple Annotation
involves only selecting a point, hitting "AddVoice", speaking, hitting
"Stop"

annotations are represented by simple icons which
do not alter the text layout
do not effect the text as seen by e.g. compilers

the voice
gets copied when the text containing it is copied
is in all ways a permanent part of a document - it can be mailed for
example
Voice Viewers
to edit voice utterances, voice viewers can be opened by a single button
push
the icon changes to identify the open window
the viewer shows a capillary display of the voice
white for silence, black for sound
no attempt is made to show an energy profile or similar, which would
distract the eye with useless information
encourage phoneme level editing
Simple Editing
keystrokes for simple operations are just the same as those used in
Tioga
insertion of fresh voice is as simple as basic annotation of text
Selection Issues
how much to select
Tioga allows character, word, paragraph selections
The voice editor steers the user towords phrase selections

what to select
Pairs of textual selections are easy to make by eye
No similar context is provided by our plain and simple voice displays
Selecting `behind the cue'
Context in a voice viewer is best gained by listening to it
As a viewer is `played', a cue moves along to point out the portion heard
All editing operations can be done `behind the cue' for speed in simple
edits
Marks can also be made `behind the cue' for later use
Simple Marks
a transposition of two voice segments needs two selections
one needs to be marked whilst the other is found
simple lightweight marks serve this purpose
they are volatile and disappear with the voice viewer
Textual Marks
created by typing at a voice selection
better than energy profiles for context recognition
keywords marking phrases could even be generated automatically
the marks are permanent - they are retained when the voice is stored in
text
Age Marks
four different colours, distinguishing the most recent four alterations
from a background colour
the colours represent an aging process
they enable the user to locate most recent changes
they are used for dictation accelerators
Dictation accelerators
support the "record, rewind, replay, (rub out), resume recording"
sequence
operate from the selection to the end of the `freshest voice'
Play From Selection
Resume From Selection
Resume From End
together make for fast correction of input
When can the dictation facility be used?
at any time, just like Tioga accelerators
- using the Dictation Machine button to produce a fresh viewer
- in the midst of annotating text, upon making a mistake
- in the midst of adding to a voice viewer, operating up to the end of
the youngest voice
Conclusion
[remark that some implementation issues are discussed in text, some
discussed in other papers from CSL]
[mention that extensions to work are also in text - adding speech
recognition and generalising the document architecture]
We have tried to
build an interface which sets a good balance between speed for simple
operations and power for application to many situations
encourage phrase level editing
exploit the uniform editor concept but not stretch it where voice and
other media need different treatment
Because Tioga is used as the basis of most textual applications, voice
annotation has been made available in all those applications