[Indigo]<Voice>Documentation>TVPaper>TVPaper.slides!1

Voice Annotation and Editing in a Workstation Environment

Stephen Ades and Daniel C. Swinehart

XEROX Palo Alto Research Center

The Integrated Office

there is a trend in the modern office away from large time-sharing
systems towards personal computers

increasingly one computer plays a part in all the tasks of the working
day - this is

The Workstation

supporting text manipulation, spreadsheets, programming, electronic mail

soon will scan and transmit documents

with the phone, it provides all an office's communications needs

Integrating the phone and the workstation would enable

control of telephone facilities from the workstation

incorporation of voice into document architectures and other computing
applications

These are the aims of the Etherphone project at Xerox PARC

[this talk addresses the latter area]

Existing Voice Editors

emphasise the need for a lightweight and spontaneous interface

e.g. the Centerpoint system

but this system is strictly limited in its capabilities

some emphasise flexible, general and extensible design

e.g. the Diamond system

but this system is cumbersome for simple operations

The balance between speed and generality is hard to set

We aim to produce an interface

which allows effortless manipulation with a powerful command structure

in which commands have a clear and unambiguous meaning

An Integrated Editor

we want to support a uniform command interface across different media
where analogous operations exist

but

the idea of uniformity should not relentlessly be applied where
inappropriate

for example

creating subwindows within text is appropriate to the insertion of
graphics but not voice

text is commonly edited at the character level, but voice is better
handled at the phrase level

Freshly Created Voice

the speaker knows whether what has just been said sounded correct

a sentence can easily be recreated if incorrect

an editor should facilitate the "record, rewind, replay, (rub out), resume
recording" action typical of a dictation machine

The Tioga Editor

is the basis of the environment of our voice editor

we have used many ideas from it

Tioga is very versatile

documents are structured

Tioga can handle fonts, layout styles, illustrations

documents can be in colour or monochrome

Tioga is nevertheless very effortless to use

the command interface makes common operations very fast

pairs of selections are the basis of many operations

accelerators allow common command sequences to be combined

Simple Annotation

involves only selecting a point, hitting "AddVoice", speaking, hitting
"Stop"

annotations are represented by simple icons which

do not alter the text layout

do not effect the text as seen by e.g. compilers

the voice

gets copied when the text containing it is copied

is in all ways a permanent part of a document - it can be mailed for
example

Voice Viewers

to edit voice utterances, voice viewers can be opened by a single button
push

the icon changes to identify the open window

the viewer shows a capillary display of the voice

white for silence, black for sound

no attempt is made to show an energy profile or similar, which would

distract the eye with useless information

encourage phoneme level editing

Simple Editing

keystrokes for simple operations are just the same as those used in
Tioga

insertion of fresh voice is as simple as basic annotation of text

Selection Issues

how much to select

Tioga allows character, word, paragraph selections

The voice editor steers the user towords phrase selections

what to select

Pairs of textual selections are easy to make by eye

No similar context is provided by our plain and simple voice displays

Selecting `behind the cue'

Context in a voice viewer is best gained by listening to it

As a viewer is `played', a cue moves along to point out the portion heard

All editing operations can be done `behind the cue' for speed in simple
edits

Marks can also be made `behind the cue' for later use

Simple Marks

a transposition of two voice segments needs two selections

one needs to be marked whilst the other is found

simple lightweight marks serve this purpose

they are volatile and disappear with the voice viewer

Textual Marks

created by typing at a voice selection

better than energy profiles for context recognition

keywords marking phrases could even be generated automatically

the marks are permanent - they are retained when the voice is stored in
text

Age Marks

four different colours, distinguishing the most recent four alterations
from a background colour

the colours represent an aging process

they enable the user to locate most recent changes

they are used for dictation accelerators

Dictation accelerators

support the "record, rewind, replay, (rub out), resume recording"
sequence

operate from the selection to the end of the `freshest voice'

Play From Selection

Resume From Selection

Resume From End

together make for fast correction of input

When can the dictation facility be used?

at any time, just like Tioga accelerators

- using the Dictation Machine button to produce a fresh viewer

- in the midst of annotating text, upon making a mistake

- in the midst of adding to a voice viewer, operating up to the end of
the youngest voice

Conclusion

[remark that some implementation issues are discussed in text, some
discussed in other papers from CSL]

[mention that extensions to work are also in text - adding speech
recognition and generalising the document architecture]

We have tried to

build an interface which sets a good balance between speed for simple
operations and power for application to many situations

encourage phrase level editing

exploit the uniform editor concept but not stretch it where voice and
other media need different treatment

Because Tioga is used as the basis of most textual applications, voice
annotation has been made available in all those applications