Voice Annotation and Editing in a Workstation Environment Stephen Ades and Daniel C. Swinehart XEROX Palo Alto Research Center The Integrated Office there is a trend in the modern office away from large time-sharing systems towards personal computers increasingly one computer plays a part in all the tasks of the working day - this is The Workstation supporting text manipulation, spreadsheets, programming, electronic mail soon will scan and transmit documents with the phone, it provides all an office's communications needs Integrating the phone and the workstation would enable control of telephone facilities from the workstation incorporation of voice into document architectures and other computing applications These are the aims of the Etherphone project at Xerox PARC [this talk addresses the latter area] Existing Voice Editors emphasise the need for a lightweight and spontaneous interface e.g. the Centerpoint system but this system is strictly limited in its capabilities some emphasise flexible, general and extensible design e.g. the Diamond system but this system is cumbersome for simple operations The balance between speed and generality is hard to set We aim to produce an interface which allows effortless manipulation with a powerful command structure in which commands have a clear and unambiguous meaning An Integrated Editor we want to support a uniform command interface across different media where analogous operations exist but the idea of uniformity should not relentlessly be applied where inappropriate for example creating subwindows within text is appropriate to the insertion of graphics but not voice text is commonly edited at the character level, but voice is better handled at the phrase level Freshly Created Voice the speaker knows whether what has just been said sounded correct a sentence can easily be recreated if incorrect an editor should facilitate the "record, rewind, replay, (rub out), resume recording" action typical of a dictation machine The Tioga Editor is the basis of the environment of our voice editor we have used many ideas from it Tioga is very versatile documents are structured Tioga can handle fonts, layout styles, illustrations documents can be in colour or monochrome Tioga is nevertheless very effortless to use the command interface makes common operations very fast pairs of selections are the basis of many operations accelerators allow common command sequences to be combined Simple Annotation involves only selecting a point, hitting "AddVoice", speaking, hitting "Stop" annotations are represented by simple icons which do not alter the text layout do not effect the text as seen by e.g. compilers the voice gets copied when the text containing it is copied is in all ways a permanent part of a document - it can be mailed for example Voice Viewers to edit voice utterances, voice viewers can be opened by a single button push the icon changes to identify the open window the viewer shows a capillary display of the voice white for silence, black for sound no attempt is made to show an energy profile or similar, which would distract the eye with useless information encourage phoneme level editing Simple Editing keystrokes for simple operations are just the same as those used in Tioga insertion of fresh voice is as simple as basic annotation of text Selection Issues how much to select Tioga allows character, word, paragraph selections The voice editor steers the user towords phrase selections what to select Pairs of textual selections are easy to make by eye No similar context is provided by our plain and simple voice displays Selecting `behind the cue' Context in a voice viewer is best gained by listening to it As a viewer is `played', a cue moves along to point out the portion heard All editing operations can be done `behind the cue' for speed in simple edits Marks can also be made `behind the cue' for later use Simple Marks a transposition of two voice segments needs two selections one needs to be marked whilst the other is found simple lightweight marks serve this purpose they are volatile and disappear with the voice viewer Textual Marks created by typing at a voice selection better than energy profiles for context recognition keywords marking phrases could even be generated automatically the marks are permanent - they are retained when the voice is stored in text Age Marks four different colours, distinguishing the most recent four alterations from a background colour the colours represent an aging process they enable the user to locate most recent changes they are used for dictation accelerators Dictation accelerators support the "record, rewind, replay, (rub out), resume recording" sequence operate from the selection to the end of the `freshest voice' Play From Selection Resume From Selection Resume From End together make for fast correction of input When can the dictation facility be used? at any time, just like Tioga accelerators - using the Dictation Machine button to produce a fresh viewer - in the midst of annotating text, upon making a mistake - in the midst of adding to a voice viewer, operating up to the end of the youngest voice Conclusion [remark that some implementation issues are discussed in text, some discussed in other papers from CSL] [mention that extensions to work are also in text - adding speech recognition and generalising the document architecture] We have tried to build an interface which sets a good balance between speed for simple operations and power for application to many situations encourage phrase level editing exploit the uniform editor concept but not stretch it where voice and other media need different treatment Because Tioga is used as the basis of most textual applications, voice annotation has been made available in all those applications