ieeeappl.tioga
Polle Zellweger (PTZ) November 14, 1986 6:27:27 pm PST
Add to Voice File Server description:
Etherphones: telephone/speakerphone instruments that include a microcomputer, encryption hardware, and an Ethernet controller. Etherphones digitize, packetize, and encrypt telephone-quality voice and send it to each other directly over an Ethernet. Etherphone software is written in C. Our current environment contains approximately 50 Etherphones, which are used daily by members of our laboratory as their only telephone service. A connection to the standard direct-dial telephone line provides access to telephones outside the Etherphone system. Additional information on the Etherphone hardware and the Voice Transmission Protocol can be found in a previous report [#A].
Voice File Server: a service that can hold conversations with Etherphones in order to record or play back user utterances. In addition to managing stored voice in a custom-designed/special-purpose file system, the Voice File Server provides operations for combining, rearranging, and replacing parts of existing voice recordings to create new voice objects. For security reasons, voice remains encrypted when stored on the file server.
Voice applications
Most of our user-level applications to date have been created in the Cedar environment, although limited functions have been provided for Interlisp and for standalone Etherphones. This section describes the voice Etherphone applications that are available in Cedar, including telephone management, text-to-speech synthesis, and voice annotation and editing. Figure 2 shows a typical Cedar screen using voice, text, and graphics to support programming and document preparation activities.
Figure 2. A Cedar screen in use. The two windows in the upper left show a document preparation task, including voice annotation of this paper. The bottom window of this pair shows new voice (represented by arrowheads) being inserted in the middle of an existing voice annotation. The two windows in the lower left show a programming task that is monitoring part of the voice annotation system. The two windows in the upper right show images created with several graphical illustration packages. The window at the lower right accepts user commands, similar to a Unix shell. The bottom row of icons represent files and tools that are active but are not currently being manipulated by the user.
In order to make voice a first-class citizen of the Cedar environment, Etherphone functions are typically available in several ways: through an Etherphone control panel, through commands that can be issued in a typescript, and through procedures that can be invoked from client programs. This integration of voice capabilities will be discussed more fully in the next section.
Telephone management
The telephone management functions provide improved capabilities for placing calls and improved capabilities for receiving calls. Etherphone functions are typically available in several ways: through an Etherphone control panel, through commands that can be issued in a typescript, and through procedures that can be incorporated into client programs. Figure 3 shows an Etherphone control window, called Finch, and a personal telephone directory window.
Users can place calls by specifying a name, a number, or other attributes of the called party. A system directory database for local Xerox employees (about 1000 entries) is stored on the Voice Control Server. Etherphone users can also create personal directories, which are consulted before the system directory to locate the desired party. A soundex search mechanism [Knuth] compensates for some kinds of spelling errors.
A variety of convenient workstation dialing methods are provided. A user can fill in fields in the Finch tool, select names or numbers from anywhere on the screen, or use either of two directory tools that present browsable lists of names and associated telephone numbers as speed-dialing buttons. Calls can also be placed by name or number from the telephone keypad.
Calls are announced audibly, visually, and functionally. Each Etherphone user selects a personalized ring tune, such as a few bars of "Mary Had a Little Lamb". This tune is played at a destination Etherphone to announce calls to that user. The caller hears the same tune as a ringback confirmation. During ringing, the telephone icon jangles with a superimposed indication of the caller, as shown in the middle portion of Figure 4. An active conversation is represented as a conversation between two people with a superimposed indication of the other party (also shown in Figure 4). A collection of icons is displayed in Figure 4. The system automatically fills in the Finch tool's Calling Party or Called Party field to allow easy redialing. It also creates a new entry in a conversation log. A user can consult the conversation log to discover who called while he was away from out of his office.
Our methods of following a user around an office building are based on rely upon the personalized ring tunes, which allow Etherphone users to identify calls to them wherever they may be: in their own offices, within earshot, or at other Etherphones. If an Etherphone user logs in at a workstation, his calls are automatically forwarded to the adjacent Etherphone. An additional feature, called visiting, allows him to register his presence with a second workstation or Etherphone, such as during a meeting. Registering with the destination location allows users to travel more freely than forwarding calls from the home location does. Each visit request cancels any earlier requests; visiting the home location cancels visiting. The common problem of forgetting to cancel forwarding is further eased by ringing both Etherphones during visiting.
Text-to-speech synthesis
A user or program can generate speech as easily as printing a message on the display by using one of the Text-to-speech Servers. A user can select text in a display window and click the Finch tool's SpeakText menu button. A program can call a procedure with the desired text as a parameter. These features are implemented by creating a "conversation" between the user's Etherphone and a Text-to-speech Server. The system sets up a connection to the Text-to-speech Server, sends the text (via RPC), returns the digitized audio signal (via the Voice Transmission Protocol), and closes the connection when the text has been spoken. A similar mechanism is used for voice recording and playback.
Our primary uses for text-to-speech so far have been in programming environment and office automation applications. Programming environment tasks have included spoken progress indicators, prompts, and error messages. Office automation applications have included proofreading (especially comparing versions of a document when one version has no electronic form, such as proofing journal galleys) and audio reminder messages generated by calendar programs.
Voice annotation and editing
Section intro... prototype system ... Figure 5 in here somewhere... Tioga
The user interface of the voice annotation system is designed to be lightweight and easy to use, since spontaneity in adding vocal annotations is essential. Voice within a document is shown as a distinctive shape superimposed around a character, so that the document's visual layout and its contents as observed by other programs (such as compilers) are unaffected.
To add an annotation, the user simply selects the desired character within a text window and buttons AddVoice in that window's menu. Recording begins immediately, using either a hands-free microphone or the telephone handset, and continues until the user buttons STOP. A voice annotation becomes a permanent part of a Tioga document, although the voice data physically resides on the Voice File Server. Copies of the document may be stored on shared file servers or sent directly to other users as electronic mail. To listen to voice, a user selects a region containing one or more voice icons and buttons PlayVoice.
Simple voice editing is available: users can select a voice annotation and open a window showing its basic sound-and-silence profile. Sounds from the same or other voice windows can be cut and pasted together using the same editing operations supported by the Tioga editor. A lightweight `dictation facility' that uses a record/stop/backup model can be used to record and incorporate new sounds conveniently. Editing is done largely at the phrase level (never at the phoneme level), representing the granularity at which editing can be done with best results and least effort. The dictation facility can also be used when placing voice annotations straight directly into documents.
Furthermore, the visual voice representation itself can be annotated: simple temporary markers are used to keep track of important boundaries during editing operations, while permanent textual markers are used to find significant locations within extended transcriptions. As a further contextual aid, the system provides a visual indication of the age of the voice in an editing window. Newly-added voice appears in a bright yellow color, while less-recently-added phrases become gradually darker as new editing operations occur.
More information about the voice annotation and editing system can be found in [Ades].
Future Directions/ Future Exploration/ Work in Progress
Section intro ...
Call filtering. options based on the subject, urgency, or caller's identity
Novel voice connections. We have begun to explore novel kinds of interactive voice connections, such as all-day "background" telephone calls, use of the telephone system to broadcast internal talks or meetings (as a sort of giant conference telephone call), and conference calls that allow side-conversations to take place. incorporate still and real-time video into system architecture?
Using text-to-speech/voice sources in telephone calls. Among planned uses for text-to-speech are: (1) providing audio confirmation of the person or number dialed as a call begins, (2) reading electronic mail over the telephone to a remote lab member (without dedicating a synthesizer solely to this task) as in PhoneSlave [ref] or MICE [ref], and (3) playing program-generated messages to callers, such as prompts or reports of the user's location (possibly by consulting the user's calendar program, such as "Dr. Smith is at the Distributed Systems seminar now, please call back after 5 o'clock"). DCS: make smaller
scripted/narrated documents
References
Phone Slave
Knuth