<> <> <> Executive Summary The CSL Voice project is an effort to define and validate experimentally a prototype architecture, for which we have coined the term "Intervoice," that can incorporate live and recorded voice into office systems. The methodology is to produce working, extensible prototypes, primarily in Cedar. There has been significant progress during the last six months: a major revision to the telephone service has been designed and largely implemented; an implementation of a common database package needed to support a number of functions of the present Intervoice has been completed; a voice synthesis service has been added; and a preliminary Interlisp-D version of the workstation program was demonstrated. An extensive questionnaire was distributed to the present Etherphone users (the most obvious customers for this work), soliciting comments on the adequacy of the existing design and ideas for future extensions. Plans include the production of advanced applications that will exhibit the power of the underlying architecture, and a continued attempt to define the Intervoice architecture in a way that could accommodate components running on other hardware and in other environments. Introduction Voice is a vital form of interpersonal communication in the office. As a non-interactive medium (in messages and as annotations to text documents) it has yet to be fully assessed, but it appears to have considerable value. (The August 5, 1985 issue of the "Seybold Report on Office Systems" says: "Although voice applications have been relatively slow to take off in the United States, we suspect that lack of integration is the villain. Once voice functions are truly intertwined with text and data applications in an organization's information systems, they will become an irresistible and compelling adjunct to any office.") The appearance of integrated voice capabilities in Xerox workstation products is overdue. The CSL Voice project is an effort to define and validate experimentally a prototype architecture, for which we have coined the term "Intervoice," that can incorporate live and recorded voice into office systems. Intervoice must be able to specify the role of telephone transmission and switching, workstations, voice file servers, and other network services in supporting voice communications  both real-time telephone calls and recorded voice as it is stored, manipulated, and experienced in documents. It must be an open architecture, permitting programmers to create new voice-related applications and modify existing ones without having to understand the detailed workings of the voice system and without endangering its normal operation. Ideally, this architecture could evolve into a standard defining the role of each major component, so that multiple vendors could cooperate to provide advanced voice functions in conjunction with our workstations. The methodology is to produce working, extensible prototypes, primarily in the Cedar programming environment; to demonstrate the present state of the architecture; and to serve as an experimental base for improving it. The present experimental environment includes: Etherphones: telephone/speakerphone instruments that communicate digitized voice and control information over the Ethernet. Each includes a microcomputer and encryption hardware. Approximately forty Etherphones are in daily use within CSL. Telephone service: a control program that provides PBX functions and manages the communicatons of all the other components. It runs on a dedicated server. This service and workstation program libraries implement the client programmer interface to the Intervoice architecture. Voice file service: a service that can connect to Etherphones in lieu of or in addition to other conversants. Supports digital recording, playback, and flexible editing of user utterances. Synthesized voice service: a recently-added facility described in more detail below. Workstation programs: provide enhanced user interfaces and control over the voice capabilities. The Cedar version is called Finch, and has been in use for some time. There has been significant progress during the last six months: a major revision to the telephone service has been designed and largely implemented; an implementation of a common database package needed to support a number of functions of the present Intervoice has been completed; a voice synthesis service has been added; and a preliminary Interlisp-D version of the workstation program was demonstrated. An extensive questionnaire was distributed to the present Etherphone users (the most obvious customers for this work), soliciting comments on the adequacy of the existing design and ideas for future extensions. Activities Database Facilities The voice system requires access to a number of high-performance, shared databases that may reside on servers or personal workstations. A simple data management system, called LoganBerry, has been developed to support these needs. LoganBerry treats data as a set of untyped key:value pairs. Data is maintained in one or more logs, which can be stored on a local disk, an Alpine file server, or in stable battery-backed RAM. To prevent a database from being corrupted by processor crashes, new database entries are always appended to the end of a log file; delete operations are simply logged. Monitor locks provide the necessary concurrency control. LoganBerry databases can be backed up using the DF facilities already present in Cedar. The basic query operations fetch an entry (or enumerate a range of entries) comprising a number of key:value pairs, given one such pair. These queries are efficiently supported through the use of B-Tree indices. A query package layered on top of these basic operations allows one to formulate queries involving multiple keys, a variety of pattern-matching techniques, and merged databases. All LoganBerry operations can be invoked locally or via remote procedure calls (RPC). A general purpose browser permits interactive querying of LoganBerry databases. Although the creation of LoganBerry was motivated by the needs of the voice project, its facilities have wider utility. It has been released as a Cedar library package, and has already been used to produce a database of publication references. Speech Synthesis Services In addition to the voice file server, which supports voice recording and playback, the Etherphone system now includes two voice synthesizers produced by different manufacturers. Each can convert arbitrary ASCII text to reasonably intelligible audio output, with control of speaking speed, pitch, and other voice characteristics. Words that do not follow usual English pronunciation rules can be specified as a sequence of phonemes. Each synthesizer is connected to a dedicated Etherphone, called a text-to-speech server. Both servers are available to any Etherphone-equipped workstation, on a first-come, first-served, one-user-at-a-time basis. Normal Etherphone connections are used to transmit the voice from the server to the user's Etherphone. A user or program can generate speech as easily as printing a message on the display. Uses so far have been general applications of text-to-speech in the electronic office, including: proofreading (especially comparing versions of a document when one version has no electronic form), audio reminders, program progress indicators, and error messages. The controllability of the generated speech suggests interesting future research in "audio style" for documents, in which speed, pitch, and other voice characteristics could be used to communicate italicization, boldface, or quotations. Redesign of Telephone service The telephone service was designed to support a wide range of advanced capabilities. Two years' operational experience with the existing design have revealed the need for an improved version, whose design and implementation have been largely completed during the last six months. The primary improvements are: A number of special-purpose databases have been replaced by Loganberry data bases, resulting in a much simpler system structure. These include the assignments of users to Etherphones, system-wide telephone directories, user-specific telephone options, and filing information needed to manage recorded voice messages. More of the control of the system has been implemented in terms of databases that describe the required configurations or behavior. It has always been a goal of the Etherphone system to give the workstation controlling a telephone priority over considerations such as whether to answer a call or not, without affecting the reliability of the underlying telephone system when the workstation fails or behaves poorly. A revision of the control program that gives each of the participating workstation and Etherphone processors more autonomy, while preserving the needed priority, has also enabled better management of conference calls, multiple simultaneous calls, and the background use of the single voice channel to each office (allowing, for example, the playing of synthesized voice without interfering with incoming telephone calls.) Within an office setting, people often use workstations away from their own offices and telephones, or visit other offices for extended periods. The system has been extended to allow the automatic forwarding of telephone calls to the location where a user is known to be, as well the proper caller identification for outgoing calls. This is an improvement over the forwarding capabilities of other systems, which require forwarding to be manually requested before leaving the office. Intervoice Progress The Intervoice architecture that is emerging to describe the structure and functional roles of the Etherphone system components is beginning to rely heavily on the use of flexible, high performance shared databases that are very hard to damage. The ones we have now store declarative information, described above. We will need to add the ability to store methods for execution when triggered by later events, as well as call-back information that can trigger execution in other parts of the system when the database is changed. These requirements are strikingly similar to those identified by SCL for their object server. Plans Plans include the implementation of "voice ropes", a design done last year to provide flexible primitives for editing voice. Using voice ropes, Loganberry databases, and the improved telephone service, we intend first to add full voice-annotation to Tioga documents, then to use the result as a component of a project integrating voice messaging and telephone conversation logging with Walnut electronic mail facilities, seeking a common paradigm spanning interactive conversations and more conventional off-line messages. Loganberry will also be used as the basis for an improved telephone directory service, combining personal and public directories that can be either browsed or queried. Use of the speech synthesizer by applications programmers will be encouraged. Possible applications include audio confirmation of the number dialled as a call begins, reading electronic mail over the telephone to a remote lab member, and playing program-generated prompts and messages to callers ("X is at the Forum now, please call back after 5pm"). The definition of a voice architecture to support all these facilities will continue. [Swinehart, Terry, Zellweger]