Managing Stored Voice in the Etherphone System
Abstract: The voice rope facilities and implementations provided by the voice manager in the EtherphoneTM system support recording, editing, and playing stored voice in a distributed personal computing environment. To facilitate sharing, voice is stored on a special voice file server that is accessible via the local internet. Etherphones transfer encrypted voice to and from the voice file server over an Ethernet. The voice manager provides operations for editing a voice passage once it has been recorded. Rather than rearranging the contents of voice files when being edited, the voice manager simply builds a data structure to represent the edited voice and stores it in a server database. This data structure, called a voice rope, consists of a list of intervals within voice files. Clients refer to voice ropes solely by reference. Typical uses include embedding references in multimedia documents. A modified style of reference counts, called interests, enable unwanted voice ropes to be garbage collected. These interests are grouped into classes and can be invalidated according to a class-specific algorithm that runs periodically.
1. Introduction
Voice is an important and widely used medium for interpersonal communication. Computers facilitate interpersonal communication through electronic mail and shared documents. Yet, our computer systems have traditionally forced us to communicate textually. A major goal of the EtherphoneTM system developed at Xerox PARC was to allow voice to be incorporated into computing environments and used in much the same way as text. This paper addresses the problems associated with managing stored voice in a distributed computing environment.
The Etherphone system is intended for use in a locally distributed computing environment containing multiple workstations and programming environments, multiple networks and communication protocols, and perhaps even multiple telephone transmission and switching choices. The system is intended to be extensible in that introducing new applications, network services, workstations, networks, and other components is possible.
As with text, we want the ability to incorporate voice1 easily into electronic mail messages, voice-annotated documents, user interfaces, and other interactive applications. Nicholson gives a good discussion of many office applications that are made possible by treating voice as data [18]. Clients should be able to combine previously recorded voice in various ways and insert fresh voice into existing voice passages. Clients should be able to share voice as freely as they share files. Also, the system should permit programmer control over all of these functions.
The characteristics of voice, however, differ greatly from those of text. Standard telephone-quality uncompacted voice occupies 64 Kbits of storage per second of recorded voice. This is several orders of magnitude greater than the equivalent typed text. Voice also requires special devices for recording and playing it; that is, a user cannot simply type in a voice passage. More importantly, voice transmission has stringent real-time requirements. These differences dictate special methods for manipulating and sharing voice.
To this end, we have developed application-independent methods for recording, playing back, editing, and otherwise manipulating digitized voice, based on an abstraction that we call voice ropes. Section 2 presents the operations on voice ropes available to application programs in the Etherphone system.
Section 3 then discusses the design and implementation of these operations and the rationale governing various design choices. The storage requirements of recorded voice demanded that editing be accomplished in such a way that existing voice passages need not be moved, copied, or decrypted. For similar reasons, sharing of voice must not involve copying the voice passages themselves. The major technical contribution described in this paper involve the use of simple databases to:
(1) describe the results of editing operations, and
(2) provide a modified style of reference counting required to allow the automatic reclamation of obsolete voice.
Section 4 relates our experiences to date with incorporating voice into workstation applications, and section 5 reviews our design principles and how they were met. First, the following subsection gives a quick overview of the Etherphone system architecture.
1.1 The Etherphone System
Figure 1 depicts the basic components of the Etherphone system in a simple configuration [25]. Each personal workstation is associated with, but not directly attached to, a microprocessor-based telephone instrument called an Etherphone. Etherphones digitize, packetize, and encrypt telephone-quality voice and transmit it directly over an Ethernet. A voice control server provides control functions similar to a conventional business telephone system and manages the interactions between all the other components. In particular, it allows voice-carrying conversations to be established between two or more Etherphones, workstations, or servers. A voice file server, discussed further in Section 3.1, provides storage for recorded voice. The system can also include other specialized sources or sinks of voice, such as a text-to-speech server that receives text strings and returns the equivalent spoken text to the user's Etherphone.
Workstations are the key to providing enhanced user interfaces and control over the voice capabilities. We rely on the extensibility of the local programming environment—be it Cedar, Interlisp, or the Xerox Development Environment—to facilitate the integration of voice into workstation-based applications. Workstation program libraries implement the client programmer interface to the voice system.
All of the communication required for control in the voice system, such as conversation establishment, is accomplished via a remote procedure call (RPC) protocol [4]. Multiple implementations of the RPC mechanisms permit the integration of workstation programs and voice applications programmed in different environments. During the course of a conversation, RPC calls emanating from the voice control server inform participants about various activities concerning the conversation. Active parties in a conversation exchange voice using a voice transmission protocol [23].
The server software and the initial workstation software was developed in the Cedar programming environment [24]. More information on the equipment and protocols used in the Etherphone system, as well as the applications built to date, can be found in related papers [23] [25].
2. Operational Overview
The implementation of facilities for recorded voice is somewhat involved, but the actions to be performed are conceptually quite simple. In looking for an application-independent abstraction to present to application programmers, it occurred to us that many of these actions closely resembled operations normally associated with text string manipulation. The Cedar system provides a powerful text string abstraction called a rope [24]. By analogy, in the Etherphone system, we refer to sequences of stored voice samples as voice ropes. Each voice rope is defined by a unique identifier, a VRID, instead of by a memory pointer, because (unlike Cedar ropes) voice ropes are persistent objects that are meant to last a long time. To aid in sharing and to facilitate the use of voice by heterogeneous workstations, the storage for voice ropes, as well as the operations on them, are provided by a network service, the voice manager.
Clients refer to voice ropes solely by reference, that is, by their unique identifiers (VRIDs). The voice manager places no restrictions on a client's use of voice ropes. For instance, voice ropes could be used by an interactive interface to provide audio feedback. Most uses involve embedding speech in some type of document, such as an annotated manuscript, program documentation, or electronic mail. The use of such embedded references to refer to voice, video, and other diverse types of information has been termed a hypermedia system [27].
From a client's perspective, a voice-annotated document should behave as though the voice were stored directly in the document's file rather than being included by reference. For example, once a voice message is sent using electronic mail, it should not be possible for the author or another user to change the message's contents. For this reason, voice ropes are immutable. The recording and editing operations create new voice ropes; they do not modify existing ones.
2.1 Recording and playback
To record or playback a voice rope, a conversation is set up between the voice manager and an Etherphone. The main operations supported by the voice manager are as follows:
RECORD[conversation]
b
VRID, requestID
Voice received by the server over the communication path defined by the given conversation is stored and assigned a unique VRID; recording continues until a subsequent STOP operation. The requestID identifies this operation in subsequent reports (see below).
PLAYBACK[conversation, VRID, interval]
b
requestID
The specified interval of the voice rope is transmitted over the given conversation. An interval denotes either the entire voice rope or a time-indexed portion of it at a resolution of about 1 ms.
STOP[conversation]
Any recording or playback operations that are in progress or queued for the given conversation are immediately halted.
These operations are invoked on the voice manager using the Cedar RPC facility. The RECORD and PLAYBACK operations are performed asynchronously. That is, the remote procedure call returns after the operation has been queued by the server. Queued operations are performed in order.
The voice manager generates event reports upon the start and completion of a queued operation. The requestID returned by each invocation is used to associate reports with specific operations. In particular, the voice manager makes the following call to all participants in a conversation to inform them of the status of various requested operations concerning that conversation:
REPORT[requestID, {started | finished | flushed}]
The requested operation has been started, successfully completed, or halted by a STOP operation.
Having reports flow from server to clients is conceptually similar to Clark's upcalls and accomplished in a similar manner [8].
2.2 Editing support
Once recorded, voice ropes can be used in editing operations to produce new, immutable voice ropes. Several of the operations on Cedar ropes, such as producing substrings or concatenating existing strings, are directly applicable to voice. Their transliteration for voice ropes yields these functions:
CONCATENATE[VRID
1, VRID
2, ...]
b
VRID
Produces a new voice rope that is the concatenation of the given voice ropes.
SUBSTRING[VRID
1, interval]
b
VRID
Produces a new voice rope consisting of the specified interval of VRID1.
REPLACE[VRID
1, interval, VRID
2]
b
VRID
Produces a new voice rope that is obtained by replacing the particular interval of VRID1 with VRID2. This is a composition of the CONCATENATE and SUBSTRING operations, provided for efficiency and convenience.
LENGTH[VRID]
b
length
Returns the length of the given voice rope in milliseconds.
One additional operation peculiar to voice ropes was provided to aid in editing:
DESCRIBE[VRID]
b
intervals
Returns a list of time intervals that denote the non-silent talkspurts of the given voice rope. A talkspurt is defined to be any sequence of voice samples separated by some minimum amount of silence.
These operations, available via RPC calls to the voice manager, are intended for use by programmers. Applications that handle voice must employ these operations to construct the facilities visible to the end user.
2.3 Interests
The voice manager also provides operations for managing voice references. These operations provide a sort of directory for voice ropes. Although simple applications can use this directory as their means for naming and locating voice ropes, that is not its primary purpose. As with any storage system, unreferenced storage space should be reclaimed. With voice, or other voluminous media such as video, the need is particularly acute. In the Etherphone system, client code must assist with garbage collection by using the directory operations to express an interest in each referenced voice rope. We list the client operations here, deferring to Section 3.4 a discussion of the rationale for this approach and a description of the underlying implementation. The interest operations are:
RETAIN[VRID, class, interest]
Registers an interest of the particular class in the given voice rope. The interest identifies a reference to the voice rope within the class. This operation is idempotent; successive calls with the same arguments register at most one interest in the given voice rope.
FORGET[VRID, class, interest]
Deregisters the specified interest.
LOOKUP[class, interest]
b
LIST
OF VRID
Returns the unordered list of voice ropes associated with a particular interest.
Both interest and class are arbitrary text string values. The form of the interest value is generally class-specific; moreover, clients are responsible for generating unique values for different interests within a class.
The class identifies the way in which the voice rope is being used by a particular application. For example, we use the class "FileAnnotation" to indicate that a document stored in a named file is annotated by a set of utterances; the interest field is the file name. The class "Message" indicates that the reference is a part of an electronic mail message incorporating recorded voice, and the interest is the unique postmark supplied by the message system.
A combination of client workstation software and automatic collection methods must hide these interest operations from actual users. Client applications must always register interests in order to ensure retention of voice ropes to which they hold references. For some classes, clients must also explicitly FORGET their interests; for others, such as the "FileAnnotation" class, automatic methods described in Section 3.4 make this unnecessary.
3. Detailed Design Decisions
The voice manager uses a voice file server to store voice data. This server provides RECORD, PLAYBACK, and STOP operations that are semantically similar to those described in section 2.1, but operate on voice files. The more complex voice rope editing and directory structures have been implemented as separate, higher-level components, in part to make the voice facilities independent of the choice of the underlying file storage. Voice ropes are actually made up of pieces of one or more voice files.
The implementations of both voice rope editing and interest management depend on a simple but robust database facility that we developed for these purposes. Although the operations for editing voice ropes have been patterned after the Cedar Rope package, a different underlying implementation was necessitated by the disparate characteristics of voice and text. Specifically, editing voice by actually copying the bytes, as is sometimes done for Cedar's ropes, is expensive since voice is voluminous. Thus, rather than rearranging the contents of voice files to edit them, the voice manager simply builds a data structure using the facilities of the database package. Interests are registered in a similar database.
A garbage collector uses interests to reclaim voice storage that is no longer needed. Devising techniques for automatically collecting garbage in a distributed, heterogeneous environment was one of the most difficult problems faced in the design of the voice manager.
The components of the voice manager are logically layered as in Figure 2. Each is discussed in more detail in the following sections.
[Artwork node; type 'ArtworkInterpress on' to command tool]
Figure 2. Voice Storage Components
3.1 Voice file server
A voice file server differs from normal file servers [22] in that it must support the real-time requirements of voice. In particular, it must be able to maintain a sustained transfer rate of 64 Kbits/sec, and it should be able to support several such transfers simultaneously. There is no inherent reason why a general-purpose file server could not be extended to support these stringent real-time requirements. However, the file systems we had available at the onset of this project, having been optimized for different styles of access, could not, in fact, support them. At present, our file server is a special-purpose system built on top of Cedar's standard file system [24].
We will not describe the workings of the Etherphone voice file server in any detail. It implements the operations needed to allocate, record and play back voice files that are named by unique identifiers (VFIDs). Tables locating the boundaries between sound and silence are stored along with the voice to permit efficient execution of the DESCRIBE operation.
Etherphones encrypt voice using DES electronic-codebook (ECB) encryption [17] as it is transmitted. The voice file server simply stores the voice in its encrypted form. This adds a requirement in the voice rope component to manage encryption keys. The stored voice is never decrypted except by an Etherphone when being played back.
3.2 Database facilities
Voice ropes and interests could be implemented using specialized data structures. However, we realized that the facilities of a database would serve these needs well, and the performance requirements were not particularly stringent. The objects representing voice ropes are immutable; multi-object atomic updates are not required; and, the usage of the system is expected to generate relatively infrequent changes to small numbers of voice segments, each several seconds or even minutes in duration. An original prototype voice rope system was built using an existing entity-relation database system (Cypress) in the Cedar environment. [7]. Cypress provides transaction semantics for reliability and data consistency, typed data, facilities for constructing elaborate queries, and adequate steady-state performance. However, in the current Cypress implementation, the overhead of opening and committing transactions is too great to permit sharing a database among many client programs that are making frequent updates. For these reasons, a simple, robust database representation that is particularly well-suited to voice ropes was developed.
A data base entry is a simple sequence of key/value pairs, each represented as a text string. The database system stores each entry in a write-ahead log [11]. Unlike most database systems in which the data is logged only until it can be committed and written to a more permanent location, the log is itself the permanent source of data. Once logged, the data is never moved. To allow rapid queries or enumeration of database entries, B-Tree indices [2] are built to map the values of one or more keys to the corresponding locations in the log file. This log-based database package was inspired by similar methods found in the Walnut electronic mail system [9] and recommended by Lampson [13].
In addition to processing queries, the database system supports atomic insert, replace, and delete operations for single entries. If a crash occurs while writing a log entry, the incomplete entry is discovered upon recovery. If a crash occurs while updating a B-Tree index, the log must be replayed to build the index from scratch, which is an expensive operation. However, since an untimely hardware crash is a rare event, rebuilding an index when necessary seems less costly than building and running a transaction system to update the log and indices atomically.
Since no long-term locks are maintained, multiple clients can easily interleave queries and updates. The append-only log leads to predictable and consistent performance (except during crash recovery), with no lengthy pauses due to transaction conflicts. A compaction operation is provided that enumerates the current entries (in the order specified by the primary key) to produce a new, minimum-length log file; the old one can be saved on archival storage, then deleted.
3.3 Voice rope structure
The data structure representing a voice rope consists of a list of [VFID, interval] pairs. Simple voice ropes consist of a single interval within a single voice file, often the whole file. More complex voice ropes can be constructed using the editing operations presented in section 2.2. For example, suppose two simple voice ropes,
VR
1 and
VR
2, exist with the following structures:
VR1 = <VFID: VF1, interval: [start: 0, length: 4000]>
VR2 = <VFID: VF2, interval: [start: 500, length: 2000]>
Then the operation
REPLACE[base: VR1, interval: [start: 1000, len: 1000], with: VR2]
produces a new voice rope,
VR
3, with the structure:
VR3 = <VFID: VF1, interval: [start: 0, length: 1000], VFID: VF2, interval: [start: 500, length: 2000], VFID: VF1, interval: [start: 2000, length: 2000]>
as depicted in Figure 3.
[Artwork node; type 'ArtworkInterpress on' to command tool]
Figure 3. Structure of VR3 after REPLACE operation
To record a new voice rope, the voice manager calls on the voice file server to allocate and then record a voice file. Once recording completes, a simple voice rope is added to the voice rope database to represent the complete voice file just recorded. A database index allows voice rope structures to be retrieved efficiently by VRID; an index is also maintained on VFIDs, which is useful in garbage collection.
When playing a voice rope, the voice manager retrieves the voice rope's structure from the database, distributes the encryption keys of the various intervals to the parties participating in the conversation, and calls upon the voice file server to play the intervals of the voice rope in the appropriate order. Requesting the playback of several intervals in succession relies on the asynchronous nature of the voice file server operations to ensure that gaps do not get inserted between the intervals. Secure RPC [5] is used to distribute encryption keys safely.
The structure of voice ropes is kept "flat" to optimize playback. By having each voice rope refer directly to voice files, only a single database access is required to determine the voice rope's complete structure. An alternative design, more closely modeled on the Cedar Rope abstraction, would store complex voice ropes as intervals of other voice ropes. In such a design, a voice rope would conceptually be the root of a tree of other voice ropes with intervals of voice files at the leaves of the tree. This alternative design would reduce the work associated with each editing operation, but would increase the number of database accesses required to play a voice rope. The flat design was chosen because it improves playback behavior, and, in practice, playback is much more frequent than editing. Moreover, it yields simpler and more compact data structures when used (as intended) to represent small numbers of coarse-grained edits to voice.
Note that the actual voice is neither moved nor copied once recorded in a voice file, even during editing. The voice also is never decrypted by the voice file server. The encryption keys for the various intervals that compose a voice rope are stored in the database along with the voice rope's structure. Independently enciphering small blocks of voice using ECB encryption [17] ensures that voice can be edited on millisecond-resolution boundaries while remaining encrypted.
3.4 Storage reclamation
This section presents the rationale for the interest operations of section 2.3 and describes their implementation. These operations were included primarily to permit automatic reclamation of storage for voice ropes and their associated voice files.
Once all voice ropes that reference a given voice file have been deleted, no voice rope will ever again refer to that voice file. The voice file can then be deleted as well. This condition is easily determined by a database query. The more difficult problem is deciding when voice ropes themselves can be reclaimed.
From the client standpoint, the most straightforward method for garbage collecting voice ropes would be periodically to examine all of the clients' storage for references to voice ropes, then to collect any unreferenced ropes in a conventional sweep pass. Unfortunately, this sort of distributed garbage collector would be impossible to implement in our open, relatively heterogeneous, environment. We do not wish to restrict the uses clients make of voice ropes, or how and where clients store VRIDs.
A common alternative is to provide a reference counting scheme in which counters are used to determine the number of clients interested in a particular object. When an object's counter goes to zero, the object's storage can be reclaimed. The burden is placed on clients to increment and decrement the counts for the objects that they are using.
The use of standard reference counting presents formidable problems in a distributed environment. Reference counts cannot be managed reliably unless an atomic transaction spans the use of the reference and the reference count operation. Without transactions, if the server or client fails in the process of incrementing or decrementing a reference count, the client may be left in an uncertain state regarding the outcome of the operation. In particular, client failure-recovery procedures might incorrectly repeat a reference count operation. Furthermore, it is not always possible to arrange for a reference count to be decremented (such as when a reference-containing file residing on an uncooperative server is deleted). Finally, reference counts are anonymous, giving no help in locating erroneous references.
Interests were designed to remedy these shortcomings associated with simple reference counts. In a way, interests represent a return to the full-scan method first proposed: since an examination of the entire environment is not possible, clients are required to record their voice rope interests in a known place. The interests serve as proxies for the actual references for reclamation purposes. Interests address two problems: how to retain voice ropes for which references exist, and how to determine when it is permissible to reclaim them.
3.4.1 Retention of voice ropes
The use of interests to retain voice ropes is straightforward. Calling RETAIN adds an entry to the system's interest database (if the entry is not already there), recording the supplied VRID, class, and interest values along with the user's identification. The interest database includes indices permitting queries based on any of these attributes. Since a given entry appears in the database at most once, the RETAIN operation is idempotent. It can be safely retried in case of failures or uncertainty.
The information stored in the interest database is sufficient to allow either human administrators or client applications to determine whether or not an interest is still valid.
3.4.2 Loss of interest
Determining when to invalidate interest entries is more involved. Some client programs can determine both when to RETAIN an interest and when to FORGET it. One example is the electronic mail system that implements the "Message" class for which the interest design was first developed. When the mail system deletes an instance of a text message that contains voice references, it issues a FORGET for all of the associated VRIDs. Another example is the "SysNoises" class, a set of useful recorded announcements and sounds that system administrators manage manually. The FORGET implementation simply deletes the associated interest entry or entries from the interest database, ignoring requests to delete nonexistent entries to ensure that FORGET is also an idempotent operation. When every interest for a voice rope has been forgotten, the voice rope is vulnerable for deletion.
As we discovered when we began including voice ropes as annotations embedded in ordinary structured text documents, it is not always easy to arrange for clients to issue FORGET actions at the necessary times. Consider the following scenario. A user records a voice rope and embeds a reference to it in a document; an interest in the voice rope is then registered for the document. The user then copies this document from his workstation to a public file server and announces its existence in a message to interested parties. Several months later, he deletes the file, without remembering that it had voice annotations. Unless further actions are taken, the interest, and hence the referenced voice rope, will never be reclaimed.
Expecting an arbitrary file server to delete interests is not really reasonable. In the above scenario, one could argue that the file server should have issued the necessary FORGET operation when the file was deleted. However, this implies that the software running on the file server could or should be modified to recognize the existence of voice in files and to take appropriate action. Many of the file servers in a typical network environment (including ours) cannot be so modified, either because they are old and written in some obscure programming language, or because they were purchased from outside vendors and the source code is not available.
In this case, although interests cannot be explicitly deleted by the agents whose actions invalidate them, it is possible to determine automatically when a particular interest is no longer valid. We assume that a knowledgeable workstation client program will issue an operation like
RETAIN[vrID: VRIDi, class: "FileAnnotation", interest: "annotatedFile">]
for each voice rope VRIDi referred to in a file as the file is moved from temporary workstation storage to the file named "annotatedFile" on a public file server. At any later time, standard directory operations can be used to determine whether that specific named instance of the file still exists; if not, the associated interest is no longer valid and may be deleted.
As another example, a "Timeout" class could record an expiration time as its interest value. The interest is not valid after that time.
3.4.3 Garbage collection
It remains only to devise methods for automatically locating and removing outdated interests. The implementor of any interest class may register a procedure of the following type with the voice manager:
GARBAGE[VRID, interest] -> Yes/No
Determines in a class-specific way whether or not the given interest still applies to the particular voice rope.
For example, for the class "FileAnnotation", this procedure returns Yes if and only if the file instance identified by the interest parameter still exists.
An interest verifier periodically enumerates the database of interests and calls the class-specific GARBAGE procedure for each interest. If the procedure returns Yes, then the verify calls FORGET[VRID, class, interest] to delete the interest from the database.
A garbage collector for voice ropes also runs periodically. For each voice rope in the database it (1) deletes the voice rope if no interests exist that reference it, and (2) deletes voice files used by the voice rope if they are no longer a part of any other voice rope. This process refuses to collect voice ropes that are too young in order to prevent it from collecting a newly created voice rope before a client has the opportunity to express an interest in it. This method will find all unreferenced voice ropes, including those for which no client ever expressed an interest. Note that, unlike a mark-and-sweep style garbage collector, these algorithms can be safely executed while the system is running and need not complete a full pass through the database in order to perform useful work.
In summary, garbage collection takes place on three levels. The voice manager deletes voice files when they are no longer referenced by voice ropes. Voice ropes are deleted if no interests exist for them. Interests are either explicitly forgotten by client applications or automatically deleted based on a class-specific test for validity.
4. Experience and Evaluation
Approximately 50 Etherphones are in daily use in the Computer Science Laboratory. Our current voice file server runs on a Dorado [12] with a 300 Mbyte local disk. Thus, it has the capacity to store over 7 hours of recorded voice; the actual storage capacity depends on the amount of suppressed silence. Most of our user-level applications to date have been created in the Cedar environment [24], although limited functions have been provided for Interlisp and for stand-alone Etherphones. We have had a voice mail system running for over two years and a prototype voice editor for about 8 months.
4.1 Voice-annotated documents
Manipulating stored voice solely by textual references, besides allowing efficient sharing and resource management, makes it easy to integrate voice into documents. For example, we were able to build a local voice mail system without changing the mail transport protocols or servers. Also, annotated documents can be stored on conventional file servers that are not aware that the documents logically contain voice.
Significant performance benefits accrued by having documents refer to voice that is stored remotely. Although most requests to record or playback voice ropes are initiated from a workstation, the voice data is never received by the workstation; instead, it is transmitted directly to the associated Etherphone.
These techniques for sharing, however, require clients to have high-bandwidth network access to the voice manager. Transferring voice-annotated documents between remote sites would require special mechanisms such as DARPA's experimental multimedia mail protocol [19]
4.2 Editing voice
We have gained considerable experience with the voice manager by building a voice editing system in Cedar. Figure 4 displays a document containing voice annotations, and just below it a visual representation of one of the voice annotations that has been opened up for editing. The bar patterns in the visual representation indicate with solid bars the contiguous intervals (talkspurts) of voice or other sounds, separated by white regions that represent periods of silence [1]. Given a voice rope representing the entire annotation, the DESCRIBE operation is sufficient for generating such a display.
The small set of editing operations provided by the voice manager has proven to be a sufficient base on which to build a complex voice editor and a dictation machine. However, to reduce traffic to the voice manager, the Cedar voice editor maintains its own data structures to represent the edited voice temporarily. That is, the voice editor ended up replicating much of the functionality of the voice manager, something we were trying to avoid. Only when a user elects to save the edited voice passage does the voice manager get called to perform the necessary operations. Given this arrangement, it may have been better to let clients simply pass the voice manager a complex voice rope that it could store in its database.
Experience indicates that editing a voice passage invariably produces a set of "temporary" voice ropes that are used in the construction of the finished result. These objects are eventually collected by the garbage collector, so they do not present much of a problem except that their creation requires seemingly unnecessary work of the voice manager. To alleviate the problem somewhat, the voice manager's interface was changed slightly so that an interval could be given for any voice rope in any operation. This substantially reduced the voice editor's use of the SUBSTRING operation.
Event reporting is important in allowing the voice editor to coordinate its visual feedback with the activities of the voice file server. In particular, the voice editor moves a cursor along the screen as a voice rope is being played (the gray marker below the word "score" in the voice displayed in Figure 4). A report indicating that the playback of a particular voice passage has started or finished is essential to synchronize the movement of the cursor with the transmission of voice data.
Although the voice file server writes files on disk so that 1-second segments can be continuously transferred, clients are allowed to edit voice ropes on 1-millisecond boundaries. The file server could not possibly playback a voice rope in real time if it had to perform a disk seek every millisecond. Fortunately, users of the voice editor are encouraged to insert, delete, and rearrange voice passages at the granularity of a sentence or phrase rather than trying to modify individual words or phonemes [1]. Thus, in practice, one rarely sees segments of a voice rope that are less than several seconds in length.
4.3 Interests
The notion of grouping interests into classes and providing class-specific garbage collection algorithms is a useful and workable concept. However, we are still groping with the details of how best to use these mechanisms. We have found several interest classes to be useful in Cedar.
The Cedar mail system, automatically registers and deregisters interests of class "Message" as voice messages are saved and deleted by users. In addition, a "Timeout" class has been used to retract an interest automatically after a certain amount of time. For instance, when sending a voice message, a timeout can be set by the sender that is long enough to give the recipients a chance to receive the message and register their own interests if so desired. Of course, problems can arise if a recipient is on vacation for a period of time longer than the timeout. For this reason, we have a means of archiving voice files before deleting them from the server.
For annotated documents in Cedar, the workstation software detects when a file is copied from the local disk to a public file server; it then automatically registers the appropriate "FileAnnotation" interests for the public file. Having workstation software automatically register interests as a file is copied to a file server works remarkably well. However, some important operations are not covered by this approach: renaming a file on a file server or copying files between two file servers. We see no way to detect such operations except by modifying file server software.
We have defined the "FileAnnotation" interest class such that its interest represents a publicly stored file name including the version number. With this scheme, interests must be reregistered for each new version of the file, that is, whenever a file is written to a public server. Unfortunately, the times that people want to annotate documents are precisely those times when the document is being updated often, so many interests are registered repeatedly. We rely on the garbage collector to get rid of old interests. An alternative would be to register a file without a version number, but that causes minor problems if voice is deleted from the file but the file itself remains in existence.
4.4 Reliability
The voice file server, voice manager, and voice control server were implemented so that they could run on separate physical processors. That is, they all communicate among themselves and with voice clients using RPC. In practice, we run all three on the same Dorado. There is little to be gained by running them separately, since the voice file server cannot record or playback voice files if the control server is down. Similarly, the voice manager cannot record or playback voice ropes if the voice file server is down. For all practical purposes, voice can also not be edited if the voice file server is down, because users invariably need to listen to the voice passages that they are editing.
Thus, availability is not adversely affected by having the voice manager and file server colocated with the control server. If this server crashes or is otherwise unavailable, then no operations can be performed on stored voice. For the most part, this is simply an inconvenience to users in the same way that unavailability of conventional file servers is an inconvenience. In Cedar, the file servers containing the important system files, fonts, and documentation are replicated to improve their availability. We have not found it necessary to pay the cost to provide a highly-available voice file server.
The one exception to this concerns voice interests. It is often the case that clients wish to register or deregister interests in voice ropes independently of playing the referenced voice. For example, an interest of type "FileAnnotation" is registered when a voice-annotated document is copied from a personal workstation to a public file server. A user should not be prevented from performing such a copy simply because the voice manager is unavailable. We have also observed that the interests for voice messages fail to get properly registered or deregistered if a person saves or deletes a voice message while the voice server is down. This has led us to contemplate writing a program that enumerates a person's mail database and checks that all voice messages have properly registered interests. The better solution is to make the voice interest database highly-available. Rather than fully replicating the database, we are planning to provide a mechanism whereby operations to RETAIN or FORGET a voice interest are logged locally by a user's workstation if the voice server is unavailable; the operations in this log will be retried when the workstation detects that the server is reachable.
4.5 Performance
(Performance measurements of the system are not available for publication at this time, but we should be able to include some in the final paper. We are convinced that the time performance of both the voice file server and the voice rope facilities exceed the requirements of intended applications, so there will be little to learn there. What we need to measure is the way people use this stuff -- how many edits they make, at how fine a grain, etc. We also need to estimate and/or measure the cost of database log compaction, once the system is in heavy use. Space requirements for archiving actual digitized voice is also an issue.)
4.6 Related work
Several companies provide speech message systems that can be accessed from standard telephones; one of the earliest examples of this type of system was IBM's experimental Speech Filing System, which was operational in 1975 [10]. Certainly the Etherphone system's facilities can be accessed from telephones, but that was not the driving application. We were interested in allowing voice to be integrated easily into a user's existing means of digital communications, rather than forcing users to learn a completely new system. The Sydis Information Manager provides workstation control over the recording, editing, and playing of voice as in the Etherphone system, but requires special workstations called VoiceStations [18]. Ruiz also developed a prototype voice system that integrates voice and data into some simple workstation applications; however, he did not address the important issues of sharing stored voice [20].
Maxemchuk's speech storage system [15] provided many of the same facilities for recording, editing, and playing voice as our voice file server. (Actually, he provided much more control over the playback of voice than we do, such as the ability to vary playback speeds or adjust silence intervals.) Also, the division of function between a main computer and a storage computer is quite similar to the separation between our voice manager and voice file server. However, Maxemchuk's system edits voice using divide and join operations that modify the control sectors of stored voice messages. Our technique of building data structures that reference voice files better supports sharing, by making voice ropes immutable, and simplifies the requirements placed on the voice file server. For instance, our techniques are very amenable to write-once storage technologies such as optical disks.
Version Storage in the Swallow system [21] has many similar characteristics to our voice manager. That is, it manages immutable objects of various sizes. Also, its "structured version images" used for large objects are similar to the data structures used by the voice file server to describe voice files. However, unlike the voice manager, Swallow maintains histories to link together objects that are derived from one another and provides atomic operations on multiple objects. Also, it provides no editing mechanisms or garbage collection, just read and write operations.
The Diamond Document Store [26], like the Etherphone system, manages documents that contain various media elements by reference; it also allows documents to be shared among users by reference. A simple reference count scheme suffices for deallocating objects that reside in the Document Store but are not referenced by any document or document folder since the Diamond system does not allow documents stored outside the system to reference internally stored objects. The Etherphone system, on the other hand, strives to provide voice services that can be used along with other existing services, such as the Grapevine mail system [3] and Alpine file servers [6].
The Cambridge File Server [16] was perhaps the first network-accessible storage system to require clients to take an explicit action to prevent files from being automatically garbage collected. In particular, it deletes files that are not accessible from server-maintained, but client-updated "indices". Thus, these indices play much the same role as the voice manager's interest database.
Liskov and Ladin present an example of a distributed garbage collector [14]. Their approach requires all sites that store references to other objects to run a garbage collector locally and send information about non-local references to a reference server. In some sense, their use of a reference server is similar to our use of registered interests, but much more limited. One interesting contribution they make is how to build a highly-available reference server; we could use these techniques to build an interest server.
5. Conclusions
The facilities for managing stored voice in the Etherphone system were designed in adherence to the following principles:
·
Permit sharing among various clients
Maintaining voice on a publicly accessible server facilitates sharing. Clients can freely share references to voice ropes without incurring the overhead of transmitting the voice itself. Because voice ropes are immutable, even though they are incorporated into documents by reference, they exhibit copy semantics.
·
Support easy editing of voice by programs
The editing operations provided by the voice manager are similar to those in the Cedar Rope package. This is intentional so that programmers can manipulate voice in the ways to which they are accustomed for text. The basic facilities to support editing reside on a server; workstations are responsible for providing a user interface that is integrated with their programming environment.
·
Move voice data as little as possible
Once recorded in the voice file server, voice is never copied until a workstation sends a playback request; at this point the voice is transmitted directly to an Etherphone. In particular, although workstations initiate most of the operations in the Etherphone system, there is little reason for them to receive the actual voice data since they have no way of playing it.
Furthermore, to support efficient editing, we maintain a two level storage hierarchy: voice ropes refer to intervals of voice files. A given voice rope can consist of intervals from several voice files, and a given voice file can be used by several voice ropes. A database stores the many-to-many relationships that exist between voice ropes and files. Editing operations simply create new voice ropes from old ones and add them to the database.
·
Allow diverse workstations to be integrated into the system
All of the operations on stored voice are performed on a server, the voice manager. Due to the heterogeneous nature of our environment, providing a single implementation of these facilities on a server seemed better than requiring each different workstation programming environment to provide its own implementation. Moreover, the only requirements placed on a workstation in order to make use of the voice services are that it have an associated Etherphone and a RPC implementation. In particular, workstations need not have hardware support for encryption or voice I/O.
·
Do not restrict the uses of voice in client applications
Voice management is provided by a server exporting an RPC interface. The voice manager makes no assumptions about the way clients make use of its services. This particularly impacted the design of the voice garbage collector.
·
Provide a level of security at least as good as that of conventional file servers
We use secure RPC for all control functions in the Etherphone system and DES encryption for transmitted voice. Thus promiscuous machines are prevented from listening to any communications in the system. Storing the voice in its encrypted form protects the voice on the server and also means that the voice need not be reencrypted on playback. All in all, the voice system actually provides better security than most file servers.
·
Reclaim automatically the storage occupied by unneeded voice
Garbage collection of voice ropes is done using a modified type of reference counting. Clients register interests in particular voice ropes. These interests are grouped into classes and can be invalidated according a class-specific algorithm. For the most part, users of voice applications are not aware of how or when interests are registered since it is handled transparently by the application software.
The Etherphone system has provided an environment in which to explore the management of voluminous, shared data among distributed and heterogeneous workstation clients. The techniques presented in this paper are applicable to and beneficial for the management of various types of data including voice, video, images, and music.
Acknowledgments
The design of voice ropes evolved for several years and many people contributed valuable suggestions. Others also deserve credit for the implementation of the voice file server and the voice editor. (Names will be included in the final paper.)
References
[1]
S. Ades and D. C. Swinehart.
Voice annotation and editing in a workstation environment,
Proceedings AVIOS Voice Applications '86, September 1986, pages 13-28.
[2]
R. Bayer and E. McCreight.
Organization and maintenance of large ordered indexes.
Acta Informatica 1(3):173-189, 1972.
[3]
A. Birrell, R. Levin, R. M. Needham, and M. D. Schroeder.
Grapevine: An exercise in distributed computing.
Communications of the ACM 25(4):260-274, April 1982.
[4]
A. D. Birrell and B. J. Nelson.
Implementing remote procedure calls.
ACM Transactions on Computer Systems 2(1):39-59, February 1984.
[5]
A. D. Birrell.
Secure communication using remote procedure calls.
ACM Transactions on Computer Systems 3(1):1-14, February 1985.
[6]
M. R. Brown, K. Kolling, and E. A. Taft.
The Alpine File System.
ACM Transactions on Computer Systems 3(4):261-293, November 1985.
[7]
R. G. G. Cattell.
Design and implementation of a relationship-entity-datum data model.
Xerox Palo Alto Research Center, Technical Report CSL-83-4, May 1983.
[8]
D. D. Clark.
The structuring of systems using upcalls.
Proceedings Tenth Symposium on Operating Systems Principles, Orcas Island, Washington, December 1985, pages 171-180.
[9]
J. Donahue and W. Orr.
Walnut: Storing electronic mail in a database.
Xerox Palo Alto Research Center, Technical Report CSL-85-9, November 1985.
[10]
J. D. Gould and S. J. Boies.
Speech filing—An office system for principles.
IBM Systems Journal 23(1): 65-81, January 1984.
[11]
J. N. Gray.
Notes on database operating systems.
In Bayer et al., Operating Systems: An Advanced Course, Springer-Verlag, 1978, pages 393-481.
[12]
B. W. Lampson and K. A. Pier.
A processor for a high-performance personal computer.
Proceedings 7th Symposium on Computer Architecture, La Baule, May 1980, pages 146-160.
[13]
B. W. Lampson.
Hints for computer system design.
Proceedings Ninth Symposium on Operating Systems Principles, Bretton Woods, New Hampshire, October 1983, pages 33-48.
[14]
B. Liskov and R. Ladin.
Highly-available distributed services and fault-tolerant distributed garbage collection.
Proceedings of Symposium on Principles of Distributed Computing, Calgary, Alberta, Canada, August 1986, pages 29-39.
[15]
N. Maxemchuk.
An experimental speech storage and editing facility.
Bell System Technical Journal 59(8): 1383-1395, October 1980.
[16]
J. G. Mitchell and J. Dion.
A comparison of two network-based file servers.
Communications of the ACM 25(4):233-245, April 1982.
[17]
National Bureau of Standards.
Data Encryption Standard.
Fedaral Information Processing Standard (FIPS) Publication 46, U. S. Department of Commerce, January 1977.
[18]
R. Nicholson.
Integrating voice in the office world.
BYTE 8(12):177-184, December 1983.
[19]
J. K. Reynolds, J. B. Postel, A. R. Katz, G. G. Finn, and A. L. DeSchon.
The DARPA experimental multimedia mail system.
Computer 18(10):82-89, October 1985.
[20]
A. Ruiz.
Voice and telephony applications for the office workstation.
Proceedings 1st International Conference on Computer Workstations, San Jose, CA, November 1985, pages 158-163.
[21]
L. Svobodova.
A reliable object-oriented data repository for a distributed computer system.
Proceedings Eighth Symposium on Operating Systems Principles, Pacific Grove, California, December 1981, pages 47-58.
[22]
L. Svobodova.
File servers for network-based distributed systems.
ACM Computing Surveys 16(4):353-398, December 1984.
[23]
D. C. Swinehart, L. C. Stewart, and S. M. Ornstein.
Adding voice to an office computer network.
Proceedings IEEE GlobeCom '83, November 1983.
Also available as Xerox Palo Alto Research Center, Technical Report CSL-83-8, February 1984.
[24]
D. C. Swinehart, P. T. Zellweger, R. J. Beach, and R. B. Hagmann.
A structural view of the Cedar programming environment.
ACM Transactions on Programming Languages and Systems 8(4):419-490, October 1986.
[25]
D. C. Swinehart, D. B. Terry, and P. T. Zellweger.
An experimental environment for voice system development.
IEEE Office Knowledge Engineering Newsletter, February 1987.
[26]
R. H. Thomas, H. C. Forsdick, T. R. Crowley, R. W. Schaaf, R. S. Tomlinsin, V. M. Travers, and G. G. Robertson.
Diamond: A multimedia message system built on a distributed architecture.
Computer 18(12):65-78, December 1985.
[27]
N. Yankelovich, N. Meyrowitz, and A. van Dam.
Reading and writing the electronic book.
Computer 18(10):15-30, October 1985.