Managing Stored Voice in the Etherphone System
Abstract: The voice rope facilities and implementations provided by the voice manager in the EtherphoneTM system support recording, 
editing, and playing stored voice in a distributed personal computing environment.  To facilitate sharing, voice is stored on a 
special voice file server that is accessible via the local internet.  Etherphones transfer encrypted voice to and from the voice 
file server over an Ethernet.  The voice manager provides operations for editing a voice passage once it has been recorded.  Rather 
than rearranging the contents of voice files when being edited, the voice manager simply builds a data structure to represent the 
edited voice and stores it in a server database.  This data structure, called a voice rope, consists of a list of intervals within 
voice files.  Clients refer to voice ropes solely by reference.  Typical uses include embedding references in multimedia documents. 
 A modified style of reference counts, called interests, enable unwanted voice ropes to be garbage collected.  These interests are 
grouped into classes and can be invalidated according to a class-specific algorithm that runs periodically.
1. Introduction
    Voice is an important and widely used medium for interpersonal communication.  Computers facilitate interpersonal communication 
    through electronic mail and shared documents.  Yet, our computer systems have traditionally forced us to communicate textually. 
     A major goal of the EtherphoneTM system developed at Xerox PARC was to allow voice to be incorporated into computing 
    environments and used in much the same way as text.  This paper addresses the problems associated with managing stored voice in 
    a distributed computing environment.
    The Etherphone system is intended for use in a locally distributed computing environment containing multiple workstations and 
    programming environments, multiple networks and communication protocols, and perhaps even multiple telephone transmission and 
    switching choices.  The system is intended to be extensible in that introducing new applications, network services, 
    workstations, networks, and other components is possible.
    As with text, we want the ability to incorporate voice1 easily into electronic mail messages, voice-annotated documents, user 
    interfaces, and other interactive applications.  Nicholson gives a good discussion of many office applications that are made 
    possible by treating voice as data [18].  Clients should be able to combine previously recorded voice in various ways and 
    insert fresh voice into existing voice passages.  Clients should be able to share voice as freely as they share files.  Also, 
    the system should permit programmer control over all of these functions.
------- length: 1 in
1 We are interested in capturing individual voices, conversations, music, and other sounds with reasonable fidelity.  This affects 
the choice of encoding and as a result the volume of storage required, but not the management methods described here.  The 
remainder of this discussion thus mentions only recorded voice, without precluding any of the other uses.
------- length: 1 in
    The characteristics of voice, however, differ greatly from those of text.  Standard telephone-quality uncompacted voice occupies 64 
    Kbits of storage per second of recorded voice.  This is several orders of magnitude greater than the equivalent typed text.  
    Voice also requires special devices for recording and playing it; that is, a user cannot simply type in a voice passage.  More 
    importantly, voice transmission has stringent real-time requirements.  These differences dictate special methods for 
    manipulating and sharing voice.
    To this end, we have developed application-independent methods for recording, playing back, editing, and otherwise manipulating 
    digitized voice, based on an abstraction that we call voice ropes.  Section 2 presents the operations on voice ropes available 
    to application programs in the Etherphone system.
    Section 3 then discusses the design and implementation of these operations and the rationale governing various design choices.  The 
    storage requirements of recorded voice demanded that editing be accomplished in such a way that existing voice passages need 
    not be moved, copied, or decrypted.  For similar reasons, sharing of voice must not involve copying the voice passages 
    themselves.  The major technical contribution described in this paper involve the use of simple databases to:
    (1)     describe the results of editing operations, and
    (2)    provide a modified style of reference counting required to allow the automatic reclamation of obsolete voice.
Section 4 relates our experiences to date with incorporating voice into workstation applications, and section 5 reviews our design 
principles and how they were met.  First, the following subsection gives a quick overview of the Etherphone system architecture.
1.1 The Etherphone System
    Figure 1 depicts the basic components of the Etherphone system in a simple configuration [25].  Each personal workstation is 
    associated with, but not directly attached to, a microprocessor-based telephone instrument called an Etherphone.  Etherphones 
    digitize, packetize, and encrypt telephone-quality voice and transmit it directly over an Ethernet.  A voice control server 
    provides control functions similar to a conventional business telephone system and manages the interactions between all the 
    other components.  In particular, it allows voice-carrying conversations to be established between two or more Etherphones, 
    workstations, or servers.  A voice file server, discussed further in Section 3.1, provides storage for recorded voice.  The 
    system can also include other specialized sources or sinks of voice, such as a text-to-speech server that receives text strings 
    and returns the equivalent spoken text to the user's Etherphone. 
<< [Artwork node; type 'ArtworkInterpress on' to command tool] >>
Figure 1.  A Simple Etherphone System Environment
    Workstations are the key to providing enhanced user interfaces and control over the voice capabilities.  We rely on the 
    extensibility of the local programming environmentbe it Cedar, Interlisp, or the Xerox Development Environmentto facilitate 
    the integration of voice into workstation-based applications.  Workstation program libraries implement the client programmer 
    interface to the voice system.
    All of the communication required for control in the voice system, such as conversation establishment, is accomplished via a remote 
    procedure call (RPC) protocol [4].  Multiple implementations of the RPC mechanisms permit the integration of workstation 
    programs and voice applications programmed in different environments.  During the course of a conversation, RPC calls emanating 
    from the voice control server inform participants about various activities concerning the conversation.  Active parties in a 
    conversation exchange voice using a voice transmission protocol [23].  
    The server software and the initial workstation software was developed in the Cedar programming environment [24].  More information 
    on the equipment and protocols used in the Etherphone system, as well as the applications built to date, can be found in 
    related papers [23] [25].
2. Operational Overview
    The implementation of facilities for recorded voice is somewhat involved, but the actions to be performed are conceptually quite 
    simple.  In looking for an application-independent abstraction to present to application programmers, it occurred to us that 
    many of these actions closely resembled operations normally associated with text string manipulation.  The Cedar system 
    provides a powerful text string abstraction called a rope [24].  By analogy, in the Etherphone system, we refer to sequences of 
    stored voice samples as voice ropes.  Each voice rope is defined by a unique identifier, a VRID, instead of by a memory 
    pointer, because (unlike Cedar ropes) voice ropes are persistent objects that are meant to last a long time.  To aid in sharing 
    and to facilitate the use of voice by heterogeneous workstations, the storage for voice ropes, as well as the operations on 
    them, are provided by a network service, the voice manager.
    Clients refer to voice ropes solely by reference, that is, by their unique identifiers (VRIDs).  The voice manager places no 
    restrictions on a client's use of voice ropes.  For instance, voice ropes could be used by an interactive interface to provide 
    audio feedback.  Most uses involve embedding speech in some type of document, such as an annotated manuscript, program 
    documentation, or electronic mail.  The use of such embedded references to refer to voice, video, and other diverse types of 
    information has been termed a hypermedia system [27].
    From a client's perspective, a voice-annotated document should behave as though the voice were stored directly in the document's 
    file rather than being included by reference.  For example, once a voice message is sent using electronic mail, it should not 
    be possible for the author or another user to change the message's contents.  For this reason, voice ropes are immutable.  The 
    recording and editing operations create new voice ropes; they do not modify existing ones.
2.1 Recording and playback
    To record or playback a voice rope, a conversation is set up between the voice manager and an Etherphone.  The main operations 
    supported by the voice manager are as follows: 
RECORD[conversation] 
    Voice received by the server over the communication path defined by the given conversation is stored and assigned a unique VRID; 
    recording continues until a subsequent STOP operation.  The requestID identifies this operation in subsequent reports (see 
    below).
PLAYBACK[conversation, VRID, interval] 
    The specified interval of the voice rope is transmitted over the given conversation.  An interval denotes either the entire voice 
    rope or a time-indexed portion of it at a resolution of about 1 ms.
STOP[conversation]
    Any recording or playback operations that are in progress or queued for the given conversation are immediately halted.
These operations are invoked on the voice manager using the Cedar RPC facility.  The RECORD and PLAYBACK operations are performed 
asynchronously.  That is, the remote procedure call returns after the operation has been queued by the server.  Queued operations 
are performed in order.  
    The voice manager generates event reports upon the start and completion of a queued operation.  The requestID returned by each 
    invocation is used to associate reports with specific operations.  In particular, the voice manager makes the following call to 
    all participants in a conversation to inform them of the status of various requested operations concerning that conversation:
REPORT[requestID, {started | finished | flushed}]
    The requested operation has been started, successfully completed, or halted by a STOP operation.
Having reports flow from server to clients is conceptually similar to Clark's upcalls and accomplished in a similar manner [8].
2.2 Editing support
    Once recorded, voice ropes can be used in editing operations to produce new, immutable voice ropes.  Several of the operations on 
    Cedar ropes, such as producing substrings or concatenating existing strings, are directly applicable to voice.  Their 
    transliteration for voice ropes yields these functions:
CONCATENATE[VRID1, VRID2, ...] 
    Produces a new voice rope that is the concatenation of the given voice ropes.
SUBSTRING[VRID1, interval] 
    Produces a new voice rope consisting of the specified interval of VRID1.
REPLACE[VRID1, interval, VRID2] 
    Produces a new voice rope that is obtained by replacing the particular interval of VRID1 with VRID2.  This is a composition of the 
    CONCATENATE and SUBSTRING operations, provided for efficiency and convenience.
LENGTH[VRID] 
    Returns the length of the given voice rope in milliseconds.
One additional operation peculiar to voice ropes was provided to aid in editing:
DESCRIBE[VRID] 
    Returns a list of time intervals that denote the non-silent talkspurts of the given voice rope.  A talkspurt is defined to be any 
    sequence of voice samples separated by some minimum amount of silence.
These operations, available via RPC calls to the voice manager, are intended for use by programmers.  Applications that handle 
voice must employ these operations to construct the facilities visible to the end user.  
2.3 Interests
    The voice manager also provides operations for managing voice references.  These operations provide a sort of directory for voice 
    ropes.  Although simple applications can use this directory as their means for naming and locating voice ropes, that is not its 
    primary purpose.  As with any storage system, unreferenced storage space should be reclaimed.  With voice, or other voluminous 
    media such as video, the need is particularly acute.  In the Etherphone system, client code must assist with garbage collection 
    by using the directory operations to express an interest in each referenced voice rope.  We list the client operations here, 
    deferring to Section 3.4 a discussion of the rationale for this approach and a description of the underlying implementation.  
    The interest operations are:
RETAIN[VRID, class, interest] 
    Registers an interest of the particular class in the given voice rope.  The interest identifies a reference to the voice rope 
    within the class.  This operation is idempotent; successive calls with the same arguments register at most one interest in the 
    given voice rope.  
FORGET[VRID, class, interest]
    Deregisters the specified interest.
LOOKUP[class, interest] 
    Returns the unordered list of voice ropes associated with a particular interest.
Both interest and class are arbitrary text string values.  The form of the interest value is generally class-specific; moreover, 
clients are responsible for generating unique values for different interests within a class.
    The class identifies the way in which the voice rope is being used by a particular application.  For example, we use the class 
    "FileAnnotation" to indicate that a document stored in a named file is annotated by a set of utterances; the interest field is 
    the file name.  The class "Message" indicates that the reference is a part of an electronic mail message incorporating recorded 
    voice, and the interest is the unique postmark supplied by the message system.
    A combination of client workstation software and automatic collection methods must hide these interest operations from actual 
    users.  Client applications must always register interests in order to ensure retention of voice ropes to which they hold 
    references.  For some classes, clients must also explicitly FORGET their interests; for others, such as the "FileAnnotation" 
    class, automatic methods described in Section 3.4 make this unnecessary.
3. Detailed Design Decisions
    The voice manager uses a voice file server to store voice data.  This server provides RECORD, PLAYBACK, and STOP operations that 
    are semantically similar to those described in section 2.1, but operate on voice files.  The more complex voice rope editing 
    and directory structures have been implemented as separate, higher-level components, in part to make the voice facilities 
    independent of the choice of the underlying file storage.  Voice ropes are actually made up of pieces of one or more voice 
    files.  
    The implementations of both voice rope editing and interest management depend on a simple but robust database facility that we 
    developed for these purposes.  Although the operations for editing voice ropes have been patterned after the Cedar Rope 
    package, a different underlying implementation was necessitated by the disparate characteristics of voice and text.  
    Specifically, editing voice by actually copying the bytes, as is sometimes done for Cedar's ropes, is expensive since voice is 
    voluminous.  Thus, rather than rearranging the contents of voice files to edit them, the voice manager simply builds a data 
    structure using the facilities of the database package.  Interests are registered in a similar database.
    A garbage collector uses interests to reclaim voice storage that is no longer needed.  Devising techniques for automatically 
    collecting garbage in a distributed, heterogeneous environment was one of the most difficult problems faced in the design of 
    the voice manager.
    The components of the voice manager are logically layered as in Figure 2.  Each is discussed in more detail in the following 
    sections.
    << [Artwork node; type 'ArtworkInterpress on' to command tool] >>
    Figure 2.  Voice Storage Components
3.1 Voice file server
    A voice file server differs from normal file servers [22] in that it must support the real-time requirements of voice.  In 
    particular, it must be able to maintain a sustained transfer rate of 64 Kbits/sec, and it should be able to support several 
    such transfers simultaneously.  There is no inherent reason why a general-purpose file server could not be extended to support 
    these stringent real-time requirements.  However, the file systems we had available at the onset of this project, having been 
    optimized for different styles of access, could not, in fact, support them.  At present, our file server is a special-purpose 
    system built on top of Cedar's standard file system [24].
    We will not describe the workings of the Etherphone voice file server in any detail.  It implements the operations needed to 
    allocate, record and play back voice files that are named by unique identifiers (VFIDs).  Tables locating the boundaries 
    between sound and silence are stored along with the voice to permit efficient execution of the DESCRIBE operation.
    Etherphones encrypt voice using DES electronic-codebook (ECB) encryption [17] as it is transmitted.  The voice file server simply 
    stores the voice in its encrypted form.  This adds a requirement in the voice rope component to manage encryption keys.  The 
    stored voice is never decrypted except by an Etherphone when being played back.
3.2 Database facilities
    Voice ropes and interests could be implemented using specialized data structures.  However, we realized that the facilities of a 
    database would serve these needs well, and the performance requirements were not particularly stringent.  The objects 
    representing voice ropes are immutable; multi-object atomic updates are not required; and, the usage of the system is expected 
    to generate relatively infrequent changes to small numbers of voice segments, each several seconds or even minutes in duration. 
     An original prototype voice rope system was built using an existing entity-relation database system (Cypress) in the Cedar 
    environment. [7].  Cypress provides transaction semantics for reliability and data consistency, typed data, facilities for 
    constructing elaborate queries, and adequate steady-state performance.  However, in the current Cypress implementation, the 
    overhead of opening and committing transactions is too great to permit sharing a database among many client programs that are 
    making frequent updates.  For these reasons, a simple, robust database representation that is particularly well-suited to voice 
    ropes was developed.
    A data base entry is a simple sequence of key/value pairs, each represented as a text string.  The database system stores each 
    entry in a write-ahead log [11].  Unlike most database systems in which the data is logged only until it can be committed and 
    written to a more permanent location, the log is itself the permanent source of data.  Once logged, the data is never moved.  
    To allow rapid queries or enumeration of database entries, B-Tree indices [2] are built to map the values of one or more keys 
    to the corresponding locations in the log file.  This log-based database package was inspired by similar methods found in the 
    Walnut electronic mail system [9] and recommended by Lampson [13].
    In addition to processing queries, the database system supports atomic insert, replace, and delete operations for single entries.  
    If a crash occurs while writing a log entry, the incomplete entry is discovered upon recovery.  If a crash occurs while 
    updating a B-Tree index, the log must be replayed to build the index from scratch, which is an expensive operation.  However, 
    since an untimely hardware crash is a rare event, rebuilding an index when necessary seems less costly than building and 
    running a transaction system to update the log and indices atomically.
    Since no long-term locks are maintained, multiple clients can easily interleave queries and updates.  The append-only log leads to 
    predictable and consistent performance (except during crash recovery), with no lengthy pauses due to transaction conflicts.  A 
    compaction operation is provided that enumerates the current entries (in the order specified by the primary key) to produce a 
    new, minimum-length log file; the old one can be saved on archival storage, then deleted.

3.3 Voice rope structure
    The data structure representing a voice rope consists of a list of [VFID, interval] pairs.  Simple voice ropes consist of a single 
    interval within a single voice file, often the whole file.  More complex voice ropes can be constructed using the editing 
    operations presented in section 2.2.  For example, suppose two simple voice ropes, VR1 and VR2, exist with the following 
    structures:
    VR1 = <VFID: VF1, interval: [start: 0, length: 4000]>
    VR2 = <VFID: VF2, interval: [start: 500, length: 2000]>
Then the operation
REPLACE[base: VR1, interval: [start: 1000, len: 1000], with: VR2]
produces a new voice rope, VR3, with the structure:
VR3 = <VFID: VF1, interval: [start: 0, length: 1000], VFID: VF2, interval: [start: 500, length: 2000], VFID: VF1, interval: [start: 
2000, length: 2000]>
as depicted in Figure 3.

<< [Artwork node; type 'ArtworkInterpress on' to command tool] >>
Figure 3.  Structure of VR3 after REPLACE operation
    To record a new voice rope, the voice manager calls on the voice file server to allocate and then record a voice file.  Once 
    recording completes, a simple voice rope is added to the voice rope database to represent the complete voice file just 
    recorded.  A database index allows voice rope structures to be retrieved efficiently by VRID; an index is also maintained on 
    VFIDs, which is useful in garbage collection.
    When playing a voice rope, the voice manager retrieves the voice rope's structure from the database, distributes the encryption 
    keys of the various intervals to the parties participating in the conversation, and calls upon the voice file server to play 
    the intervals of the voice rope in the appropriate order.   Requesting the playback of several intervals in succession relies 
    on the asynchronous nature of the voice file server operations to ensure that gaps do not get inserted between the intervals.  
    Secure RPC [5] is used to distribute encryption keys safely.
    The structure of voice ropes is kept "flat" to optimize playback.  By having each voice rope refer directly to voice files, only a 
    single database access is required to determine the voice rope's complete structure.  An alternative design, more closely 
    modeled on the Cedar Rope abstraction, would store complex voice ropes as intervals of other voice ropes.  In such a design, a 
    voice rope would conceptually be the root of a tree of other voice ropes with intervals of voice files at the leaves of the 
    tree.  This alternative design would reduce the work associated with each editing operation, but would increase the number of 
    database accesses required to play a voice rope.  The flat design was chosen because it improves playback behavior, and, in 
    practice, playback is much more frequent than editing.  Moreover, it yields simpler and more compact data structures when used 
    (as intended) to represent small numbers of coarse-grained edits to voice.
    Note that the actual voice is neither moved nor copied once recorded in a voice file, even during editing.  The voice also is never 
    decrypted by the voice file server.  The encryption keys for the various intervals that compose a voice rope are stored in the 
    database along with the voice rope's structure.  Independently enciphering small blocks of voice using ECB encryption [17] 
    ensures that voice can be edited on millisecond-resolution boundaries while remaining encrypted.
3.4 Storage reclamation
    This section presents the rationale for the interest operations of section 2.3 and describes their implementation.  These 
    operations were included primarily to permit automatic reclamation of storage for voice ropes and their associated voice files.
    Once all voice ropes that reference a given voice file have been deleted, no voice rope will ever again refer to that voice file.  
    The voice file can then be deleted as well.  This condition is easily determined by a database query.  The more difficult 
    problem is deciding when voice ropes themselves can be reclaimed.
    From the client standpoint, the most straightforward method for garbage collecting voice ropes would be periodically to examine all 
    of the clients' storage for references to voice ropes, then to collect any unreferenced ropes in a conventional sweep pass.  
    Unfortunately, this sort of distributed garbage collector would be impossible to implement in our open, relatively 
    heterogeneous, environment.  We do not wish to restrict the uses clients make of voice ropes, or how and where clients store 
    VRIDs.
    A common alternative is to provide a reference counting scheme in which counters are used to determine the number of clients 
    interested in a particular object.  When an object's counter goes to zero, the object's storage can be reclaimed.  The burden 
    is placed on clients to increment and decrement the counts for the objects that they are using.
    The use of standard reference counting presents formidable problems in a distributed environment.  Reference counts cannot be 
    managed reliably unless an atomic transaction spans the use of the reference and the reference count operation.  Without 
    transactions, if the server or client fails in the process of incrementing or decrementing a reference count, the client may be 
    left in an uncertain state regarding the outcome of the operation.  In particular, client failure-recovery procedures might 
    incorrectly repeat a reference count operation.  Furthermore, it is not always possible to arrange for a reference count to be 
    decremented (such as when a reference-containing file residing on an uncooperative server is deleted).  Finally, reference 
    counts are anonymous, giving no help in locating erroneous references.
    Interests were designed to remedy these shortcomings associated with simple reference counts.  In a way, interests represent a 
    return to the full-scan method first proposed: since an examination of the entire environment is not possible, clients are 
    required to record their voice rope interests in a known place.  The interests serve as proxies for the actual references for 
    reclamation purposes.  Interests address two problems: how to retain voice ropes for which references exist, and how to 
    determine when it is permissible to reclaim them.
3.4.1 Retention of voice ropes
    The use of interests to retain voice ropes is straightforward.  Calling RETAIN adds an entry to the system's interest database (if 
    the entry is not already there), recording the supplied VRID, class, and interest values along with the user's identification.  
    The interest database includes indices permitting queries based on any of these attributes.  Since a given entry appears in the 
    database at most once, the RETAIN operation is idempotent.  It can be safely retried in case of failures or uncertainty.
    The information stored in the interest database is sufficient to allow either human administrators or client applications to 
    determine whether or not an interest is still valid.
3.4.2 Loss of interest
    Determining when to invalidate interest entries is more involved.   Some client programs can determine both when to RETAIN an 
    interest and when to FORGET it.  One example is the electronic mail system that implements the "Message" class for which the 
    interest design was first developed.  When the mail system deletes an instance of a text message that contains voice 
    references, it issues a FORGET for all of the associated VRIDs.  Another example is the "SysNoises" class, a set of useful 
    recorded announcements and sounds that system administrators manage manually.  The FORGET implementation simply deletes the 
    associated interest entry or entries from the interest database, ignoring requests to delete nonexistent entries to ensure that 
    FORGET is also an idempotent operation.  When every interest for a voice rope has been forgotten, the voice rope is vulnerable 
    for deletion.
    As we discovered when we began including voice ropes as annotations embedded in ordinary structured text documents, it is not 
    always easy to arrange for clients to issue FORGET actions at the necessary times.  Consider the following scenario.  A user 
    records a voice rope and embeds a reference to it in a document; an interest in the voice rope is then registered for the 
    document.  The user then copies this document from his workstation to a public file server and announces its existence in a 
    message to interested parties.  Several months later, he deletes the file, without remembering that it had voice annotations.  
    Unless further actions are taken, the interest, and hence the referenced voice rope, will never be reclaimed.  
    Expecting an arbitrary file server to delete interests is not really reasonable.  In the above scenario, one could argue that the 
    file server should have issued the necessary FORGET operation when the file was deleted.  However, this implies that the 
    software running on the file server could or should be modified to recognize the existence of voice in files and to take 
    appropriate action.  Many of the file servers in a typical network environment (including ours) cannot be so modified, either 
    because they are old and written in some obscure programming language, or because they were purchased from outside vendors and 
    the source code is not available.
    In this case, although interests cannot be explicitly deleted by the agents whose actions invalidate them, it is possible to 
    determine automatically when a particular interest is no longer valid.  We assume that a knowledgeable workstation client 
    program will issue an operation like
RETAIN[vrID: VRIDi, class: "FileAnnotation", interest: "annotatedFile">]
for each voice rope VRIDi referred to in a file as the file is moved from temporary workstation storage to the file named 
"annotatedFile" on a public file server.  At any later time, standard directory operations can be used to determine whether that 
specific named instance of the file still exists; if not, the associated interest is no longer valid and may be deleted.
    As another example, a "Timeout" class could record an expiration time as its interest value.  The interest is not valid after that 
    time.
3.4.3 Garbage collection
    It remains only to devise methods for automatically locating and removing outdated interests.  The implementor of any interest 
    class may register a procedure of the following type with the voice manager:
GARBAGE[VRID, interest] -> Yes/No
Determines in a class-specific way whether or not the given interest still applies to the particular voice rope.
For example, for the class "FileAnnotation", this procedure returns Yes if and only if the file instance identified by the interest 
parameter still exists.
    An interest verifier periodically enumerates the database of interests and calls the class-specific GARBAGE procedure for each 
    interest.  If the procedure returns Yes, then the verify calls FORGET[VRID, class, interest] to delete the interest from the 
    database.
    A garbage collector for voice ropes also runs periodically.  For each voice rope in the database it (1) deletes the voice rope if 
    no interests exist that reference it, and (2) deletes voice files used by the voice rope if they are no longer a part of any 
    other voice rope.  This process refuses to collect voice ropes that are too young in order to prevent it from collecting a 
    newly created voice rope before a client has the opportunity to express an interest in it.  This method will find all 
    unreferenced voice ropes, including those for which no client ever expressed an interest.  Note that, unlike a mark-and-sweep 
    style garbage collector, these algorithms can be safely executed while the system is running and need not complete a full pass 
    through the database in order to perform useful work.
    In summary, garbage collection takes place on three levels.  The voice manager deletes voice files when they are no longer 
    referenced by voice ropes.  Voice ropes are deleted if no interests exist for them.  Interests are either explicitly forgotten 
    by client applications or automatically deleted based on a class-specific test for validity.
4. Experience and Evaluation
    Approximately 50 Etherphones are in daily use in the Computer Science Laboratory.  Our current voice file server runs on a Dorado 
    [12] with a 300 Mbyte local disk.  Thus, it has the capacity to store over 7 hours of recorded voice; the actual storage 
    capacity depends on the amount of suppressed silence.  Most of our user-level applications to date have been created in the 
    Cedar environment [24], although limited functions have been provided for Interlisp and for stand-alone Etherphones.  We have 
    had a voice mail system running for over two years and a prototype voice editor for about 8 months.
4.1 Voice-annotated documents
    Manipulating stored voice solely by textual references, besides allowing efficient sharing and resource management, makes it easy 
    to integrate voice into documents.  For example, we were able to build a local voice mail system without changing the mail 
    transport protocols or servers.  Also, annotated documents can be stored on conventional file servers that are not aware that 
    the documents logically contain voice.
    Significant performance benefits accrued by having documents refer to voice that is stored remotely.  Although most requests to 
    record or playback voice ropes are initiated from a workstation, the voice data is never received by the workstation; instead, 
    it is transmitted directly to the associated Etherphone.
    These techniques for sharing, however, require clients to have high-bandwidth network access to the voice manager.  Transferring 
    voice-annotated documents between remote sites would require special mechanisms such as DARPA's experimental multimedia mail 
    protocol [19]
4.2 Editing voice
    We have gained considerable experience with the voice manager by building a voice editing system in Cedar.  Figure 4 displays a 
    document containing voice annotations, and just below it a visual representation of one of the voice annotations that has been 
    opened up for editing.  The bar patterns in the visual representation indicate with solid bars the contiguous intervals 
    (talkspurts) of voice or other sounds, separated by white regions that represent periods of silence [1].  Given a voice rope 
    representing the entire annotation, the DESCRIBE operation is sufficient for generating such a display.
<< [Artwork node; type 'ArtworkInterpress on' to command tool] >>
Figure 4.  An example of voice annotation and editing. 
    The small set of editing operations provided by the voice manager has proven to be a sufficient base on which to build a complex 
    voice editor and a dictation machine.  However, to reduce traffic to the voice manager, the Cedar voice editor maintains its 
    own data structures to represent the edited voice temporarily.  That is, the voice editor ended up replicating much of the 
    functionality of the voice manager, something we were trying to avoid.  Only when a user elects to save the edited voice 
    passage does the voice manager get called to perform the necessary operations.  Given this arrangement, it may have been better 
    to let clients simply pass the voice manager a complex voice rope that it could store in its database.  
    Experience indicates that editing a voice passage invariably produces a set of "temporary" voice ropes that are used in the 
    construction of the finished result.  These objects are eventually collected by the garbage collector, so they do not present 
    much of a problem except that their creation requires seemingly unnecessary work of the voice manager.  To alleviate the 
    problem somewhat, the voice manager's interface was changed slightly so that an interval could be given for any voice rope in 
    any operation.  This substantially reduced the voice editor's use of the SUBSTRING operation.
    Event reporting is important in allowing the voice editor to coordinate its visual feedback with the activities of the voice file 
    server.  In particular, the voice editor moves a cursor along the screen as a voice rope is being played (the gray marker below 
    the word "score" in the voice displayed in Figure 4).  A report indicating that the playback of a particular voice passage has 
    started or finished is essential to synchronize the movement of the cursor with the transmission of voice data.
    Although the voice file server writes files on disk so that 1-second segments can be continuously transferred, clients are allowed 
    to edit voice ropes on 1-millisecond boundaries.  The file server could not possibly playback a voice rope in real time if it 
    had to perform a disk seek every millisecond.  Fortunately, users of the voice editor are encouraged to insert, delete, and 
    rearrange voice passages at the granularity of a sentence or phrase rather than trying to modify individual words or phonemes 
    [1].  Thus, in practice, one rarely sees segments of a voice rope that are less than several seconds in length.  
4.3 Interests
    The notion of grouping interests into classes and providing class-specific garbage collection algorithms is a useful and workable 
    concept.  However, we are still groping with the details of how best to use these mechanisms.  We have found several interest 
    classes to be useful in Cedar.  
    The Cedar mail system, automatically registers and deregisters interests of class "Message" as voice messages are saved and deleted 
    by users.  In addition, a "Timeout" class has been used to retract an interest automatically after a certain amount of time.  
    For instance, when sending a voice message, a timeout can be set by the sender that is long enough to give the recipients a 
    chance to receive the message and register their own interests if so desired.  Of course, problems can arise if a recipient is 
    on vacation for a period of time longer than the timeout.  For this reason, we have a means of archiving voice files before 
    deleting them from the server.
    For annotated documents in Cedar, the workstation software detects when a file is copied from the local disk to a public file 
    server; it then automatically registers the appropriate "FileAnnotation" interests for the public file.  Having workstation 
    software automatically register interests as a file is copied to a file server works remarkably well.  However, some important 
    operations are not covered by this approach: renaming a file on a file server or copying files between two file servers.  We 
    see no way to detect such operations except by modifying file server software.
    We have defined the "FileAnnotation" interest class such that its interest represents a publicly stored file name including the 
    version number.  With this scheme, interests must be reregistered for each new version of the file, that is, whenever a file is 
    written to a public server.  Unfortunately, the times that people want to annotate documents are precisely those times when the 
    document is being updated often, so many interests are registered repeatedly.  We rely on the garbage collector to get rid of 
    old interests.  An alternative would be to register a file without a version number, but that causes minor problems if voice is 
    deleted from the file but the file itself remains in existence.
4.4 Reliability
    The voice file server, voice manager, and voice control server were implemented so that they could run on separate physical 
    processors.  That is, they all communicate among themselves and with voice clients using RPC.  In practice, we run all three on 
    the same Dorado.  There is little to be gained by running them separately, since the voice file server cannot record or 
    playback voice files if the control server is down.  Similarly, the voice manager cannot record or playback voice ropes if the 
    voice file server is down.  For all practical purposes, voice can also not be edited if the voice file server is down, because 
    users invariably need to listen to the voice passages that they are editing.  
    Thus, availability is not adversely affected by having the voice manager and file server colocated with the control server.  If 
    this server crashes or is otherwise unavailable, then no operations can be performed on stored voice.  For the most part, this 
    is simply an inconvenience to users in the same way that unavailability of conventional file servers is an inconvenience.  In 
    Cedar, the file servers containing the important system files, fonts, and documentation are replicated to improve their 
    availability.  We have not found it necessary to pay the cost to provide a highly-available voice file server.
    The one exception to this concerns voice interests.  It is often the case that clients wish to register or deregister interests in 
    voice ropes independently of playing the referenced voice.  For example, an interest of type "FileAnnotation" is registered 
    when a voice-annotated document is copied from a personal workstation to a public file server.  A user should not be prevented 
    from performing such a copy simply because the voice manager is unavailable.  We have also observed that the interests for 
    voice messages fail to get properly registered or deregistered if a person saves or deletes a voice message while the voice 
    server is down.  This has led us to contemplate writing a program that enumerates a person's mail database and checks that all 
    voice messages have properly registered interests.  The better solution is to make the voice interest database 
    highly-available.  Rather than fully replicating the database, we are planning to provide a mechanism whereby operations to 
    RETAIN or FORGET a voice interest are logged locally by a user's workstation if the voice server is unavailable; the operations 
    in this log will be retried when the workstation detects that the server is reachable.
4.5 Performance
    (Performance measurements of the system are not available for publication at this time, but we should be able to include some in 
    the final paper.  We are convinced that the time performance of both the voice file server and the voice rope facilities exceed 
    the requirements of intended applications, so there will be little to learn there.  What we need to measure is the way people 
    use this stuff -- how many edits they make, at how fine a grain, etc.  We also need to estimate and/or measure the cost of 
    database log compaction, once the system is in heavy use.  Space requirements for archiving actual digitized voice is also an 
    issue.)
4.6 Related work
    Several companies provide speech message systems that can be accessed from standard telephones; one of the earliest examples of 
    this type of system was IBM's experimental Speech Filing System, which was operational in 1975 [10].  Certainly the Etherphone 
    system's facilities can be accessed from telephones, but that was not the driving application.  We were interested in allowing 
    voice to be integrated easily into a user's existing means of digital communications, rather than forcing users to learn a 
    completely new system.  The Sydis Information Manager provides workstation control over the recording, editing, and playing of 
    voice as in the Etherphone system, but requires special workstations called VoiceStations [18].  Ruiz also developed a 
    prototype voice system that integrates voice and data into some simple workstation applications; however, he did not address 
    the important issues of sharing stored voice [20].
    Maxemchuk's speech storage system [15] provided many of the same facilities for recording, editing, and playing voice as our voice 
    file server.  (Actually, he provided much more control over the playback of voice than we do, such as the ability to vary 
    playback speeds or adjust silence intervals.)  Also, the division of function between a main computer and a storage computer is 
    quite similar to the separation between our voice manager and voice file server.  However, Maxemchuk's system edits voice using 
    divide and join operations that modify the control sectors of stored voice messages.  Our technique of building data structures 
    that reference voice files better supports sharing, by making voice ropes immutable, and simplifies the requirements placed on 
    the voice file server.  For instance, our techniques are very amenable to write-once storage technologies such as optical disks.
    Version Storage in the Swallow system [21] has many similar characteristics to our voice manager.  That is, it manages immutable 
    objects of various sizes.  Also, its "structured version images" used for large objects are similar to the data structures used 
    by the voice file server to describe voice files.  However, unlike the voice manager, Swallow maintains histories to link 
    together objects that are derived from one another and provides atomic operations on multiple objects.  Also, it provides no 
    editing mechanisms or garbage collection, just read and write operations.
    The Diamond Document Store [26], like the Etherphone system, manages documents that contain various media elements by reference; it 
    also allows documents to be shared among users by reference.  A simple reference count scheme suffices for deallocating objects 
    that reside in the Document Store but are not referenced by any document or document folder since the Diamond system does not 
    allow documents stored outside the system to reference internally stored objects.  The Etherphone system, on the other hand, 
    strives to provide voice services that can be used along with other existing services, such as the Grapevine mail system [3] 
    and Alpine file servers [6].
    The Cambridge File Server [16] was perhaps the first network-accessible storage system to require clients to take an explicit 
    action to prevent files from being automatically garbage collected.  In particular, it deletes files that are not accessible 
    from server-maintained, but client-updated "indices".  Thus, these indices play much the same role as the voice manager's 
    interest database.
    Liskov and Ladin present an example of a distributed garbage collector [14].  Their approach requires all sites that store 
    references to other objects to run a garbage collector locally and send information about non-local references to a reference 
    server.  In some sense, their use of a reference server is similar to our use of registered interests, but much more limited.  
    One interesting contribution they make is how to build a highly-available reference server; we could use these techniques to 
    build an interest server.
5. Conclusions
    The facilities for managing stored voice in the Etherphone system were designed in adherence to the following principles:
� Permit sharing among various clients
    Maintaining voice on a publicly accessible server facilitates sharing.  Clients can freely share references to voice ropes without 
    incurring the overhead of transmitting the voice itself.  Because voice ropes are immutable, even though they are incorporated 
    into documents by reference, they exhibit copy semantics.
� Support easy editing of voice by programs
    The editing operations provided by the voice manager are similar to those in the Cedar Rope package.  This is intentional so that 
    programmers can manipulate voice in the ways to which they are accustomed for text.  The basic facilities to support editing 
    reside on a server; workstations are responsible for providing a user interface that is integrated with their programming 
    environment.
� Move voice data as little as possible
    Once recorded in the voice file server, voice is never copied until a workstation sends a playback request; at this point the voice 
    is transmitted directly to an Etherphone.  In particular, although workstations initiate most of the operations in the 
    Etherphone system, there is little reason for them to receive the actual voice data since they have no way of playing it.  
    Furthermore, to support efficient editing, we maintain a two level storage hierarchy: voice ropes refer to intervals of voice 
    files.  A given voice rope can consist of intervals from several voice files, and a given voice file can be used by several 
    voice ropes.  A database stores the many-to-many relationships that exist between voice ropes and files.  Editing operations 
    simply create new voice ropes from old ones and add them to the database.
� Allow diverse workstations to be integrated into the system
    All of the operations on stored voice are performed on a server, the voice manager.  Due to the heterogeneous nature of our 
    environment, providing a single implementation of these facilities on a server seemed better than requiring each different 
    workstation programming environment to provide its own implementation.  Moreover, the only requirements placed on a workstation 
    in order to make use of the voice services are that it have an associated Etherphone and a RPC implementation.  In particular, 
    workstations need not have hardware support for encryption or voice I/O.
� Do not restrict the uses of voice in client applications
    Voice management is provided by a server exporting an RPC interface.  The voice manager makes no assumptions about the way clients 
    make use of its services.  This particularly impacted the design of the voice garbage collector.  
� Provide a level of security at least as good as that of conventional file servers
    We use secure RPC for all control functions in the Etherphone system and DES encryption for transmitted voice.  Thus promiscuous 
    machines are prevented from listening to any communications in the system.  Storing the voice in its encrypted form protects 
    the voice on the server and also means that the voice need not be reencrypted on playback.  All in all, the voice system 
    actually provides better security than most file servers.
� Reclaim automatically the storage occupied by unneeded voice
    Garbage collection of voice ropes is done using a modified type of reference counting.  Clients register interests in particular 
    voice ropes.  These interests are grouped into classes and can be invalidated according a class-specific algorithm.  For the 
    most part, users of voice applications are not aware of how or when interests are registered since it is handled transparently 
    by the application software.
    The Etherphone system has provided an environment in which to explore the management of voluminous, shared data among distributed 
    and heterogeneous workstation clients.  The techniques presented in this paper are applicable to and beneficial for the 
    management of various types of data including voice, video, images, and music.
Acknowledgments
    The design of voice ropes evolved for several years and many people contributed valuable suggestions.  Others also deserve credit 
    for the implementation of the voice file server and the voice editor.  (Names will be included in the final paper.)
References
[1]
S. Ades and D. C. Swinehart. 
Voice annotation and editing in a workstation environment, 
Proceedings AVIOS Voice Applications '86, September 1986, pages 13-28.
[2]
R. Bayer and E. McCreight.
Organization and maintenance of large ordered indexes.
Acta Informatica 1(3):173-189, 1972.
[3]
A. Birrell, R. Levin, R. M.  Needham, and M. D. Schroeder.
Grapevine: An exercise in distributed computing.
Communications of the ACM 25(4):260-274, April 1982.
[4]
A. D. Birrell and B. J. Nelson.
Implementing remote procedure calls.
ACM Transactions on Computer Systems 2(1):39-59, February 1984.
[5]
A. D. Birrell.
Secure communication using remote procedure calls.
ACM Transactions on Computer Systems 3(1):1-14, February 1985.
[6]
M. R. Brown, K. Kolling, and E. A. Taft.
The Alpine File System.
ACM Transactions on Computer Systems 3(4):261-293, November 1985.
[7]
R. G. G. Cattell.
Design and implementation of a relationship-entity-datum data model.
Xerox Palo Alto Research Center, Technical Report CSL-83-4, May 1983.
[8]
D. D. Clark.
The structuring of systems using upcalls.
Proceedings Tenth Symposium on Operating Systems Principles, Orcas Island, Washington, December 1985, pages 171-180.
[9]
J. Donahue and W. Orr.
Walnut: Storing electronic mail in a database.
Xerox Palo Alto Research Center, Technical Report CSL-85-9, November 1985. 
[10]
J. D. Gould and S. J. Boies.
Speech filingAn office system for principles.
IBM Systems Journal 23(1): 65-81, January 1984.
[11]
J. N. Gray.
Notes on database operating systems.
In Bayer et al., Operating Systems: An Advanced Course, Springer-Verlag, 1978, pages 393-481.
[12]
B. W. Lampson and K. A. Pier.
A processor for a high-performance personal computer.
Proceedings 7th Symposium on Computer Architecture, La Baule, May 1980, pages 146-160.
[13]
B. W. Lampson.
Hints for computer system design.
Proceedings Ninth Symposium on Operating Systems Principles, Bretton Woods, New Hampshire, October 1983, pages 33-48. 
[14]
B. Liskov and R. Ladin.
Highly-available distributed services and fault-tolerant distributed garbage collection.
Proceedings of Symposium on Principles of Distributed Computing, Calgary, Alberta, Canada, August 1986, pages 29-39.
[15]            
N. Maxemchuk. 
An experimental speech storage and editing facility. 
Bell System Technical Journal 59(8): 1383-1395, October 1980.
[16]
J. G. Mitchell and J. Dion.
A comparison of two network-based file servers.
Communications of the ACM 25(4):233-245, April 1982. 
[17]
National Bureau of Standards.
Data Encryption Standard.
Fedaral Information Processing Standard (FIPS) Publication 46, U. S. Department of Commerce, January 1977.
[18]                
R. Nicholson. 
Integrating voice in the office world. 
BYTE 8(12):177-184, December 1983.
[19]
J. K. Reynolds, J. B. Postel, A. R. Katz, G. G. Finn, and A. L. DeSchon.
The DARPA experimental multimedia mail system.
Computer 18(10):82-89, October 1985.
[20]                    
A. Ruiz. 
Voice and telephony applications for the office workstation. 
Proceedings 1st International Conference on Computer Workstations, San Jose, CA, November 1985, pages 158-163.
[21]
L. Svobodova.
A reliable object-oriented data repository for a distributed computer system.
Proceedings Eighth Symposium on Operating Systems Principles, Pacific Grove, California, December 1981, pages 47-58. 
[22]
L. Svobodova.
File servers for network-based distributed systems.
ACM Computing Surveys 16(4):353-398, December 1984.
[23]
D. C. Swinehart, L. C. Stewart, and S. M. Ornstein.
Adding voice to an office computer network.
Proceedings IEEE GlobeCom '83, November 1983.
Also available as Xerox Palo Alto Research Center, Technical Report CSL-83-8, February 1984.
[24]        
D. C. Swinehart, P. T. Zellweger, R. J. Beach, and R. B. Hagmann. 
A structural view of the Cedar programming environment. 
ACM Transactions on Programming Languages and Systems 8(4):419-490, October 1986.
[25]
D. C. Swinehart, D. B. Terry, and P. T. Zellweger.
An experimental environment for voice system development.
IEEE Office Knowledge Engineering Newsletter, February 1987.
[26]
R. H. Thomas, H. C. Forsdick, T. R. Crowley, R. W. Schaaf, R. S. Tomlinsin, V. M. Travers, and G. G. Robertson.
Diamond: A multimedia message system built on a distributed architecture.
Computer 18(12):65-78, December 1985.
[27]
N. Yankelovich, N. Meyrowitz, and A. van Dam.
Reading and writing the electronic book.
Computer 18(10):15-30, October 1985.