Page Numbers: Yes X: 527 Y: 10.5" First Page: 21
Part II: Some Aspects of Grapevine in Detail
6. Updating the Registration Data Base
The choice of methods for managing the distributed registration data base was largely determined by the requirement that Grapevine provide highly available, decentralized administrative functions. Administrative functions are performed by changing the registration data base. Replication of this data base makes high availability of administrative functions possible. An inappropriate choice of the method for ensuring the consistency of copies of the data, however, might limit this potential high availability. In particular, if we demanded that data base updates be atomic across all servers, then most servers would have to be accessible before any update could be started. For Grapevine, the nature of the services dependent on the registration data allows a definition of consistency that results in higher availability of the update function. Grapevine guarantees only that the multiple copies of a registration data base entry eventually will have the same new value following an update to one of them. If all servers containing copies are up and can communicate with one another, convergence will occur within a few minutes at most. While an update is converging, clients may detect inconsistency by reading the value of an entry from several servers.
6.1 Representation
The value for each entry in the registration data base is represented mainly as a collection of lists. The membership set of a group is one such list. Each list is represented as two sublists of items, called the active sublist and the deleted sublist. An item consists of a string and a timestamp. A particular string can appear only once in a list, either in the active or the deleted sublist. A timestamp is a unique identifier whose most significant bits are a time and least significant bits an internet address. The time is that perceived by the server that placed the item in the list; the address is that server’s. As long as a particular server never includes the same time in two different timestamps, all timestamps from all servers are totally ordered.
For example, Figure 3 presents the complete entry for a group named "LaurelImp↑.pa" from the registration data base as it appeared in early April 1981. There are three such lists in this entry; the membership set labelled members and two access control lists labelled owners and friends (see section 6.4 for the semantics of these). There are four current members followed by the corresponding five timestamps, and one deleted member followed by the corresponding timestamp. The owners and friends lists each contain one name and there are no deletions recorded from either.
A registration data base entry also contains a version timestamp. This timestamp, which has the same form as an item timestamp, functions as a version number for the entry. Whenever anything in an entry changes the version timestamp increases in value. (Usually it is the maximum of the other timestamps in the entry.) When interrogating the data base, a client can compare the version timestamp on which it based some cached information with that in the data base. If the cached timestamp is still current then the client is saved the expense of obtaining the data base value again and recomputing the cached information. The version timestamp appears in the prefix line in Figure 3.
Prefix: [3#14, 1-Apr-81 12:46:45], type = group, LaurelImp↑.pa
Remark: (stamp=[3#22, 22-Aug-80 23:42:14]) Laurel Team
Members: Birrell.pa, Brotz.pa, Levin.pa, Schroeder.pa
Stamps: [3#22, 23-Aug-80 17:27:45], [3#22, 23-Aug-80 17:42:35], [3#22, 23-Aug-80 19:31:01], [3#22, 23-Aug-80 20:50:23]
DelMembers: Butterfield.pa
Stamps: [3#14, 25-Mar-81 14:15:12]
Owners: Brotz.pa
Stamps: [3#14, 22-Aug-80 23:43:09]
DelOwners: none
Stamps: null
Friends: LaurelImp↑.pa
Stamps: [3#14, 1-Apr-81 12:46:45]
DelFriends: none
Stamps: null
Figure 3: A group from the registration data base
6.2 Primitive Operations
We use two primitive operations on the lists in a registration data base entry. An update operation can insert or delete an item from a list. To add/delete the string s to/from a list, any item with the matching string in either of the sublists first is removed. Then a timestamp t is produced from the server’s internet address and clock. Finally the item (s,t) is added to the active/deleted sublist. A merge operation combines two versions of a complete list to produce a new list with the most recent information from both. Each string that appears in either version will appear precisely once in the result. Each string will be in the active or deleted sublist of the merge according to the largest timestamp value associated with that string in either version. That largest timestamp value also provides the timestamp for the string in the result. Keeping the sublists sorted by string value greatly increases the speed with which the merge can be performed. The update and merge operations are atomic in each particular server.
6.3 Propagation
The administrative interface to Grapevine is provided by client software running in an administrator’s computer. To make a change to the data of any registry, a client machine connects to some registration server that knows about that registry. (The GrapevineUser package resource location facilities are used to find an appropriate server.) That registration server performs an update operation on the local copy of the entry. Once this update has been completed the client can go about its other business. The server propagates the change to the copies of the entry in other servers. The means used to propagate changes is Grapevine’s delivery service itself, since it gives a guarantee of delivery and provides buffering when other servers are temporarily inaccessible. As described in section 5.1, the members of the group that represent the registry are the registration servers that contain a copy of the data for that registry. Thus, if the change is to an entry in the reg registry, the accepting server sends a change message to the members, other than itself, of the distribution list reg.gv. A change message contains the name of the affected entry and the entire new value for the entry. Registration servers poll their inboxes for new messages every 30 seconds. When a change message is received by a server it uses merge operations to combine the entry from the change message with its own copy.
With this propagation algorithm, the same final state eventually prevails everywhere. When a client makes multiple updates to an entry at the same server, the same sequence of entry values will occur everywhere, even if the resulting change messages are processed in different orders by different servers. If two administrators perform conflicting updates to the data base (such as adding and removing the same member of a group), initiating the updates at different servers at nearly the same time, it is hard to predict which one of them will prevail; this appears to be acceptable, since the administrators presumably are not communicating with each other outside the system. Also, since copies will be out of step until the change messages are received and acted upon, clients must be prepared to cope with transient inconsistencies. The algorithms used by clients have to be convergent in the sense that an acceptable result will eventually ensue even if different and inconsistent versions of the registration data appear at various stages in a computation. The message delivery algorithms have this property.
Similar update propagation techniques have been proposed by others who have encountered situations that do not demand instantaneous consistency [BBN, SDD-1].
If deleted items are never removed from an entry, continued updates may cause the data base to grow. Deleted items are kept in an entry so that out-of-order arrival of change messages involving addition followed by deletion of the same string will not cause the wrong final state. Deleted items also provide a record of recent events for use by human administrators. We declare an upper bound on the clock asynchrony among the registration servers, on message delivery delay, and on administrative hindsight. The Grapevine servers each scan their local data base once a day during inactive periods and purge all deleted items older than the bound. We use a bound of 14 days.
If a change message gets destroyed because of a software bug or equipment failure, there is a danger that a permanent inconsistency will result. Since a few destroyed messages over the life of the system are inevitable, we must provide some way to resynchronize the data base. At one point we dealt with this problem by detecting during the merge operation whether the local copy of the entry contained information that was missing from the incoming copy. Missing information caused the server to send the result of the merge in a change message to all servers for the registry. While this "anti-entropy" mechanism tended to push the data base back into a consistent state, the effect was too haphazard to be useful; errors were not corrected until the next change to an entry. Our present plan for handling long-term inconsistencies is for each registration server periodically (say once a night) to compare its copy of the data base for a registry with another and to use merges to resolve any inconsistencies that are discovered. The version timestamp in each entry makes this comparison efficient. If two version timestamps are equal then the entries match. Care must be taken that the comparisons connect all registration servers for a registry, or else disconnected regions of inconsistency can survive.
6.4 Creating and Deleting Names
The rule that the latest timestamp wins does not deal adequately with the creation of new names. If two administrators connect to two different registration servers at about the same time and try to create a new data base entry with the same name, it is likely that both will succeed. When this data base change propagates, the entry with the latest time timestamp will prevail. The losing administrator may be very surprised, if he ever finds out. Because the later creation could be trapped in a crashed registration server for some time, an administrator could never be sure that his creation had won. For name creation we want the earlier creation to prevail, in contrast with other updates, such as changing the membership set of a group, where the later update should prevail. We faced the possibility of having to implement one of the known and substantial algorithms for atomic updates to replicated databases [Gifford], which seemed excessive, or of working out a way to make all names unique by appending a hidden timestamp, which seemed complex. We instead fell back on observations about the way in which systems of this nature are used. For each registry there is usually some human-level centralization of name creation, if only to deal with questions of suitability of RNames (not having a junior clerk preempt the RName which everyone would associate with the company president). We consider this centralization enough to solve the problem. Note that there is no requirement that a particular server be used for name creation: there is no centralization at the machine level.
Deleting names is straightforward. A deleted entry is marked as such and retained in the data base with a version timestamp. Further updates to a deleted entry are not allowed. Recreation of a deleted entry is not allowed. Sufficiently old deleted entries are removed from the data base by the purging process described in section 6.3.
6.5. Access Controls
An important aspect of system administration is control of who can make which administrative changes. To address this need we associate two access control lists with each group: the owners list and the friends list. (These lists appear in the example entry in Figure 3.) The interpretation of these access lists is the responsibility of the registration server; for example, for ordinary groups the conventions are as follows. Membership in the owners list confers permission to add or remove any group member, owner, or friend; membership in the friends list confers permission to add or remove oneself. The names in the owners and friends lists may themselves be the names of groups. Quite separately, clients of the registration server have freedom to use membership in groups for access control purposes about which the registration server itself knows nothing at all. The owners and friends lists on the groups that represent registries are used to control name creation and deletion within registries; these lists also provide the default access controls on groups whose owners list is empty. While we have spent some time adjusting the specific semantics of the Grapevine access controls, we do not present further details here.
6.6 Other Consequences of Changes
The registration servers and message servers are normal clients of one another’s services, with no special relationship. Registration servers use message server delivery functions and message servers use the registration service to authenticate clients, locate inboxes, etc. This view, however, is not quite complete. If a change is made to the inbox locations of any individual, notice has to be given to all message servers that are removed, so they can redeliver any messages for that individual buffered in local inboxes. To achieve this effect, the registration server delivers a message to the message servers in question informing them of the change. Correctness requires only that the last registration server that changes its copy of the entry emit the message; we achieve this effect by having each registration server emit such a message as the change is made. A message server receiving an inbox removal message simply redelivers all messages in the affected inbox. Redelivery is sufficient to rebuffer the messages in the proper server. In the system as implemented a simplification is made; inbox removal messages are sent to all inbox sites for the affected individual. While this may appear to be wasteful, it is most unusual for any site other than the primary one to have anything to redeliver.
Other registration service clients that use the registration data base to control resource bindings may also desire notification of changes to certain entries. A general notification facility would require associating notification lists with data base entries. Any change to an entry would result in a message being sent to the RNames on its notification list. We have not provided this general facility in the present implementation, but would do so if the data base were being reformatted.
7. Finding an Inbox Site
The structure and distribution of the Grapevine registration data base are quite complex, with many indirections. Algorithms for performing actions based on this data base should execute reliably in the face of administrative changes to the registration data base (including those which cause dynamic reconfiguration of the system) and multiple servers that can crash independently. In their full generality such algorithms are expensive to execute. To counter this, we have adopted a technique of using caches and hints to optimize these algorithms. By cache we mean a record of the parameters and results of previous calculations. A cache is useful if accessing it is much faster than repeating the calculation and frequently produces the required value. By hint we mean a value that is highly likely to be correct and that is easier to check than to recalculate. To illustrate how caches and hints can work, we describe here in some detail how the message server caches hints about individuals’ inbox sites.
The key step in the delivery process is mapping the name of an individual receiving a message to the preferred inbox site. The mapping depends upon the current state of the registration data base and the current availability of particular message servers.
To make this mapping process as efficient as possible, each message server maintains an inbox site cache that maps RNames of individuals to a hint for the currently preferred inbox site. Each message server also maintains a down server list containing the names of message servers that it believes to be inaccessible at present. A message server is placed on this list when it does not accept connections or fails during a connection. The rules for using the inbox site cache to determine the preferred message server for a recipient I are:
1.If an entry for I is in the cache and the site indicated for I in the cache is not on down server list, then use that site;
2.Otherwise get the inbox site list for I from the registration service; cache and return for use the first site not on the down server list; if the selected site is not first on the list, mark the entry as "secondary".
There has to be a rule for removing message servers from the down server list: this happens when the server shows signs of life by responding to a periodic single packet poll.
When a message server is removed from the down server list, the inbox site cache must be brought up to date. Any entry which is marked as "secondary" and which is not the revived site could be there as a substitute for the revived site; all such entries are removed from the cache. This heuristic removes from the cache a superset of the entries whose preferred inbox site has changed (but not all entries in the cache) and will cause recalculation of the preferred inbox site for those entries the next time they are needed.
We noted earlier that changing an individual’s inbox site list may require a message server to redeliver all messages in that individual’s inbox, and that this redelivery is triggered by a messages from registration servers to the affected message servers. The same changes also can cause site caches to become out-of-date. Part of this problem is solved by having the inbox redelivery messages also trigger appropriate site cache flushing in the servers that had an affected inbox. Unfortunately any message server potentially has a site cache entry made out-of-date by the change. Instead of sending a message to all message servers, we correct the remaining obsolete caches by providing feedback from one message server to another when incorrect forwarding occurs as a result of an out-of-date cache. Thus, the site cache really does contain hints.
To summarize the inbox location algorithm, then, registration servers remove servers from an inbox site list and send messages to all servers originally on the list. Each responds by removing any entry for the subject individual from its site cache and redelivering any messages found in that individual’s inbox. During this redelivery process, the cache entry will naturally be refreshed. Other message servers with out-of-date caches may continue to forward messages here for the subject individual. Upon receiving any message forwarded from another server, then, the target message server repeats the inbox site mapping for each name in the steering list. If the preferred site is indeed this target message server, then the message is added to the corresponding inbox. If not, then the target site does the following:
1.Forwards the message according to the new mapping result;
2.Sends a cache flush notification for the subject individual back to the server that incorrectly forwarded the message here.
The cache flush notification is a single packet sent unreliably: if it fails to arrive, another one will be provoked in due course. This strategy results in the minimum of cache flush notifications being sent - one to each message server whose cache actually needs attention, sent when the need for attention has become obvious. This mechanism is more economical than the alternative of sending cache flush notifications to all message servers, and even if that were done it would still be necessary to cope with the arrival of messages at old inbox sites.
8. System Configuration
As described in section 5, the configuration of the Grapevine system is controlled by its registration data base. Various entries in the data base define the computers available to Grapevine and the ways in which the data and functions of Grapevine are distributed among them. We now consider procedures for reconfiguring Grapevine.
8.1. Adding and Deleting Registry Replicas
The set of registration servers that contain some registry is defined by the membership set for the corresponding group in the GV registry. When a change occurs, the affected server(s) need to acquire or discard a copy of the registry data. To discover such changes, the registration servers simply monitor all change messages for groups in the GV registry, watching for additions or deletions of their own name. A registration server responds to being deleted by discarding the local replica of the registry. With the present implementation, a registration server ignores being added to a registry site list. Responding to a registry addition in the obvious way - by connecting to another registration server for the registry and retrieving the registry data - is not sufficient. Synchronization problems arise that can lead to the failure to send change messages to the added server. Solving these problems may require the use of global locks, but we would prefer a solution more compatible with the looser synchronization philosophy of Grapevine. For the present adding a registry replica is triggered manually, after waiting for the updates to the GV registry to propagate and after ensuring that other such reconfigurations are not in progress.
8.2. Creating Servers
Installing a new Grapevine computer requires creating a new registration server and a new message server. To create the new registration server named, say, Cabernet.gv, a system administrator first creates that individual (with password) in the registration data base, and gives it a connect site that is the internet address of the new computer. Next, Cabernet.gv is added to the membership set of all registries that are to be recorded in this new registration server. To create the new mail server named, say, Cabernet.ms, the administrator creates that individual with the same connect site, then adds Cabernet.ms to MailDrop.ms. Both servers are assigned inbox sites.
Once the data base changes have been made, the registration and message servers are started on the new computer. The first task for each is to determine its own name and password so that it may authenticate itself to the other Grapevine servers; the registration server must also obtain copies of the data base entries for those registries which it will support. A server obtains its name by noting its own internet address, which is always available to a machine, then consulting the data base in a different registration server to determine which server is specified to be at that address: the registration server looks for a name in the group gv.gv, the message server looks for a name in the group MailDrop.ms. Having found its name, the server asks a human operator to type its password; the operator being able to do this correctly is the fundamental source of the server’s authority. The server verifies its password by the authentication protocol, again using a registration server that is already in operation, and then records its name and password on its own disk. The new registration server then consults some other registration server to obtain the contents of the GV registry in order to determine which groups in the GV registry contain its name: these specify which registries the new server should contain. It then contacts appropriate other servers to obtain copies of the data base for these registries. Because the new server can authenticate itself as an individual in the GV registry, other registration servers are willing to give it entire database entries, including individuals’ passwords.
Obtaining the registry replicas for the new registration server suffers from the same synchronization problems as adding a registry replica to an existing server. We solve them the same way, by waiting for the administrative updates to the GV registry to propagate before starting the new computer and avoiding other simultaneous reconfigurations.
8.3. Stopping and Restarting Servers
Stopping a server is very easy. Grapevine computers can be stopped without disturbing any disk write in progress. The message and registration server are programmed so that, when interrupted between disk page writes, they can be restarted without losing any permanent information. While a message or registration server is not running, messages for it accumulate in its inboxes in message servers elsewhere, to be read after it restarts.
Whenever a message and registration server restart, each verifies its name and password by consulting other servers, and verifies that its internet address corresponds to the connect site recorded for it in the data base; if necessary it changes the connect site recorded in the data base. Updating the connect site allows a server to be moved to a new machine just by moving the contents of the disk. After restarting, a registration server acts on all accumulated data base change messages before declaring itself open for business.
Using the internet, it is possible, subject to suitable access controls, to load a new software version into a remote running Grapevine computer, stop it, and restart it with the new version.
8.4. Other Reconfigurations
One form of reconfiguration of the system requires great care: changing the location of inbox sites for a registration server. Unless special precautions are taken, the registration server may never encounter the change message telling it about a new inbox site, because that message is waiting for it at the new site. A similar problem arises when we change the internet address of a mail server that contains a registration server’s inbox. Restrictions on where such data base changes can be initiated appear to be sufficient to solve these problems, but we have not automated them. Although this resolution of this problem is somewhat inelegant, the problem is not common enough to justify special mechanisms.
9. Proper Authentication Mechanisms
So far, we have said little about authentication and security, which we regard as important aspects of a message delivery and registration service. Earlier we described the weak password-based authentication mechanism that is actually implemented. Here we present the design for a more satisfactory authentication mechanism.
Guarantees of authenticity and security in a distributed environment require the use of encryption-based protocols. An earlier paper [Needham & Schroeder] describes one such set of protocols. We shall not reproduce here the detailed descriptions of these protocols, nor the discussions of the hazards which these protocols guard against. Instead we show how a selection of these protocols may be applied to Grapevine, and how Grapevine’s registration servers can function as the authentication servers described there. The protocols are based on "secret key" encryption, such as the federal Data Encryption Standard [DES].
9.1. Security Goals
First we must consider what assurances we want the protocols to provide. One is a requirement for authenticity of individuals. It should not be possible to send messages under the name of another and it should not be possible (without authorization) to read another’s incoming messages. The present Grapevine system makes a rather feeble attempt at this type of authentication; its major failings are that passwords are transmitted over the internet as clear-text and that clients of the authentication service see individuals’ passwords. It also does not provide two-way authentication: clients cannot authenticate servers nor servers each other.
Separate from these questions of authenticity is one of security. There should be some reasonable assurance that messages do not get read or altered in transit, or delivered to places other than those intended.
9.2. Protocols
We assume that physical security makes the registration server disks safe repositories for secret keys. The authenticator recorded in the registration data base for each individual is used as that individual’s secret key. Thus all assurances are derived from the knowledge that the registration server knows an individual’s secret key. The two-way authentication assurance is derived from the fact that the registration service knows the secret keys of both individuals, although it is not necessary for a single registration server to know both. We take advantage of the fact that the GV registry is known to every registration server: this allows registration servers to communicate securely with each other, since they each know every registration server’s secret key. Our other assumptions are as given in the earlier paper.
On these assumptions it is possible to proceed in a mutually authenticated manner by use of the following protocol, which is equivalent to that described in sections four and five of [Needham & Schroeder]. Suppose a client A (with secret key KA) is trying to conduct mutual authentication with a server B (with secret key KB). RA and RB are the relevant registration servers, with their secret keys KRA and KRB. We use the notation {V}K to mean the value of V encrypted with the key K, with suitable care being taken to seriate separate blocks of V. I1, I2, and I3 are nonce identifiers (that is, they are used only for this one conversation). CK is a conversation key chosen, in a non-predictable fashion, for this conversation. The required protocol is as follows:
1.A->RA:A, B, I1
2.RA->RB:{ CK, A, B, I1 }KRB
3.RB->RA:{ {CK,A}KB, I1, A }KRA
4.RA->A:{ {CK,A}KB, I1, B, CK }KA
5.A->B:{CK,A}KB, {I2}CK
6.B->A:{ I2-1, I3 }CK
7.A->B:{ I3-1 }CK
At the conclusion of the protocol A and B can have reasonable confidence in each other’s authenticity, and they share a secret key, CK which has never been transmitted in the clear; they may use CK for encrypting their subsequent communication. It is possible to shortcut the protocol a little by having RA maintain a cache of items such as that transmitted at step 4, for direct issue to A on receipt of the request at step 1, or by A caching the data received in step 4. If RA and RB are the same machine, then steps 2 and 3 can be omitted.
In the most general case, a client wishes to contact some service whose name is known, but the client does not a priori know the names of the servers providing this service. The client can obtain assurance of the authenticity of a server as an instance of a service by using secure communication when talking to the registration servers to obtain the names of servers providing the service, then using the above protocol.
It is also possible to provide end-to-end authentication and security for transmitted mail based on the authenticator {CK,A}KB, as described in section seven of [Needham & Schroeder].
Part III: Conclusions
10.Present State
The Grapevine system was first made available to a limited number of clients during 1980. At present, Fall 1981, it is responsible for most of the mail traffic and distribution lists on the Xerox research internet. There are five dedicated Grapevine computers, each containing a registration server and a mail server. The computers are physically distributed among northern and southern California and New York. We plan to add a Grapevine computer in Texas and one in England in the near future. The registration data base contains about 1500 individuals and 500 groups, divided mainly into four major registries; there are two other registries used by non-mail clients of the registration service, plus the GV and MS registries. The total mail traffic amounts to some 2500 messages each working day, with an average of 9 recipients each; the messages average about 500 characters, and are almost exclusively text. The registration data base also is used for authentication and configuration of various file servers, for authentication and access control in connection with maintenance of the basic software and data bases that support our internet gateways, and for resource location associated with remote procedure call binding. The registration data base is administered almost exclusively by non-technical staff. There are at least three separate computer mail interface programs in use for human-readable mail.
The Grapevine registration and message servers are programmed in Mesa [Mesa]. They contain some 33,000 lines of custom written code, together with standard packages for runtime support and PUP-level communications. The Grapevine computers are Altos [Alto] with 256K bytes of main memory and 5M bytes of disk storage. A running Grapevine computer has between 40 and 70 Mesa processes [Lampson & Redell], and handles 12 simultaneous connections. The peak load of messages handled by a single message server exceeds 150 per hour and 1000 messages per day. The maximum number of primary inboxes that have been assigned to a server is 340.
11.Discussion
The fundamental design decision to use a distributed data base as the basis for Grapevine’s message delivery services has worked out well. The distributed data base allowed us to meet the design goals specified in section 2, and has not generated operational difficulties. The distributed update algorithms that trade atomic update for increased availability have had the desired effect. The temporary inconsistencies do not bother the users or administrators and the ability to continue data base changes while the internet is partitioned by failed long-distance links is exercised enough to be appreciated.
In retrospect, our particular implementation of the data base for Grapevine was too inflexible. As the use of the system grew, the need for various extensions to the values recorded in individual and group entries has become apparent. Reformatting the existing distributed data base to include space for the new values is difficult operationally. In a new implementation we would consider providing facilities for dynamic extension of the value set in each entry. With value set extension, however, we would keep the present update algorithm and its loose consistency guarantees. These guarantees are sufficient for Grapevine’s functional domain, and their simplicity and efficiency are compelling. There is a requirement in a message system for some data base which allows more flexible descriptions of recipients or distribution lists to be mapped onto message system RNames (such as the white or yellow page services of the telephone system), but in our view that service falls outside of Grapevine’s domain. A system which provides more flexibility in this direction is described in [COCOS].
Providing all naming semantics by indirection through the registration data base has been very powerful. It has allowed us to separate the concept of naming a recipient from that of addressing the recipient. For example, the fact that a recipient is named Birrell.pa says nothing about where his mail should be sent. This is in contrast to many previous message systems. Indirections also provide us with flexibility in configuring the system.
One feature which recurs in descriptions of Grapevine is the concept of a "group" as a generalization of a distribution list. Our experience with use of the system confirms the utility of use of the single "group" mechanism for distribution lists, access control lists, services, and administrative purposes.
Clients other than computer mail interfaces are beginning to use Grapevine’s naming, authentication, and resource location facilities. Their experience suggests that these are an important set of primitives to provide in an internet for constructing other distributed applications. Message transport as a communication protocol for data other than textual messages is a useful addition to our set of communication protocols. The firm separation between Grapevine and its clients was a good decision; it allows us to serve a wide variety of clients and to give useful guarantees to our clients, even if the clients operate in different languages and in different computing environments.
At several points in Grapevine, we have defined and implemented mechanisms of substantial versatility. As a consequence, the algorithms to implement these mechanism in their full generality are expensive. The techniques of caches and hints are powerful tools that allow us to regain acceptable efficiency without sacrificing "correct" structure. The technique of adding caches and hints to a general mechanism is preferable to the alternative style of using special case short cut mechanisms whose existence complicates algorithmic invariants.
Our experience with Grapevine has reinforced our belief in the value of producing "real" implementations of systems to test ideas. At several points in the implementation, reality forced us to rethink initial design proposals: for example, the arrangements to ensure long-term consistency of the data base in the presence of lost messages. There is no alternative to a substantial user community when investigating how the design performs under heavy load and incremental expansion.
Grapevine was built partly to demonstrate the assertion that a properly designed replicated system can provide a very robust service. The chance of all replicas being unavailable at the same time seems low. Our experience suggests that unavailability due to hardware failure follows this pattern. No more than one Grapevine computer at a time has ever been down because of a hardware problem. On the other hand, some software bugs do not exhibit this independence. Generally all servers are running the same software version. If a client’s action provokes a bug that causes a particular server to fail, then in taking advantage of the service replication that client may cause many servers to fail. A client once provoked a protocol bug when attempting to present a message for delivery. By systematically trying again at each server in MailDrop.ms, that client soon crashed all the Grapevine computers. Another widespread failure occurred as a result of a malformed registration data base update propagating to all servers for a particular registry. We conclude that it is hard to design a replicated system that is immune from such coordinated software unreliability.
12.Acknowledgements
Many people have contributed to the success of the Grapevine project. Bob Taylor and Bob Metcalfe recognized early the need for work on computer mail systems and encouraged us to develop Grapevine. Ben Wegbreit participated in the initial system design effort. Many colleagues have helped the project in various ways: Dave Boggs, Doug Brotz, Jeremy Dion, Jim Horning, Robert Kierr, and Ed Taft deserve special mention. Jerry Saltzer and several anonymous referees have made valuable commentaries on earlier drafts of the paper.
13.References
[Alto]
C. P. Thacker, E. M. McCreight, B. W. Lampson, R. F. Sproull and D. R. Boggs, "Alto: A Personal Computer", in Siewiorek, Bell & Newell, Computer Structures: Readings and Examples, 2nd edition, McGraw-Hill, 1981.
[BBN]
R. H. Thomas, "A Solution to the Update Problem for Multiple Copy Data Base which Used Distributed Control", Bolt, Beranek and Newman technical report #3340, July 1976.
[COCOS]
N. Dawes, S. Harris, M. Magoon, S. Maveety and D. Petty, "The Design and Service Impact of COCOS - An Electronic Office System", in Proc. IFIP International Symposium on Computer Message Systems, (April 1981).
[DES]
National Bureau of Standards, "Data Encryption Standard", Federal Information Processing Standards 46, January 1977.
[Ethernet]
R. M. Metcalfe and D. R. Boggs, "Ethernet: Distributed Packet Switching for Local Computer Networks", Comm. ACM 19, 7 (July 1976), 395-404.
[Gifford]
D. K. Gifford, "Weighted Voting for Replicated Data", in Proc. 7th Symposium on Operating Systems Principles, (December 1979).
[Internet]
J. F. Shoch, "Internetwork Naming, Addressing and Routing", in Proc. 17th IEEE Computer Society International Conference, September 1978.
[Lampson & Redell]
B. W. Lampson and D. D. Redell, "Experience with Processes and Monitors in Mesa", Comm. ACM 23, 2 (Feb 1980), 105-117.
[Levin & Schroeder]
R. Levin and M. D. Schroeder, "Transport of Electronic Messages through a Network", Technical Report CSL-79-4, Xerox Palo Alto Research Center, 1979.
[Mesa]
J. G. Mitchell, W. Maybury and R. Sweet, "Mesa Language Manual (Version 5.0)", Technical Report CSL-79-3, Xerox Palo Alto Research Center, 1979.
[Needham & Schroeder]
R. M. Needham and M. D. Schroeder, "Using Encryption for Authentication in Large Networks of Computers", Comm ACM 21, 12 (December 1978), 993-999.
[PUP]
D. R. Boggs, J. F. Shoch, E. A. Taft and R. M. Metcalfe, "PUP: an Internetwork Architecture", IEEE Trans. on Communications 28, 4 (April 1980), 612-634.
[SDD-1]
J. B. Rothnie, N. Goodman and P. A. Bernstein, "The Redundant Update Methodology of SDD-1: A System for Distributed Databases (The Fully Redundant Case)", Computer Corporation of America, June 1977.