\boldhead{2.1 Dormant Death Certificates}

There is a distributed way of extending the time threshold back much further than the space on any one server would permit.  This 
scheme, which we call {\it dormant death certificates}, is based on the following observation.  If a death certificate is older 
than the expected time required to propagate it to all sites, then the existence of an obsolete copy of the corresponding data item 
anywhere in the network is unlikely.  We can delete very old death certificates at most sites, retaining ``dormant'' copies at only 
a few sites.  When an obsolete update encounters a dormant death certificate, the death certificate can be ``awakened'' and 
propagated again to all sites.  This operation is expensive, but it will occur infrequently.  In this way we can ensure that if a 
death certificate is present at {\it any} site in the network, resurrection of the associated data item will not persist for any 
appreciable time.  Note the analogy to an immune reaction, with the awakened death certificates behaving like antibodies.

The implementation uses two thresholds, $\tau_1$ and $\tau_2$, and attaches a list of $r$ retention site names to each death 
certificate.  When a death certificate is created, its retention sites are chosen at random.  The death certificate is propagated 
by the same mechanism used for ordinary updates.  Each server retains all death certificates timestamped within $\tau_1$ of the 
current time.  Once $\tau_1$ is reached, most servers delete the death certificate, but every server on the death certificate's 
retention site list saves a dormant copy.  Dormant death certificates are discarded when $\tau_1+\tau_2$ is reached.

(For simplicity we ignore the differences between sites' local clocks.  It is realistic to assume that the clock synchronization 
error is at most $\epsilon << \tau_1$.  This has no significant effect on the arguments.)

Dormant death certificates will occasionally be lost due to permanent server failures.  For example, after one server half-life the 
probability that all servers holding dormant copies of a given death certificate have failed is $2^{-r}$.  The value of $r$ can be 
chosen to make this probability acceptably small.

To compare this scheme to a scheme using a single fixed threshold $\tau$, assume that the rate of deletion is steady over time.  
For equal space usage, assuming $\tau > \tau_1$, we obtain
$$\tau_2 = (\tau - \tau_1)n/r .$$
That is, there is $O(n)$ improvement in the amount of history we can maintain.  In our existing system, this would enable us to 
increase the effective history from 30 days to several years.

At first glance, the dormant death certificate algorithm would appear to scale indefinitely.  Unfortunately, this is not the case.  
The problem is that the expected time to propagate a new death certificate to all sites, which grows with $n$, will eventually 
exceed the threshold value $\tau_1$, which does not grow with $n$.  While the propagation time is less than $\tau_1$, it is seldom 
necessary to reactivate old death certificates; after the propagation time grows beyond $\tau_1$, reactivation of old death 
certificates becomes more and more frequent.  This introduces additional load on the network for propagating the old death 
certificates, thereby further degrading the propagation time.  The ultimate result is catastrophic failure.  To avoid such a 
failure, system parameters must be chosen to keep the expected propagation time below $\tau_1$.  As described above, the time 
required for a new update to propagate through the network using anti-entropy is expected to be $O(\log n)$.  This implies that for 
sufficiently large networks $\tau_1$, and hence the space required at each server, eventually must grow as $O(\log n)$.