\input Paper.tex

\centerline{\titlefont Section 2 of the Paper}

{\noindent\bf 2. Deletion}

{\noindent\bf 2.0 Death Certificates}

Using either anti-entropy or rumor mongering, we cannot delete an item from the database by simply removing a local copy of the 
item and expecting the ``absence'' of the item to spread to other sites.  Just the opposite will happen: the propagation mechanism 
will spread old copies of the item from elsewhere in the database back to the site where we have deleted it.  Unless we can 
simultaneously delete all copies of an obsolete item from the data base, it will eventually be "resurrected" in this way.

To remedy this problem we replace deleted items with {\sl death certificates}, which carry time stamps and spread like ordinary 
data.  During propagation, when old copies of deleted items meet death certificates, the old items are removed.  If we keep a death 
certificate long enough it eventually cancels all old copies of its associated item.  Unfortunately this does not completely solve 
the problem.  We still must decide when to delete the death certificates themselves, so they do not ultimately consume all 
available storage at the sites.

One strategy is to retain each death certificate until it can be determined that every site has received it.  Sarin and Lynch [??] 
describe a protocol for making this determination, based on the distributed snapshot algorithm of Chandy and Lamport [??].  
Separate protocols are needed for adding and removing sites (the site addition protocol is not actually described in [??]).  If any 
site has failed since the creation of a death certificate,  that death certificate cannot be deleted until the site removal 
protocol has been run.  In a network of several hundred sites this fact can become quite significant.  In our experience, there is 
a fairly high probably that at any time some site will be down for hours or even days, preventing the distributed snapshot or site 
deletion algorithm from completing.

A much simpler strategy is to hold death certificates for some fixed length of time, such as 30 days, and then discard them.  With 
a threshold such as this, we run the risk of obsolete items older than the threshold being resurrected, as described above.  
Fortunately this risk can be made very small.

{\noindent\bf 2.1 Death Certificate Decay}

There is a distributed way of extending the threshold back further than the space on any one server would permit.  This scheme, 
which we call {\sl death certificate decay}, uses two thresholds, $\tau_1$ and $\tau_2$.  Each sever keeps all death certificates 
timestamped within $\tau_1$ of the current time, and once $\tau_1$ is reached, older death certificates are deleted at a rate of 
$1/\tau_2$, creating an exponential distribution:
$$P_1(t) = \cases{1, & $0  t < \tau_1$;\cr
e^{-{t - \tau_1\over \tau_2}}, & $t  \tau_1$\cr}$$
(For simplicity we ignore the differences between sites' local clocks.  It is realistic to assume that the clock synchronization 
error is at most $\epsilon << \tau_1$.  This has no significant effect on the arguments below.)  $P_1(t)$ is the probability that a 
death certificate of age $t$ will still be present in the database of a single site.  This probability drops off very quickly.  
However, as will be seen below, we can ensure that if a death certificate is present at {\sl any} site in the network, resurrection 
of the associated data item cannot occur. 

For $n$ sites, the probability that at least one site holds a given death certificate of age $t$ is:
$$P_n(t) = 1 - (1 - P_1(t))^n$$
To compare this with a fixed threshold $\tau$, let us assume that the rate of deletion is steady over time, so that for equivalent 
use of memory on the individual machines we should choose $\tau_1 + \tau_2 = \tau$.  For example, suppose that we have $500$ sites 
using a fixed $\tau = 30 \hbox{days}$, and we replace it with a probabilistic $\tau_1 = 10 \hbox{ days}$ and $\tau_2 = 20 \hbox{ 
days}$.  There is now a negligible ($10^{-100}$) chance that a death certificate of age $30$ days has been discarded by every site, 
and a significant chance ($.965$) that a certificate of age $110$ days is still held by some site.

{\noindent\bf 2.2. Anti-Entropy with Death Certificate Decay}

If anti-entropy is used for distributing updates, decaying death certificates older than $\tau_1$ should not normally be propagated 
during anti-entropy exchanges.  Whenever such a death certificate encounters an obsolete data item, however, the death certificate 
must be "reactivated" in some way, so it will propagate to all sites and cancel the obsolete data item.

The obvious way to reactivate a death certificate is to set its timestamp forward to the current clock value.  This approach might 
be acceptable in a system that did not allow deleted data items to be "reinstated."  In general it is incorrect, because somewhere 
in the network there could be a legitimate update with a timestamp between the original and revised timestamps of the death 
certificate (e.g. one reinstating the deleted item).  Such an update would incorrectly be cancelled by the reactivated death 
certificate.

To avoid this problem, we store a second timestamp, called the {\sl activation timestamp}, with each death certificate.  Initially 
the ordinary and activation timestamps are the same.  A death certificate cancels a corresponding data item if its ordinary 
timestamp is greater than the timestamp of the item; it is propagated during anti-entropy if its activation timestamp is newer than 
$\tau_1$ before the current time.  To reactivate a death certificate, we set its activation timestamp to the current time, leaving 
its ordinary timestamp unchanged.  This has the desired effect of propagating the reactivated death certificate without cancelling 
more recent updates.

{\noindent\bf 2.3. Rumor Mongering with Death Certificate Decay}

If rumor mongering is used for distributing updates, decaying death certificates require no special treatment.  Each death 
certificate is created as an active rumor.  Eventually it propagates through the network and becomes inactive at all sites.  
Whenever a death certificate with a timestamp older than $\tau_1$ encounters an obsolete data item at some site, the death 
certificate is reactivated simply by making it an active rumor at that site.  The normal rumor mongering mechanism then distributes 
the death certificate throughout the network, without regard for the fact that its timestamp may be quite old.

{\noindent\bf 2.4. Scaling Properties}

Notice that the performance of the death certificate decay algorithm, measured by the retained history in death certificates, 
improves as the number of sites increases.  More precisely, we can choose a fixed probability $p$ and compute the time $t$ at which 
$P_n(t)$ falls below $p$, as a function of $n$.  The result is
$$t = \tau_2 \ln n + O(1)$$
That is, for a database with $n$ sites we can expect to get a factor of $O(\log n)$ more history by using death certificate decay.  
As described above, the time required for a new update to propagate through the network, using either anti-entropy or rumor 
mongering, is also expected to be $O(\log n)$, so at first glance this algorithm would appear to scale indefinitely.

Unfortunately, this is not the case.  The problem is that the expected time to propagate a new update, which grows with $n$, will 
eventually exceed the threshold value $\tau_1$, which does not grow with $n$.  While the propagation time is less than $\tau_1$, it 
is seldom necessary to reactivate old death certificates; after the propagation time grows beyond $\tau_1$, reactivation of old 
death certificates becomes more and more frequent.  This introduces additional load on the net for propagating the old death 
certificates, thereby further degrading the propagation time.  The ultimate result is catastrophic failure.  To avoid such a 
failure, system parameters must be chosen to keep the expected propagation time below $\tau_1$.

\par\vfill\end