\boldhead{1.4 Complex Epidemics} As we have seen already, direct mailing of updates has several problems: it can fail because of message loss, or because the originator has incomplete information about other database sites, and it introduces an $O(n)$ bottleneck at the originating site. Some of these problems would be remedied by a broadcast mailing mechanism, but most likely that mechanism would itself depend on distributed information. The epidemic mechanisms we are about to describe do avoid these problems, but they have a different, explicit probability of failure that must be studied carefully with analysis and simulations. Fortunately this probability of failure can be made arbitrarily small. We refer to these mechanisms as ``complex'' epidemics only to distinguish them from anti-entropy which is a simple epidemic; complex epidemics still have simple implementations. Recall that with respect to an individual update, a database is either {\it susceptible} (it does not know the update), {\it infective} (it knows the update and is actively sharing it with others), or {\it removed} (it knows the update but is not spreading it). It is a relatively easy matter to implement this so that a sharing step does not require a complete pass through the database. The sender keeps a list of infective updates, and the recipient tries to insert each update into its own database and adds all new updates to its infective list. The only complication lies in deciding when to remove an update from the infective list. Before we discuss the design of a ``good'' epidemic, let's look at one example that is usually called rumor spreading in the epidemiology literature. Rumor spreading is based on the following scenario: There are $n$ individuals, initially inactive (susceptible). We plant a rumor with one person who becomes active (infective), phoning other people at random and sharing the rumor. Every person hearing the rumor also becomes active and likewise shares the rumor. When an active individual makes an unnecessary phone call (the recipient already knows the rumor), then with probability $1/k$ the active individual loses interest in sharing the rumor (the individual becomes removed). We would like to know how fast the system converges to an inactive state (a state in which no one is infective) and the percentage of people who know the rumor (are removed) when this state is reached. Following the epidemiology literature, rumor spreading can be modeled deterministically with a pair of differential equations. We let $s$, $i$, and $r$ represent the fraction of individuals susceptible, infective, and removed respectively, so that $s+i+r=1$: $$ \eqalign{ {d s\over d t} &= -si\cr {d i\over d t} &= +si - {1\over k}(1-s)i\cr} \eqno (*)$$ The first equation suggests that susceptibles will be infected according to the product $si$. The second equation has an additional term for loss due to individuals making unnecessary phone calls. A third equation for $r$ is redundant. A standard technique for dealing with equations like (*) is to use the ratio [Ba]. This eliminates $t$ and lets us solve for $i$ as a function of $s$: $${d i\over d s} = -{k+1\over k} + {1 \over k s} $$ $$ i(s) = -{k+1\over k}s + {1 \over k} \log s + c $$ where $c$ is a constant of integration that can be determined by the initial conditions: $i(1 - \epsilon) = \epsilon$. For large $n$, $\epsilon$ goes to zero, giving: $$ c = {k+1\over k} $$ and a solution of $$ i(s) = {k+1\over k}(1 - s) + {1 \over k} \log s.$$ The function $i(s)$ is zero when $$ s = e^{-(k+1)(1-s)}.$$ This is an implicit equation for $s$, but the dominant term shows $s$ decreasing exponentially with $k$. Thus increasing $k$ is an effective way of insuring that almost everybody hears the rumor. For example, at $k = 1$ this formula suggests that $20\%$ will miss that rumor, while at $k = 2$ only $6\%$ will miss it. {\noindent\bf Variations} So far we have seen only one complex epidemic, based on the rumor spreading technique. In general we would like to understand how to design a ``good'' epidemic, so it is worth pausing now to review the criteria used to judge epidemics. We are principally concerned with: \item{1.} {\bf Residue.} This is the value of $s$ when $i$ is zero, that is, the remaining susceptibles when the epidemic finishes. We would like the residue to be as small as possible, and, as the above analysis shows, it is feasible to make the residue arbitrarily small. \item{2.} {\bf Traffic.} Presently we are measuring traffic in database updates sent between sites, without regard for the topology of the network. It is convenient to use an average $m$, the number of messages sent from a typical site: $$m = {\hbox{Total update traffic}\over \hbox{Number of sites}} $$ In section 3 we will refine this metric to incorporate traffic on individual links. \item{3.} {\bf Delay.} There are two interesting times. The average delay is the difference between the time of the initial injection of an update and the arrival of the update at a given site, averaged over all sites. We will refer to this as $t_{\hbox{\seveni ave}}$. A similar quantity, $t_{\hbox{\seveni last}}$, is the delay until the reception by the last site that will receive the update during the epidemic. Update messages may continue to appear in the network after $t_{\hbox{\seveni last}}$, but they will never reach a susceptible site. We found it necessary to introduce two times because they often behave differently, and the designer is legitimately concerned about both times. \noindent Next, let us consider a few simple variations of rumor spreading. First we will describe the practical aspects of the modifications, and later we will discuss residue, traffic, and delay. \noindent {\bf Blind vs. Feedback.} The rumor example used feedback from the recipient; a sender loses interest only if the recipient already knows the rumor. A blind variation loses interest with probability $1/k$ regardless of the recipient. This obviates the need for the bit vector response from the recipient. \noindent {\bf Counter vs. Coin.} Instead of losing interest with probability $1/k$ we can use a counter, so that we lose interest only after $k$ unnecessary contacts. The counters require keeping extra state for elements on the infective lists. Note that we can combine counter with blind, remaining infective for $k$ cycles independent of any feedback. A surprising aspect of the above variations is that they share the same fundamental relationship between traffic and residue: $$ s = e^{-m} $$ This is relatively easy to see by noticing that there are $nm$ updates sent and the chance that a single site misses all these updates is $(1-1/n)^{nm}$. (Since $m$ is not constant this relationship depends on the moments around the mean of the distribution of $m$ going to zero as $n\rightarrow \infty$, a fact that we have observed empirically, but have not proven.) Delay is the the only consideration that distinguishes the above possibilities: simulations indicate that counters and feedback improve the delay, with counters playing a more significant r\^ole than feedback. \vskip 10pt \noindent {\bf Table 1.} Performance of an epidemic on 1000 sites using feedback and counters. \penalty1000 \vskip 5pt \penalty1000 {\vbox{\tabskip=0pt \offinterlineskip \def\tr{\noalign{\hrule}} \halign to\hsize{\strut#& \vrule#\tabskip=1em plus2em&\hfil#\hfil& \vrule#& \hfil#\hfil& \vrule#&\hfil#\hfil& \vrule#& \hfil#\hfil& \vrule#&\hfil#\hfil& \vrule#\tabskip=0pt\cr\tr && Counter && Residue && Traffic &&\multispan3 Convergence &\cr && $k$ && $s$ && $m$ && $t_{\hbox{\seveni ave}}$ &\omit& $t_{\hbox{\seveni last}}$ &\cr\tr && 1 && 0.18 && 1.7 && 11.0 && 16.8 &\cr\tr && 2 && 0.037 && 3.3 && 12.1 && 16.9 &\cr\tr && 3 && 0.011 && 4.5 && 12.5 && 17.4 &\cr\tr && 4 && 0.0036 && 5.6 && 12.7 && 17.5 &\cr\tr && 5 && 0.0012 && 6.7 && 12.8 && 17.7 &\cr\tr }}} \vskip 10pt\penalty-500 \noindent {\bf Table 2.} Performance of an epidemic on 1000 sites using blind and coin. \penalty1000 \vskip 5pt \penalty1000 {\vbox{\tabskip=0pt \offinterlineskip \def\tr{\noalign{\hrule}} \halign to\hsize{\strut#& \vrule#\tabskip=1em plus2em&\hfil#\hfil& \vrule#& \hfil#\hfil& \vrule#&\hfil#\hfil& \vrule#& \hfil#\hfil& \vrule#&\hfil#\hfil& \vrule#\tabskip=0pt\cr\tr && Coin && Residue && Traffic &&\multispan3 Convergence &\cr && $k$ && $s$ && $m$ && $t_{\hbox{\seveni ave}}$ &\omit& $t_{\hbox{\seveni last}}$ &\cr\tr && 1 && 0.96 && 0.04 && 19 && 38 &\cr\tr && 2 && 0.20 && 1.6 && 17 && 33 &\cr\tr && 3 && 0.060 && 2.8 && 15 && 32 &\cr\tr && 4 && 0.021 && 3.9 && 14.1 && 32 &\cr\tr && 5 && 0.008 && 4.9 && 13.8 && 32 &\cr\tr }}} \vskip 10pt \noindent {\bf Push vs. Pull.} So far we have assumed that all the sites would randomly choose destination sites and {\it push} infective updates to the destinations. The {\it push} vs.\ {\it pull} distinction made already for anti-entropy can be applied to rumor mongering as well. {\it Pull} has some advantages, but the primary practical consideration is the rate of update injection into the distributed database. If there are numerous independent updates a {\it pull} request is likely to find a source with a non-empty rumor list, triggering useful information flow. By contrast, if the database is quiescent, the {\it push} algorithm ceases to introduce traffic overhead, while the {\it pull} variation continues to inject fruitless requests for updates. Our own CIN application has a high enough update rate to warrant the use of {\it pull}. The chief advantage of {\it pull} is that it does significantly better than the $ s = e^{-m} $ relationship of {\it push}. Here the blind vs. feedback and counter vs. coin variations are important. Simulations indicate that the counter and feedback variations improve residue, with counters being more important than feedback. We have a recurrence relation modeling the counter with feedback case that exhibits $s = e^{-\Theta(m^3)}$ behavior. \vskip 10pt \noindent {\bf Table 3.} Performance of a pull epidemic on 1000 sites using feedback and counters\raise.6ex\hbox{\dag}. \vskip 5pt {\vbox{\tabskip=0pt \offinterlineskip \def\tr{\noalign{\hrule}} \halign to\hsize{\strut#& \vrule#\tabskip=1em plus2em&\hfil#\hfil& \vrule#& \hfil#\hfil& \vrule#&\hfil#\hfil& \vrule#& \hfil#\hfil& \vrule#&\hfil#\hfil& \vrule#\tabskip=0pt\cr\tr && Counter && Residue && Traffic &&\multispan3 Convergence &\cr && $k$ && $s$ && $m$ && $t_{\hbox{\seveni ave}}$ &\omit& $t_{\hbox{\seveni last}}$ &\cr\tr && 1 && $3.1 \times 10^{-2}$ && 2.7 && 9.97 && 17.6 &\cr\tr && 2 && $5.8 \times 10^{-4}$ && 4.5 && 10.07 && 15.4 &\cr\tr && 3 && $4.0 \times 10^{-6}$ && 6.1 && 10.08 && 14.0 &\cr\tr }}} \noindent \raise.6ex\hbox{\dag} If more than one recipient pulls from a site in a single cycle, then at the end of the cycle the effects on the counter are as follows: if any recipient needed the update then the counter is reset; if all recipients did not need the update then one is added to the counter. \vskip 10pt \noindent {\bf Minimization.} It is also possible to make use of the counters of both parties in an exchange to make the removal decision. The idea is to use a push and a pull together, and if both sites already know the update, then only the site with the smaller counter is incremented (in the case of equality both must be incremented). This requires sending the counters over the network, but it results in the smallest residue we have seen so far. \noindent {\bf Connection Limit.} It is unclear whether connection limitation should be seen as a difficulty or an advantage. So far we have ignored connection limitations. Under the {\it push} model, for example, we have assumed that a site can become the recipient of more than one push in a single cycle, and in the case of {\it pull} we have assumed that a site can service an unlimited number of requests. Since we plan to run the rumor spreading algorithms frequently, realism dictates that we use a connection limit. The connection limit affects the {\it push} and {\it pull} mechanism differently: {\it pull} gets significantly worse, and, paradoxically, {\it push} gets significantly better. To see why {\it push} gets better, assume that the database is nearly quiescent, that is, only one update is being spread, and that the connection limit is one. If two sites contact the same recipient then one of the contacts is rejected. The recipient still gets the update, but with one less unit of traffic. (We have chosen to measure traffic only in terms of updates sent. Some network activity arises in attempting the rejected connection, but this is less than that involved in transmitting the update. We have, in essence, shortened a connection that was otherwise useless). How many connections are rejected? Since an epidemic grows exponentially, we assume most of the traffic occurs at the end when nearly everybody is infective and the probability of rejection is close to $e^{-1}$. So we would expect that {\it push} with connection limit one would behave like: $$ s = e^{-\lambda m} \qquad\qquad \lambda = {1\over 1 - e^{-1} }. $$ Simulations indicate that the counter variations are closest to this behavior (counter with feedback being the most effective). The coin variations do not fit the above assumptions, since they do not have most of their traffic occurring when nearly all sites are infective. Nevertheless they still do better than $ s = e^{-m} $. In all variations, since {\it push} on a nearly quiescent network works best with a connection limit of $1$ it seems worthwhile to enforce this limit even if more connections are possible. {\it Pull} gets worse with a connection limit, because its good performance depends on every site being a recipient in every cycle. As soon as there is a finite connection failure probability $\delta$, the asymptotics of {\it pull} changes. Assuming, as before, that almost all the traffic occurs when nearly all sites are infective, then the chance of a site missing an update during this active period is roughly: $$ s = \delta^m = e^{-\lambda m} \qquad \qquad \lambda = -\ln \delta .$$ Fortunately, with only modest-sized connection limits, the probability of failure becomes extremely small, since the chance of a site having $j$ connections in a cycle is $e^{-1}/j!$. \noindent {\bf Hunting.} If a connection is rejected then the choosing site can ``hunt'' for alternate sites. Hunting is relatively inexpensive and seems to improve all connection-limited cases. In the extreme case, a connection limit of $1$ with infinite hunt limit results in a complete permutation. Push and pull then become equivalent, and the residue is very small. \vskip 10pt Wordlist remedied J JJJJJJJJJJJJJ^JJJJiJJJJ6J JJJJJJJJJJJJBJSJJJJ JJJJJJlJJ}JJJJ JJ JTJ J JJ?JJ@J\JJ__JJJJJ JMJ J JJ?JJ=J\JyyJYYJJ JJJJJJJ JJ JmJJ JJ?JJ@J\JJJJJ JJJ JJJJEJJJJJHJJJJJJ 9;