[_CD6_]<alpine>Doc>EarlyDesign>RecoveryTradeoffs.bravo!1

Abstract:This memo attempts to understand the tradeoffs in the design of FileStore’s online recovery facilities. We sketch a relatively simple design that seems to give performance that meets our needs, and discuss the properties of a more complex design.

At any instant of time, the state of a FileStore file is defined by the current state of a FilePageMgr file (the base file or simply base) that represents the FileStore file, plus the contents of the log (giving recent updates to the FileStore file.) Online recovery of a file consists of deriving the true contents of the file from these two sources, and then updating the base to contain these true contents. Online recovery of a FileStore is required after a crash caused, for instance, by an intermittent hardware problem.

The speed of online recovery is important. In some environments, a recovery time greater than 1 minute is considered unacceptable. In CSL the recovery speed requirement is clearly less severe. If crashes are sufficiently rare (say 1 per month), a recovery time of 10 minutes for most crashes is probably acceptable to most members of CSL.

Online recovery is not the process of restoring a FileStore to a transaction-consistent state after a failure of disk storage. Recovery from media failure is related to online recovery only in that much of the information required to drive each process is the same, and is recorded in the log. Media failure is a sufficiently rare event that much slower recovery from media failure is acceptable in CSL.

We restrict the following discussion to the recovery of file-level actions (for instance, "update page".) File-level recovery is the first phase of online recovery; a FileStore is restored a state that is transaction-consistent with respect to file actions before the recovery of client-defined actions is started. We expect the time for file-level recovery to be the largest component of the time for recovery as a whole.

In what follows we shall assume that the FileStore implementation uses a deferred-update strategy for pages, i.e. that no uncommitted page update is ever propogated to the base. Toward the end of this memo we shall explore the recovery consequences of not deferring page update.

Recovery must determine the difference between the contents of the base and the true contents of the FileStore, and properly update the base. This suggests that to make recovery fast, we should:

Point 1) implies "don’t allow large (in terms of amount of update) transactions", because while a transaction is in progress there is an inevitable gap between the base and the potential (depending upon commit or abort of the transaction) contents of the FileStore. We should probably discourage large transactions anyway, since it is difficult to optimize for all types of transactions and our intuition is that small transactions are more important. We can even make a case for banning such transactions, on the grounds that they can potentially make recovery too slow for all users of the system.

Point 1) also implies that we be careful, in our deferred-update strategy, not to get too far behind in carrying out committed updates. This requires control of internal scheduling so that finishing deferred work has a slightly higher priority than initiating new work, at least when the server is heavily loaded.

Point 2) reveals a real tradeoff. The more work we put into writing redundant information into the log (to make it easy to determine the difference between the base and the FileStore), the less work is required at recovery time. Since the extra logging is wasted most of the time (we rarely crash), we should choose to log the minimum amount that is consistent with a reasonably simple recovery scheme (recovery should be reliable) and with our target recovery speed. So to choose a design we must be able to estimate the actual recovery time for candidate designs.

First observe that the recovery algorithm for a file system can gain no information about what updates to redo by inspecting base file pages. The reason is that these pages may contain arbitrary bit patterns, specified by the client. If some space in each page were reserved for control information (as is done in some file systems that are used exclusively for storing databases), then this control information could be used to help decide which updates to redo. But we shall assume that recovery is driven strictly from the log, and the base state is used only for the data it contains (for instance, if a partial page update is logged, the base contents for the non-updated portion of the page may be relevant.)

1) Use the restart record to locate the approximate section of log that was active before the crash. Find a log record (the recovery start point) such that all older log records are irrelevant to recovery. An underestimate of the maximum recovery start point is logically ok (we can always recover from the start of the log), but the better the estimate, the faster the recovery will be.

2) Analyze the log from the recovery start point forward to the end. The analysis divides all transactions that were active in this section of log into winners (committed), losers (aborted explicitly or by the crash), and in-doubt (in the ready state, waiting for a decision to commit or abort from the transaction coordinator.) The analysis may determine more information, such as the set of pages that must be updated during recovery.

3) Carry out updates to the base. Depending upon the degree of detail in the preceding analysis, few or many of these updates will actually be redundant. There may or may not be logging activity during this step to reflect these updates. (Reacquire locks for resources updated by in-doubt transactions.)

4) Commit the recovery by writing appropriate log records and then updating the restart record. If the server crashes prior to the restart record update, then recovery starts from scratch. If no logging took place during step 3, then the base updates performed in the second recovery will be identical to the first, since recovery is driven strictly by the log.

Step 1 of the recovery scheme just outlined is crucial. Somehow, the recovery algorithm must bound the amount of log to be examined during restart. The standard way to accomplish this is to periodically record a checkpoint in the log. A checkpoint log record contains information of some kind to put an upper bound on restart. The restart record points to the most recent checkpoint.

The time between checkpoints generally has some influence on the time for online recovery: the more frequent the checkpoints, the faster the recovery. It is feasible to take a cheap checkpoint more frequently than an expensive one. So cheap checkpoints are desirable. In particular, a checkpoint should definitely not require quiescing the system, either by forcing all transactions in the system to commit or abort, or by stopping all FilePageMgr updates and forcing out the FilePageMgr buffer pool.

There is a wide range of possibilities for the amount of information recorded in a checkpoint. One possibility is to record essentially the entire volatile state of the FileStore: the log map, the set of active transactions, the contents of the FilePageMgr buffer pool. This is an expensive checkpoint. At the other extreme, a checkpoint might record a single log record id: the recovery start point. The best possible recovery start point is the oldest log record recorded in the log map, or the oldest log record representing an update currently residing in the FilePageMgr buffer pool, whichever is older. Note, however, that determining the oldest log record representing an update currently residing in the FilePageMgr buffer pool could be difficult.

A good case can be made that the potentially most expensive aspect of recovery is the update activity of step 3. Step 2 (the analysis), if arranged properly, consists of a sequential scan of the log from either the checkpoint or the recovery start point to the end. This scan must maintain some volatile data structures, but the size of these structures should be less than the size of volatile structures during normal system operation. Step 3 involves page writes that are likely to be scattered at random over the disk, and hence run a factor of 100 slower than sequential log reads. (To be fair, we may not be I/O bound during the analysis phase.) So one effective way to speed up recovery is to reduce the number of pages written during step 3.

There are two plausible methods for saving update work: on a per-transaction basis, and on a per-page basis. The per-transaction method is to write log records to note that all updates performed by a given committed transaction have been fully carried out (pages written to the base, files deleted or shrunk), or all updates performed by a given aborted transaction have been fully undone (files deleted or shrunk.) These records allow the redoing or undoing of all updates by such transactions to be skipped in step 3.

In general it is quite difficult to determine when a transactions’s updates have been fully carried out. The FilePageMgr has no notion of transaction, and it has control of the scheduling of page writes from its buffer pool. Note that a page may be written several times by different transactions before actually migrating to disk; if a transaction writes a page that is later completely overwritten by another transaction, then we consider that the first transaction’s update has been "fully carried out" even though it never appeared on the disk.

The following design, suggested by Ed Taft, seems like a good compromise. The procedure FilePageMgr.ForceOutEverything should operate as follows: first mark all known dirty pages as "needs to be forced out". Then enumerate all pages so marked, forcing each one out and clearing the mark. The rate of this enumeration should perhaps be controlled by a parameter to the procedure. When all pages that were dirty on entry have been forced out, return. The implementation of FileStore periodically activates a process that (1) notes all transactions whose deferred updates have been completely written through to FilePageMgr since the last time the process was activated, (2) calls FilePageMgr.ForceOutEverything, and when it returns (3) writes a log record showing that the transactions noted in (1) are finished.

The log record written in (3) could even be a checkpoint. In that case, instead of mentioning transactions that are finished, it would list transactions that are not known to be finished. The recovery start point is the oldest log record of any transaction not known to be finished. If the log record written in (3) is not a checkpoint, then it can be considered an update to a checkpoint, and should contain enough information to not only eliminate transactions but to move the recovery start point forward as far as possible.

The per-page method for reducing update work is to write a log record to note that a single page has been written to the base. This log record should contain the log record ID of the most recent update to the page that was actually included in what was written to the base. Even for small pages, the volume of logging for such "end write" records would be small compared to the logging required for the page update itself; furthermore, one "end write" log record could easily cover a number of page writes.

The direct way to implement the per-page method is to modify FilePageMgr to do the log record ID accounting and write the "end write" log records. This means that whenever a FilePageMgr page is modified, the log record ID of the log record representing the update should be passed in to FilePageMgr. Note that if FilePageMgr has this information, it can contribute to the recovery start point calculation (since it knows the oldest log record that it is depending upon.) Giving FilePageMgr this information also allows page update before commit (as discussed in a section below.)

A drawback of the per-page method is that a more complex analysis phase must be used in recovery to take advantage of the per-page logging. This may slow down the analysis phase significantly, and it will certainly make it more difficult to get right.

By doing extra bookkeeping above the FilePageMgr, it is possible to implement the per-page method in an analogous fashion to the per-transaction implementation just described. But the per-transaction method seems superior for this type of implementation.

From the recovery viewpoint is relatively easy to support partial page updates. A partial page update may be useful, for instance, in updating the version number in the leader page of a file.

If FilePageMgr knows about log record IDs, then with partial page updates each page needs to have two log record IDs associated with it: (1) the newest update to the page, and (2) (an underestimate of) the oldest update to the page that has not been written to disk. Log record ID (2) is used in the recovery start point calculation; (1) is used when writing the "end write" record. If a page is updated only by full page update commands, then the two IDs are the same. To avoid complex bookkeeping, ID (2) is probably changed only when a full page update is done or when the page is written out. So an update call to FilePageMgr must pass the log record ID, and a BOOLEAN saying whether the update is full or partial.

To actually implement partial page updates, the log map must be sufficiently general. Alternatively, such an update could require an exclusive page lock and be performed in the FilePageMgr buffer pool before commit (using undo-redo logging.) See the next section.

Once the FilePageMgr knows about log record IDs, it is straightforward for the system to support updates in the FilePageMgr buffer pool before commit. The FilePageMgr is then responsible for enforcing the write-ahead log protocol (before writing a page, log records responsible for updates to the page must be written to the log.)

To actually implement update before commit, the FileStore implementation must either exclusive-lock the updated page (or bytes), or the log map entry must be generalized so that other transactions can find the old contents (in the log.)

The recovery start point must factor in the oldest log record representing an already-performed update by an uncommitted transaction, as well as the oldest deferred update. And a somewhat different analysis must be done: for each page updated by an aborted transaction, we need to redo all committed updates and redo the "preimages" of aborted updates.

Given the mix of transactions that we expect (file transfers, which do not require deferred update, and short database transactions), we have no strong motivation to implement update before commit.

The per-transaction strategy for saving work in recovery seems most promising. The proposed implementation is simple, keeping both transactions and log record IDs out of the FilePageMgr. The FilePageMgr retains a large degree of control over disk transfer scheduling.

The per-page strategy, with logging of end-writes by FilePageMgr, is significantly more complex than the per-transaction strategy, especially in the analysis phase of recovery. It is less well-adapted to using the Pilot swapper to control disk I/O (since the swapper gives no notification of end-write to its client.) The extra complexity does enable the ability to perform FilePageMgr updates before commit. But we see no strong motivation for this capability in the FileStore implementation.