[_CD6_]<alpine>Doc>EarlyDesign>DBMSRecovery.bravo!1

Abstract:We discuss the issue of recovery in database management systems. We present several implementation schemes for recoverable actions, and explore the tradeoffs between CPU cycles, main storage, I/Os, and concurrency. One of our goals is to identify facilities that can be supplied by the underlying operating system, rather than being duplicated by each program that implements recoverable actions.

A database management system (DBMS) provides actions on the objects it implements. A DBMS typically supports objects such as record, record set, and field within record, and operations such as create record, locate record using key, fetch value of record field, and assign value to record field. (A commercial DBMS usually includes a data communications component, which supports a terminal object with actions such as send message to terminal.) These actions typically involve other low-level actions, performed on objects that the DBMS implementation hides from its client: adding a page to a file, searching a hash-addressed record set, inserting an item into a B-Tree index, etc. A typical DBMS is able to execute actions for several client programs concurrently. (Existing systems are generally not able to execute multiple actions for the same client program concurrently, even if multiple nodes of a network are involved.)

The client then supplies t as a parameter to each action that it considers part of this transaction; the DBMS groups these actions into a single recovery unit. A recoverable action is any action that takes a Transaction parameter in order to join a transaction in this way.

To end the transaction t the client calls CommitTransaction[t]. If this call returns sucessfully, then all actions that were included in t have been committed. The effects of committed actions persist even in the event of a later system failure; the system provides no operation for undoing the effects of a committed transaction.

If the client gets into trouble, it calls AbortTransaction[t] which undoes all actions performed under t. The DBMS may also abort t unilaterally, for instance if t gets into a deadlock with other transactions or uses too much of some resource. (If the DBMS includes data communications, the system ensures that no message sent by an aborted transaction is received the terminal to which it was addressed.)

In case of a system crash, the DBMS typically aborts an uncommitted transaction that was in progress at the time of the crash. The DBMS preserves a committed transaction in spite of the crash. (If the DBMS includes data communications, the system ensures that each message sent by a committed transaction is received at least once by the terminal to which it was addressed.)

Even the best implementation of transactions only provides the behavior just described with high probability. Carefully-designed redundacy is used to ensure that k independent failures are necessary to cause the loss of transaction integrity, for some specific k. Several commercial database management systems provide integrity in the face of any single failure of a storage device, and often survive multiple failures.

Few existing systems support the notion of transaction as we have described it. In some systems, a transaction is implicitly created by receiving a message from a terminal, and committed by sending the reply. System R supports transactions, but since no IBM file systems or other IBM data management systems support System R transactions, it is impossible to group actions not implemented by System R into a transaction. System R also does not perform message recovery or allow any parallelism within a transaction.

In our view, there is a universal notion of transaction that applies to all systems that wish to use it. We expect file systems, and general- and special-purpose database management systems to implement actions capable of taking part in a transaction. This allows any number of processes, executing on any number of machines, to perform logically related actions concurrently as part of a single transaction. The underlying operating system supports the type Transaction and the operations BeginTransaction, CommitTransaction, and AbortTransaction.

From the point of view of a client of a transactional file system or database management system, this is essentially all there is to know about transactions. The remaining question is: what interface should the operating system provide to the implementor of a recoverable action? The most effective way to ensure that a number of different systems all agree upon the same notion of transaction is to make it easy to write recoverable actions for such transactions.

I shall now reserve the term database to mean a logical storage entity which is operated on by recoverable actions. In a quiesced DBMS there is some identifiable storage volume that contains a database, in the sense that any database record can be found at some location on this volume. I shall call this storage volume the base of the database.

Locking is used to ensure the consistency of concurrent transactions. LockConcepts0.bravo discusses locking issues at length. Providing an adequate interface to the lock manager does not seem very difficult. It is necessary for clients of the lock manager to lock objects that are meaningful to them, such as index intervals and individual records. The lock manager for such entities should be able to communicate with the lock manager that the file system uses to lock files and pages, in order to detect deadlocks. It is possible to manage all locks with one facility.

It is worth keeping the lock granularity tradeoff in mind: Locking large objects means using few locks, so the locking overhead is small, but it also means reduced concurrency. Locking small objects allows higher concurrency but is more expensive in processor cycles and storage.

Logging is used to record, in nonvolatile storage, a history of transaction executions on a database. Let the persistent state of a database be the state determined by all committed actions that have been executed prior to a given instant of time. Information written to the log preserves the following write-ahead log invariant:

Since the information in a database is the result of executing a sequence of recoverable actions against an initially empty database, it follows that the complete log of a database defines the persistent state of the database at every instant of time including the present; the base is logically redundant. A system that does not compress all but the most recent portion of the log into a random-access base has unacceptable performance.

A log record typically records information at the same level of abstraction as the recoverable action that produced it. For instance "record 25835 field 3 now has value 10 because of transaction 4273" and "commit transaction 4273" are typical log records. Depending upon the timing of updates to the base, a log record produced by executing a recoverable action may contain the database values produced by the action (a redo log record) or the database values overwritten by the action (an undo log record).

As illustrated in the example above, the outcome (commit or abort) of a transaction is an event that may be recorded in a log. In a multi-database transaction coordinated through two-phase commit, there will always be some database with the property that at some instant the transaction outcome is not represented entirely within its log or base; hence the write-ahead log invariant above must be modified somewhat. Failure of transaction outcome storage has a potentially high cost since all resources locked by a multi-database transaction are tied up until the transaction outcome is established. Lampson has proposed a multi-round voting procedure using replicated commit records to solve this problem. Replicated commit records might be implemented using the logs of several independent databases.

Recovery is a procedure invoked after a system crash. It ensures that updates performed by a transaction that was committed before the crash take effect (eventually appear in the base), and that updates performed by a transaction aborted before the crash do not take effect. Recovery may also abort transactions that were in progress (not committed or aborted) at the time of the crash.

Action recovery is generally very specific to the action; for instance, only the code that implements record storage knows how to create a record. The recovery manager is unable to interpret the base, but understands the overall syntax of the log (i.e. how to parse it into records) and can discover the outcome of transactions. The recovery manager is therefore responsible for calling action-specific routines in a sequence, determined by the transaction outcomes and the state of the log, that is designed to bring the entire system (log plus base plus volatile data structures) to a state that will allow the processing of new transactions. An action-specific recovery routine does not understand the overall syntax of the log, but can interpret a log record that was generated by the action.

A file system may be regarded as a very simple database managment system. It supports objects such as file and page within file, and actions such as create file, read page of file, and write page of file. It is natural for these to be recoverable actions, yielding what we sometimes call a "transactional file system" (bletch.) Database management systems can build their more complicated recoverable actions on this base.

A buffer manager combines the functions of (virtual) memory allocation and data transfer between files and virtual memory. A basic buffer manager call is "get page p of file f for readonly use by transaction t." This sets a read lock on the page (or perhaps the entire file) for the transaction, returns a page buffer containing a copy of the requested page, and increases the share count of the page buffer by 1. When the DBMS is through with the page, it calls the buffer manager to say "release buffer b." This merely decreases the page buffer’s share count by 1; it does not release any locks. When the buffer’s share count is 0, the buffer manager may re-use the buffer.

Since transactions do writes, the buffer manager must provide modes of access beyond readonly. The simplest way of providing modes is for the "get page" call to accept a mode parameter, which might be Read, Update, or Write (corresponding to the modes in which the page will be locked by the call.) The other way is an explicit "upgrade" call, which allows a transaction to request Update or Write access to a page buffer for which it already has Read access. There is a subtle limitation on "upgrade" calls, however. If other transactions already have Read access to a buffer that transaction t requests in Update mode, t must wait for the reading transactions to finish. This is because the transactions are sharing memory, but the readers must be isolated from the writer. This is not a problem if a "get page" call is used, since the buffer manager has the freedom to allocate a new buffer for the exclusive use of the Update transaction. In any event, when a transaction t dirties and releases a buffer, and the buffer manager decides to re-use that buffer, the buffer manager writes the buffer to the file system under transaction t.

The buffer manager also supports miscellaneous other calls. For instance, the buffer manager will supply a buffer that is not mapped to any file. The client may wish to use the buffer as scratch space, or to write in it and then have the buffer manager map it to a specified location in a file (thereby writing the file without first reading it.) To give isolation from the implementation of files (location and replication transparency), the buffer manager should provide the complete repertoire of standard file actions such as creation, deletion, and length-changing.

The first time that the buffer manager performs an action on behalf of a transaction t, the buffer manager registers itself with the file system transaction manager. This causes the buffer manager to be notified when t is committed or aborted. When t is about to be committed, the buffer manager writes all of t’s dirty pages to the file system; when t about to be aborted, the buffer manager discards all of t’s dirty pages. The client may also give the buffer manager advance warning that it should begin writing the dirty pages of a particular transaction back to the file system.

A database management system implementation places two additional requirements on the buffer manager. The first requirement is that the buffer manager support a notion of block that may be larger than the file system’s inherent page size, and be willing to buffer and transfer entire blocks instead of pages. Naturally a buffer must still consist of contiguous virtual memory, even if it is more than one page long. System R uses a block size of 4k bytes, and this is representative of commercial systems. This holds enough records that the technique of clustering of related records onto a single block (for instance, a "parent" record and all of its children) can be effective. It also gives a large fan-out from B-Tree index blocks, so that these indexes do not get too deep. The choice of block size is inevitably a compromise, and most existing buffer managers require that all buffers have equal size. If this requirement were relaxed, say to allow the buffer size to be a function of the file being buffered, the DBMS might well use different page sizes for different data structures.

The second requirement is for the buffer manager to accept pragmatic advice for read-ahead/write-behind decisions. Sequential search through a database is common. There are always queries that cannot be accelerated with an index, such as string pattern matching requests involving record fields. It is important to take advantage of sequential file access when it occurs, and it is easier for the client to inform the buffer manager than for the buffer manager to deduce the reference pattern.

Simplicity. It is very easy to write recoverable actions; all locking and recovery issues are hidden by the buffer manager. An action makes a change directly in a buffer page by LOOPHOLEing a pointer to some word in a buffer’s interior into the correct Mesa type, and then storing through the pointer.

Distributed computation. The DBMS runs entirely on a user’s workstation; no DBMS-specific code lives in the file server, so the DBMS is not a bottleneck in the system. A query that only involves pages already fetched to the workstation places no demands on the network or the file server. The file server’s transaction machinery is sufficient to keep all of the remote buffer managers consistent.

The problem of extra disk I/O only becomes serious at high rates of update transactions, so it does not seem to be a pressing issue right now. The low concurrency problem is also serious for high rates of certain classes of transactions, but may become an issue sooner for another reason. In simple interactive database applications, users experience lock conflicts as delays (seeing someone else’s think time), or as transaction abortions that refresh the screen and force the user to reconsider any tentative updates. These problems can be "solved" by forcing all interactive applications to use short transactions, and to validate information at the application level before performing updates that a user has requested based on information from a previous transaction. From a logical point of view this is a great solution, but it adds complexity to applications; it amounts to application-level cache validation. By providing higher concurrency at a low level, we may make certain classes of interactive applications much easier to build. (At the same time we might explore the idea of system support for application-level cache validation.)

A structure that needs to support high-concurrency access, at least in some situations, is a B-Tree index. If transactions T1 and T2 each makes an insertion that falls into the same leaf page of a B-Tree, and each independently writes this page to the file system, the file system will detect a conflict and disallow one or the other update (abort T1 or T2). Indeed, if the only point in common between the two transactions is the file server, there is little hope of avoiding this. But if the two transactions are being processed by the same instance of a DBMS, then all of the right information should be available to perform the two insertions without conflict.

Certain record accesses require high concurrency. For instance, in most database systems the catalog (the repository for data about data) is a hot spot. It is bad if a catalog entry update by one transaction (perhaps caused by a record insertion) causes a second transaction, which has read a different entry on the same page, to abort.

A third common pattern of access that is known to have bad concurrency properties is appending records to a log. A log is a simple, compact, reliable structure that is just right for a number of applications. Logs are very important in the world of commerical data processing: all financial transactions keep them. A log may be abstracted to something called a sequential-dependent record file. The relationship between records in such a file is a successor (and predecessor) function, whose ordering is determined by the time of record insertion (and not by any explicit key value in the record.) If insertion is implemented by finding the last page of the file, locking it, and creating the new record there, then the end of the file becomes a hot spot and concurrency is limited. A transaction generally requires at most that all its insertions appear adjacent in the sequential-dependent file after the transaction commits; it does not care about the identity of the record that precedes the first record it inserts. Hence a high-concurrency log implementation is possible as long as information is shared among the simultaneous transactions. (The IMS "Fast Path feature" implements high-performance sequential-dependent record files, and these play a role in raising IMS’ transaction rate from 5/sec to 100/sec for some commercially important transactions.)

Database systems support high-concurrency access to these classes of shared structures because it increases throughput (by decreasing lock wait and I/Os.) This is true in spite of the fact that high-concurrency access greatly complicates the DBMS implementation and costs extra CPU cycles when it is used. In cases when high-concurrency access is not required, the DBMS will generally prefer to use the file system’s physical concurrency controls (on pages or files) directly.

It is possible for the file system to allow update of a byte range within a file, rather than update of an entire page or block. This is of limited usefulness to a database system. Of the three situations cited above, the only one that it may help with is high-traffic record access. It is limited to situations in which the mapping from logical update to physical update is relatively static. It also widens the buffer manager interface significantly.

Our basic strategy is to use a single transaction s for all page I/Os to the base of a given shared database. A transaction t that performs an update to this shared database will lock the updated portion for transaction t and write a log record describing the update and identifying transaction t. At some point (before or after t commits) we make the base update in the buffer pool under transaction s.

A CheckpointTransaction[s] call, which has the effect of committing updates made by transaction s and downgrading all of s’s write locks to read locks, is useful (in systems that don’t support this as a primitive we must commit s and start a new transaction, discarding the state of the buffer pool.) We checkpoint transaction s occasionally in order to commit changes to the base and thereby bound the amount of work required at restart. These checkpoints are taken only when no update actions are in progress. This restriction means that the committed base is always action-consistent: it is the correct result of performing some sequence of actions, without reference to transactions. It is this restriction, which greatly reduces the possible states of the base at restart, that gives us a benefit from the underlying transaction mechanism. Note that to satisfy the write-ahead log invariant, the log records for all updates made to the shared structure must appear on the disk before s is checkpointed. This invariant is easy to satisfy if the file system and DBMS logs are implemented as a single log, since s’s checkpoint will force the combined log to disk.

There are two specific versions of this strategy, which we call in-place update and deferred update. The in-place update strategy is to acquire a write lock, write the log record containing both undo and redo information, and then perform the update in the buffer pool, before t is committed. Because of the continuous flow of transactions, it is not practical to exclude the possibility of checkpointing s before t commits. Thus there is never any guarantee that the base, or even a single page of the base, is consistent with respect to any particular transaction that is using it; only locks prevent the multiple transactions from seeing each other’s updates.

It is possible to avoid logging redo information in the in-place update strategy by checkpointing s immediately before committing any transaction that made updates under s. As a practical matter this is probably a poor strategy, because the amount of extra log I/O and computation for the redo record is negligible compared to the I/O cost of checkpointing s.

The deferred update strategy is to obtain an update lock, write the log record containing redo information, and then perform the update in a volatile data structure (sometimes called the shadow.) During phase one of commit, we convert the update lock to a write lock, and some time after the start of phase two we perform the update in the buffer pool. (This gives, in effect, a built-in undo for shadow updates before commit.) In this case it is possible to serialize the updates of committing transactions and synchronize these updates with checkpoints to s, so that only transaction-consistent states of the base are commited. But there may be no advantage in serializing updates; it depends on the application.

In comparing the two variants, note that in-place update requires both undo and redo recovery routines, and uses a write lock from the time of update until commit. Deferred update requires only redo recovery, which may be significantly simpler than undo and redo, and holds a write lock only during phase one. The locking advantage is probably not significant. But the disadvantage of deferred update is special code to maintain the shadow data structure, and more costly lookups since the shadow structure must be consulted.

There seem to be applications for both techniques. Of the three examples listed above, B-Tree and record update seem well suited to in-place processing. But the sequential-dependent record file is probably better implemented using deferred update, since undo is very clumsy and the shadow data structure can be simply a list of log record IDs.

It appears that no tight coupling is required between a file system that provides transactions involving file pages and a DBMS that provides transactions involving database objects. (The only form of coupling that we have introduced is the assumption that DBMS log records pertaining to a particular file are forced to the disk in the process of committing a file system transaction on that file; this is a convenience but is not essential to the logic of the scheme.) It is interesting to note that System R, which uses the in-place update strategy described above, became tightly coupled to its file system. The System R designers now say that their file system is inappropriate for large databases ( > 100 mByte), but they cannot change it without a major rewrite of System R.

After a crash, the recovery manager first recovers the file system to a state that is transaction-consistent with respect to file actions. In the process it determines the outcome (committed, aborted, or ready = finished phase one) of each transaction that has been active since the last file system checkpoint.

The recovery manager then restarts all resource managers (subsystems that write log records and perform undo and redo), and begins to interpret the log of DBMS actions. <During its restart a resource manager might acquire exclusive locks on files that it "owns">. First it undoes aborted transactions, in reverse chronological order. This means: for each undo log record, call the action recovery undo procedure, passing it a copy of the log record. Then it redoes the ready and committed transactions, in chronological order. This means: for each redo log record, call the action recovery redo procedure, passing it a copy of the log record. At the point in the log at which "ready" or "committed" is determined, the recovery manager informs each resource manager of this fact. <Is locking off during recovery, as in System R? This is an decision that each resource manager is free to make. Most of them will probably take this approach. I don’t yet understand the issue of locking and deadlock during undo.>

The most difficult part of implementing fine-grained update seems to be in designing the data structure so that recovery can be performed from any action-consistent state. An action recovery procedrue (undo, redo) must be idempotent: executing it (fully or partially) any number of times must have the same effect as executing it once.

A basic implementation technique is to version-number objects that are subject to updates. Log record IDs make good version numbers in this application. Let a data object d contain the ID of the log record for the most recent update applied to d; call this d’s update ID. Since applying an update modifies the update ID, an undo log record contains the previous value of the update ID.

If all updates to an object are undo-logged, and log record IDs are used as update IDs, then the combination of object plus log has a very simple structure: the object points to the most recent update, which points to the previous (committed) update, etc. The entire history of the object can be found by traversing the log in this way. Because of locking, only the most recent update is possibly uncommitted and hence candidate for undoing. Undo pops the top element from the stack of log records.

The point of view that emerges from the preceding discussion is that a DBMS action for a transaction t should, whenever possible, be performed by simply using file actions under transaction t. In cases where this leads to unacceptable performance, a shared transaction should be used for the file actions and logical locking and recovery procedures should be associated with the DBMS action. Note however that if t commits and the system crashes before the shared transaction commits, the file system will recover t’s updates that were done as file actions, but will not recover t’s updates that were done under the shared transaction. So the limitations of "action-consistent" must be understood by those who write logical update recovery procedures. Only updates covered by the shared transaction are guaranteed to be consistent. Hence recovery will be much simpler if it is based entirely on information that is only updated under the shared transaction. Data structures that are subject to these shared updates must be designed with this local recovery property in mind.

Our DBMS’ are built on file system transactions. Our file system is built upon page I/Os, perhaps duplexed (pages written twice on independent media) for redundancy. A duplexed page write behaves like a simple transaction, since a failure during the first write can be undone by reading the second page, and a failure during the second write can be redone by reading from the first page. A non-duplexed page write does not behave like a transaction, because there is no (local) way to recover from a failure.

In any case, file page writes are very limited transactions. Hence there is no hope of structuring the file system implementation along the lines of our DBMS that used the transactional file system for its recovery. Instead, the file system implementation must be structured along the lines of the DBMS with a shared transaction.

Four file system actions seem to be typical of all problems involved in the implementation. They are create file, destroy file, read page of file, and write page of file. The action set file size is analogous to create file if the file is being extended, and analogous to destroy file if the file is being contracted.

It is natural to implement create file using an in-place update strategy. The reason is that create file must acquire resources, and to commit a create file request we must be sure that the resources are truly available. A straightforward method to ensure this is to actually carry out the action, then undo it if the transaction aborts. For similar reasons it is natural to use a deferred update strategy for destroy file. If a destroy file action is aborted, then the "destroyed" file must be preserved. A straightforward method to ensure this is to defer the action (keeping a volatile record of having done it), then do it during phase two if the transaction commits.

An in-place update write page of file implementation requires undo logging: the entire old contents of a base page must be logged before the base page is updated. This costs a base page read if the old contents of the base page are not in the file system buffer pool. In the simplest implementation a writer must wait until (1) the old base page contents are read, if necessary, (2) the old base page contents are written into the log, and (3) the new base page contents are written. So the real-time delay for a write is considerable in the obvious implementation. This can be improved by acknowledging the write immediately and actually processing it in background; this may be considerably more complex to implement (consider the synchronization at commit time and when a transaction reads a page that it has written.) Note that here, unlike the DBMS case, it may make sense to "checkpoint the shared transaction" (force all base page writes) before committing a transaction, since this avoids the cost of redo logging. If we follow this strategy then a write that is acknowledged before it completes may cause a real-time delay at commit time. Ignoring the problem of writes that are acknowledged before they are completed, the read page of file action is straightforward.

There are two possible deferred update implementations. The first is to use shadow pages for writes, log only pointers to the shadow pages, and swing the pointers after commit. This approach means that sequential in the file is not sequential on the disk; hence sequential access performance is poor. The complexity of this approach is also relatively high due to the management of shadow space (how do you avoid scavenging the disk to reclaim shadow after a crash?)

The alternative deferred update approach is to log the entire new contents of a page before commit, and to perform all base page updates after commit. This requires a lookup in a volatile data structure for each page read, to see if the data is in the log instead of in the base, but generally the data will be in the base or be buffered in the log. It has the attractive properties that (1) once the log write has been buffered and the volatile map entry made, a write action may return, (2) commit requires only a log force, and no synchronization with base writes, (3) sequentiality of file pages is preserved. Like the pure undo scheme, three I/Os are required in the worst case: a log write, a log read, and a base write. For short transactions the log read will be buffered in the file system buffer pool and hence will be cheap.

Important special cases can be optimized in any of these approaches. For a file server an important special case is writing a file in the same transaction that created it. In this case there is no need to defer base page updates since the file will disappear if the transaction is aborted.

A file system contains certain permanent data structures that are not visible to the client. One is the volume allocation map (VAM), a mapping from disk page address to {free, occupied}. Another is the volume file map, a map from [file unique name, page in file] to disk page address. If updates to these maps are not recoverable then a potentially costly scavenging operation may be necessary when a file system crashes.

The log must be simple and robust. It is typically implemented as a single large file, allocated in consecutive disk pages (modulo bad spots.) If protection against single-page failures is needed, then the log must be duplexed. By saving log pages on tape before they are reused, we ensure that even old data can be recovered after single-page failures.

There are several arguments in favor of merging the logically separate file system and DBMS transaction logs into a single log. These include (1) simpler synchronization of log forces, (2) reduced disk arm motion, (3) only one resource to run out of, (4) no merging required when saving log to tape. There are undoubtedly situations in which each of these turns from an advantage to a disadvantage, but on balance we favor a single log.

Note the tradeoff between the detail of logging and the (potential) cost of undo and redo. Suppose that a write page action changes a single byte of a page. For redo, it is logically sufficient to simply log the page number, byte offset, and new byte value. But in the event of a crash the page may not be readable. So in this case redo must reconstruct the previous state of the page somehow (perhaps by playing the log forward from a backup copy of the file system), and then apply the update.

A better strategy, in terms of recovery time, is to redo log the entire contents of the page, with the update applied. Then redo simply rewrites the page. The number of bytes written to the log is larger for this strategy than for the previous example. But unless either the entire old contents or entire new contents of a page are logged before a page is updated, a failure in the middle of updating the page will require a recovery whose cost may be quite high. This is the motivation for full-page logging for all page updates.

Jim Gray, Paul McJones, Mike Blasgen, Bruce Lindsay, Raymond Lorie, Tom Price, Franco Putzolu, and Irving Traiger, "The recovery manager of a Data Management System," IBM San Jose Research Report RJ 2623. (This very nice paper describes the System R recovery manager. It does not give a very coherent explanation of the relationship between System R transactions and the transactions provided by its underlying file system, or explore alternatives to the many design choices made in System R.)