DRAFT -- September 20, 1988 1:26:00 pm PDT -- DRAFT
Internal Memo
ToFrom
DistributionBob Hagmann
 PARC/CSL
SubjectDate
Yggdrasil Concurrency Control and September 20, 1988
Crash Recovery Design
Introduction
This memo specifies the overall design for concurrency control and crash recovery for Yggdrasil. Much of this memo has been superceeded by the use of Camelot. For example, logging, checkpoint, and fast restart are not under our control.
There are many additional sections to this memo. They are as follows:
Requirements
These requirements have mostly been gathered from other memos.
Transactions, services and clusters
Locking
Alerters
Logging
Checkpoint
Fast restart
Interrelationship with archive
Snapshot backup
Incremental backup
Faking mirrored disks - parity disks
Requirements
Locking
f a document (object) is the natural unit of locking
..., but allow for finer granularity locking later
f locks come in three flavors
browse lock: this lock shares with any other lock. It is the weakest of all the locks. Alerters can be fired off of browse locks when a stronger lock is granted.
read lock: locks the object so that it cannot be modified
write lock: locks the object for modification. Does not share with read lock. Does share with browse lock, but commit of the transaction breaks the browse lock.
Alerters (send a message when a system event occurs)
f Simple triggers on lock, insert, delete, read, or update
f Trigger queues a message for some (probably) external server
f Some are persistent
f Uses:
Screen refresh due to update of a document being viewed
Triggering of services (e. g., automatic indexing of a new document)
Transactions
f they are supposed to be short
f there will be "Begin Transaction", "Commit Transaction", and "Abort Transaction" primitives.
Issues
f Long term locking (e. g., check-in and check-out)
Checkpoints
Logging
Data that must survive crashes includes
free page map
name and attribute metadata
object roots
location of buffered contents of optical disk contents
committed transaction table
Archiving
f backup and archiving must be coordinated so that the archive can become the only storage site for an object (or an object run)
Backup
f fast recovery from a single magnetic disk totally failing
f good recovery of failed sector
f transaction consistent backup
f unattended operation
Transactions, services and clusters
Yggdrasil has both services and clusters. A service is an instance of Yggdrasil. A cluster is a set of front end machines that together perform several services.
Yggdrasil will have a transaction model. The transaction will be local only to the cluster. Full distributed transactions will not be supported. Standard operations such as "Begin Transaction", "Commit Transaction", and "Abort Transaction" primitives will be implemented. We should plan for nested transactions, but delay implementation.
Locking
The natural unit of locking is the object. This will be the basic unit of locking for the initial version of the server. We may extend this later to lock individual properties or pages of objects, so nothing in the detailed design should preclude finer locking.
When an object it locked, the following is locked:
the contents
all attributes
``out'' links
What is not locked are the ``in'' links, nor are any objects that are linked to by the object locked. If an attribute has the value of a reference, the attribute and its value is locked but not the object pointed to by the reference.
Locks come in three flavors:
browse lock
This lock shares with any other lock. It is the weakest of all the locks. Alerters can be fired off of browse locks when a stronger lock is granted. Browse locks break immediately if a stronger lock is needed by another transaction.
read lock
Locks the object so that it cannot be modified.
write lock
Locks the object for modification. Does not share with read lock. Does share with browse lock, but commit of the transaction breaks the browse lock.
Alerters
An alert is a message that is sent when some system event occurs. Only simple events related to locking will be supported. Alerters can be set on lock, insert, delete, read, or update of an object. Some small amount of client data can be included in the alerter (the system provided at least the DID and xact ID). The message is sent to either a specific external client or to a class of servers.
Alerters can have a lifetime of a transaction, an incarnation of the server, an incarnation of the remote client, or be persistent over both client and server crash.
Logging
Skip this section. We are using Camelot.
Logging is the basic recovery method. It has to be fast, low overhead, and admit fast crash recovery. It's deceptively simple, yet complicated.
The log is a circular disk file. The log is a undo-redo write-ahead log. Items in the log are typed. Type handlers are registered for the do and undo actions. Items can be added to the log with or without force. With force, the procedure call does not return until the item (and all previous items) are stably recorded. It is up to the type implementor to do either logical or physical logging and recovery.
The log is a undo-redo log, but it is intended in normal operations that it will be a redo log. Only when buffer space becomes critical will an undo record be written and an uncommitted buffer be written to disk. In the checkpoint record will be an indication as to if undo processing is needed. The implementor can be fairly conservative about whether undo processing is needed (e.g., existence of undo records may be kept at a fairly gross level, say for 10% of the log). Since only great stress on the server will cause undo records to be read and we do not expect this, we can do one pass recovery.
The details of the log format and writing will not be covered here. They should be mostly straightforward and simply (but carefully!) handled by the implementor.
In this paper, we will assume that the log will (usually) be a redo-only log. All changed pages, except for those of large newly created objects, will (almost) never be written ``home'' before transaction commit. This (often) limits the size of a transaction's updates, but the memories should be large so the restriction will not be large in practice.
One piece of advise, waste space to get alignment. This can avoid copies and space should not be the issue anyway.
What gets logged and how?
The items marked with ?? are those that further thought or implementation decisions may change.
large object creation
Directly write the object root, attributes, and contents where they are to previously free pages (and duplicate to the backup buffer). Only log updates to partial pages (physical page undo-redo) and transaction commit. The transaction can't commit until the direct writes and the backup buffer writes complete. [This is a variant of the new file and high-water-mark trick Alpine does.]
The reason to do this is to avoid writing the data four times. If it was originally logged, then it would be written there twice, then once more to the ``home'' location, and finally to the backup buffer. By writing directly ``home'' to disk and to the backup buffer, two writes are avoided. This optimization is not worth it if the object is small.
object roots, attributes, and contents (except for large object creation)
Physical page undo-redo.
With physical page logging, we can use the log as a way to load up the memory cache quickly while doing few disk writes.
free page map
Logical run undo-redo (??)
Logical updates probably provide for more concurrency.
name metadata
Physical page undo-redo (this is just a special attribute)
Do the same thing as for object roots.
location of buffered contents of optical disk contents
Physical page undo-redo.
Fast recovery.
committed transaction table
Physical page undo-redo (??)
Fast recovery.
DID map cache
Physical page undo-redo (??)
Fast recovery.
Log wrap-around
Use Log Record Forwarding to preserve data. Avoid forwarding superceded pages.
How Alpine does logging and recovery
Redo stuff is logged after the previous copy of the data is forced to disk (isn't this exactly wrong? doesn't this do the exact wrong thing for hot spots?). Checkpoints are wimpy: they only seem to have a pointer to the oldest interesting record.
Initialize
To recover, read from the recovery file the address of the checkpoint record. Read up the checkpoint (why couldn't this just have been saved in the restart file?). Set the scan pointer where checkpoint says it's to go.
Analysis
Scan the relevant part of the log. Observe transaction start/commit/abort records and built a transaction fate table. Look at file sizes.
Recovery
Scan the relevant part of the log. For committed transactions, apply the changes (this does disk I/O, I think). Worry about file sizes, high-water-marks, and all that.
Checkpoint
Camelot won't let data servers participate in checkpoint. Ignore this for now.
Periodically take a checkpoint at the end of an epic (and hence epics are defined in terms of checkpoints). Checkpoints are fairly heavyweight. They have aborted/active/committed transaction tables. They also point at the checkpoint (epic) where recovery must start.
At the time of a checkpoint, a priority queue is built of buffers modified that have not yet been written "home" to disk. "Home" means the primary copy on disk plus the copy in the backup buffer. The system has a parameter it keeps of how many checkpoints to age a buffer before it is written to disk. All buffers that have been modified prior to this time and have not been re-modified are put on the queue. The queue is ordered by epic of the most recent modified buffers, where oldest is earliest on the list.
The epic where recovery must start is at least as old as the oldest epic in the priority queue.
The priority queue is then processed. Processing stops at the time of the next checkpoint.
If the system is doing well, then there will not be a great deal of writing. In this case, the IO subsystem can easily absorb the writes. The write behind process spaces the writes equally over the time period to the next checkpoint.
When the delay time between initiating writes would be zero or when the system is in danger of filling its log, multiple processes will be used. Care should be taken not to overly stress the system with writes. This can cut normal operations performance dramatically, since normal operations needs to read pages and write the log also.
Fast restart
Camelot does restart. Ignore this for now.
Since the log is (usually) a redo log with mostly physical page logging, recovery can be blindly fast.
Stably record the start of the log and the address of the last checkpoint. On recovery, read these up and start reading the log just past the checkpoint. Parse the log forward to find out the fate of transactions.
If undo is needed (a rare event except in memory poor servers), then read the log backwards processing undo items. Follow the same idea as in the next paragraph for undo.
Start at the start of the log and read it in big chunks (megabyte-ish) using double buffering with read-ahead. Parse the log forward. When a physical page redo record is found, blt it into the buffer pool of the database. That's it. Don't write it ``home'' to disk!!! If the buffer pool fills up, write buffers based on their LRU ordering in the log. This is recovery, so only write ``big'' things (unless the system is really out of room).
Logical recovery is more complicated and will require more sophisticated processing than a blt. Try to have recovery never need to go to the disk (except to read the log).
To recover the free page map, load in the disk resident copy the the free page map into memory. As updates are seen in the log, apply them to the memory bitmap. Use the default buffering for the bitmap. The bitmap should not normally be written during recovery.
Interrelationship with archive
As objects are modified and created, updates only occur to magnetic disk and main memory. Archival storage is not involved. As the magnetic disk and main memory start to fill with ``dirty'' data, some objects have to be written the optical disk jukebox backend (tertiary storage). Once written and backed up (to quaternary storage), these objects may be removed from magnetic disk and main memory. [From YggdrasilStorageDesign.tioga, section Interoperation of optical disk subsystem with full system].
Snapshot backup
Periodically, the disks should be read in their entirety and written to a backup device. This provides a good starting place for disaster recovery, as well as a check that the pages are all good on disk.
The scheme proposed here does not write transaction-consistent copy of all the disks. This is OK. Each disk is transaction-consistent. All disks can have the incremental recovery applied to reach the last committed transaction state. Hence, incremental backup must work to have the snapshots be meaningful. The system can never be reloaded without also running incremental recovery.
Read the disk
Make sure that all the sectors on the disk are readable. Fix sectors from backup.
Get the disk into a consistent state
Write a marker in the log. Process the log starting at the end and copy committed stuff onto disk. Only copy the last committed value for a page as of the time of the marker write (memory resident data structures should make this easy (somewhat)).
After the marker is written, newly committed transactions cannot write ``home'' to disk. Data is kept only in the log and memory. Newly created files are written to the log: they do not use the ``just write it home, don't log it'' optimization. If the log is too short, there will be lots of transaction aborts. Eventually, the server will be unuseable since the log will fill up.
Maybe the server has to run in degraded mode during snapshot: no large transactions will be possible. If the server crashed during snapshot, the snapshot is lost.
Blast the bits at the backup device
Go like the blazes!
Start everything back up
Allow writes ``home'' to disk.
Incremental backup
Delta between snapshot are incremental backup. Once incremental backup starts, the system is committed to completing it (unless it is reloaded from backup).
Get the backup buffer full of pages to write
Write a marker in the log. Process the log starting at the end and copy committed stuff into the backup buffer. Only copy the last committed value for a page as of the time of the marker write (memory resident data structures should make this easy (somewhat)).
After the marker is written, newly committed transactions write their extra copy of pages to part of the backup buffer not seen by this incarnation of backup. At the end of backup, the pages just backed up are freed.
Write the backup buffer to the backup device
The trick is to optimize recovery without hurting normal operation or backup too much.
As backup occurs, the DID map is updated to reflect the backup location of data. Hence, when backup completes the DID map is capable of finding backed-up copies of pages. Note that the DID map itself is not backed up yet. This may seem anomalous, but it is really OK. If a page is lost, then the DID map is used to find it in backup. The current DID map is what you want. If a disk has to be reloaded, we can incrementally update the DID map as we run reload.
Disk backup (optical or magnetic)
This will be the first backup system.
Per disk, write delta's to backup device so that all the bits for a disk are together in chunks. Disks are nice here, since we can position it randomly.
The backup data per disk is kept together, or in large chunks, so that it can quickly and easily be read during recovery from drive failure. Ideally, the operations staff will have a pre-formatted disk ready to go. All we have to do is stream in the backup data and write it out. Extensive buffering and write planning should make this go quite fast (e.g., recover at a good fraction of the transfer rate).
Tape backup (optical or magnetic)
This may be implemented later.
Data is streamed out to tape. No attempt is made to keep it in any order. Backup is slow.
Finish up
Write an inverse marker. Free backup buffer pages just written to backup.
Faking mirrored disks - parity disks
This is an idea that I want to record. It is motivated by the solution to a problem one of my CS 240B students at Stanford had last year. Garth Gibson talked about a similar idea when he visited us in October, 1987. Peter Crean told me that a disk manufacturer, Maxstor (sp?) I think, makes a disk drive combination that does this in the controller. They package five drives together, and give you the capacity of four.
What you want to recover quickly from is the loss of a disk. Mirrored disks works, but wastes hardware. The problem is to get the same effect without wasting as much hardware. We assume that writes do not take up all that much time of the disk.
For every eight disks, add one more disk: the parity disk. [The value of eight is chosen here as an example; the value one is just mirrored disks.] Maintain the invariant that sector n on the parity disk can be computed by exclusive or-ing the values of sector n from the eight disks.
Initial computation of the parity disk's sectors is easy. To write page n on disk d, fetch page n on disk d and on the parity disk. Exclusive-or into the parity disk's sector the old value of page n on disk d, and then exclusive-or the new value of page n on disk d. Atomicly write the new value of page n on disk d and the new value of the sector the parity disk. There are some subtilties, which will be ignored here, about how to do the update. We can use logging to solve these problem.
What happens if a disk fails? Spare in an empty disk and start reading from the other disks. Compute the value for a sector as the exclusive-or of all the other disk's sector at the same address. Keep a horizon of where recovery has passed. If a request is received for data past the horizon, read up the needed sectors from all the disks and do the exclusive or's, and put the result into the buffer pool. Do something similar if a write occurs. After a long time (like two hours on a gigabyte disk transferring at 1 megabyte a second with eight disks using a parity disk), the spare has been rebuilt without human intervention or hardware repair. Note that the parity disk is recovered the same way as the normal disks.