[_CD6_]<alpine>Doc>EarlyDesign>AlpineBackup.bravo!1

1) Perform backup and recovery. An Alpine server must survive any single storage medium failure without permanent loss of information. Since most files in an Alpine server are singly-recorded, the backup component must provide an extra degree of redundancy to make recovery possible.

An Alpine server must also recover from larger failures, as might be caused by a major earthquake or an operator error. In case of such failures some information will surely be lost, but we can recover to a recent (a few days old) transaction-consistent state. (Any data that is distributed across servers will be inconsistent after such a recovery, but local consistency seems preferable to no consistency at all.)

Recovery from the failure of an Alpine server should be reasonably quick, convenient and foolproof. Repairing a single smashed file page should take a couple of hours, at most; reloading an entire volume or system may take longer.

2) Consume few resources. The backup component should disrupt the normal functioning of an Alpine server as little as possible. Disruption ranges from degrading the Alpine server performance to aborting transactions unnecessarily to making the Alpine server completely unavailable. We prefer a form of backup that does not involve taking the server down. As much as possible, the backup component should use resources only during slack periods.

According to our definition of the item "Perform backup and recovery", the backup component must be capable of listing the identities of all files in the file system, of retrieving the contents of any file in the file system, and so forth. Thus no matter how the backup component is implemented, its behavior cannot be distinguished from that of a file system (except for performance.)

In some environments, it is attractive to provide backup in just this way: by replicating the file system. Each disk has a mirror copy: a write goes to both disks, and a read goes to either disk. When this scheme is carefully implemented it has performance advantages over a normal file system, since reads always predominate over writes and doubling the number of disk arms improves the performance of reads. Recovery from disk errors is very straightforward with this scheme, so errors barely degrade performance and normal operation of the system can continue during recovery. But doubling the number of disk drives in a file system is usually inconsistent with the item "Consume few resources." This is why we consider other ways of doing backup.

Alpine supports a notion of transaction to control consistent access to and updating of files. After a crash, our goal is to restore the file system to a state that reflects the execution of all committed transactions (in the order in which they executed), and does not reflect the execution of any aborted transaction. A crash aborts in-progress transactions.

Alpine implements transactions using a redo log. Alpine’s log contains a complete history of update activity: typical log records are create file F containing P pages, destroy file F, set size of file F to be P pages, and write page P of file F to be V. The updates of a transaction are recorded in the log before commit, and are reflected in the file representation only after commit. (This is a slight distortion of reality, but really is true for page writes.)

Though the transaction redo log is logically infinite, the server stores only a bounded tail of the infinite log; we shall call this tail the online log. The online log must be doubly recorded to protect against single storage medium failure. Without a doubly-recorded log, the server is vulnerable to the loss of a log record between the time the record’s transaction commits and the time that the record’s action (e.g file write) is carried out after commit.

It is also natural to consider using the online log to drive updates to the backup file system, keeping it consistent (to within some bounded delay) with the primary file system. Two copies of an online log record are required until the primary file system has been updated; one copy of an online log record is required until the backup file system has been updated; then the online log record can safely be discarded. From this viewpoint the primary and backup file systems look very similar, as we earlier argued they should. The remaining asymmetry is mainly caused by the fact that only a portion of the backup file system can be kept online at all times.

This is a sound way of thinking about backup, but it is far from a complete design. Consider the following two complications. First, some clients of Alpine do not require, and cannot afford, Alpine’s logging of each file page update. Some of these clients can afford the less complete protection against failures that the backup component provides. These clients are willing to accept the extra cost of logging the fact that a certain file has been updated, but not the cost of logging the new value for each page of the file. We need a strategy for backing up these files.

Second, there is the problem of actually designing data structures for the backup file system so that it can be updated with only one volume online. An obvious design is to structure the backup file system as a copy of the log records written by successful transactions. This design is optimized in favor of fast writing, which is good since we seldom read the backup file system. The problem with this design is that the size of the backup file system grows without limit, as does the time required to recover from backup. We must add more structure to the backup file system to bound its storage requirement by some small multiple of the primary file system’s.

We shall assume that the backup file system consists of a set of backup volumes, which are normal Alpine volumes (except that they are not candidates for backup.) At least one of the backup volumes must be online during a period of backup activity. In the discussion below we shall assume that exactly one backup volume is online during a period of backup activity, since we must support such configurations.

1) The set of offline backup volumes is sufficient to perform recovery to some old transaction-consistent state. This allows for failure modes that destroy all online volumes, including online backup and online log volumes, but do not destroy any offline volumes.

(This requirement may be somewhat controversial. It has a major impact on the design. At the same time, some may question how necessary it is, since it deals with an unusual failure mode and only guarantees local consistency. We shall proceed by assuming this requirement, and later we shall reexamine it.)

3) The size of files to be backed up is not be constrained by the size of a backup volume; it is possible to run the backup system even it the primary file system contains files that are several times larger than a single backup volume.

Item 1 implies that each backup volume is write-once, in the sense that we never mount a backup volume for writing unless we are willing to erase its contents.

Suppose that we are about to take the online backup volume offline. By item 1, the other offline volumes are sufficient to restore the system up to a certain transaction t. The online backup volume either has no effect on how many transactions can be restored, or else it allows transactions [t .. u) to be restored for t < u. The latter must be true at least some of the time, otherwise the backup system is not doing its job. Hence we conclude that at least some backup volumes contain a section of archived log.

An archived log of transactions [t1 .. tn) on volumes {v1, ... , vk} is a sequence S of records such that volumes {v1, ... , vk} updated by transactions [0 .. t1) can be redone using S to reach the same state as redoing transactions [t1 .. tn). The archived log need not represent updates in a form that allows the effects of different transactions to be distinguished; for instance, an early update that is overwritten by a later update need not be represented at all. Information about individual transactions is not necessary for the backup system since it doesn’t have to recover to an arbitrary point in past history, only to some recent point.

Our problem, as noted earlier, is how to keep the archived log bounded in length. One simple technique is to include a separate mechanism as part of the backup system to periodically dump the entire primary file system. This dump need not save a transaction-consistent snapshot of the entire file system, hence it is called a fuzzy dump. We shall now define this more precisely.

Each possible page in all possible primary file systems is assigned an integer name, using some mapping from [volume ID, file ID, page number]. This defines a total ordering on pages.

A fuzzy dump of pages [p .. q) on volumes {v1, ... , vk} taken during transactions [t1 .. tn) is a representation of a partial function F from [p .. q) to page values (512 bytes) such that: (1) wherever F[x] is defined, its value is the value of page x on volumes {v1, ... , vk} seen by some transaction in [t1 .. tn), (2) F[x] is defined for every page x in [p .. q) on volumes {v1, ... , vk} that existed in the instant before transaction t1 commited, and was never updated (deleted or overwritten) by a transaction in [t1 .. tn).

Note that fuzzy dumps "add" in the following way: if {d1, ... , dj} is a set of individual fuzzy dumps on {v1, ... , vk}, then the set is a fuzzy dump whose page "range" is the union of the individual page ranges, and whose transaction interval is the smallest interval containing all the individual transaction intervals.

Claim: a fuzzy dump of [0 .. infinity) on volumes {v1, ... , vk} taken during transactions [tm .. tn), plus an archived log of transactions [tl .. tp) on volumes {v1, ... , vk}, where tl<tm<tn<tp , is sufficient to restore the volumes {v1, ... , vk} updated by transactions [0 .. tp). Reason: if a page can be found in the fuzzy dump, then all later updates can be found in the archive log. If a page can’t be found in the fuzzy dump, then either it does not exist or updates to it can be found in the archive log.

So taking a new fuzzy dump (a fuzzy dump with a newer start time) allows old archived log to be discarded. Note that fuzzy dumping can be continuous, just like log archiving, so it is not necessary to accumulate an entire new dump in order to discard some log.

Note that if fuzzy dumping slows down, the backup file system grows, while if log archiving slows down, the primary file system may have to stop (for lack of space for new online log entries.)

One unattractive aspect of this design is that there are two mechanisms involved in the backup file system: logging and dumping. The straightforward implementation of the design involves two kinds of backup volume, archive log volumes and fuzzy dump volumes. This is an operational inconvenience if the time to fill a fuzzy dump volume is comparable to the time to fill the online log, since dumping and log archiving are mutually exclusive.

A (combined) backup volume contains pages [p ..q) on volumes {v1, ... , vk} taken in transactions [tl .. tp) iff it contains a fuzzy dump of [p .. q) on volumes {v1, ... , vk} taken during [tm .. tn) and an archived log of [tl .. tp) on {v1, ... , vk} where tl<tm<tn<tp.

A (combined) backup file system to transaction tp is a set of backup volumes whose union of page intervals covers [0 .. infinity) and whose union of transaction intervals is an interval ending in tp.

In rough terms, we plan to implement this mechanism as follows. When a new backup pack is mounted, we first archive as much online log as possible. Then determine how much fuzzy dumping to do, and start doing it. As the fuzzy dumping progresses, write log records to synchronize the fuzzy dumping with other transactions. When fuzzy dumping finishes, begin archiving the online log again. If the backup volume fills before enough log is archived to cover the fuzzy dumping activity, then enough fuzzy dump must be thrown away to make room for the log archiving. This probably indicates the need to increase the number of volumes in the backup file system.

Bookkeeping. The system protects the operator against manual errors. Each backup volume contains internal self-identification. The Alpine system keeps a small online index of all backup volumes and their contents (page ranges, transaction ranges.) This index can be rebuilt from the backup volumes should it be lost.

Adding new backup volumes. How this works depends a lot upon the mechanism used to control fuzzy dumping. If the system tries to fuzzy dump a constant fraction of the primary system on each run, it will eventually have trouble when this fraction becomes close to the size of a volume. It will then ask for a new volume (i.e. it will tell the operator that he cannot re-mount the oldest offline backup volume.) The operator may wish to add a new volume in order to increase the number of days between backup mounts; he must tell the system to fuzzy dump a smaller amount each day.

Control might also work on the basis of making the total cycle equal a certain number of days, making the pack-channging interval equal an integral number of days, or some other criterion.

Backup volume fill rate. For the moment, assume that primary and backup volumes have the same capacity. Let n be the number of primary volumes worth of files on the primary volumes. Let d be the number of days required to accumulate one volume’s worth of archive log. Then let the total number of backup volumes be n + m. Then the time to cycle through the backup volumes is roughly d * m days.

Structure of fuzzy dump: Separate files; Merging the archive log and fuzzy dump. Should the fuzzy dump be represented as a collection of files, one per file (or portion thereof) dumped? Or should we merge the archive log and fuzzy dump into one file, by making fuzzy dump information look like log records? This turns the backup volume into a single sequential file. But it has added complexity when the volume fills before the archive log "covers" the fuzzy dump.

Volume overflow during recovery. Since fuzzy dump records are not synchronized with archive log records, it is possible that a straightforward replay of the backup volume (load from fuzzy dump, then apply archive log) may overflow the primary volume being reloaded. This is easy to see: the fuzzy dump alone may contain more files than will fit onto the file system. This is unlikely, but not impossible. The solution is to search the archive log forward, executing all of the space-freeing commands you can find. Then restart the fuzzy dump scan at the point of the error. This is guaranteed to make progress, though the error may reoccur on a different file in the same volume. (I am still a bit unsure of how good the synchronization has to be between the dump and log for this to really work.)

Immutable files. A possible optimization is to note the immutable/not immutable status of a file at the point that the file is dumped. For an immutable file the dump is guaranteed consistent, and there is no need to examine the archive log for update records!

Order of fuzzy dumping. Dumping the files with large names first reduces the size of the dump set (assuming that file names assigned by CreateFile increase with time.) The cost is that reloading a newly-created file is guaranteed to take a full scan of the backup file system.

Fuzzy dump at low level. If the efficiency of the dump process ever became an issue, it could access the disk using raw file accesses, rather than going through normal channels. This would require somewhat more careful design to ensure that the dump is properly syncd with the log, but this could probably be accomplished by taking a file system checkpoint (a nondisruptive operation.)

Crash recovery. How to restart backup component after a crash? We want this to be reasonably fast; certainly it should not delay the server in coming onto the network.

Changing logs. It is probably a bad idea to store log record IDs in the backup system, since a single volume may migrate to a different log. Backup system must be able to handle multiple logs?

Removable volumes. Problem with removable volumes is that backup system can’t always access them when it needs to. This is a problem. Maybe removable volume requires its own backup volumes, or is backed up by pack copying.

The design just presented has the virtue that it meets the stated design criteria and is very simple. One important respect in which it is simple is that backup records are kept on a very coarse grain. The price we pay for this simplicity is that the same information may be duplicated in the backup system: an object may be fuzzy dumped even though a full image of it is already in the archived log. If fuzzy dumping proceeds at a uniform rate we expect that at any instant, half of the archive log records are redundant. It is just not clear that there is any simple way to eliminate this redundancy.

The objective of some more elaborate designs might be to speed up recovery. The speed of recovery is not emphasized in our design. We consider it acceptable to mount a dozen packs in sequence to restore one file, since this is seldom required. On a system with many disk drives, you might search several backup packs in parallel, but even this sounds like excessive complexity.

The only way to give real protection against disasters is to make the failure modes of the primary and backup systems more independent. But both systems rely on the log to record recent information. So the first conclusion is that to really protect against disaster, you must keep one copy of the log remote from the file system. With present-day communication technology this is not practical for cost and performance reasons.

So the best we can hope for in case of a disaster is to roll back to some previous transaction-consistent state; the effects of recent transactions will be lost.

The most effective form of protection is to invest in a more secure physical environment for the offline portion of the backup file system. The online portion is generally vulnerable to the same destructive forces as the primary file system and the log. With relatively little effort it should be possible to protect the offline portion of the backup system from all but the most severe earthquake or well-planned attack. Yet the offline portion can be kept near enough to the primary system to make the operation of rotating backup packs easy. (I am thinking of a vault located on the first floor of this building, near the massive concrete wall.)