Heading:z18697qjk40(635)\g Alpine backupz18697y756qjk40\g Page Numbers: Yes X: 527 Y: 10.5"z18697qjk40\g CSL Notebook Entryz18697l4445y762\f5bg To: Interested parties Date: January 15, 1982z18697l4445d2998e21(0,65535)(1,4445)(5,11684)(6,13653)\f1g3f0t2 1t0 18t6 1f1t0 5f0t7 1t0 From: Mark Brown Location: PARC/CSLz18697l4445d2998y716e25(6,13664)\f1g5f0t2 1t0 10t6 1f1t0 9f0t7 1t0 Subject: Alpine backup File: [Ivy]Doc>AlpineBackup.bravoz19579l4445d2998e25\f1g8f0t2 1t0 13t6 1f1t0 5f0t7 1f1t0 35f0 XEROX z18697l508y644e14(2116)\f2g5f0 Attributes: informal, technical, Database, Distributed computing, Filingz18697l4896d2999e10(0,4904)(1,65535)(5,65535)(6,65535)\f1g11f0 Abstract: This memo proposes a design for Alpine's backup component.z18697l4896d2999e10j\f1g9f0 z18697e10(2116)\gv Requirements for Alpine's backup componentz18697e24jk40(635)\bg42B Alpine's backup component must have the following two properties:z18697e12jk40\g 1) Perform backup and recovery. An Alpine server must survive any single storage medium failure without permanent loss of information. Since most files in an Alpine server are singly-recorded, the backup component must provide an extra degree of redundancy to make recovery possible.z18697l3175e6jk40\g3i27I An Alpine server must also recover from larger failures, as might be caused by a major earthquake or an operator error. In case of such failures some information will surely be lost, but we can recover to a recent (a few days old) transaction-consistent state. (Any data that is distributed across servers will be inconsistent after such a recovery, but local consistency seems preferable to no consistency at all.)z18697l3175e6jk40\g Recovery from the failure of an Alpine server should be reasonably quick, convenient and foolproof. Repairing a single smashed file page should take a couple of hours, at most; reloading an entire volume or system may take longer.z18697l3175e6jk40\g 2) Consume few resources. The backup component should disrupt the normal functioning of an Alpine server as little as possible. Disruption ranges from degrading the Alpine server performance to aborting transactions unnecessarily to making the Alpine server completely unavailable. We prefer a form of backup that does not involve taking the server down. As much as possible, the backup component should use resources only during slack periods.z18697l3175e6jk40\g3i21I The backup component must tie up a tolerably small amount of hardware, such as disk or tape drives, and disk packs or tape reels.z18697l3175e6jk40\g According to our definition of the item "Perform backup and recovery", the backup component must be capable of listing the identities of all files in the file system, of retrieving the contents of any file in the file system, and so forth. Thus no matter how the backup component is implemented, its behavior cannot be distinguished from that of a file system (except for performance.)z18697e12jk40\g301i8I77G In some environments, it is attractive to provide backup in just this way: by replicating the file system. Each disk has a mirror copy: a write goes to both disks, and a read goes to either disk. When this scheme is carefully implemented it has performance advantages over a normal file system, since reads always predominate over writes and doubling the number of disk arms improves the performance of reads. Recovery from disk errors is very straightforward with this scheme, so errors barely degrade performance and normal operation of the system can continue during recovery. But doubling the number of disk drives in a file system is usually inconsistent with the item "Consume few resources." This is why we consider other ways of doing backup.z18697e12jk40 Alpine logs and backupz18697e24jk40\bg22B Alpine supports a notion of transaction to control consistent access to and updating of files. After a crash, our goal is to restore the file system to a state that reflects the execution of all committed transactions (in the order in which they executed), and does not reflect the execution of any aborted transaction. A crash aborts in-progress transactions.z18697e12jk40\g Alpine implements transactions using a redo log. Alpine's log contains a complete history of update activity: typical log records are create file F containing P pages, destroy file F, set size of file F to be P pages, and write page P of file F to be V. The updates of a transaction are recorded in the log before commit, and are reflected in the file representation only after commit. (This is a slight distortion of reality, but really is true for page writes.)z18697e12jk40\g135i32I2i14I2i32I6i30I Though the transaction redo log is logically infinite, the server stores only a bounded tail of the infinite log; we shall call this tail the online log. The online log must be doubly recorded to protect against single storage medium failure. Without a doubly-recorded log, the server is vulnerable to the loss of a log record between the time the record's transaction commits and the time that the record's action (e.g file write) is carried out after commit.z18697e12jk40\g142i10I It is also natural to consider using the online log to drive updates to the backup file system, keeping it consistent (to within some bounded delay) with the primary file system. Two copies of an online log record are required until the primary file system has been updated; one copy of an online log record is required until the backup file system has been updated; then the online log record can safely be discarded. From this viewpoint the primary and backup file systems look very similar, as we earlier argued they should. The remaining asymmetry is mainly caused by the fact that only a portion of the backup file system can be kept online at all times.z18697e12jk40\g76i6I This is a sound way of thinking about backup, but it is far from a complete design. Consider the following two complications. First, some clients of Alpine do not require, and cannot afford, Alpine's logging of each file page update. Some of these clients can afford the less complete protection against failures that the backup component provides. These clients are willing to accept the extra cost of logging the fact that a certain file has been updated, but not the cost of logging the new value for each page of the file. We need a strategy for backing up these files.z18697e12jk40\g259i3I Second, there is the problem of actually designing data structures for the backup file system so that it can be updated with only one volume online. An obvious design is to structure the backup file system as a copy of the log records written by successful transactions. This design is optimized in favor of fast writing, which is good since we seldom read the backup file system. The problem with this design is that the size of the backup file system grows without limit, as does the time required to recover from backup. We must add more structure to the backup file system to bound its storage requirement by some small multiple of the primary file system's.z18697e12jk40\g A backup file system designz18697e24jk40\bg27B We shall assume that the backup file system consists of a set of backup volumes, which are normal Alpine volumes (except that they are not candidates for backup.) At least one of the backup volumes must be online during a period of backup activity. In the discussion below we shall assume that exactly one backup volume is online during a period of backup activity, since we must support such configurations.z18697e12jk40\g65i14I We can state a couple of requirements for backup volumes:z18697e12jk40\g 1) The set of offline backup volumes is sufficient to perform recovery to some old transaction-consistent state. This allows for failure modes that destroy all online volumes, including online backup and online log volumes, but do not destroy any offline volumes.z18697l3175e6jk40\g14i7I (This requirement may be somewhat controversial. It has a major impact on the design. At the same time, some may question how necessary it is, since it deals with an unusual failure mode and only guarantees local consistency. We shall proceed by assuming this requirement, and later we shall reexamine it.)z18697l3175e6jk40\g 2) A backup volume contains only simple data structures. This increases our confidence that recovery will really work even though it is seldom exercised.z18697l3175e6jk40\g 3) The size of files to be backed up is not be constrained by the size of a backup volume; it is possible to run the backup system even it the primary file system contains files that are several times larger than a single backup volume.z18697l3175e6jk40\g Item 1 implies that each backup volume is write-once, in the sense that we never mount a backup volume for writing unless we are willing to erase its contents.z18697e12jk40\g Suppose that we are about to take the online backup volume offline. By item 1, the other offline volumes are sufficient to restore the system up to a certain transaction t. The online backup volume either has no effect on how many transactions can be restored, or else it allows transactions [t .. u) to be restored for t < u. The latter must be true at least some of the time, otherwise the backup system is not doing its job. Hence we conclude that at least some backup volumes contain a section of archived log.z18697e12jk40\g An archived log of transactions [t1 .. tn) on volumes {v1, ... , vk} is a sequence S of records such that volumes {v1, ... , vk} updated by transactions [0 .. t1) can be redone using S to reach the same state as redoing transactions [t1 .. tn). The archived log need not represent updates in a form that allows the effects of different transactions to be distinguished; for instance, an early update that is overwritten by a later update need not be represented at all. Information about individual transactions is not necessary for the backup system since it doesn't have to recover to an arbitrary point in past history, only to some recent point.z18697e12jk40\g3i31f1o252 1f0o0 5f1o252 1f0o0 15f1o252 1f0o0 9f1o252 1f0o0 1I48f1o252 1f0o0 5i1I3f1o252 1f0o0 33f1o252 1f0o0 74f1o252 1f0o0 5f1o252 1f0o0 Our problem, as noted earlier, is how to keep the archived log bounded in length. One simple technique is to include a separate mechanism as part of the backup system to periodically dump the entire primary file system. This dump need not save a transaction-consistent snapshot of the entire file system, hence it is called a fuzzy dump. We shall now define this more precisely.z18697e12jk40\g Each possible page in all possible primary file systems is assigned an integer name, using some mapping from [volume ID, file ID, page number]. This defines a total ordering on pages. z18697e12jk40\g A fuzzy dump of pages [p .. q) on volumes {v1, ... , vk} taken during transactions [t1 .. tn) is a representation of a partial function F from [p .. q) to page values (512 bytes) such that: (1) wherever F[x] is defined, its value is the value of page x on volumes {v1, ... , vk} seen by some transaction in [t1 .. tn), (2) F[x] is defined for every page x in [p .. q) on volumes {v1, ... , vk} that existed in the instant before transaction t1 commited, and was never updated (deleted or overwritten) by a transaction in [t1 .. tn).z18697e12jk40\g2i42f1o252 1f0o0 9f1o252 1f0o0 30f1o252 1f0o0 5f1o252 1f0o0 1I173f1o252 1f0o0 9f1o252 1f0o0 32f1o252 1f0o0 5f1o252 1f0o0 65f1o252 1f0o0 9f1o252 1f0o0 1i1I48f1o252 1f0o0 80f1o252 1f0o0 5f1o252 1f0o0 Note that fuzzy dumps "add" in the following way: if {d1, ... , dj} is a set of individual fuzzy dumps on {v1, ... , vk}, then the set is a fuzzy dump whose page "range" is the union of the individual page ranges, and whose transaction interval is the smallest interval containing all the individual transaction intervals.z18697e12jk40\g55f1o252 1f0o0 9f1o252 1f0o0 42f1o252 1f0o0 9f1o252 1f0o0 Claim: a fuzzy dump of [0 .. infinity) on volumes {v1, ... , vk} taken during transactions [tm .. tn), plus an archived log of transactions [tl .. tp) on volumes {v1, ... , vk}, where tl