Heading:z18697qjk40(635)\g Outline of a possible FileStore implementationz18697y756qjk40\g Page Numbers: Yes X: 527 Y: 10.5"z18697qjk40\g CSL Notebook Entryz18697l4445y762\f5bg To: Alpine designers Date: March 18, 1981z18697l4445d2998e21(0,65535)(1,4445)(5,11684)(6,13653)\f1g3f0t2 1t0 16t6 1f1t0 5f0t7 1t0 From: Mark Brown Location: PARC/CSLz18697l4445d2998y716e25(6,13664)\f1g5f0t2 1t0 10t6 1f1t0 9f0t7 1t0 Subject: Outline of a possible FileStore implementation File: [Ivy]Doc>FileStoreImplOutline0.bravoz19579l4445d2998e25\f1g8f0t2 1t0 46t6 1f1t0 5f0t7 1f1t0 44f0 XEROX z18697l508y644e14(2116)\f2g5f0 Attributes: informal, technical, Database, Distributed computing, Filingz18697l4896d2999e10(0,4904)(1,65535)(5,65535)(6,65535)\f1g11f0 Abstract: This memo contains a sketch of how FileStore might be implemented.z18697l4896d2999e10j\f1g9f0 z18697e10(2116)\gv Prioritiesz18697e36k100(635)\bg Sequential I/O will always be very important, and Alpine should try hard to do a good job on sequential I/O. Applications that exhibit sequentiality: file transfer, external sorting, searching (pattern matching in text, no-index query processing), logs (both low and high level.)z18697e12jk40\g Random I/O can really only be optimized to the extent that it is non-random, e.g. it exhibits locality of reference and benefits from an LRU buffer manager.z18697e12jk40\g FilePageMgrz18697e36k100\bg In the proposed design, all operations on FileStore files go through the FilePageMgr interface. The interface deals in the same logical entities as Pilot's interface: files identified by unique IDs, pages within files (indexed from 0 with no "leader page"), etc. Transactions do not appear in the interface. Data is transferred across the interface in buffers that are controlled by the FilePageMgr implementation. In a machine with multiple FileStores, there needs to be only one FilePageMgr instance. An FilePageMgr "volume" corresponds to a Pilot logical volume, but it might be implemented by duplicating all writes to two independent Pilot logical volumes.z18697e12jk40\g660G6g FilePageMgr can be implemented on public Pilot interfaces. It can later be improved, on basis of experience, by using Pilot internal interfaces (perhaps by doing its own buffering and disk I/O. If the implementation assumes responsibility for buffering and disk I/O, then performance can be improved by adding buffer memory to a machine, without interacting with Pilot swapper performance.)z18697e12jk40\g System reliability can be improved by using replicated volumes, but this is transparent at the FilePageMgr interface. (In commercial systems this replication is called "mirrored disks"; high-reliability commercial systems support replication at this level, and tolerate single failures without interruption of service.)z18697e12jk40\g Basic transaction strategyz18697e36k100\bg Execute file page write, file deletion, file shortening, and file immutable-making by writing a redo log entry before the end of phase one. During phase two, carry out the operation.z18697e12jk40\g Execute file creation and file lengthening by writing an undo log entry, then executing the operation, before the end of phase one. Undo the operation if transaction aborts.z18697e12jk40\g Optimize important special cases. For example, page writes do not need to be deferred if the file was created in this transaction.z18697e12jk40\g The transaction log (See FileStoreTransCode1.bravo for more discussion of this)z18697e36k100\bg19B2f1 59f0b The log is a stream to a file; log accesses are performed through FilePageMgr. The amount of log buffering in the FilePageMgr implementation may have a large impact on the performance of update transactions.z18697e12jk40\g The log file is used as a circular buffer. Pages of log are formatted so that they can be parsed even in case of an unexpected crash. One simple scheme: each page contains a sequence number that tells the number of times that the page has been written. If a page stamped "i" is followed by a page stamped "i-1", then the page stamped "i" is the last page of the log.z18697e12jk40\g Log activity can be modelled as: there is one read pointer per transaction, a single write pointer shared by all transactions. The read pointers always lag the write pointer.z18697e12jk40\g If protection against any single-page failure is required, then the log must be duplicated on physically separate volumes.z18697e12jk40\g The log may be stored on tape for backup purposes. (Searching a lot of log tapes can be very slow; a backup tape organized for fast access is needed if recovery is to be reasonably fast.)z18697e12jk40\g An interface to the log will be provided so that other programs, such as a database management system, can write their own recoverable actions.z18697e12jk40\g The log mapz18697e36k100\bg A volatile data structure (lost in crash.)z18697e12jk40\g Maps [file, page] -> [transaction, log ID] of log record containing data written by transaction on that page.z18697e12jk40\g How various interesting operations are implementedz18697e36k100\bg Write Pagez18697e12jk40\ig Obtain page lock (U or W mode) if required.z18697l3351e6jk40\g Consult log map. If a committed update (necessarily by another transaction) to this page exists, then read data from log, write the new data through FilePageMgr, and erase the log entry.z18697l3351e6jk40\g Write log record "transaction t wrote data d on page p of file f". Not necessary to force log.z18697l3351e6jk40\g Insert entry containing log ID of this log record into the log map.z18697l3351e6jk40\g Read Pagez18697e12jk40\ig Obtain page lock (R mode) if required.z18697l3351e6jk40\g Consult log map. If a committed update (necessarily by another transaction) to this page exists, then read data from log (and as side effect, perform the write through FilePageMgr and erase the log entry.) If an update from this transaction exists, then read data from log. Otherwise read data from file.z18697l3351e6jk40\g Commit (multiple machine, two-phase case)z18697e12jk40\ig Phase one.z18697l3351e6jk40\ig Convert all of t's U mode locks to W mode locks.z18697l3704e6jk40\g Write log record "transaction t is ready". Force log.z18697l3704e6jk40\g Phase two.z18697l3351e6jk40\ig Write log record "transaction t is committed". Force log. Mark the volatile representation of t "committed".z18697l3704e6jk40\g Release or downgrade all of t's locks. (Commit call returns now, but processing continues in a background process.)z18697l3704e6jk40\g Enumerate t's log map entries. For each entry, read data from log, write data to file.z18697l3704e6jk40\g System checkpointz18697e12jk40\ig Abort any excessively long transactions.z18697l3351e6jk40\g Flush buffers of any pages that represent committed writes performed by excessively old transactions.z18697l3351e6jk40\g Write a log record that identifies (1) all active transactions, (2) the oldest log record that must be consulted to recover from the current state. Force log.z18697l3351e6jk40\g Write the restart record, making it point to the checkpoint just written. Note that restart record is duplicated to tolerate failures during write. After a crash, recovery begins by reading the restart record and finding the most recent checkpoint.z18697l3351e6jk40\g