[_CD6_]<alpine>Doc>EarlyDesign>FileStoreImplOutline0.bravo!1

Heading:

Outline of a possible FileStore implementation

Page Numbers: Yes X: 527 Y: 10.5"

CSL Notebook Entry

To:Alpine designersDate:March 18, 1981

From:Mark BrownLocation:PARC/CSL

Subject:Outline of a possible FileStore implementationFile:[Ivy]<Alpine>Doc>FileStoreImplOutline0.bravo

XEROX

Attributes:informal, technical, Database, Distributed computing, Filing

Abstract:This memo contains a sketch of how FileStore might be implemented.

Priorities

Sequential I/O will always be very important, and Alpine should try hard to do a good job on sequential I/O. Applications that exhibit sequentiality: file transfer, external sorting, searching (pattern matching in text, no-index query processing), logs (both low and high level.)

Random I/O can really only be optimized to the extent that it is non-random, e.g. it exhibits locality of reference and benefits from an LRU buffer manager.

FilePageMgr

In the proposed design, all operations on FileStore files go through the FilePageMgr interface. The interface deals in the same logical entities as Pilot’s interface: files identified by unique IDs, pages within files (indexed from 0 with no "leader page"), etc. Transactions do not appear in the interface. Data is transferred across the interface in buffers that are controlled by the FilePageMgr implementation. In a machine with multiple FileStores, there needs to be only one FilePageMgr instance. An FilePageMgr "volume" corresponds to a Pilot logical volume, but it might be implemented by duplicating all writes to two independent Pilot logical volumes.

FilePageMgr can be implemented on public Pilot interfaces. It can later be improved, on basis of experience, by using Pilot internal interfaces (perhaps by doing its own buffering and disk I/O. If the implementation assumes responsibility for buffering and disk I/O, then performance can be improved by adding buffer memory to a machine, without interacting with Pilot swapper performance.)

System reliability can be improved by using replicated volumes, but this is transparent at the FilePageMgr interface. (In commercial systems this replication is called "mirrored disks"; high-reliability commercial systems support replication at this level, and tolerate single failures without interruption of service.)

Basic transaction strategy

Execute file page write, file deletion, file shortening, and file immutable-making by writing a redo log entry before the end of phase one. During phase two, carry out the operation.

Execute file creation and file lengthening by writing an undo log entry, then executing the operation, before the end of phase one. Undo the operation if transaction aborts.

Optimize important special cases. For example, page writes do not need to be deferred if the file was created in this transaction.

The transaction log (See FileStoreTransCode1.bravo for more discussion of this)

The log is a stream to a file; log accesses are performed through FilePageMgr. The amount of log buffering in the FilePageMgr implementation may have a large impact on the performance of update transactions.

The log file is used as a circular buffer. Pages of log are formatted so that they can be parsed even in case of an unexpected crash. One simple scheme: each page contains a sequence number that tells the number of times that the page has been written. If a page stamped "i" is followed by a page stamped "i-1", then the page stamped "i" is the last page of the log.

Log activity can be modelled as: there is one read pointer per transaction, a single write pointer shared by all transactions. The read pointers always lag the write pointer.

If protection against any single-page failure is required, then the log must be duplicated on physically separate volumes.

The log may be stored on tape for backup purposes. (Searching a lot of log tapes can be very slow; a backup tape organized for fast access is needed if recovery is to be reasonably fast.)

An interface to the log will be provided so that other programs, such as a database management system, can write their own recoverable actions.

The log map

A volatile data structure (lost in crash.)

Maps [file, page] -> [transaction, log ID] of log record containing data written by transaction on that page.

How various interesting operations are implemented

Write Page

Obtain page lock (U or W mode) if required.

Consult log map. If a committed update (necessarily by another transaction) to this page exists, then read data from log, write the new data through FilePageMgr, and erase the log entry.

Write log record "transaction t wrote data d on page p of file f". Not necessary to force log.

Insert entry containing log ID of this log record into the log map.

Read Page

Obtain page lock (R mode) if required.

Consult log map. If a committed update (necessarily by another transaction) to this page exists, then read data from log (and as side effect, perform the write through FilePageMgr and erase the log entry.) If an update from this transaction exists, then read data from log. Otherwise read data from file.

Commit (multiple machine, two-phase case)

Phase one.

Convert all of t’s U mode locks to W mode locks.

Write log record "transaction t is ready". Force log.

Phase two.

Write log record "transaction t is committed". Force log. Mark the volatile representation of t "committed".

Release or downgrade all of t’s locks. (Commit call returns now, but processing continues in a background process.)

Enumerate t’s log map entries. For each entry, read data from log, write data to file.

System checkpoint

Abort any excessively long transactions.

Flush buffers of any pages that represent committed writes performed by excessively old transactions.

Write a log record that identifies (1) all active transactions, (2) the oldest log record that must be consulted to recover from the current state. Force log.

Write the restart record, making it point to the checkpoint just written. Note that restart record is duplicated to tolerate failures during write. After a crash, recovery begins by reading the restart record and finding the most recent checkpoint.