Heading:
BFS interface, version 0
Page Numbers: Yes X: 527 Y: 10.5"
CSL Notebook Entry
To:Sierra designersDate:February 16, 1981
From:Mark BrownLocation:PARC/CSL
Subject:BFS interface, version 0File:[Ivy]<Sierra>Doc>BFSDesign0.bravo
XEROX
Attributes:informal, technical, Database, Distributed computing, Filing
Abstract:This memo proposes a design for Sierra’s BFS interface. The design is far from complete, but the facilities for multiple file system transactions have been worked out in some detail.
Introduction
This memo proposes a design for Sierra’s BFS. The BFS is a low-level file system interface, several instances of which may exist on a single machine. Some instances will represent local file systems, while another may have a stub implementation that talks to a remote file system (itself a thin veneer on top of a local BFS) via RPC. This memo does not propose an "on the wire" BFS, though that interface will also be important. We believe that the proposed BFS defines procedures that are sufficient for conversion to and from an "on the wire" representation.
The BFS provides primitives for coordinating transactions that involve multiple file systems. These are given in the sections "Transaction Coordinator" and "Transaction Worker" below. Some care has been taken to provide primitives that can take advantage of important special cases such as single file system transactions and readonly transactions. In particular, we do not take the Lampson-Sturgis paper’s "always ready" approach to transactions, since the benefits of this approach are unclear while its cost is a heavier transaction mechanism. The transaction primitives are arranged so that duplicate calls, which will happen occasionally with a cheap RPC, will not cause problems. Some aspects of the interface are designed with an eye to supporting replicated commit records as a future option, but the interface does not fully support them now.
A BFS on a workstation will not normally use its local file system to coordinate multiple file system transactions, since a crashed coordinator can tie up resources on other file systems. It is perfectly ok for a workstation’s BFS to coordinate its own transactions, or to act as a worker in a multiple file system transaction, since in these cases the only resources that it can tie up in a crash are its own.
Several planned-for or at least talked-about items are missing from the interface as proposed here. Assuming that we wish to build a file transfer protocol on top of BFS, the BFS should accept some form of prefetch hint to help it schedule transfers. Databases and fast streams may wish to deal with groups of pages (what Pilot calls "uniform swap units") rather than single pages. Databases will want to log application-specific undo and redo actions and set application-specific locks. We planned to support encryption at a low level. The BFS should perform low-level authentication, at least to the extent of verifying that it is being called by a known piece of software. We have not yet specified the means of validating file caches or manipulating file properties (read, write, and create times, string names, etc.), or any form of "broken read lock", transaction commit without releasing locks (Juniper "checkpoint"), or transaction save point.
Following the BFS interface, we sketch BFSPrivate, an interface for communication among modules that implement BFS (for instance, to hold the concrete versions of types that are opaque in BFS.) We have started to implement some of the transaction-related procedures; this code will appear in a later memo.
BFS: DEFS = BEGIN
FileSystemID: TYPE [3];
-- Unique ID of a file system.
TransID: TYPE [5];
-- Unique ID of a transaction.
-- Locks
LockMode: TYPE = MACHINE DEPENDENT
{ S, U, X, IS, IU, IX, SIU, SIX };
-- or something like that ... .
-- Brief introduction to lock modes:
-- S means "share", U means "update", X means "exclusive". S is required for reading, X is required for writing. U essentially makes a reservation to write; it is compatible with S but not U or X. (Juniper uses U locks to cover writes, converting to X locks just before commit. An asymmetrical U mode may also be useful: compatible with existing S locks but not with subsequent S requests.) These three modes are the only ones that apply to pages, since from the file system’s point of view pages have no lockable substructure.
-- I means "intention" and relates to the interaction between file and page locks. Locking a file in IX (intention exclusive) mode, for instance, says that we intend to lock pages of the file in X mode. If we lock the file in S, U, or X mode, we don’t acquire page locks at all. Hence IX is incompatible with S, U, and X but not IS, IU, or IX.
-- Combined modes like SIX lock the whole file in S mode, and also declare an intention to lock pages of the file in X mode. This is useful since it allows us to avoid getting a lock for each page we read (we expect to do more reads than writes.)
-- File actions (below) perform implicit locking to maintain serial consistency.
-- Files
FileID: TYPE [5];
-- Unique ID of a file.
File: TYPE = REF FileObject;
FileObject: TYPE;
-- Represents an "open" file, opaque to BFS client. The following is a basic concrete representation, used in describing operations below (BFS to local file system will use something like this, BFS to remote file system will use a representation containing less information):
FileObject: TYPE = RECORD [
file: FileID,
trans: Worker,
fileLockMode: LockMode ];
CreateFile: PROC [f: FileID, initialPageLength: PageLength, trans: Worker];
-- Pass in FileID so that a copy of an immutable file can be created.
DeleteFile: PROC [f: FileID, trans: Worker];
-- This could be made available in both U and X mode (see WritePage below), but it is probably not worth the trouble.
OpenFile: PROC [f: FileID, trans: Worker, initialFileLockMode: LockMode ← IS]
RETURNS [File];
-- Opens file f for access under transaction t, first locking it in the specified mode.
FileFromFileIDAndTransaction: PROC [f: FileID, trans: Worker]
RETURNS [File];
-- Finds an already-opened file. Returns NIL if file is not found.
ReadPage: PROC [f: File, p: PageIndex, pageLockMode: LockMode {S, U, X} ← S]
RETURNS [data: Page];
-- Reads data from page p of file f. Client will specify a pageLockMode stronger than S if it expects to update the page later in the transaction.
WritePage: PROC [f: File, p: PageIndex, data: Page, pageLockMode: LockMode {U, X} ← U];
-- Writes data on page p of file f. Client will specify pageLockMode = U if it wishes WritePage to return immediately without waiting for readers of page p to leave (client may still have to wait, but waiting will happen at commit time.)
GetLength: PROC [f: File, lengthLockMode: LockMode {S, U, X} ← S]
RETURNS [pageLength: PageLength];
SetLength: PROC [f: File, pageLength: PageLength, lengthLockMode: LockMode {U, X} ← U];
-- Transaction Coordinator
Coordinator: TYPE = REF CoordinatorObject;
CoordinatorObject: TYPE;
-- Volatile representative for a coordinator, opaque to BFS client. The following is a basic concrete representation, used in describing operations below (BFS to local file system will use something like this, BFS to remote file system will use a representation containing less information):
CoordinatorObject: TYPE = MONITORED RECORD [
transID: TransID,
state, recoverableState: TransState,
coordinator: FileSystemID,
workers:
LIST OF FileSystemID ];
CreateTransaction: PROC [stableCreation: BOOLEANFALSE] RETURNS [t: Coordinator];
-- Creates a Coordinator for a brand-new transaction. If stableCreation, then CreateTransaction will not return until a stable record of creating the transaction has been written. Without stableCreation, there is a small window in which a coordinator crash will cause the coordinator to forget that the transaction ever existed. Since this crash certainly aborts the transaction, stableCreation is not mandatory, but it allows clients to ignore the possibility that CoordinatorFromTransID (below) will ERROR for a recent transaction (last few hours).
CoordinatorFromTransID: PROC [t: TransID] RETURNS [Coordinator];
-- Caller asserts that transaction t was created here (we can tell by using CoordinatorFileSystemFromTransID; ERROR if not true.) Locates t’s coordinator (searching the online log and creating a new volatile coordinator if necessary) and returns it. CoordinatorFromTransID raises TransactionUnknown if t cannot be found; if TransactionUnknown is resumed then we return its result.
TransactionUnknown: SIGNAL [t: TransID] RETURNS [Coordinator];
RegisterWorker: PROC [t: Coordinator, w: FileSystemID, stableRegistration: BOOLEANFALSE] RETURNS [success: BOOLEAN];
-- Requests permission to perform actions under transaction t on file system w. The request is a noop if w is already a member of t.workers. The request is refused (returns success = FALSE) if t.state # active. If stableRegistration, then RegisterWorker will not return until the registration has been written to stable storage. Without stableRegistration, there is a small window in which a coordinator crash will cause the coordinator to forget that the worker is part of the transaction. Since this crash is guaranteed to abort the transaction, stableRegistration is not mandatory, but it allows workers to have a higher tolerance of transaction inactivity before unilaterally aborting a transaction. Note: the caller (an implementation of CreateWorker below) must not have worker monitors locked.
FinishTransaction: PROC [t: Coordinator, requestedOutcome: {commit, abort},
commitRecordPlaces:
LIST OF FileSystemID ← NIL]
RETURNS [outcome: {committed, aborted}];
-- Caller requests t to commit or abort. The COLLECTING record (containing t.transID, requestedOutcome, commitRecordPlaces, and t.workers) will be written to t.coordinator. If a crash happens before this record appears, then t will be aborted during recovery (if there is any stable record of t at all); otherwise t will try to commit or abort as indicated by requestedOutcome. The COMMIT record will written to commitRecordPlaces (NIL means t.coordinator). If t.workers = t.coordinator = commitRecordPlaces, the coordinator will delegate all responsibility to the worker, saving two synchronous disk I/Os (COLLECTING and READY).
-- Two trivial conversion procedures that have long names.
CoordinatorFileSystemIDFromTransID: PROC [t: TransID] RETURNS [FileSystemID];
-- Returns location of coordinator of transaction t ("location of coordinator" means "file system that will record the outcome"). We expect the encoding of the FileSystemID in the TransID to be relatively simple.
TransIDFromCoordinator: PROC [t: Coordinator] RETURNS [TransID];
-- Returns t.transID.
-- Transaction Worker
Worker: TYPE = REF WorkerObject;
WorkerObject: TYPE;
-- Volatile representative for a worker, opaque to BFS client. The following is a basic concrete representation, used in describing operations below (BFS to local file system will use something like this, BFS to remote file system will use a representation containing less information):
WorkerObject: TYPE = MONITORED RECORD [
transID: TransID,
state, recoverableState: TransState,
coordinator: FileSystemID,
worker: FileSystemID ];
WorkerFromTransID: PROC [t : TransID, createWorkerIfNecessary: BOOLEANTRUE] RETURNS [Worker];
-- Searches for volatile record of a worker for transaction t on this file system, and returns it if found. If not found and not createWorkerIfNecessary, returns NIL. Otherwise calls coordinator’s RegisterWorker, then creates the worker. If worker cannot be created (for instance, t is finishing), returns NIL.
PrepareWorker: PROC [w: Worker, commitRecordPlaces: LIST OF FileSystemID]
RETURNS [{ready, notReady}];
-- Caller (an implementation of FinishTranasction above) requests the worker w to get ready to finish (commit or abort.) The commit record for transaction w.transID will be stored on the indicated file systems (passing the places allows for future use of replicated commit, in which a ready worker can attempt to determine the transaction outcome itself.) If w cannot guarantee to commit when FinishTransaction is called (for instance, if w = NIL), it must respond notReady. Note that if the worker cannot be found (WorkerFromTransID[t, FALSE] = NIL) in fielding a remote call to PrepareWorker, then the worker has already finished due to previous calls on PrepareWorker and FinishWorker.
FinishWorker: PROC [w: Worker, requestedOutcome: {commit, abort}];
-- Caller (an implementation of FinishTranasction above) asserts that w is recoverably in the ready state (because of a call on PrepareWorker), and demands that w finish with the specified outcome. Note that if the worker cannot be found (WorkerFromTransID[t, FALSE] = NIL) in fielding a remote call to FinishWorker, then the worker has already finished due to a previous call on FinishWorker.
-- Another trivial conversion procedure.
TransIDFromWorker: PROC [w: Worker] RETURNS [TransID];
-- Returns w.transID.
END.--BFS
BFSPrivate: DEFS = BEGIN
FileSystemID: TYPE = MACHINE DEPENDENT RECORD [
... ];
FileID: TYPE = MACHINE DEPENDENT RECORD [
birthFileSystemID: FileSystemID,
idOnFileSystem:
LONG CARDINAL ];
-- Pilot uses a birthMachineID, which is ok, too (but may complicate the file-location machinery).
FileObject: TYPE = RECORD [
file: FileID,
trans: Worker,
fileLockMode: LockMode ];
TransactionID: TYPE = MACHINE DEPENDENT RECORD [
coordinatorFileSystemID: FileSystemID,
idOnFileSystem:
LONG CARDINAL ];
-- Note that a FileSystemID, not a MachineID, is needed (because of CoordinatorFileSystemIDFromTransID).
TransState: TYPE = MACHINE DEPENDENT
{ unknown, active, ready, collecting, committing, aborting, committed, aborted };
-- or something like that ... this is the union of coordinator and worker states.
CoordinatorObject: TYPE = MONITORED RECORD [
transID: TransID,
state, recoverableState: TransState,
coordinator: FileSystemID,
workers:
LIST OF FileSystemID ];
WorkerObject: TYPE = MONITORED RECORD [
transID: TransID,
state, recoverableState: TransState,
coordinator: FileSystemID,
worker: FileSystemID ];
PrepareAndFinishWorker: PROC [w: Worker, requestedOutcome: {commit, abort}]
RETURNS [{committed, aborted}];
-- Caller (an implementation of BFS.FinishTranasction) asserts that w is the only worker of the transaction and that w.worker = w.coordinator. Caller requests w to finish with the specified outcome, writing the COMMIT record on w.coordinator. Worker can return the opposite of the requested outcome.
END.--BFSPrivate