Heading:z18697qjk40(635)\g BFS interface, version 0z18697y756qjk40\g Page Numbers: Yes X: 527 Y: 10.5"z18697qjk40\g CSL Notebook Entryz18697l4445y762\f5bg To: Sierra designers Date: February 16, 1981z18697l4445d2998e21(0,65535)(1,4445)(5,11684)(6,13653)\f1g3f0t2 1t0 16t6 1f1t0 5f0t7 1t0 From: Mark Brown Location: PARC/CSLz18697l4445d2998y716e25(6,13664)\f1g5f0t2 1t0 10t6 1f1t0 9f0t7 1t0 Subject: BFS interface, version 0 File: [Ivy]Doc>BFSDesign0.bravoz19579l4445d2998e25\f1g8f0t2 1t0 24t6 1f1t0 5f0t7 1f1t0 33f0 XEROX z18697l508y644e14(2116)\f2g5f0 Attributes: informal, technical, Database, Distributed computing, Filingz18697l4896d2999e10(0,4904)(1,65535)(5,65535)(6,65535)\f1g11f0 Abstract: This memo proposes a design for Sierra's BFS interface. The design is far from complete, but the facilities for multiple file system transactions have been worked out in some detail.z18697l4896d2999e10j\f1g9f0 z18697e10(2116)\gv Introductionz18697e6jk40(635)\bg12B This memo proposes a design for Sierra's BFS. The BFS is a low-level file system interface, several instances of which may exist on a single machine. Some instances will represent local file systems, while another may have a stub implementation that talks to a remote file system (itself a thin veneer on top of a local BFS) via RPC. This memo does not propose an "on the wire" BFS, though that interface will also be important. We believe that the proposed BFS defines procedures that are sufficient for conversion to and from an "on the wire" representation. z18697e12jk40\g The BFS provides primitives for coordinating transactions that involve multiple file systems. These are given in the sections "Transaction Coordinator" and "Transaction Worker" below. Some care has been taken to provide primitives that can take advantage of important special cases such as single file system transactions and readonly transactions. In particular, we do not take the Lampson-Sturgis paper's "always ready" approach to transactions, since the benefits of this approach are unclear while its cost is a heavier transaction mechanism. The transaction primitives are arranged so that duplicate calls, which will happen occasionally with a cheap RPC, will not cause problems. Some aspects of the interface are designed with an eye to supporting replicated commit records as a future option, but the interface does not fully support them now. z18697e12jk40\g A BFS on a workstation will not normally use its local file system to coordinate multiple file system transactions, since a crashed coordinator can tie up resources on other file systems. It is perfectly ok for a workstation's BFS to coordinate its own transactions, or to act as a worker in a multiple file system transaction, since in these cases the only resources that it can tie up in a crash are its own. z18697e12jk40\g Several planned-for or at least talked-about items are missing from the interface as proposed here. Assuming that we wish to build a file transfer protocol on top of BFS, the BFS should accept some form of prefetch hint to help it schedule transfers. Databases and fast streams may wish to deal with groups of pages (what Pilot calls "uniform swap units") rather than single pages. Databases will want to log application-specific undo and redo actions and set application-specific locks. We planned to support encryption at a low level. The BFS should perform low-level authentication, at least to the extent of verifying that it is being called by a known piece of software. We have not yet specified the means of validating file caches or manipulating file properties (read, write, and create times, string names, etc.), or any form of "broken read lock", transaction commit without releasing locks (Juniper "checkpoint"), or transaction save point. z18697e12jk40\g Following the BFS interface, we sketch BFSPrivate, an interface for communication among modules that implement BFS (for instance, to hold the concrete versions of types that are opaque in BFS.) We have started to implement some of the transaction-related procedures; this code will appear in a later memo. z18697e12jk40\g BFS: DEFS = BEGINl3351d2998e24k792(0,3352)(1,2998)(2,2998)\f5b5f1 4B3b l3351d2998e1\g FileSystemID: TYPE [3];l3351d2998e1\g14f1 4f0 -- Unique ID of a file system.l3704d3351e1\g TransID: TYPE [5];l3351d2998e1\g9f1 4f0 -- Unique ID of a transaction.l3704d3351e1\g -- Locksl3351d2998e24\f5bg LockMode: TYPE = MACHINE DEPENDENT { S, U, X, IS, IU, IX, SIU, SIX };l3351d2998e1\g10f1 4f0 3f1 17f0 -- or something like that ... .l3704d3351e1\g -- Brief introduction to lock modes:l3704d3351e1\g -- S means "share", U means "update", X means "exclusive". S is required for reading, X is required for writing. U essentially makes a reservation to write; it is compatible with S but not U or X. (Juniper uses U locks to cover writes, converting to X locks just before commit. An asymmetrical U mode may also be useful: compatible with existing S locks but not with subsequent S requests.) These three modes are the only ones that apply to pages, since from the file system's point of view pages have no lockable substructure.l3704d3351e1\g137i11I -- I means "intention" and relates to the interaction between file and page locks. Locking a file in IX (intention exclusive) mode, for instance, says that we intend to lock pages of the file in X mode. If we lock the file in S, U, or X mode, we don't acquire page locks at all. Hence IX is incompatible with S, U, and X but not IS, IU, or IX.l3704d3351e1\g -- Combined modes like SIX lock the whole file in S mode, and also declare an intention to lock pages of the file in X mode. This is useful since it allows us to avoid getting a lock for each page we read (we expect to do more reads than writes.)l3704d3351e1\g -- File actions (below) perform implicit locking to maintain serial consistency.l3351d2998e6\g -- Filesl3351d2998e24k792\f5bg FileID: TYPE [5];l3351d2998e1\g8f1 4f0 -- Unique ID of a file.l3704d3351e1\g File: TYPE = REF FileObject;l3351d2998e1\g6f1 4f0 3f1 3f0 FileObject: TYPE;l3351d2998e1\g12f1 4f0 -- Represents an "open" file, opaque to BFS client. The following is a basic concrete representation, used in describing operations below (BFS to local file system will use something like this, BFS to remote file system will use a representation containing less information):l3704d3351e1\g FileObject: TYPE = RECORD [ file: FileID, trans: Worker, fileLockMode: LockMode ];l4057d3704e1\g12f1 4f0 3f1 6f0 CreateFile: PROC [f: FileID, initialPageLength: PageLength, trans: Worker];l3351d2998e6\bg10B2f1 4f0 -- Pass in FileID so that a copy of an immutable file can be created.l3704d3351e1\g DeleteFile: PROC [f: FileID, trans: Worker];l3351d2998e6\bg10B2f1 4f0 -- This could be made available in both U and X mode (see WritePage below), but it is probably not worth the trouble.l3704d3351e1\g OpenFile: PROC [f: FileID, trans: Worker, initialFileLockMode: LockMode _ IS] RETURNS [File];l3351d2998e6\bg8B2f1 4f0 64f1 7f0 -- Opens file f for access under transaction t, first locking it in the specified mode.l3704d3351e1\g FileFromFileIDAndTransaction: PROC [f: FileID, trans: Worker] RETURNS [File];l3351d2998e6\bg28B2f1 4f0 28f1 7f0 -- Finds an already-opened file. Returns NIL if file is not found.l3704d3351e1\g42f1 3f0 ReadPage: PROC [f: File, p: PageIndex, pageLockMode: LockMode {S, U, X} _ S] RETURNS [data: Page];l3351d2998e6\bg8B2f1 4f0 63f1 7f0 -- Reads data from page p of file f. Client will specify a pageLockMode stronger than S if it expects to update the page later in the transaction.l3704d3351e1\g WritePage: PROC [f: File, p: PageIndex, data: Page, pageLockMode: LockMode {U, X} _ U];l3351d2998e6\bg9B2f1 4f0 -- Writes data on page p of file f. Client will specify pageLockMode = U if it wishes WritePage to return immediately without waiting for readers of page p to leave (client may still have to wait, but waiting will happen at commit time.)l3704d3351e1\g GetLength: PROC [f: File, lengthLockMode: LockMode {S, U, X} _ S] RETURNS [pageLength: PageLength];l3351d2998e6\bg9B2f1 4f0 51f1 7f0 SetLength: PROC [f: File, pageLength: PageLength, lengthLockMode: LockMode {U, X} _ U];l3351d2998e6\bg9B2f1 4f0 -- Transaction Coordinatorl3351d2998e24k792\f5bg Coordinator: TYPE = REF CoordinatorObject;l3351d2998e1\g13f1 4f0 3f1 3f0 CoordinatorObject: TYPE;l3351d2998e1\g19f1 4f0 -- Volatile representative for a coordinator, opaque to BFS client. The following is a basic concrete representation, used in describing operations below (BFS to local file system will use something like this, BFS to remote file system will use a representation containing less information):l3704d3351e1\g CoordinatorObject: TYPE = MONITORED RECORD [ transID: TransID, state, recoverableState: TransState, coordinator: FileSystemID, workers: LIST OF FileSystemID ];l4057d3704e1\g19f1 4f0 3f1 16f0 94f1 7f0 CreateTransaction: PROC [stableCreation: BOOLEAN _ FALSE] RETURNS [t: Coordinator];l3351d2998e6\bg17B2f1 4f0 18f1 7f0 3f1 5f0 2f1 7f0 -- Creates a Coordinator for a brand-new transaction. If stableCreation, then CreateTransaction will not return until a stable record of creating the transaction has been written. Without stableCreation, there is a small window in which a coordinator crash will cause the coordinator to forget that the transaction ever existed. Since this crash certainly aborts the transaction, stableCreation is not mandatory, but it allows clients to ignore the possibility that CoordinatorFromTransID (below) will ERROR for a recent transaction (last few hours).l3704d3351e1\g505f1 5f0 CoordinatorFromTransID: PROC [t: TransID] RETURNS [Coordinator];l3351d2998e6\bg22B2f1 4f0 14f1 7f0 -- Caller asserts that transaction t was created here (we can tell by using CoordinatorFileSystemFromTransID; ERROR if not true.) Locates t's coordinator (searching the online log and creating a new volatile coordinator if necessary) and returns it. CoordinatorFromTransID raises TransactionUnknown if t cannot be found; if TransactionUnknown is resumed then we return its result. l3704d3351e1\g110f1 5f0 TransactionUnknown: SIGNAL [t: TransID] RETURNS [Coordinator];l3351d2998e6\bg18B2f1 6f0 14f1 7f0 RegisterWorker: PROC [t: Coordinator, w: FileSystemID, stableRegistration: BOOLEAN _ FALSE] RETURNS [success: BOOLEAN];l3351d2998e6\bg14B2f1 4f0 55f1 7f0 3f1 5f0 2f1 7f0 11f1 7f0 -- Requests permission to perform actions under transaction t on file system w. The request is a noop if w is already a member of t.workers. The request is refused (returns success = FALSE) if t.state # active. If stableRegistration, then RegisterWorker will not return until the registration has been written to stable storage. Without stableRegistration, there is a small window in which a coordinator crash will cause the coordinator to forget that the worker is part of the transaction. Since this crash is guaranteed to abort the transaction, stableRegistration is not mandatory, but it allows workers to have a higher tolerance of transaction inactivity before unilaterally aborting a transaction. Note: the caller (an implementation of CreateWorker below) must not have worker monitors locked.l3704d3351e1\g185f1 5f0 FinishTransaction: PROC [t: Coordinator, requestedOutcome: {commit, abort}, commitRecordPlaces: LIST OF FileSystemID _ NIL] RETURNS [outcome: {committed, aborted}];l3351d2998e6\bg17B2f1 4f0 73f1 7f0 16f1 3f0 2f1 7f0 -- Caller requests t to commit or abort. The COLLECTING record (containing t.transID, requestedOutcome, commitRecordPlaces, and t.workers) will be written to t.coordinator. If a crash happens before this record appears, then t will be aborted during recovery (if there is any stable record of t at all); otherwise t will try to commit or abort as indicated by requestedOutcome. The COMMIT record will written to commitRecordPlaces (NIL means t.coordinator). If t.workers = t.coordinator = commitRecordPlaces, the coordinator will delegate all responsibility to the worker, saving two synchronous disk I/Os (COLLECTING and READY). l3704d3351e1\g46f1 10f0 329f1 6f0 44f1 3f0 173f1 10f0 5f1 5f0 -- Two trivial conversion procedures that have long names.l3351d2998e6\g CoordinatorFileSystemIDFromTransID: PROC [t: TransID] RETURNS [FileSystemID];l3351d2998e6\bg34B2f1 4f0 14f1 7f0 -- Returns location of coordinator of transaction t ("location of coordinator" means "file system that will record the outcome"). We expect the encoding of the FileSystemID in the TransID to be relatively simple. l3704d3351e1\g TransIDFromCoordinator: PROC [t: Coordinator] RETURNS [TransID];l3351d2998e6\bg22B2f1 4f0 18f1 7f0 -- Returns t.transID. l3704d3351e1\g -- Transaction Workerl3351d2998e24k792\f5bg Worker: TYPE = REF WorkerObject;l3351d2998e1\g8f1 4f0 3f1 3f0 WorkerObject: TYPE;l3351d2998e1\g14f1 4f0 -- Volatile representative for a worker, opaque to BFS client. The following is a basic concrete representation, used in describing operations below (BFS to local file system will use something like this, BFS to remote file system will use a representation containing less information):l3704d3351e1\g WorkerObject: TYPE = MONITORED RECORD [ transID: TransID, state, recoverableState: TransState, coordinator: FileSystemID, worker: FileSystemID ];l4057d3704e1\g14f1 4f0 3f1 16f0 WorkerFromTransID: PROC [t : TransID, createWorkerIfNecessary: BOOLEAN _ TRUE] RETURNS [Worker];l3351d2998e6\bg17B2f1 4f0 40f1 7f0 3f1 4f0 2f1 7f0 -- Searches for volatile record of a worker for transaction t on this file system, and returns it if found. If not found and not createWorkerIfNecessary, returns NIL. Otherwise calls coordinator's RegisterWorker, then creates the worker. If worker cannot be created (for instance, t is finishing), returns NIL.l3704d3351e1\g163f1 4f0 142f1 4f0 PrepareWorker: PROC [w: Worker, commitRecordPlaces: LIST OF FileSystemID] RETURNS [{ready, notReady}];l3351d2998e6\bg13B2f1 4f0 33f1 7f0 15f1 7f0 -- Caller (an implementation of FinishTranasction above) requests the worker w to get ready to finish (commit or abort.) The commit record for transaction w.transID will be stored on the indicated file systems (passing the places allows for future use of replicated commit, in which a ready worker can attempt to determine the transaction outcome itself.) If w cannot guarantee to commit when FinishTransaction is called (for instance, if w = NIL), it must respond notReady. Note that if the worker cannot be found (WorkerFromTransID[t, FALSE] = NIL) in fielding a remote call to PrepareWorker, then the worker has already finished due to previous calls on PrepareWorker and FinishWorker.l3704d3351e1\g432G13f1 3f0 1g91f1 5f0 4f1 3f0 FinishWorker: PROC [w: Worker, requestedOutcome: {commit, abort}];l3351d2998e6\bg12B2f1 4f0 -- Caller (an implementation of FinishTranasction above) asserts that w is recoverably in the ready state (because of a call on PrepareWorker), and demands that w finish with the specified outcome. Note that if the worker cannot be found (WorkerFromTransID[t, FALSE] = NIL) in fielding a remote call to FinishWorker, then the worker has already finished due to a previous call on FinishWorker. l3704d3351e1\g261f1 5f0 4f1 3f0 -- Another trivial conversion procedure.l3351d2998e6\g TransIDFromWorker: PROC [w: Worker] RETURNS [TransID];l3351d2998e6\bg17B2f1 4f0 13f1 7f0 -- Returns w.transID. l3704d3351e1\g l3351d2998e1\g END.--BFSl3351d2998e1\f1b3f0B3f5b3f0B BFSPrivate: DEFS = BEGINl3351d2998e24k792\f5b12f1 4B3b l3351d2998e1\g FileSystemID: TYPE = MACHINE DEPENDENT RECORD [ ... ];l3351d2998e1\g14f1 4f0 3f1 24f0 FileID: TYPE = MACHINE DEPENDENT RECORD [ birthFileSystemID: FileSystemID, idOnFileSystem: LONG CARDINAL ];l3351d2998e1\g8f1 4f0 3f1 24f0 52f1 13f0 -- Pilot uses a birthMachineID, which is ok, too (but may complicate the file-location machinery).l3704d3351e1\g FileObject: TYPE = RECORD [ file: FileID, trans: Worker, fileLockMode: LockMode ];l3351d2998e1\g12f1 4f0 3f1 6f0 l3351d2998e1\g TransactionID: TYPE = MACHINE DEPENDENT RECORD [ coordinatorFileSystemID: FileSystemID, idOnFileSystem: LONG CARDINAL ];l3351d2998e1\g15f1 4f0 3f1 24f0 58f1 13f0 -- Note that a FileSystemID, not a MachineID, is needed (because of CoordinatorFileSystemIDFromTransID).l3704d3351e1\g TransState: TYPE = MACHINE DEPENDENT { unknown, active, ready, collecting, committing, aborting, committed, aborted };l3351d2998e1\g12f1 4f0 3f1 17f0 -- or something like that ... this is the union of coordinator and worker states.l3704d3351e1\g CoordinatorObject: TYPE = MONITORED RECORD [ transID: TransID, state, recoverableState: TransState, coordinator: FileSystemID, workers: LIST OF FileSystemID ];l3351d2998e1\g19f1 4f0 3f1 16f0 94f1 7f0 WorkerObject: TYPE = MONITORED RECORD [ transID: TransID, state, recoverableState: TransState, coordinator: FileSystemID, worker: FileSystemID ];l3351d2998e1\g14f1 4f0 3f1 16f0 l3351d2998e1\g PrepareAndFinishWorker: PROC [w: Worker, requestedOutcome: {commit, abort}] RETURNS [{committed, aborted}];l3351d2998e6\bg22B2f1 4f0 48f1 7f0 -- Caller (an implementation of BFS.FinishTranasction) asserts that w is the only worker of the transaction and that w.worker = w.coordinator. Caller requests w to finish with the specified outcome, writing the COMMIT record on w.coordinator. Worker can return the opposite of the requested outcome.l3704d3351e1\g212f1 6f0 l3351d2998e1\g END.--BFSPrivatel3351d2998e1\f1b3f0B3f5b10f0B