[_CD6_]<alpine>Doc>EarlyDesign>NewFS.bravo!1

It is time to consider building a new file server to satisfy the projected needs of the Cedar File System and the Cedar Database Management System. We feel that such a server can be developed quickly by (1) building it on a virtual memory (Cedar Kernel), a garbage collector (Cedar Mesa), and an existing file system (Pilot’s), (2) keeping its processing strategies simple, and (3) getting good performance by running on a Dorado.

The Cedar FS will support a uniform address space for files (i.e. given an 80-bit file ID the system will provide access to the file, regardless of its location), management of the local disk as a cache for recently-used files, increased file availability via replication, and some form of directory system for mapping user-sensible descriptions of files into file IDs. (The Cedar FS will provide access to files stored on an IFS, as well as whatever new server is developed).

The Cedar DBMS supports shared, structured databases. Concurrent access to a database is allowed; atomic transactions help in maintaining database consistency. Database actions are executed on personal computers, and may manipulate a database that is spread across several file servers. (The Cedar DBMS curently uses Juniper for its file storage, locking, and transactions).

4) locking (at granularity of files) and transactions, extending across multiple machines (to allow Gifford’s replication scheme to work). Also support of file access without transactions (in cases where transactions are too expensive).

1) locking (at granularity of files or pages) and transactions, extending across multiple machines (to allow shared, recoverable databases). Also support of file access without transactions (in cases where transactions are too expensive).

2) fast random access to pages, and fast whole-file transfers. Support of page sizes larger than 512 bytes (this is a random-access performance requirement; it does not imply that disk pages must be large).

3) potential for creating a database server by installing database-specific programs, written in Cedar Mesa, on file server (perhaps allowing simultaneous operation as both file server and database server).

It is reasonable to ask what use a DBMS has for the notion of "file". After all, a file is just a relatively simple data structure, imposed upon a disk volume. A database imposes more complex structures. Why not have the DBMS manage its disk directly?

The most radical form of this question is: why not dispense with file sytems altogether, and vend equivalent services through the DBMS? This is certainly possible in principle. But one of the attractive features of files is that people have confidence in them: they are willing to store a file on Ivy without taking precautions against the possibility that the file will be lost or damaged. They believe in the simple backup and protection mechanisms, which can be simple precisely because files are simple in structure. Only experience can breed similar confidence in databases. (In the meantime, we can free users from some of the irrelevant details concerning files by supporting file-valued database attributes.)

Another motivation for files as an intermediate layer is that there is no such thing as a universal database management system. While the Cedar DBMS is attempting to satisfy a wide range of applications’ requirements, we still expect specialized database managers to be written in the future (consider the needs of ISL applications in storing huge images and retrieving portions of them). It will be much easier to produce these systems on top of a file server with support for transactions.

Given that files are going to be part of our environment for awhile, it is desirable that files and databases understand the same notions of transactions and locking. If their transactions are not managed in a compatible way, it will be impossible to guarantee consistency when a file and a related database are both updated. If their locks are not managed in a compatible way, it will be impossible to recover from deadlock situations spanning both databases and files (except via timeouts).

One potential problem with building a DBMS on a file system comes in scavenging the file system: the file system’s scavenger is ignorant of the form of database structures and thus does a dumb job when scavenging databases (imagine an Alto file system scavenger that is ignorant of file directories). But if the file system keeps logs then recovery, even from most media failures, can be performed from the logs instead of by scavenging. Running the scavenger is required only in desperate situations.

The Alto base has served well in IFS. But the IFS development is all in BCPL, and hence much of it cannot be used directly. The Alto file system does not support efficient random access; this has been grafted on in a different way by each client who needed it. There may be other negative factors in the Alto/Mesa file system that I am not aware of.

The only part of Juniper that seems promising to adopt is the disk division. The disk division is organized around the use of shadow pages, which in the light of experience was probably a mistake. Only T-300 disks can be handled by the disk driver (Karen estimates 1-2 months work to make T-80s work too).

We have relatively little experience with Pilot. But the Cedar Kernel is based on Pilot, so CSL will soon have the experience, and Pilot is being stressed by various product groups that have difficult performance requirements. Pilot’s basic file system structures (page labels, volume file map, and volume allocation map) seem sound.

A well-developed language. Juniper’s development was hampered by the primitive state of Mesa’s development at the time the project was started. Juniper also did not track developments in Mesa well: for instance, the conversion from message-passing to monitor-style interprocess communication was made without realizing the true impact of the change (it introduced deadlocks). To the extent that we write a new file server while the Cedar Mesa language is under development, this sort of thing is still a potential problem. But the language as it stands is quite an improvement over Mesa 2.0.

A big enough machine. Juniper’s development was hampered by having to wedge the system onto an Alto I (later an Alto II with XMesa). IFS shows that a very capable file server can run on an Alto, but a usable new system can be produced more quickly if it is run on a machine where the number of procedure calls required to perform a function is not a factor. Once the system runs acceptably on a Dorado, it should be possible to tune it to run on a Dolphin also.

Multi-machine transactions. The latest version of the Lampson-Sturgis paper gives a very simple structure for multi-machine transactions, expressed in terms of remote procedure calls and monitors. The amount of mechanism required to make single-machine transactions into multi-machine transactions is small, and it introduces no changes in the system structure. Thus it should be possible to concentrate on building a single-machine file server, with confidence that adding multi-machine capability later will be easy.

Interface; cooperating servers on same machine. Juniper contains a lot of internal complexity that derives from the fact that the directory server is "almost" like a remote client. The boundaries of the system were not carefully enough drawn, and the supposed client interface (User Pine) was so difficult to use that it was "papered over" in several incompatible ways for the directory server, FTP server, etc. Changes in this interface affected (at least in terms of compilation dependency) nearly the entire system; changing POINTER to LONG POINTER in this interface changes the format of requests on the wire and requires changes in the transaction division! In building a new server, it is important to define its full interface explicitly. It must be possible to write servers to co-habit the machine without causing monitor deadlocks or scheduling problems.

Shadow pages. Using shadow pages in the implementation of transactions has several drawbacks. It eliminates the possibility of allocating a file as a run of consecutive pages on the disk; this not only leads to bad sequential-access performance, but also means that the description of a file is fairly large (say 0.5% of the file size, for Alto-sized pages) and competes with file pages for buffer space. Shadows contribute to the following Juniper anomaly: nonoverlapping (byte-range) writes to a single page are incompatible. In a database server, a log will be necessary for recovery in any case, since shadow pages operate at the wrong granularity. Logs support a general-purpose user checkpointing facility in a straightforward way, and are useful in recovering from media failures.

Non-transactional access. Some applications will want to supply their own forms of recovery (e.g. Eric Schmidt’s database loader). They should not have to pay the price of transactions. (I feel that with a careful implementation it should be possible to make the cost low enough that everyone will use transactions, but developing such a careful implementation will take time).

Communication. Use RPC where possible, including setting up and tearing down streams. A more specialized stream protocol will still be needed for efficient whole-file transfers, especially through gateways or over slow lines.

Log, recovery manager. Keeping the log on a separate volume will reduce arm contention, but is not necessary during development (we can use Pilot logical volumes). Later we may duplex the log; for now it is enough to duplex the "root", i.e. the pointer to the most recent checkpoint.

Locks. Support file and page locks; no coalescing of adjacent pages’ locks. Perhaps eliminate Juniper’s "broken lock" idea. Maybe use "lock escalation" (when a large number of pages in a file are locked, trade these locks for a whole-file lock to reduce overhead). Certainly allow exclusive-mode locks. Use deadlock detection (timeouts only as a last resort). Jim Gray’s notes describe a working lock implementation very explicitly.

Transactions. Using Butler’s latest algorithm, the multi-machine case seems to be a relatively simple extension if planned from the start. The algorithm in the paper is always "ready" (forces log before acking any action), we don’t want that for performance reasons but it is inessential to the paper. It should be possible to access file without transaction (may be a protected mode). If we choose not to support "broken locks", we should be able to eliminate Juniper’s notion of multi-client transactions.

Directory. The need in this area depends upon the plans for the Cedar FS directory system. Preference would be to build no directory system, or to use an instance of the Cedar DBMS as a simple directory system.

Space accounting. Some notion of file ownership is required. Enforcement of quotas need not be done every minute, if that makes things easier. An archive system makes it possible to reach quota by forced archiving.

There are several strategies for implementing transactions via logs. We shall describe the redo log scheme here. There may be advantages to the more elaborate undo-redo logging scheme, but we do not consider it here because we want to see how simple an adequate scheme can be. The advantage of redo log over undo log is primarily in its better real-time response to write requests; redo requires a minimum of disk IO before commit, at a cost of more internal storage for volatile transaction state. Undo complicates life in other ways: it makes it more difficult to perform checkpoint (see below), and introduces a scheduling problem since deadlock between undoers cannot be tolerated.

Any client request that would change the state of the disk is deferred until after the transaction has committed. Instead, a record of the change is written in the log. For instance, a write page request would write the file ID, the page number in the file, the data to be written, and other bookkeeping information into the log. Then a volatile record of the change is written into a random-access data structure.

To commit the transaction, a commit record is written to the log, and all dirty pages of the log are forced to the disk (the page containing the commit record is written last). The volatile record of the transaction is then used to actually make the changes to the disk. To recover from a crash, the log is replayed from the most recent checkpoint to rebuild the volatile record of the transaction, and then commit processing is carried out as before.

Checkpoints are used to limit the amount of work required to recover from a crash. At a minimum, a checkpoint records information on the disk which helps to locate the end of the log at restart. It also snapshots internal file system data structures (the volume file map and volume allocation map) so that they can be recovered without scavenging the entire file system.

The simplest way to perform a checkpoint is to finish all transactions that are in the process of doing disk writes to the structures being snapshotted (i.e. have committed and are creating, destroying, or changing the number of pages in a file), then disallow all such activity while writing the changes into the log. This does not prevent transactions from committing while the checkpoint is in progress, it just means that more volatile state must be maintained until the checkpoint is over. (Note that with an undo log, a transaction’s progress is halted during log checkpoint if it attempts to create or extend a file).

Pilot supports a "scavenge bit" associated with the VAM and VFM (the bit is set before any change is made to these structures, and is cleared when the change is complete and consistent on the disk). If we feel that the likelihood of a trashed VAM or VFM is small enough, we could take checkpoints without saving VAM and VFM in the log. This would make checkpoints much cheaper, hence they could be taken more frequently and the amount of log to be processed in recovery would be smaller. During recovery, volumes whose scavenge bit is set would require scavenging. A T-300 can be read in 5 minutes, so in principle a scavenger could fix one volume in roughly this amount of time.

Checkpoint and recovery will be an area of tension between Pilot and our needs. We may find that the best solution is to write our own VAM and VFM managers. These are relatively small modules in Pilot.