Heading:
Minutes of February 5 meeting
Page Numbers: Yes X: 527 Y: 10.5"
Inter-Office Memorandum
ToSierra designersDateFebruary 6, 1981
FromMark BrownLocationPalo Alto
SubjectMinutes of February 5 meeting OrganizationCSL
XEROX
Filed on: <Sierra>Doc>2-5-81.minutes
Attending: Birrell, Boggs, Brown, Levin, Kolling, Schroeder, Taft.
This memo is a combination of minutes and a further development of the design sketched at the meeting.
The following is a list of "principles" that we seemed to agree upon during the meeting. In some cases I have extrapolated a bit on what was actually said.
1) A workstation will be able to function even when it cannot connect to a file server. This allows both for file server failures and for configurations that do not include a file server.
2) Immutable files will be supported; we expect many files to be immutable. Most databases will not be immutable.
3) An encryption-based protection mechanism will be provided at a low level. The actual encryption technique used will be a function of the hardware and microcode support available; the initial encryption technique may be very weak, but eventually the DES should be supported.
4) We shall design a basic file system (BFS) that is used to manage disks. The BFS will be instantiated (at least) once on each workstation, to give local file storage, and (at least) once on each server machine. The BFS will not provide string file names, file locating services, file replication, or coordination of multi-machine transactions. The BFS will perform low-level authentication to ensure that only "trusted" software calls it.
5) A file server will provide services by exporting the BFS interface for remote access via RPC.
6) The universal file system (UFS) provided to applications running on a workstation is implemented as a layer on top of some number of instances of BFS, either remote or local. The UFS provides string file names, file locating services, and coordination of multi-machine transactions. It might provide replication transparency; this is mainly an issue for mutable files, because of the consistency problem. It manages the local BFS as a cache of files stored on file servers. A workstation has the option of acting as a file server; a file server has the option of including the UFS layer (in order to provide it to other programs that run on the server and export services to the network).
The following issues were not discussed at the meeting, but perhaps should be discussed now.
a) Relationship between BFS, Pilot, and the Cedar Kernel. Clearly BFS needs virtual memory for its volatile data structures and needs to run and swap code stored in files. This suggests that BFS lives entirely on top of Pilot. This seems to be an expensive approach, given that the BFS will not need mapped I/O (see c below). BFS may also wish to redesign some parts of the Pilot file system, e.g. the volume file map structure. The BFS code will presumably not actually be stored in Pilot files, but in Cedar Kernel files. Specification of the Cedar Kenel interface to local and remote files might clarify the picture somewhat.
b) Use of collectable storage by BFS. We would like to do this, but the collector does file I/O in accessing symbols. I think this will reduce to the code swapping issue above, if and when some planned compiler and binder changes are made.
c) BFS will not find mapped I/O especially advantageous for accessing recoverable files and their logs, since BFS requires a certain degree of control over the sequencing of writes. For instance, in a recovery scheme based on redo logs, no updates can be made to a file until phase two of commit. In writing a sequence of log pages, a page containing a commit record must be written after all pages containing previous log records, and a process must be notified as soon as the commit record has been written.
d) The design allows for multiple local BFSs per machine. This leads to a number of issues. First, BFSs (actually their logs) need unique names, so that the transaction coordinator can name and communicate with the things it talks to. Juniper identified BFS log (intentions) ID = pack ID, and kept the log on the same volume as the files being logged; many systems use one BFS log per machine (hence BFS ID = machine ID), since most crashes bring down an entire machine. We need to define the name space for BFSs, and the mapping from BFS ID to machine ID.
Secondly, there is the issue of where to "stand" during recovery. In a design where the server BFS runs on a machine containing a standard Pilot file system, recovery has the use of this file system from the start of recovery (if the Pilot system is damaged we must scavenge it, but given the way it is used it is unlikely to be damaged). In a server containing a BFS for swapping and code, and a second BFS for files, we need to recover the first BFS in order to run the recovery program. In particular, recovery of non-BFS actions (e.g. database actions that use specialized locking and updating) requires the BFS to call non-BFS procedures that have been registered with BFS. So recovery of at least one BFS per machine should be simple enough not to require a BFS for code and swapping.
e) (The following is just a list of low-level issues to keep in mind, it is not specific to the level of BFS design discussed above.) Juniper has the desirable property that a physical volume is self-contained, and hence can be moved from a broken machine to a working machine. In a log-based system there is a performance benefit in making the log volume separate from other volumes, since log I/O is sequential. Our system must be designed to operate, perhaps in a degraded mode, with one or more disk drives down. This will involve specialized recovery procedures to, for instance, process the log multiple times so that a number of volumes can be recovered in sequence on the same drive.
Another issue to keep in mind is on-the-fly backup, i.e. the ability to capture a transaction-consistent state while the system is running. On the systems that I am aware of, this involves both undo and redo of the "fuzzy dump" taken while the system is running. We should have a strategy that does not involve undo, or else we should allow both undo and redo everywhere. A plausible strategy is to use a background process to copy a BFS (probably to tape). When the copying is finished, it is necessary to redo all writes that went to already-copied pages. This can be done by sorting the log and merging in the changes.
An aid to this process (and others) that is used in many systems is to attach a "log sequence number" to each page on the disk, giving the ID of the log record that represents the most recent update to the page. This is logically part of the page data (not the header or label), which would have to be expanded so that pages would still contain 256 words of true data. I’m not sure what to do with log sequence numbers when a pack is moved between BFSs.