CSL Notebook EntryTo:CSLDate:September 17, 1981From:Alpine designersLocation:PARC/CSL(M. Brown, K. Kolling, E.Taft)Subject:Alpine file server overviewFile:[Ivy]Doc>AlpineOverview.bravoXEROX Attributes:informal, technical, Alpine, Database, Distributed computing, Filing, Remote procedurecallsAbstract:The Alpine project will implement a file server. This document describes the facilitiesthat Alpine will provide and gives some information on how Alpine will beimplemented.1. IntroductionThe Alpine project will implement a file server. This document describes the facilities that Alpinewill provide and gives some information on how Alpine will be implemented.This memo takes the CSL context for granted. The reader should know what an IFS is and what aJuniper transaction is. To learn about the latter, read the CSL report by Israel, Mitchell, andSturgis, "Separating Data From Function in a Distributed File System," CSL-78-5, September 1978.2. Why a new file server?CSL's style of file server usage is changing. The type of use that is on the decline is exemplified bythe Alto FTP program and Tajo's FileTool: direct user interaction with a file server to accomplishfile transfers between a workstation and a server. An IFS is well suited to this type of use.We can see two styles of file access that will likely predominate in the future: access through auniveral file system and access through database systems.A universal file system would support a uniform address space for files, in which the uniqueidentifier of a file is independent of where the file is stored. This means (or at least makes it likely)that any permanent file will be stored on a file server. To implement this scheme with acceptableperformance, the universal file system will manage the local disk of a workstation as a cache ofrecently-used files. To increase a file's availability, the system may replicate it in several servers.This is especially attractive in the case of an immutable file (a file that is written once and neverupdated thereafter.) The directory system (that implements the mapping from file string name tofile unique identifier) of a universal file system should be a distributed database, since it has nological connection to any particular file server or workstation.An IFS is not well suited to support a universal file system. One reason is that FTP connections areexpensive (an IFS supports at most nine of them.) The file transfer rate from IFS could be muchhigher without saturating the disk or the ethernet (the Alto is slow and its memory is small.) IFSalso does not support transactions, which are useful for consistent replication of files.Database systems support shared, structured databases. A database system may run on a workstationor may run as a server; the server architecture has several advantages. One advantage is that a]gpi c8q]r-q5UrX ]q]r-q5`r`\w Wq]rX-q5`q%Ssr Pq (rV O= LXq(r? J0I IP At >mr] <J 9D 89@ 6B 1t .r \ -3T +H (I '9 #ur& "PA) 9) HA C% @F K 8T @ I D c Y urJ M3- : ?Q]#Alpine file server overview2server can protect records or even fields of records in a database, rather than protecting informationat the granularity of files. Another advantage is that a server can perform concurrency control onthe basis of logical database units, rather than files or pages, thereby increasing the allowable degreeof concurrency. In either case, the database system performs both sequential access to large sectionsof a file and random access to pages (or small blocks of pages.) A database system tends to allocatefiles in large units (disk extents, which are groups of adjacent cylinders), and wants logically adjacentpages to be physically adjacent most of the time.An IFS is not well suited to support database systems; this was never a goal of IFS. Since IFS isbased on the Alto operating system, its facilities for random file access (through Leaf) are an add-on.IFS does not support transactions, so each database system using IFS must implement its ownfacilities for concurrency control and recovery.In the future, many CSL application programs will deal primarily with database systems instead offile systems. But it will take time to develop database systems in CSL that are sufficiently fast andreliable for this heavy use. Even in a database-oriented CSL, files are a convenient means ofcommunication with the outside world. Hence a new file server should support the uses of files thatare common today (source and object code, documents, images) as well as the needs of futuredatabase systems.There are several reasons why it makes sense to build a new file server rather than convertingJuniper to run in the Cedar/Pilot world. Converting Juniper would not be easy; a new system canbe structured from the start to take advantage of virtual memory and a standard storage allocator.Juniper was not designed to support fast sequential transfers, which we now feel are important evenin a database environment. Now that CSL has some experience with databases, we can design a fileserver that supports databases more effectively than Juniper does, while eliminating some Juniperfeatures that database systems are not likely to need.3. Alpine's scopeAlpine's scope is determined by the projected needs of CSL, as described above, and the interestsand energies of the implementors. Here are our conclusions.IncludedTransactions. Alpine must implement atomic transactions, to support file replication and databasesystems. Transactions must be able to span multiple machines.Access control. Alpine must implement some form of control over access to files. These accesscontrol facilities should be simple; there is little motivation for elaborate access control at the serverlevel if the universal file system and database systems will be implementing their own access controlpolicies.Disk space accounting. Alpine must implement some form of control over the allocation of diskspace to various users and projects. In the long term we expect that the main client of the spaceaccounting facilities will be higher level facilities rather than individual users, since users won't wantto be bothered with knowing where their files are actually stored.Logical locking and recovery for database systems. Alpine must support the many database systemsthat will be written over the next few years. The simpler systems will rely on the locking andrecovery facilities that are supplied implicitly by Alpine transactions, but other systems will want toperform their own concurrency control to improve performance.Capacity, speed, reliability, availability. Alpine must meet performance goals of various kinds. Aserver should be able to maintain many (say 40 or 50) connections at once; the cost of an inactiveconnection must be low. The speed of both sequential and random (single page or short block of frG b9- `F _9/ ]f \A$ ZU Y1 U U TV=* RD QN0 N#N L$B K> IE H#8 F CcN A(8 @[P >(; =SF ; V :K6 5xt 2LrW 0< -u *r rV (> %u rH $>` " Y !6  u rI 11 $F ~B Sur0 _ K%B = u*r , 11 "= L?Q\RAlpine file server overview3pages) transfers should be good, making effective use of a large processor memory for buffering; thespecial case of whole-file transfers will continue to be important for some time to come.Alpine must support server configurations that survive any single failure of the disk storage mediumwithout any long-term loss of information from committed transactions. This degree of redundancyshould not be a requirement for all servers, however.Since storage medium failure is rare, recovery from it can be relatively slow (a few hours); recoveryfrom crashes caused by software bugs should be much faster (5-10 minutes.) It should be possible tooperate a server in a degraded mode when portions of it fail (say one volume or drive); it should bepossible to move a logical disk volume from one server to another, and to store a volume offline.Workstation file system. The Alpine project is constructing a file server, but there is no intrinsicreason why the system that is produced cannot run on a workstation. (The workstation must stillcontain a standard Pilot volume for code swapping, virtual memory, etc.) Allowing a workstation tocontain an Alpine file system would support local databases. The workstation might also attempt toprovide Alpine file services to the network, allowing a small system configuration that does notcontain a dedicated file server. A workstation-based file system may be limited in some ways, forinstance by the lack of independent disk volumes to improve performance, reliability, andavailability, and the lack of an operations staff to perform backup tasks. When the requirements ofa shared file server conflict with the requirements of a workstation file system, Alpine's design favorsthe shared file server.DeferredArchiving. A system for archiving files from the primary disk storage of a file server and forbringing archived files back again seems very valuable, and we should plan for it. But implementingan archive system is a major task and IFS has been successful without doing it, so Alpine shoulddefer it until higher-priority goals have been reached. We do feel that the Alpine file organization(low-level naming by unique IDs, with high-level naming by a separate location-independentdirectory system) eliminates many of the problems that would make it difficult to implement asatisfactory archive system for IFS.The requirements of archiving should have an impact on the design of our database systems. Ifoptical disks become a reality it may become possible to hold the entire "archive" online, which willchange the nature of archive systems. If a means of file replication is provided on top of Alpine,this might also serve as the archiving mechanism.ExcludedLocation-transparent file access. Alpine will not implement services such as volume or file location.A client of the Alpine interface will be aware at all times that he is communicating with a singleserver. We expect a more civilized interface to be built on top of this.Directory system. Alpine will not implement a directory system. As we argued previously,directories should not be localized to particular servers but should instead be replicated anddistributed across servers and workstations.FTP access. The FTP protocol requires a directory on the file server.It is possible that a single-server directory system and an FTP server might need to be built as partof the transition to Alpine. This will depend upon how fast other projects, particularly a universalfile system, are able to progress. Some clients, such as the Cedar database management system, willbe able to use Alpine immediately, without a directory system or FTP.Continuous availability. Our environment does not demand continuous availability of a file server;we can tolerate scheduled downtime during off hours, and small amounts of downtime due to frG bd `Y ]n Y [D Zf5 W;F UN T3W RQ OurL MI L{:) J#@ IsD GE FkY DJ CcD$ A >u ;rV :"B 8 V 6$A 5xZ 3] 2p$ /D3+ -Q ,<[ *1 'u $ar* "Z !YI .urI L &, u r< #B K.7 9+ CE ur: J L?Q\MAlpine file server overview4crashes during working hours. If better availability has significant cost, we don't need it.Guaranteed real-time response. Alpine's emphasis is on reliability and good average-caseperformance, and not on the guaranteed real-time response that an audio file server must provide.4. Alpine's implementation strategiesAlpine uses Pilot to handle its disks. This means that a disk volume used by Alpine has Pilot diskformat, as well as a higher-level structure defined by Alpine; Pilot disk utilities, such as theformatter and scavenger, work on Alpine volumes. Alpine does not use Pilot's implementation oftransactions.Alpine implements transactions using an online disk log. Most operations, such as writing a filepage, are recorded in the log but not executed until after a transaction commits. Other operations,such as creating a file, are recorded in the log and executed immediately, and later undone if thetransaction aborts. By writing the log to two independent log volumes, we can ensure that theeffects of a transaction survive any single failure of storage medium while the transaction is beingcarried out; the log is then saved to tape for use in backup. The time required to restore lostinformation from log tapes is bounded by periodically copying entire volumes to backup packs; allof a server's volumes need not be copied at one time.Using the log and deferring updates until commit is not free (though having lots of primary memorymakes it more so), and we expect that some clients will not require it. Alpine offers a mode of fileaccess in which updates are not protected by the log. A possible client of this mode is the filereplication machinery, which might choose to log updates to the primary copy of a file only, and usethis copy to fix the others if they fail. In case the server crashes while such an unprotected update isin progress, the file being updated is marked bad. Alpine also optimizes a case (updating a file pastthe highest page written in any earlier transaction) that includes FTP style transfers. Thisoptimization makes logging and file update occur in parallel, instead of deferring the file updateuntil after commit.Alpine clients are identified by RName and authenticated through Grapevine. Alpine usesGrapevine to help implement file access control.The core of Alpine is a file system that exports a simple interface, usable by clients on its ownmachine. One such client is a module generated by the Cedar remote procedure call (RPC) facility.To access a remote Alpine server, a client on a remote machine must load a stub implementation ofthe Alpine file system interface; the RPC facility also generates this module. RPC will supportauthenticated, encrypted conversations over the Ethernet, with especially good performance for callson the local net. The Cedar RPC facility was designed to support Alpine as one of its first clients.5. Alpine objectsIn this section we shall informally describe the primary objects that are visible to a client of Alpine.These are the persistent objects server, log, volume, file, owner, volume group, and transaction, and thevolatile object open file. In places where it clarifies things we shall mention some aspects of theimplementation.A client of an Alpine server, meaning a program that calls through Alpine's interface, is identified bythe RName of an individual.An Alpine server consists of a single log L and a set V of (logical disk) volumes. All recoveryinformation that is recorded for volumes in V is written to L. A server is identified by the uniqueID of its log L. frG b] ^u:r< ]n01 Xt% Upr7, S\ Rh1. P Ma L5X Jb I-^ GK F$` D=$ C5 ?[ >m23 <\ ;e` 9U 8]f 6X 5UP 3 0L /!0 +2/ *rI (a 'iJ %^ $a_ t cr@( !uru r [u r7  ur_ (  urC x]  ?Q[^<Alpine file server overview5We generally expect a machine running Alpine to contain a single Alpine server. Two servers onthe same machine communicate as if they were remote from each other (e.g. they follow thedistributed two-phase commit protocol if a transaction whose coordinator is on one server makesupdates on both servers.) One can imagine situations in which one log is a bottleneck andconfiguring multiple servers on one machine is the best way to solve the problem, but thesesituations are likely to be rare in our environment.A log is a Pilot logical volume containing a single large file that is managed as a ring buffer ofpages. The unique ID of a log is the unique ID of the Pilot logical volume containing it. Pages oflog are always written in ascending order according to page number, until the log fills and writingstarts again from the beginning. Information in a log is used for online recovery from server crashesand for undoing the effects of aborted transactions. In server configurations that include backupfacilities, each log page is recorded on tape before the disk copy is overwritten; this tape is called thearchive log. (The archive log actually includes only the log records that are relevant to mediarecovery.)A volume is a Pilot logical volume with some added higher-level structure. The unique ID of avolume is the unique ID of the Pilot logical volume. We expect most volumes to be entire diskpacks.For an Alpine server to tolerate all possible single failures of storage media, it must write twoindependent copies of the log (consider a failure of the log between the time of commit and thetime that the commited transaction's writes are propogated to a file.) These log copies must bestored on different packs, but each copy may stored on the same pack as some logical volume.Logging is most efficient if two drives are dedicated to the log copies, since in this case the speed oflogging is not limited by disk arm motion. (Continuous logging at high rates actually requires threedrives, so that one drive can be dumping the log to tape as the other two accumulate it. We do notplan to support this feature.)A volume can be moved from one server to another. This operation requires a "volume quiesce",which means aborting all transactions that request access to the volume, performing all committedupdates to the volume, and taking the volume offline. (A quiesced volume may be stored offline.)In principle, a volume quiesce is not required if a copy of the log is stored on the same pack as thevolume, but it is not clear that the option of moving a volume without a quiesce is worth supporting.A file is a set of property values and a sequence of 512 byte pages, indexed from zero. A file isimplemented from a Pilot file (some number of "leader pages", invisible to the client, may be usedto store property values) and a file's unique ID is the Pilot file's ID. A file is contained in avolume.The set of file properties is fixed (not expandable.) The set includes the Pilot file attributes type andimmutable, and other properties such as a string name, a byte length, a create time, and so on.One of a file's properties is its owner. An owner is an entity indentified by an RName, such as"McCreight.pa" or "CedarImplementors^.pa". The disk space occupied by a file is charged againstits owner's space quota.Two other file properties are its read list and its modify list. Each of these is a list of RNames, suchas (CSL^.pa Wick.pa). For implementation simplicity these lists are limited to contain at most twoRNames, except that the owner of a file, and the special RName "*" meaning "world", do not countagainst the two RName limit. An Alpine client may read a file if he is contained in (the closure of)one element of its read list. An Alpine client may read or modify a file if he is contained in (theclosure of) one element of its modify list.A server often contains several volumes that are equivalent from the point of view of its users. Inparticular, a user may not care which volume he creates a file on. For this reason, each server'svolumes are partitioned into one or more volume groups. A group contains one or more volumes,and every volume belongs to exactly one group. Disk space quotas for owners are maintained for frG b9& `Q _6) ]8" \H Z4 W^urX U/5 TVM RW QNG Oj NFu r%0 L Iur7 H^ F CcR AB @[?! >\ =S_ ;?& :KI 8 5%9 4*7 2K 1V /50 ,_ur< *-5 )WS ' $Hur #$ur K ur9 tC  "uru r  AM "> 9D! O 1+ ` %= u r( yE 2?Q]Alpine file server overview6volume groups rather than individual volumes. A create file call may specify a volume group,instead of a specific volume; the server decides which volume to create the file on, and informs theclient. We expect an entire volume group to go online or offline together. In some cases it will bepossible to operate with some volumes of a group offline.A transaction is identified by the ID of a server (log) and a unique sequence number on that server.The server named in the transaction ID is the coordinator of the transaction. In principle, thisserver can always respond to queries of the form "did transaction t commit or abort?" In practice,the server will respond "don't know" if the transaction is very old (no record of it is online.) Weare interested in the possibility of having the coordinator store replicated commit records on otherservers, but do not plan to implement this feature right away.An open file is essentially an association between a file, a client, and a transaction. All calls on anAlpine server that access a file require that the file be opened. Access control and file-level lockingis performed at file open time.Important omissionsWe have not described the Alpine objects that an ambitious database system might use in additionto files: locks and log records. This is because to do so would greatly enlarge this document.[Ivy]Doc>LockConcepts*.bravo (* = 0, 1, ...) describes locks, and [Ivy]Doc>DBMSRecovery.bravo describes logging (but less definitively.)6. Where to learn more about AlpineThe memos [Ivy]Doc>FileStore*.bravo (* = 0, 1, ...) describe successive iterations of thepublic interfaces to Alpine. Read the most recent version of this document to learn the specifics ofthe interfaces as they exist today. The public Cedar interfaces themselves (AlpineFile, AlpineAccess,AlpineTransaction) are stored on [Ivy]Defs>. A description of the interfaces from theimplementation's point of view is given on [Ivy]Doc>FileStoreInternal*.bravo. frG bA ` [ _6/ ]9 Zfu r.) XQ W^5. UW TV/5 R> Our\ N#O L Isu FHr-3 DU C@'4 A=