[_CD6_]<alpine>Doc>EarlyDesign>MultiPackVolumes.bravo!1

The design of Pilot includes a very general notion of logical (disk) volume. A file is contained in a logical volume (cannot be spread across several logical volumes.) A logical volume may span multiple disk packs (Pilot physical volumes.) The Rubicon release supports logical volumes that are contained in a single physical volume, but the Pilot implementation seems to be sufficiently general that making multi-pack volumes work will not be a big deal. (I am not sure on this point, but I believe that one of the original motivations for multi-pack volumes in Pilot was to support floppy disks as Pilot volumes; floppy disk Pilot volumes are now being phased out of Pilot.)

The question of whether Alpine volumes should span multiple disk packs is still open, and there are arguments on both sides. As usual, implementing an abstraction (such as arbitrarily large disk volumes) at a low level in a system simplifies the higher levels of the system in some ways. The usual danger is that the abstraction hides too much from the higher levels, and may make it difficult for the higher levels to meet their goals (of efficiency, reliability, or whatever.)

A file can be larger than a pack. Multi-pack volumes give more flexiblity in allocating pages to files. For files that are small relative to a pack, the added flexibility is probably negligible, however. When a file becomes a significant fraction of a disk pack in size, the flexibility may be of more use (consider trying to fit three files onto two packs, where each file is about 2/3 the size of a pack.)

Though the abstraction of huge files is nice, few clients will take advantage of it. As we shall discuss below, these clients may wish to control the allocation of file pages to disk volumes.

The larger a volume is, the fewer volumes there will be. If there is only one volume in the world, then it is easy to locate a file: you just look on the volume. As the number of volumes in the world grows, the need grows for volume-location facilities. Clients must store not only the unique ID of a file, but a unique ID of the volume it resides on. This is the way of the future; large volumes only postpone the need for implementing this particular aspect of the future.

A multi-pack volume makes it natural for there to be a single database of owner space quotas and allocations for the entire set of packs that make up a volume. If all packs are equivalent, then users should not be bothered with separate quotas on a per-pack basis.

We can delay exercising Pilot’s code for multiple logical volumes. We know that some aspects of the Pilot implmentation give poor performance when many logical volumes are accessed concurrently. By having fewer volumes we can avoid stressing these areas of Pilot until they improve. (But note that even in a minimum server there will be two active volumes: the system volume, used for swapping, and the Alpine volume.)

Single-pack volumes are available now. Pilot does not yet implement multi-pack logical volumes. It would not only have to implement these, but would also have to support the ability to expand an existing volume by adding packs, for multi-pack logical volumes to become a practical proposition.

Clients can take advantage of the correspondence between volumes and disk arms. If the file system is one huge Pilot volume, we have no way of influencing Pilot’s placement of file pages on packs. Demanding clients want their active files to be equally distributed across packs to reduce arm contention. Clients with really large random-access files generally perform a hashing step to select the disk pack containing a piece of information. Really large sequential files aren’t a practical proposition.

Separate packs are less likely to require a huge scavenge at restart. If a crash occurs when Pilot is updating the volume file map or volume allocation map, then the first step of crash recovery is to rebuild these maps from scratch, using the Pilot scavenger. (Longer term we can eliminate this use of the scavenger by logging VAM and VFM updates, but this is a change to Pilot and the payoff is uncertain.) This takes time proportional to the number of pages in the logical volume. Bounding the size of a logical volume by the size of a pack reduces the amount of work that is likely to be required at restart in a multi-pack system, since it is unlikely that more than one volume is being updated at the instant of a crash.

A pack may be a useful granularity for taking image dumps. An image dump is a copy of a file system for the purpose of backup. (The second component of the backup system is the archive log, which is a compressed version of the online disk log, generally written to tape.) The design of an image dump facility must consider (1) the operational difficulty of taking a dump (does Ron Weaver have to attend to the dump for minutes, or for hours? is the system offline during a dump, and if so for how long?), (2) the cost of the dump, in extra hardware and media and in computation during system operation, (3) the complexity of implementing the dump, (4) the complexity of using the dump (what procedure is required to find a particular page? all pages on a disk pack?).

One choice to be made in taking an image dump is the granularity at which the dump is guaranteed to be transaction-consistent. At one extreme, we guarantee that the whole file system is transaction-consistent in the dump. This requires that the entire system be quiesced long enough for the whole file system to be dumped. For a large system an image dump takes hours. At the other extreme, we guarantee that individual pages are transaction-consistent. This allows the system to process transactions concurrently with the dump. Intermediate positions are to guarantee that individual files or volumes are transaction-consistent.

If a logical volume spans multiple packs, then it may be difficult to organize the image dump of the logical volume in a way that makes it possible to restore from the failure of a single pack. Consider that in general, the pages of a file may be sprinkled across all packs of the logical volume containing the file. Also consider that between the time an image dump is taken and the time it is used, a file may shrink and then grow, causing its pages to move from one pack to another. This argues that the image dump of a multi-pack logical volume cannot be accessed by pack, it must be accessed by logical entities such as file and page in file. Hence the dump must be organized more-or-less like a file system, including a volume file map. But this does not help in determining the identity of the pages lost in a pack failure -- this will require a specialized scavenging of the remaining volumes, combined with a complex analysis of the image dump.

The "operational difficulty" criterion argues in favor of dumping to disk. Such a dump can be faster than a dump to tape (the disk transfer rate is 6 times that of a tape, and tape transfers must go over the ether to the Alto tape server.) A 2400 foot, 1600 byte per inch tape holds about 40 m bytes, so one T-300 dump requires 8 tapes -- a lot of tape handling. It would be tempting to attempt to optimize the dump by dumping only those pages or files that have changed since the last dump, but this greatly increases the complexity of accessing the dump.

One possible strategy for taking image dumps is to quiesce a single disk pack at a time and dump it at full disk speed to a backup disk volume. Not all packs would have to be dumped on the same day. The time to read a 300 mByte pack at full disk speed is 5 minutes; it should be possible to back up a pack in double this time, since the processing time per page is minimal. In a system with two or more drives, an extra drive is not required for taking dumps (though a backup drive is probably advisable in any case.) The copying program would have to understand Pilot volume format, including the bad spot table. It could invoke the scavenger on the copy to build the volume file map and volume allocation map. With a disk-based image dump and one pack per logical volume, locating a particular file page or the entire volume is very easy.

A FileStore interface instance (a handle, if the RPC folks get their way) is identified with a server, not a volume. Some interface procedures then require a volume parameter (but note that the volume is implicit in the OpenFileHandle, so that operations on open files do not require a volume parameter.)

We introduce the notion of a volume group, a group of volumes (on a single server) that are considered equivalent for the purposes of storage allocation. A volume group has a single associated owner database for storing aggregate disk quotas and allocations for the set of packs. FileStore contains two versions of CreateFile: CreateFileOnVolume, and CreateFileOnVolumeGroup. For CreateFileOnVolumeGroup, the server chooses the volume to create the file on, and returns the identity of the volume to the client. Note that the CreateFile calls take an initialPageCount, potentially useful in deciding where to allocate the file.

The implementation of volume groups will be to store the owner database on one of the volumes of the group. If this volume is unavailable, then the server cannot create (or change the length of) files. It can still read and write files. If another volume of the volume group is unavailable, then the server operates normally except that no operations are allowed on files contained in the missing volume. This means, for instance, that a client cannot delete a file on a down volume and re-use the space quota by creating a new volume somewhere else.