[_CD6_]<alpine>Doc>EarlyDesign>BufferDesign0.bravo!1

For the sake of this discussion, we consider the "buffer manager" to be a component of Alpine that lives below the FileStore interface. It does not present a generally-usable interface such as VMDefs. Instead, it provides an interface to logical files on the disk, without transactions. That is, a typical call might be "return a pointer to a buffer containing page p of file f," not "return a pointer to a buffer containing the page at disk address d," or "read page p of file f under transaction t."

Appropriate buffering is a key to good file system performance. Without it, the average time to service a page request is likely to be at least the average rotational latency of the disk, if not the average seek time. With buffering we can do much better than this.

One function of buffering is to keep frequently referenced pages of a file resident. Examples of files within the file system itself that are candidates for this type of buffering are the volume file map (VFM) and volume allocation map. (Pilot buffers only one page of the VFM, and may suffer for it.) This buffering has the biggest payoff when a file is widely shared; a file that is maintained by a single application may be buffered by that application, which can make more intelligent policy decisions than the file system can. (But a sequence of incarnations of the same application has nearly the same effect as simultaneous sharing, if the various instances do not communicate.)

A pool of pages with least-recently used page replacement would keep frequently referenced pages resident, but this is not the best buffering strategy. For many common patterns of access, LRU is exactly the opposite of the correct replacement strategy. A page that has just been read from a stream connection is very unlikely to be read again in the near future; hence it should be recycled immediately. The buffer manager’s replacement policy should prevent a single file’s pages from flooding the buffer pool due to a stream transfer; the amount of buffering required to keep a stream going is not extremely large.

A second role of the buffer manager is to perform read-ahead (read pages that have not yet been requested) and write-behind (delay a page write to a more opportune time, when a number of pages can be written.) Using Pilot on a Dolphin, it is necessary to express read-ahead and write-behind in some way that translates into a single disk command to transfer a run of consecutive pages from the disk to a run of consecutive pages of virtual memory. If these runs of pages are not used at the disk face, then a rotation of the disk will be lost between consecutive page transfer commands. The buffer manager must also be willing to perform certain requests without reordering them, for instance the page writes to a log.

The buffer manager should have the ability to determine adaptively the number of buffer pages to assign to a particular file, based on the rate of recent activity on the file and the overall load on the system. The number of real memory pages assigned to the buffer pool might also vary with the level of file system activity (this is mainly useful if some other program coexists with the file system on a machine.)

The buffer manager is in a good position to detect thrashing. It can take some actions on its own to reduce thrashing (for instance, reducing its use of read-ahead and write-behind), but it should also communicate the situation to a higher-level system component that can reduce the degree of concurrency by delaying the creation of new transactions or aborting existing transactions.

The buffer manager should not know about transactions. Mutual exclusion of writers is to a particular buffer page is provided by the lock mechanism, not by page buffer monitors. The transaction manager can provide pragmatic suggestions to the buffer manager, like any other client.

The buffer manager’s job is greatly simplified if it deals only with a single size of buffer. The size of these buffers should be a compile-time constant that is some multiple of the disk page size, so that speed may be traded for memory fragmentation by increasing the buffer size. (Unfortunately, the lack of scatter / gather in the disk face and the need to process runs in a single disk face command makes fixed size buffers less attractive.)

The buffer manager cannot be expected to make perfect decisions. It will always be possible to construct perverse worst-case scenarios to make its policies look bad. We should strive for an understandable set of policies that works well on the average, given the expected reference pattern.

Pages that are consecutive in the file are (generally) consecutive on the disk. (It helps if clients, such as the file transfer program, allocate an entire file with one call intead of extending the file a page at a time. Databases should extend files by many pages at a time.)

Both sequential and random access to a database file is common. Sequential access is used for searching without index structures, and for log reads and writes; random access is used for searching with index structures. Database files are structured into database pages that are some fixed multiple of the standard page size. (Packaged code segments are similar to database pages: access to one page implies probable access to the others, based on prior analysis of the program. But the size of code segments is variable.)

-- Identifies an open file. The reason for open files is to associate state with a particular client of a file (one client might open the file for sequential access, another for random access), and to allow the buffer manager to find the internal state of a file quickly. Pragmatic advice can also be given when the file is used; the proper balance between the two seems to depend on the client.

-- Opens f in the specified mode. The mode does not restrict the operations allowed on the file, but does give the buffer manager information for policy decisions. The log mode also has semantic content: in a log file, page writes are guaranteed to happen in ascending order.

-- Allocates a buffer of length p.count pages, and associates it with the pages [p.firstPage .. p.firstPage+p.count) of file f. No I/O from these pages occurs. (Note that I/O may occur for pages that are part of the buffer but not part of the page run, if the requested run is not aligned with the buffer boundaries. With a bit of extra hair in the buffer implementation this I/O can be avoided. The real question is the extent to which the client of this interface wishes to be ignorant of the buffer size.)

-- Indicates that the caller is through with the given buffers (decrements share count.) If wait and buffer has been marked written, then the call will return after the pages in the buffer have been written to disk; otherwise the call returns immediately. If deactivate = yes then the caller is unlikely to reference the given buffers in the near future; if deactivate = writeBehind then the caller expects to deactivate other nearby buffers in the near future.

Reading / writing runs is very important. To do this, it is necessary to allocate contiguous VM for the buffer (the alternative, extending the disk face to allow scatter read / gather write, is impractical.) Dynamically allocating contiguous VM through the Space interface will be unacceptably expensive. There seem to be a couple of plausible designs: (1) keep buffers in various popular sizes (1 page, 4 page, 8 page.) This has the advantage of allowing us to run on top of Pilot initially; no private interfaces are needed. (2) Implement our own simple space machinery. Pilot will stay out of the way if we simply Create a space but never Map it to anything. We may then obtain some real memory from Pilot, and manipulate the map to achieve desired effects. A compacting virtual memory allocator is possible, since we have control of all pointers most of the time.

Extremely long disk commands are bad for overall system performance. They have the potential of delaying many shorter commands. Long disk queues are a symptom of excessively long commands. A possible rule of thumb is to let no command get so long that its fixed overhead becomes less than 5-10% of the total command time.

Listed below are a few ideas of the flavor "we certainly don’t need X right away, and X may never really be important enough to merit doing, but let’s try not to preclude X’s happening at some future time."

In Pilot, allocating any page requires a label read followed by a label write. The Pilot designers felt that is was important to keep the read and write close together in time, to reduce the likelihood of foul-ups. In our system we should be able to eliminate one I/O by combining the label write with the data write.

The Pilot page allocator works like this: if you request a file of length n, it finds the first n free pages and makes a file out of them. It may be worth expending more effort to make files contiguous. Maybe a file create request should indicate how important contiguity is. It might be enough to simply allocate on a coarser grain for some files.

The Pilot VFM machinery may not meet our needs. The current implementation (1) spreads VFM pages uniformly around the disk, so that a sequence of VFM page reads probably causes a seek for each read, and (2) buffers only a single page of VFM at any time, even if the system includes multiple volumes. The cheapest improvement is to expand VFM buffering to multiple pages; in the context of Alpine it would be natural to buffer the VFM in the buffer manager, as long as this does not introduce deadlocks. Another possibility is to make the VFM mappable like any other file, so that Pilot can do the buffering in a simple way. This requires a structure in the logical volume leader page to hold together the VFM’s file; this would lead to either some ad-hoc overflow scheme or a bound on how fragmented the VFM can get (compressing the VFM could be done by the scavenger.)

In-situ reorganization (cleanup) of the disk is a known hard problem (it was the "gamma" problem in Simonyi’s thesis.) Having this could be a big performance win, especially if we can identify and cluster high-traffic files. Doing reorganization may involve the buffer manager since it deals with "logical" files, not raw disk pages, at the interface.