Heading:z18697qjk40(635)\gAlpine buffer manager design, version 0z18697y756qjk40\gPage Numbers: Yes  X: 527  Y: 10.5"z18697qjk40\gCSL Notebook Entryz18697l4445y762\f5bgTo:	Alpine designers	Date:	March 12, 1981z18697l4445d2998e21(0,65535)(1,4445)(5,11684)(6,13653)\f1g3f0t2 1t0 16t6 1f1t0 5f0t7 1t0From:	Mark Brown	Location:	PARC/CSLz18697l4445d2998y716e25(6,13664)\f1g5f0t2 1t0 10t6 1f1t0 9f0t7 1t0Subject:	Alpine buffer manager design, version 0	File:	[Ivy]<Alpine>Doc>BufferDesign0.bravoz19579l4445d2998e25\f1g8f0t2 1t0 39t6 1f1t0 5f0t7 1f1t0 36f0XEROX       z18697l508y644e14(2116)\f2g5f0Attributes:	informal, technical, Database, Distributed computing, Filingz18697l4896d2999e10(0,4904)(1,65535)(5,65535)(6,65535)\f1g11f0Abstract:	This memo proposes the scope of the Alpine buffer manager, and makes some observations on how it might be implemented using Pilot.z18697l4896d2999e10j\f1g9f0z18697e10(2116)\gvIntroduction  responsibility of buffer managerz18697e36k100(635)\bgFor the sake of this discussion, we consider the "buffer manager" to be a component of Alpine that lives below the FileStore interface.  It does not present a generally-usable interface such as VMDefs.  Instead, it provides an interface to logical files on the disk, without transactions.  That is, a typical call might be "return a pointer to a buffer containing page p of file f," not "return a pointer to a buffer containing the page at disk address d," or "read page p of file f under transaction t."z18697e12jk40\gAppropriate buffering is a key to good file system performance.  Without it, the average time to service a page request is likely to be at least the average rotational latency of the disk, if not the average seek time.  With buffering we can do much better than this.z18697e12jk40\gOne function of buffering is to keep frequently referenced pages of a file resident.  Examples of files within the file system itself that are candidates for this type of buffering are the volume file map (VFM) and volume allocation map.  (Pilot buffers only one page of the VFM, and may suffer for it.)  This buffering has the biggest payoff when a file is widely shared; a file that is maintained by a single application may be buffered by that application, which can make more intelligent policy decisions than the file system can.  (But a sequence of incarnations of the same application has nearly the same effect as simultaneous sharing, if the various instances do not communicate.)z18697e12jk40\gA pool of pages with least-recently used page replacement would keep frequently referenced pages resident, but this is not the best buffering strategy.  For many common patterns of access, LRU is exactly the opposite of the correct replacement strategy.  A page that has just been read from a stream connection is very unlikely to be read again in the near future; hence it should be recycled immediately.  The buffer manager's replacement policy should prevent a single file's pages from flooding the buffer pool due to a stream transfer; the amount of buffering required to keep a stream going is not extremely large.z18697e12jk40\g119i3IA second role of the buffer manager is to perform read-ahead (read pages that have not yet been requested) and write-behind (delay a page write to a more opportune time, when a number of pages can be written.)  Using Pilot on a Dolphin, it is necessary to express read-ahead and write-behind in some way that translates into a single disk command to transfer a run of consecutive pages from the disk to a run of consecutive pages of virtual memory.  If these runs of pages are not used at the disk face, then a rotation of the disk will be lost between consecutive page transfer commands.  The buffer manager must also be willing to perform certain requests without reordering them, for instance the page writes to a log.z18697e12jk40\gThe buffer manager should have the ability to determine adaptively the number of buffer pages to assign to a particular file, based on the rate of recent activity on the file and the overall load on the system.  The number of real memory pages assigned to the buffer pool might also vary with the level of file system activity (this is mainly useful if some other program coexists with the file system on a machine.)z18697e12jk40\gThe buffer manager is in a good position to detect thrashing.  It can take some actions on its own to reduce thrashing (for instance, reducing its use of read-ahead and write-behind), but it should also communicate the situation to a higher-level system component that can reduce the degree of concurrency by delaying the creation of new transactions or aborting existing transactions.z18697e12jk40\gThe buffer manager should not know about transactions.  Mutual exclusion of writers is to a particular buffer page is provided by the lock mechanism, not by page buffer monitors.  The transaction manager can provide pragmatic suggestions to the buffer manager, like any other client.z18697e12jk40\g26i3IThe buffer manager's job is greatly simplified if it deals only with a single size of buffer.  The size of these buffers should be a compile-time constant that is some multiple of the disk page size, so that speed may be traded for memory fragmentation by increasing the buffer size.  (Unfortunately, the lack of scatter / gather in the disk face and the need to process runs in a single disk face command makes fixed size buffers less attractive.)z18697e12jk40\gThe buffer manager cannot be expected to make perfect decisions.  It will always be possible to construct perverse worst-case scenarios to make its policies look bad.  We should strive for an understandable set of policies that works well on the average, given the expected reference pattern.z18697e12jk40\gBasic assumptions of buffer manager designz18697e36k100\bgPages that are consecutive in the file are (generally) consecutive on the disk.  (It helps if clients, such as the file transfer program, allocate an entire file with one call intead of extending the file a page at a time.  Databases should extend files by many pages at a time.)z18697e12jk40\gSequential access to a file is very common.  For instance, sequential access is used for file copying, log reads and writes, sorting, searching, etc.z18697e12jk40\gBoth sequential and random access to a database file is common.  Sequential access is used for searching without index structures, and for log reads and writes; random access is used for searching with index structures.  Database files are structured into database pages that are some fixed multiple of the standard page size.  (Packaged code segments are similar to database pages: access to one page implies probable access to the others, based on prior analysis of the program.  But the size of code segments is variable.)z18697e12jk40\gSome pages of some files are "hot" (are requested frequently for reading or writing.)z18697e12jk40\gInterfacez18697e36k144\bgThe following is obviously incomplete, but does give a flavor of what the buffer manager might look like.z18697e12jk40\gFSBuffer: DEFS = BEGINl3351d2998e24(0,3352)(1,2998)(2,2998)\f5b10f1 4B3bpagesPerBuffer: CARDINAL = n; -- 1, 2, 4, and 8 are likely candidates.l3351d2998e13\g16f1 8f0 6f1 40f0bytesPerBuffer: CARDINAL = pagesPerBuffer * Environment.bytesPerPage;l3351d2998e1\g16f1 8f0 20f1 12f0PageRun: TYPE = RECORD [firstPage: PageNumber,count: CARDINAL _ 1 ];l3351d2998e1\g9f1 4f0 3f1 6f0 33f1 8f0 4f1 1f0-- Identifies a run of pages within a file ([firstPage .. firstPage+count).)l3704d3351e1\gBufferRun: TYPE = LIST OF Buffer;l3351d2998e1\g11f1 4f0 3f1 8f0-- Describes a run of pagesPerBuffer * LENGTH[BufferRun] consecutive buffered file pages.l3704d3351e1\g39f1 6f0Buffer: TYPE = REF BufferObject;l3351d2998e1\g8f1 4f0 3f1 4f0BufferObject: TYPE = RECORD [pageNumber: PageNumber,buffer: LONG POINTER TO BufferBlock;dirty: PRIVATE BOOLEAN,shareCount: PRIVATE CARDINAL,... ];l3351d2998e1\g14f1 4f0 3f1 7f0 34f1 16f0 20f1 15f0 14f1 17f0BufferBlock: TYPE = PACKED ARRAY [0..bytesPerBuffer) OF CHARACTER;l3351d2998e1\g13f1 4f0 3f1 13f0 20f1 12f0File: TYPE = REF FileObject;l3351d2998e13\g6f1 4f0 3f1 3f0FileObject: TYPE;l3351d2998e1\g12f1 4f0-- Identifies an open file.  The reason for open files is to associate state with a particular client of a file  (one client might open the file for sequential access, another for random access), and to allow the buffer manager to find the internal state of a file quickly.  Pragmatic advice can also be given when the file is used; the proper balance between the two seems to depend on the client.l3704d3351e1\gOpenFile: PROC [f: FileID, mode: {random, sequentialRead, sequentialWrite, log}] RETURNS [File];l3351d2998e6\bg8B2f1 4f0 66f1 8f0-- Opens f in the specified mode.  The mode does not restrict the operations allowed on the file, but does give the buffer manager information for policy decisions.  The log mode also has semantic content: in a log file, page writes are guaranteed to happen in ascending order.l3704d3351e1\gCloseFile: PROC [f: File];l3351d2998e6\bg9B2f1 4f0-- Releases f, which becomes invalid for further use.  Has no effect on pages of f in the buffer pool.l3704d3351e1\gReadAhead: PROC [f: File, p: PageRun];l3351d2998e6\bg9B2f1 4f0-- Notifies the buffer manager that the indicated pages are likely to be read soon.l3704d3351e1\gReadPages: PROC [f: File, p: PageRun] RETURNS [BufferRun];l3351d2998e6\bg9B2f1 4f0 22f1 8f0-- Reads data from pages [p.firstPage .. p.firstPage+p.count) of file f to some PageBuffer, and returns it.l3704d3351e1\gUsePages: PROC [f: File, p: PageRun] RETURNS [BufferRun];l3351d2998e6\bg8B2f1 4f0 22f1 8f0-- Allocates a buffer of length p.count pages, and associates it with the pages [p.firstPage .. p.firstPage+p.count) of file f.  No I/O from these pages occurs.  (Note that I/O may occur for pages that are part of the buffer but not part of the page run, if the requested run is not aligned with the buffer boundaries.  With a bit of extra hair in the buffer implementation this I/O can be avoided.  The real question is the extent to which the client of this interface wishes to be ignorant of the buffer size.)l3704d3351e1\gShareBuffer: PROC [b: Buffer];l3351d2998e6\bg11B2f1 4f0-- Bumps the share count of the buffer (also incremented by ReadPages and UsePages, and decremented by ReleaseBuffers.)l3704d3351e1\gMarkBufferWritten: PROC [b: Buffer];l3351d2998e6\bg17B2f1 4f0-- Indicates that b has been dirtied, and will eventually need to be written to the disk  Does not alter b's share count.l3704d3351e1\gReleaseBufferRun: PROC [b: BufferRun, deactivate: {no, yes, writeBehind}, wait: BOOLEAN];l3351d2998e6\bg16B2f1 4f0 58f1 7f0-- Indicates that the caller is through with the given buffers (decrements share count.)  If wait and buffer has been marked written, then the call will return after the pages in the buffer have been written to disk; otherwise the call returns immediately.  If deactivate = yes then the caller is unlikely to reference the given buffers in the near future; if deactivate = writeBehind then the caller expects to deactivate other nearby buffers in the near future.l3704d3351e1\gWriteBehind: PROC [f: File, p: PageRun];l3351d2998e6\bg11B2f1 4f0-- Notifies the buffer manager that the indicated pages are likely to have been written in the recent past, and are candidates for transfer to the disk now.l3704d3351e1\gEND.--FSBufferl3351d2998e1\f1b3f0B3f5b8f0BImplementationz18697e36k100(635)\bgReading / writing runs is very important.  To do this, it is necessary to allocate contiguous VM for the buffer (the alternative, extending the disk face to allow scatter read / gather write, is impractical.)  Dynamically allocating contiguous VM through the Space interface will be unacceptably expensive.  There seem to be a couple of plausible designs: (1) keep buffers in various popular sizes (1 page, 4 page, 8 page.)  This has the advantage of allowing us to run on top of Pilot initially; no private interfaces are needed.  (2) Implement our own simple space machinery.  Pilot will stay out of the way if we simply Create a space but never Map it to anything.  We may then obtain some real memory from Pilot, and manipulate the map to achieve desired effects.  A compacting virtual memory allocator is possible, since we have control of all pointers most of the time.z18697e12jk40\g26i4IExtremely long disk commands are bad for overall system performance.  They have the potential of delaying many shorter commands.  Long disk queues are a symptom of excessively long commands.  A possible rule of thumb is to let no command get so long that its fixed overhead becomes less than 5-10% of the total command time.z18697e12jk40\gDeferred design ideas in or below the buffer managerz18697e36k100\bgListed below are a few ideas of the flavor "we certainly don't need X right away, and X may never really be important enough to merit doing, but let's try not to preclude X's happening at some future time."z18697e12jk40\gIn Pilot, allocating any page requires a label read followed by a label write.  The Pilot designers felt that is was important to keep the read and write close together in time, to reduce the likelihood of foul-ups.  In our system we should be able to eliminate one I/O by combining the label write with the data write.z18697e12jk40\gThe Pilot page allocator works like this: if you request a file of length n, it finds the first n free pages and makes a file out of them.  It may be worth expending more effort to make files contiguous.  Maybe a file create request should indicate how important contiguity is.  It might be enough to simply allocate on a coarser grain for some files.z18697e12jk40\gThe Pilot VFM machinery may not meet our needs.  The current implementation (1) spreads VFM pages uniformly around the disk, so that a sequence of VFM page reads probably causes a seek for each read, and (2) buffers only a single page of VFM at any time, even if the system includes multiple volumes.  The cheapest improvement is to expand VFM buffering to multiple pages; in the context of Alpine it would be natural to buffer the VFM in the buffer manager, as long as this does not introduce deadlocks.  Another possibility is to make the VFM mappable like any other file, so that Pilot can do the buffering in a simple way.  This requires a structure in the logical volume leader page to hold together the VFM's file; this would lead to either some ad-hoc overflow scheme or a bound on how fragmented the VFM can get (compressing the VFM could be done by the scavenger.)z18697e12jk40\g752i6IIn-situ reorganization (cleanup) of the disk is a known hard problem (it was the "gamma" problem in Simonyi's thesis.)  Having this could be a big performance win, especially if we can identify and cluster high-traffic files.  Doing reorganization may involve the buffer manager since it deals with "logical" files, not raw disk pages, at the interface.z18697e12jk40\ig7I