VMDoc.tioga
Doug Wyatt, December 11, 1986 3:19:02 pm PST
Rick Beach, March 13, 1987 11:45:37 am PST
VM
CEDAR 7.0 —
VM
The Cedar Virtual Memory System
Roy Levin, et. al.
© Copyright 1985, 1986 Xerox Corporation. All rights reserved.
This document describes VM, the interface to Cedar's virtual memory. It also includes some of the design decisions that influenced the VM implementation.
Keywords: memory allocation, page fault, performance, tuning, virtual memory
XEROX  Xerox Corporation
   Palo Alto Research Center
   3333 Coyote Hill Road
   Palo Alto, California 94304

1. The VM Interface
Basic Types
The fundamental entities in which the VM traffics are page numbers and page counts, defined by the following types.
PageNumber: TYPE = PrincOps.PageNumber;
Implementation note: The language doesn't allow us to express this precisely the way we would like: INT[0..2^24).
PageCount: TYPE = PrincOps.PageCount;
Implementation note: The language doesn't allow us to express this precisely the way we would like: INT[0..2^24].
Their base representation (in PrincOps) is as INTs, but in reality they are non-negative quantities. Although they have 32-bit representations, they are actually limited by the addressing capabilities of the architecture. A virtual address is 32 bits, and since a page is 256 words (see wordsPerPage, below), the maximum page count is 232/28 = 224.
Many of the virtual memory operations manipulate a sequence of consecutive pages in the address space; hence it is convenient to define the following:
Interval: TYPE = RECORD [page: PageNumber, count: PageCount];
nullInterval: Interval = [page: 0, count: 0];
Page State Manipulation
Each page in the virtual memory has a client-visible state, called its PageState. Most of the operations in this interface alter the PageState(s) of the page(s) passed as parameters to them. For the convenience of the client (and occasionally to improve performance), operations generally apply to an Interval rather than an individual page. Any operation defined on an Interval manipulates the pages of the interval in an undefined order. If the operation completes without error, each page will have had the operation applied to it exactly once. However, if an error is reported (e.g., AddressFault), no assumptions can be made about the PageStates of the pages in the interval (other than the information implicit in the error report). In particular, the operation may have been applied successfully to some pages in the interval and not to others.
PageState: TYPE = RECORD [
dataState: DataState,
readOnly: BOOL, -- altered only by MakeReadOnly and MakeReadWrite
hasRealMemory: BOOL, -- altered only by SwapIn and MakeUndefined
needsCleaning: BOOL, -- a hint for clients of "Clean"
pinCount: INT -- altered only by Pin and Unpin
];
DataState: TYPE = {
none, undefined, unchanged, changed};
This type describes the state of the contents of a virtual memory page, independent of whether it presently resides in real memory or in backing storage. The responsibility for maintenance of this state is shared between the implementation of this interface and the client. "none" means that the virtual page has no associated backing storage; i.e., it is unallocated. "undefined" means that no location of the page has a well-defined value. This is the initial state of pages obtained via Allocate[...]. Whenever a location in an allocated virtual memory page is altered, the page data enter the "changed" state. This state is exited only by explicit use of SetDataState; in particular, the "unchanged" state can be reached only by invoking MakeUnchanged.
It should be noted that the unchanged/changed distinction is not the same as the hardware-supported clean/dirty distinction. The latter is not visible to the client of this interface (although the "needsCleaning" bit is a close approximation).
State: PROC [page: PageNumber] RETURNS [state: PageState];
Explicit changes to a page's dataState are made using Free, Kill, MakeChanged, and MakeUnchanged.
SetDataState: PRIVATE UNSAFE PROC [interval: Interval, dataState: DataState];
This procedure appears here only to satisfy the inline definitions of Free, Kill, MakeChanged, and MakeUnchanged.
In most cases, the PageState of a virtual memory page changes only as the result of an operation on that page. (In this context, "operation" means a procedure in this interface or a memory access within the addresses bounded by the page. Thus, a successful store into a page is construed as an operation that changes the page's dataState to "changed".) However, there is one important exception. The "hasRealMemory" field of a page's PageState may change asynchronously with operations on the page. This is because the virtual memory system employs a replacement algorithm that occasionally alters the correspondence between virtual and real memory. When real memory is required by a virtual page for which "hasRealMemory" is FALSE, the replacement algorithm considers pages for which "hasRealMemory" is TRUE and chooses a suitable victim. The requestor's "hasRealMemory" state becomes TRUE; the victim's becomes FALSE. The replacement algorithm's decision process can be affected by certain operations in this interface (e.g., Pin, Age).
Virtual Memory Allocation
Virtual memory is divided into four (disjoint) partitions:
VMPartition: TYPE = {lowCore, pda, mds, normalVM};
The "normalVM" partition describes regular virtual memory and is the one that most clients will use. The remaining three partitions are rather special and are intended for use by knowledgeable clients providing low-level I/O and architectural support facilities; such clients will necessarily operate below the safe language level.
LogPageCount: TYPE = NAT[0..23--LogBase2[PageCount.LAST]-1--];
Allocate: PROC [count: PageCount, partition: VMPartition ← normalVM, subRange: Interval ← nullInterval, start: PageNumber ← 0, alignment: LogPageCount ← 0, in64K: BOOLFALSE] RETURNS [interval: Interval];
Allocate searches the specified partition of virtual memory for an interval such that
(1) for each page in "interval", State[page].dataState = none,
(2) interval.count = count,
(3) if subRange.count >= count, then "interval" is contained in "subRange",
(4) interval.page MOD 2alignment is zero, and
(5) if start is non-zero, then interval.page-start is minimal, and
(6) if in64K is TRUE, then the interval does not cross a 64K word boundary.
If such an interval is found, it is returned and, for every page in "interval",
State[page] = [dataState: undefined, readOnly: FALSE, hasRealMemory: FALSE, pinCount: 0]
If no such interval can be found, CantAllocate is raised, passing a interval "bestInterval" such that
(1) for each page in "bestInterval", State[page].dataState = none,
(2) bestInterval.count < count with count-bestInterval.count minimal, and
(3) if subRange.count >= count, then "bestInterval" is contained in "subRange",
Note that "bestInterval" has no guaranteed alignment properties.
Performance note: if in64K is TRUE (and partition = normalVM), the allocator adopts a "clumping" policy that will tend to segregate storage allocated with in64K = TRUE from other storage. The allocator will put aside "clumps" of 64K words aligned on 64K word boundaries and satisfy all requests with in64K = TRUE from these clumps. This is intended to be used for "long-term" storage, like code segments and display bitmaps, in order to cut down on virtual memory fragmentation.
CantAllocate: ERROR [bestInterval: Interval];
This error is raised by Allocate and gives the largest available interval within the requested one.
Free: UNSAFE PROC [interval: Interval]
= TRUSTED INLINE {SetDataState[interval, none]};
Free makes previously allocated memory again available for allocation. If, for any page in "interval", State[page].dataState = none, AddressFault is raised.
Procedures that may affect the virtual-real memory correspondence
SwapIn ensures that real memory is associated with the specified interval of virtual memory. Common variants (Touch and Pin) are defined separately for convenience and clarity.
SwapIn: UNSAFE PROC [interval: Interval, kill: BOOLFALSE, pin: BOOLFALSE, nextPage: PageNumber ← 0];
The "kill" parameter permits an optimization of the client's code: if TRUE, the equivalent of SetDataState[interval, undefined] is first performed. The subsequent behavior of SwapIn for each page in "interval" is as follows:
If State[page].dataState = none, AddressFault is raised.
If State[page].hasRealMemory = FALSE (i.e., the page does not already have associated real memory):
A real memory page is assigned, causing State[page].hasRealMemory to become TRUE. In doing so, other virtual memory pages with associated real memory and for which State[otherPage].pinCount = 0 may have their real memory reclaimed (after ensuring that its data are written to backing storage).
If State[page].dataState is "unchanged" or "changed", the real memory page is set to the contents of the corresponding backing storage page (i.e., it is read from the disk).
State[page].dataState is unaffected by this procedure.
If "pin" is TRUE, State[page].pinCount is incremented by one.
The implementation may perform certain optimizations that cause the actual order of state changes to differ from the sequence implied by this description. In particular, disk transfers may be batched, and if an address fault is detected within the interval, there is no guarantee that the preceding pages in the interval will have reached the states implied by the above description. Furthermore, if "kill" is TRUE, the implementation may optimize the real memory management to avoid freeing and reallocating memory.
The "nextPage" parameter is a hint to be used in optimizing the disk arm; if it is non-zero, the disk driver will be informed that the likely subsequent disk operation will be to the backing storage page associated with "nextPage".
The primary clients of SwapIn will be the PageFault process, which will call with kill and pin FALSE, and the File system, which will typically call with kill = reading and pin = TRUE before initiating an i/o operation.
Kill, MakeUndefined: UNSAFE PROC [interval: Interval]
= TRUSTED INLINE {SetDataState[interval, undefined]};
If, for any page in "interval", State[page].dataState = none, AddressFault is raised. Otherwise, for each page in "interval", State[page].dataState is set to "undefined" and State[page].hasRealMemory becomes FALSE (i.e., any associated real memory is reclaimed).
Touch: PROC [interval: Interval]
= TRUSTED INLINE {SwapIn[interval: interval]};
If, for any page in "interval", State[page].dataState = none, AddressFault is raised. Otherwise, for each page in "interval", State[page].hasRealMemory becomes TRUE, causing memory accesses to the page to proceed without delay. Since the real memory has not been pinned (see below), the real memory may be reclaimed by the replacement algorithm at any time.
This procedure would generally be used by a client who plans to reference an interval in the near future and wants to avoid the delay of a page fault, e.g., Process.Detach[FORK Touch[interval]].
Virtual memory pages can be "pinned" in real memory; that is, their contents can be forced to reside in main storage until they are subsequently unpinned. Since pinning reduces the pool of pages that can participate in the replacement algorithm, it should be used only when logically required and not to try to reduce swapping (it is likely to have just the opposite effect). Consequently, only specialized clients should use Pin and Unpin.
Pin: PROC [interval: Interval]
= TRUSTED INLINE {SwapIn[interval: interval, pin: TRUE]};
For each page in "interval", the following actions occur:
If State[page].dataState = none, AddressFault is raised.
If State[page].hasRealMemory is FALSE, a page of real memory is acquired, causing State[page].hasRealMemory to become TRUE. Then, unless State[page].dataState is "undefined", the real memory page is set to the contents of the corresponding backing storage page (i.e., the page is read from disk).
State[page].pinCount is incremented by one.
Unpin: PROC [interval: Interval];
For each page in "interval", the following actions occur:
If State[page].dataState = none, AddressFault is raised.
If State[page].pinCount is greater than zero, it is decremented by one.
Upon completion of Unpin, any page in "interval" for which State[page].pinCount = 0 becomes eligible for consideration by the replacement algorithm.
Procedures that do not affect the virtual-real memory correspondence
The following two procedures alter the "readOnly" field of a PageState. These procedures affect only the virtual memory system's data structures; they do not cause any I/O operations to occur as side-effects.
MakeReadOnly: PROC [interval: Interval];
If, for any page in "interval", State[page].dataState = none, AddressFault is raised. Otherwise, for each page in "interval", State[page].readOnly becomes TRUE. A subsequent attempt to store into any page of the interval will cause WriteProtectFault to be raised.
MakeReadWrite: PROC [interval: Interval];
If, for any page in "interval", State[page].dataState = none, AddressFault is raised. Otherwise, for each page in "interval", State[page].readOnly becomes FALSE.
The following two procedures alter the "dataState" field of a PageState. These procedures affect only the virtual memory system's data structures; they do not cause any I/O operations to occur as side-effects.
MakeUnchanged: PROC [interval: Interval]
= TRUSTED INLINE {SetDataState[interval, unchanged]};
If, for any page in "interval", State[page].dataState = none, AddressFault is raised. Otherwise, State[page].dataState is set to "unchanged".
Note: MakeUnchanged can be used by a client to detect when the contents of a virtual memory page have been altered, since any store causes the page's dataState to become "changed".
MakeChanged: PROC [interval: Interval]
= TRUSTED INLINE {SetDataState[interval, changed]};
If, for any page in "interval", State[page].dataState = none, AddressFault is raised. Otherwise, State[page].dataState is set to "changed".
Performance Accelerators
The procedures described here have no visible effect on the PageState and therefore are not formally necessary. However, they do affect the performance of the virtual memory implementation and are intended for use when the client has particular knowledge of the virtual memory reference patterns.
Clean: PROC [interval: Interval];
If, for any page in "interval", State[page].dataState = none, AddressFault is raised. Otherwise, this procedure ensures that the backing storage for each page in the interval contains the same information as the associated real memory (if any).
Age: PROC [interval: Interval];
If, for any page in "interval", State[page].dataState = none, AddressFault is raised. Otherwise, this procedure increases the likelihood that the backing storage associated with pages in the interval will be selected for replacement. However, this procedure does not initiate any I/O operations.
ForceCleaning: PROC ;
A call to this procedure will force the VM laundry process to clean some memory.
Errors
The following two errors generally arise from erroneous memory accesses to virtual memory. AddressFault means that an unallocated page (i.e., one for which State[page].dataState = none) has been referenced. (The name AddressFault is intended to suggest that the program performing the access has an incorrect address.) Following this model, the operations in this interface raise AddressFault if any page they are asked to manipulate is unallocated. WriteProtectFault is raised by an attempt to store into a page for which State[page].readOnly is TRUE. No operations in this interface raise WriteProtectFault explicitly.
AddressFault: ERROR [address: LONG POINTER];
WriteProtectFault: ERROR [address: LONG POINTER];
CantDoIO is raised by SwapIn (including its derivatives Touch and Pin) and Clean when the input/output operation from/to backing storage cannot be performed. The "reason" parameter is a generic classification of the cause of the error.
CantDoIO: ERROR [reason: IOErrorType, page: PageNumber];
IOErrorType: TYPE = {software, hardware};
The Laundry process can have trouble cleaning a page. It records the error, and returns it when LastLaundryError is called. A value of [software, 0] returned means no errors have been encountered.
LaundryError: TYPE = RECORD [
errorType: IOErrorType, -- error classification
page: PageNumber -- page that gave trouble
];
LastLaundryError: PROC RETURNS [LaundryError];
Utilities
It is frequently necessary to convert between counts in different units (pages, words, bytes) and between counts and addresses. The following definitions and procedures are copied from the PrincOps and PrincOpsUtils interfaces to avoid forcing clients to acquire those interfaces for these common operations.
wordsPerPage: NAT = PrincOps.wordsPerPage;
bytesPerPage: NAT = PrincOps.bytesPerPage;
Implementation note: These implementations are temporary ones (see PrincOpsUtils) that should be updated to the commented-out versions when the Trinity instruction set is adopted.
PagesForWords: PROC [words: INT] RETURNS [pages: PageCount]
= INLINE {
RETURN[Basics.DBITSHIFT[words, -logWordsPerPage]]
longA: Basics.LongNumber = [li[li: words+wordsPerPage-1]];
longB: Basics.LongNumber = [bytes[lh: longA.hl, ll: longA.lh, hh: 0, hl: longA.hh]];
RETURN[longB.li]
};
WordsForPages: PROC [pages: PageCount] RETURNS [words: INT]
= INLINE {
RETURN[Basics.DBITSHIFT[pages, logWordsPerPage]]
longA: Basics.LongNumber = [li[li: pages]];
longB: Basics.LongNumber = [bytes[lh: longA.ll, ll: 0, hh: longA.hl, hl: longA.lh]];
RETURN[longB.li]
};
PagesForBytes: PROC [bytes: INT] RETURNS [pages: PageCount]
= INLINE {
RETURN[Basics.DBITSHIFT[bytes, -logBytesPerPage]]
n1: Basics.LongNumber = [li[li: bytes+1]];
n2: Basics.LongNumber = [num[
lowbits: Basics.BITSHIFT[n1.lowbits, -1] +
(IF Basics.BITAND[n1.highbits, 1] = 1 THEN 100000B ELSE 0),
highbits: Basics.BITSHIFT[n1.highbits, -1]
]];
RETURN[PagesForWords[n2.li]]
};
BytesForPages: PROC [pages: PageCount] RETURNS [bytes: INT]
= INLINE {
RETURN[Basics.DBITSHIFT[pages, logBytesPerPage]]
words: INT = WordsForPages[pages];
RETURN[words+words]
};
AddressForPageNumber: UNSAFE PROC [page: PageNumber] RETURNS [address: LONG POINTER]
= INLINE {
RETURN[Basics.DBITSHIFT[page, logWordsPerPage]]
longA: Basics.LongNumber = [li[li: page]];
longB: Basics.LongNumber = [bytes[lh: longA.ll, ll: 0, hh: longA.hl, hl: longA.lh]];
RETURN[LOOPHOLE[longB.lc]]
};
PageNumberForAddress: PROC [address: LONG POINTER] RETURNS [page: PageNumber]
= INLINE {
RETURN[Basics.DBITSHIFT[address, -logWordsPerPage]]
longA: Basics.LongNumber = LOOPHOLE[address];
longB: Basics.LongNumber = [bytes[lh: longA.hl, ll: longA.lh, hh: 0, hl: longA.hh]];
RETURN[longB.li]
};
Special-purpose Facilities
The following zone provides a simple way for obtaining small blocks of storage from the first 64K of the virtual address space. All such memory is permanently resident. At present, only lowCore.NEW is supported; lowCore.FREE doesn't actually reclaim any storage. The block of memory returned by lowCore.NEW is guaranteed to be 16-word aligned.
lowCore: UNCOUNTED ZONE;
The following procedure also obtains small blocks of storage from the first 64K of the virtual address space. It returns a RELATIVE POINTER, which is required by some Faces; this pointer is relative to PrincOps.first64K.
AllocateLowCore: PROC [words: CARDINAL, alignment: PrincOps.Alignment ← a1]
RETURNS [PrincOps.Base RELATIVE POINTER];
2. VM Design Decisions
The following material is from "Some Design Assumptions and Decisions for a New Virtual Memory System," dated January 17, 1983 4:14 pm.
A read/write file system model is to be used
We have some experience with the mapped file model and believe that it does not offer any significant advantages over the more traditional read/write model for file access. A mapped file approach may reduce the total amount of backing storage required for virtual memory, but the code to implement it seems to be more complex. A read/write model tends to require more backing storage but reduces or eliminates some sources of implementation complexity (e.g., residency requirements/strategies for backing storage address tables, swap buffers, file size modification)
The entire virtual memory can be backed by a single disk file
The virtual memory is backed by a large, anonymous disk file. Our disks are large enough to accommodate a 24-bit address space (64K pages of backing storage; about 32 megabytes) on a Dorado and a 22-bit address space (16K pages; 8 megabytes) on a Dolphin. (22 bits should be acceptable on a Dandelion with an SA4000 or Quantum disk; perhaps even 23 bits would fit.)
In principle, the mapping from virtual page number to disk page number is a simple numerical calculation. In practice, one cannout assume that it will be possible to allocate a file of 16K-64K pages completely contiguously, since most disks have one or more bad spots. Thus, the mapping will be implemented with a (presumbably short) run table.
Coarse-grain synchronization of virtual memory operations is acceptable
A single monitor will synchronize all virtual memory manipulations. In particular, the monitor will be held during fault-handling. This means that during an I/O transfer initiated by either a page-fault or a swapout, no access to the virtual memory data structures will be permitted. This eliminates the need for selective locking of virtual address ranges at the expense of some potential parallelism. We believe the actual lost parallelism will be slight.
Simple "seek-ahead" provides adequate I/O overlap with fault-handling
Page faults are handled in a simple, synchronous way. There is a single PageFault process that is awakened by the occurrence of a fault, calls into the virtual memory system to swap in the requested page, and waits for the I/O to complete. No queuing facilities are provided at the disk level. A simple "seek-ahead" scheme is employed: when a disk request is submitted, it may include a disk address to which a seek is to be performed at the completion of the requested operation. The idea is that, if the PageFault queue has multiple requests in it, the disk address for the second faultee's page is supplied when the first faultee's page is requested. Therefore, when the I/O transfer completes, the software can restart the first faultee and initiate the I/O for the second faultee while the disk is already seeking to the required address. Since all of virtual memory is backed by storage on a single disk, there is no opportunity for overlapping transfers with each other. "Seek-ahead" would seem to provide most of the concurrency necessary for adequate performance.
Swap units are not a fundamental requirement
Pilot's swap unit notion introduces substantial complexity at the level of the swapper. We propose that the lowest level swapping facilities traffic in intervals of virtual memory, but that the PageFault process, at least initially, only request single pages at a time. Should it become useful to do so, additional data structures can be provided to permit the PageFault process to request appropriate swap units.
Multiple drives must be supported
The disk facilities must be able to support overlapped transfers to multiple drives, since this will be an important performance consideration for Alpine. However, the virtual memory system has no need for this facility; it is provided at the "disk channel" level.
A "real memory only" version of the virtual memory system is useful
It is useful to be able to write utility programs that massage the disk (e.g., formatter, scavenger) using the same virtual memory system as one uses for regular programs. The only difference is that there is no presumption that backing storage is available, and consequently the virtual memory system is constrained to operate within the available real memory and without swapping. This is the UtilityPilot model. Such a facility naturally falls out of the proposed design.
Cedar "safe" storage should be available at the lowest possible level
Adequate functionality should be provided in the virtual memory system to permit the file system to be a Cedar program. This simplifies the task of writing the file system and, in conjunction with the "real memory only" version of the system, permits utility programs to be Cedar programs as well, and hence more easily maintained.