mem.tioga Bland, Sindhu, Vuillemin March 7, 1986 11:14:50 am PST <> <> <> <> 1 <<>> Dragon-86-04 4.0 Overview The Dragon system can support up to four gigabytes of shared memory. It has multiple virtual address spaces and a paged virtual memory with 32-bit virtual and real addresses. A modest protection scheme divides each space into two segments, kernel and user. A two-level cache architecture and special support for high speed I/O assure good performance with up to 16 processors per Dragon system. Figure 4.0 Memory System Overview << [Artwork node; type 'ArtworkInterpress on' to command tool] >> 4.1 The Two Level Cache Architecture The Dragon system associates caches with each processor. Each cache contains relevant copies of addresses and data from main memory. When a processor issues an instruction to Read(A), one of its caches responds with the data D, if it has that data (the hit case); otherwise, (the miss case) the cache fetches D from the next level of the memory hierarchy, and keeps it for further reference. Since the caches are smaller than main memory, a processor must usually victimize (delete) an entry on every miss. The global effect is to speed up most reads and to reduce traffic on the shared line to memory, thus allowing a single M bus to effectively service multiple caches. Dragon's memory is implemented as a hierarchical cached memory. The root of this hierarchy is main memory which is connected at the next lower level to a global communication M-bus. Each Dragon processor is connected at the next higher level to two private processor caches which are called the small caches. One of the small caches, the IFU cache, is for instructions and the other, the EU cache, is for data. The small caches are connected (above) to an internal M bus, called MI bus. Each MI-bus is connected above to another type of cache, the big cache. A given big cache is connected above to a M-bus (which can be the main M-bus in a two-level system, or another MI-bus in deeper hierarchies). Thus, the hierarchical cached memory is a tree: the root is the main memory, internal nodes are the big caches, and the leaf nodes are the small caches. The small caches use virtual addresses in communication with their processor and real addresses for communication over the MI bus. The big cache uses real addresses for all communication. The big cache itself contains control logic as well as the real address, owner, shared and exists-below bits for each of its entries; the actual data for the entries is contained in a separate 64K RAM that is attached to the big cache. An arbiter for the MI bus sits inside the big cache. It merges all memory requests with the contraint that it does not reorder requests from any single processor. The address for a given piece of data is in two parts: the high-order 22 bits specify the real or virtual page number. The low-order 10 bits specify the offset within that page; this offset is the same for real and virtual addresses. Caches first match against the entire 32-bit address field. If the 32-bit address misses, the cache matches against the page portion (high-order 22 bits) of the address. If the page hits, the real address is formed by concatenating the high-order 22-bits from the real address with the low-order 10 bits of the original requested address. Thus, it is possible for the real address of a given piece of data to be formed by the caches, and to retrieve that piece of data from a higher level of the memory hierarchy. 4.2 The Cache Consistency Problem The cache consistency problem is to (the memory system) prevent multiple copies of a datum in distinct caches from having different values. This problem arises because Dragon caches contain shared data that may be written into. The solution, implemented in hardware, involves the use of the snoopy cache scheme described below. This scheme guarantees that requests from all processors for memory are serialized by the Arbiter. The Arbiter merges all requests with the constraint that it does not re-order requests from any single processor. Figure 4.1 Organization - The Snoopy Caches << [Artwork node; type 'ArtworkInterpress on' to command tool] >> Definitions: virtual address This 32-bit address is used in communication with the processor. real address This 32-bit address is used in communication on the MI bus and M bus. exists below This bit is set to true if and only if a cached copy of that address exits in a subtree of that node. Exists below is not explicitly stored in main memory; it defaults to false (0) in the small caches. owner The owner bit is set to true on all copies of an address A located on the unique path from main memory to the cache of the last processor which wrote into A. If owner(A)=1 in a cache, then owner(A) must be =1 in the cache at the next higher level of the memory hierarchy, and, at most, one copy of A at the next lowest level of the memory hierarchy can have owner(A)=1. Owner defaults to 1 in main memory for all addresses. shared line The shared line is used to communicate information about shared data. There is one shared line for each level of the memory hierarchy. shared For a given data entry, shared is true if more than one copy of a piece of data exist at that level of the memory hierarchy. In other words, shared is true as soon as two copies of the address exist in caches which are not on a path between the root of hierarchical cached memory and a leaf. However, if shared is true other caches may or may not contain the data because the end of sharing is not discovered immediately. The end of sharing is detected in the following manner: A cache has owner and shared set; consequently, it broadcasts its writes to the M bus. No other cache sets the value of Shared Line true to indicate that it also contains the address to be written. The cache performing the write notes that the value of the Shared Line is false and subsequently sets its own shared bit to false. Quad A quad is four contiguous 32-bit words aligned in physical address space such that the address of the first word is 0 MOD 4. 4.2.1 Reading and Writing Data In the following description of (snoopy) reads and writes, the word master is used to describe the device that has control. An action initiated by a master will typically involve at least one other device on the bus; such a device is called a slave. Reads The ReadQuad command reads 4 words of data into the master from memory or another cache. The specific word of data requested by the master is read first. The remaining words of the quad are read in subsequent cycles. The ReadQuad command is issued by a cache if it cannot supply the address specified by processor. The requested address is placed on the bus. All the other caches snoop; that is they all match against the address specified. If another cache contains the requested address, it sets the Shared Line to true. Any cache containing the address specified copies the value of the Shared Line into its shared bit. If the owner bit is set on one of the a matching caches, that cache supplies the data to the M bus itself (since the memory may not know the current value). Otherwise, the data is provided by the next level of the memory hierarchy. Writes - Single Words The WriteSingle is used to write a single word to other caches that share the quad containing the word being written. It is never used to write single words to main memory. Each time a processor writes a single word of a quad, if that quad is shared, the WriteSingle 1) updates all other copies of the word at that level of the memory hierarchy and causes their owner bits to be set to false, and 2) writes through, that is, updates the copy of the word in the cache at the next level of the memory hierarchy. Writes - Quad Words The WriteQuad indicates that a cached entry is to be saved in the next highest level of the memory hierarchy because it has been victimized in the current level. Only quads with owner set need to be saved. Unlike the ReadQuad command which may specify any of the four words of a quad of data to be read first, the WriteQuad command always specifies the address of the first word of the quad. In other words, the address specified by the write must be 0 MOD 4. When a cache issues the WriteQuad command all other caches on the bus match against the quad address to determine if they hold a copy of the quad. All matching quads except the one supplying data to the bus load new values from the bus and clear their owner bits. The above definitions imply a lazy updating scheme. Write(A,D) into a word which has Owner(A)=1 and Shared(A)=0 is only performed locally in that cache, leaving an out-of-date copy above in memory. Consequently, the cache which has Owner(A)=1 associated with address A is responsible for providing the answer to a Read(A) command on the shared line. A Write(A,D) which does not have both Owner(A)=1 and Shared(A)=0 is handled by writing through to the next level of the memory hierarchy. The global effect this scheme is to keep consistent copies of the lastest value of address A in the lowest common subtree containing A. Equivalently, the value of address A in a given memory node may be out of date iff Owner(A)=1, Shared(A)=0 and ExistBelow(A)=1. Six assertions summarize the semantic intent of Dragon's hierarchical cached memory. Figure 4.2 The Hierarchical Cached Memory << [Artwork node; type 'ArtworkInterpress on' to command tool] >> Notes: The term sister is used to describe addresses found in more than one cache at the same level of the memory hierarchy where each has shared set. At most one of two sisters can be owner. The term shared is used when a copy of a given address exists in more than 1 cache at the same level of the memory hierarachy. As soon as two copies of the address exist in caches which are not on a path between the root of hierarchical cached memory and a leaf, the data is shared. In the above diagram Small Cache B and Big Cache A cannot share data; Small Cache B and Big Cache D can share data. Assertions: Every address in C is also present in the parent of C. The parent copy of any address has ExistsBelow set. The parent of an address with owner = 1 also has owner = 1. Two sister addresses both have shared set. The child of a shared address is also a shared address. Two copies of an address which are marked shared have identical data values. 4.3 Deadlock Avoidance Deadlock occurs when each of two transactions cannot complete because both are waiting to get control of a bus that is held by the other. This situation arises when two processors each contain a given shared address, and each processor wants to update that address at the same time. In the example below, neither transaction can complete. Processor 1 is trying to obtain the MI Bus in order to update address A with data X. At the same time, Processor 2 is trying to obtain the M Bus in order to update address A in both of Processor 1's caches. Figure 4.3 The Deadlock Problem << [Artwork node; type 'ArtworkInterpress on' to command tool] >> The deadlock problem is solved by dividing small cache transactions into two categories: global Transactions for which the bus must always be acquired, for example, WriteSingle. others Transactions for which it is not always possible to determine whether the bus will be needed. In the case of global transactions, a processor must acquire the bus before before initiating the transaction. For other transactions, if a conflict arises, the transaction must be aborted and restarted. 4.4 Virtual Memory The Dragon Map Processor is a single logical device that implements the mapping from virtual to real addresses and provides per-page memory protection. Its main features are: fast response to mapping requests on the average (300 ns); the use of a Map Table that consumes a small (< 1%), fixed fraction of real memory; the support of multiple virtual address spaces that are permitted to share memory; and an organization in which both function and performance can be enhanced relatively easily after hardware implementation. In spite of these features, the design is simple. The following are definitions and abbreviations are used in describing the Dragon Map Processor: page A page is a contiguous chunk of memory 2s 32-bit words long, aligned on a 2s word boundary; in the current implementaion, s is fixed at 10. va Va is an abbreviation for virtual address, a 32-bit quantity. ra Ra is an abbreviation for real address, a 32-bit quantity. vp Vp is an abbreviation for virtual page, a (32s)-bit quantity. rp Rp is an abbreviation for real page, a (32s)-bit quantity. offset Offset is an s-bit quantity. fg Fg designates the four flag bits that are kept for each page. order Order is a 3-bit field used to communicate faults associated with address mapping. aid Aid is an abbreviation for address space id which is used to denote address spaces. pn Pn is an abbreviation for processor number. Processors are identified by their processor number. If ||x|| is an abbreviation for relationships between the variables given above: ||rp||+||fg||+||order||=32. ||aid||+||vp||=32. The expressions: va=vp|offset ra=rp|offset are shorthands indicating that vp and offset are the virtual page and offset corresponding to va, and rp and offset are the virtual page and offset corresponding to ra, respectively. And the expression: selector ? ifOne : ifZero uses the bits of selector, to choose corresponding bits from either ifZero or ifOne to form a result as follows: IF selector[bit] = 1 THEN result[bit] _ ifOne[bit] ELSE result[bit] _ ifZero[bit]; 4.4.1 Addressing Architecture At any time there is exactly one address space associated with each processorthat address space is said to be loaded on that processor. The correspondence between processors and address spaces is kept in an aid (address space identification) table that is common to all processors. The address space of each processor is divided into two non-overlapping parts: a 228 word shared region at the low end of the space, and a 232-228 word switched region in the remainder. If the address to be mapped lies in the shared part the address space always maps into address space 0. If the address to be mapped lies in the switched part, the address space used is the one currently associated with the processor. Address translation takes place in two steps. The first step determines the address space (identified by its aid) in which to perform the translation. The second step does the actual translation: the address space id and the virtual address are mapped into a real address using the Map Table. It is possible to bypass the above mapping. Bypassing provides a mechanism for processors to access the Map Table and map-fault handling code without getting map faults in the process; it is also useful in turning off mapping altogether. Bypassing is activated whenever a processor makes a reference to a portion of virtual address space called the map bypass area. This area appears at the same location in all address spaces, and is defined by three (32s)-bit quantities: BypassMask, BypassPattern, and BypassBase. The BypassMask and BypassPattern determine the bypass area's location and size in virtual memory as follows: a va is defined to be in the bypass area if the bits of the va under BypassMask match the corresponding bits of BypassPattern. BypassBase determines the starting location of the area in real memory. The real address produced by bypass mapping is ra=rp|offset, where rp = BypassMask ? BypassBase : vp. The mask, pattern and base can all be modified. This architecture is designed to facilitate sharing between address spaces. The shared area provides a restricted form of sharing where all of the vp's sharing a particular rp must be identical; this sharing is also all-or-nothing in that a page is shared by all spaces if it is shared by any. It is expected that all but a few cases of sharing will be covered by this restricted mechanism. The architecture also permits more general sharing in which the vp's sharing a particular rp are unrestricted and where each page may be shared by any subset of the spaces. (see the figure below) Figure 4.4 Virtual to Real Address Mapping << [Artwork node; type 'ArtworkInterpress on' to command tool] >> 4.4.2 Protected Kernel As the previous section indicated, the address space of each processor is divided into two non-overlapping regions: a 228 word shared region at the low end of the space, and a 232-228 word switched region in the remainder. The shared region at the low end of each space constitutes a protected kernel. Having this protected kernel present in every address space means that it is never necessary to change address spaces in order to access the protected region. This assures that access to the protected kernel is fast. The protected kernel is implemented by a user/kernel mode bit. Protection checks made are based upon the value of this bit. Certain privileged instructions, for example the Store to Internal Processor register instruction, as well as instructions used to control transfers into and out of kernal mode are illegal in user mode. In addition, it is possible to read, but not to write into the kernel while in user mode. (See Figure 4.5 below.) Figure 4.5 Addressing Protection << [Artwork node; type 'ArtworkInterpress on' to command tool] >> 4.4.3 System Organization The Map Processor consists of three components: a custom VLSI Map Cache, a Map Cache Controller and a Map Table stored in main memory. The Map Cache is a performance accelerator that sits between the processor's caches (the small cache and the big cache in Figure 4.0) and the Map Table. It contains around 2500 mapping entries and allows mapping requests that hit to be serviced in three cycles, or 200 ns. The processor's caches themselves keep a limited number of entries (~50). Simulation studies show a 0.5% miss rate for the Small IFU Cache and a 1.0% miss rate for the Small EU Cache so that in most cases, the processor will send a virtual address to its Small Cache and it will return the corresponding data as illustrated in Figure 4.6. Figure 4.6 Processor's Caches Holds Requested Data << [Artwork node; type 'ArtworkInterpress on' to command tool] >> Note: For simplication purposes, the Processor is shown on the M bus, not the MI bus. When the processor's caches do not contain the data requested, the big cache sends a mapping request to the Map Cache. A Current Space Table in the Map Cache uses processor number and virtual page to determine the address space id, real page number and flags for its entries. The Map Cache returns the entry if it is present, and signals a map fault if it is not. (See Figure 4.7 below.) Figure 4.7 Map Cache Holds the RP for the Data Requested << [Artwork node; type 'ArtworkInterpress on' to command tool] >> Note: For simplication purposes, the Processor is shown on the M bus, not the MI bus. The Map Cache implements a number of other operations, including ones to flush entries, to manipulate the control bits for entries in the processor's caches to switch address spaces, and to control the location and size of the map bypass area. All communication with the Map Cache occurs over the M-Bus. The Map Cache Controller is a fictitious device whose functionality is implemented by ordinary Dragon processors. Whenever the Map Cache does not contain the data requested, the processor whose cache needs the data handles the map fault. This processor plays the role of the Cache Controller in fetching the missing entry from the Map Table and sending it to the Map Cache where it is entered. (See Figure 4.8.) In addition to servicing misses, the controller also implements a complete set of operations for manipulating the Map Table. The code for these operations is available to all processors and can be executed by any one of them. Figure 4.8 RP for Requested Data Found in Map Table << [Artwork node; type 'ArtworkInterpress on' to command tool] >> The final component of the Map Processor is the Map Table. It serves as the repository for mapping information for all address spaces, mapping (aid, vp) to rp and fg. For each real page the table maintains four flag bits used for paging and memory protection; in high to low order these are: shared, wtEnable, dirty, and spare. The Map Table is structured as a hash table indexed by (aid, vp). Since there may be more than one mapping entry per physical page, it is possible for the Map Table to overflow. The number of entries in the table will be 1.5 times the number of real pages in main memory, making overflow possible, but extremely unlikely. The Map Table handles the overflow problem by using the all-or-nothing rule. This rule states that either all of the vp's for a given rp are in the table or none are. With this rule, a Table miss definitely indicates page fault. The page fault handler then locates a free page in Main Memory, or it may victimize an entry in Main Memory if necessary. (In this context, Main Memory is treated as a cache for disk storage.) The page fault handler looks in another table maintained by the operating system which maps virtual pages into disk addresses. It reads in the appropriate page from disk into Main Memory and adds this entry to the Map Table. 4.5 Display Memory - High Speed I/O The Dragon will have a 1K by 1K color display with 24 bits per pixel. The display will be refreshed 60 times per second, thus requiring a bandwidth of about 1400 megabits/sec. This high speed I/O requirement is met by providing the display with its own display memory and display processor. (See Figure 4.9 below.) Display memory is distinguished from main memory by assigning it a set of real addresses that is disjunct from main memory at system design time. Monitor locks, implemented by software, assure that at any given time, a given set of words may be accessed by either the Display processor or a Dragon processor; both cannot have access at the same time. Figure 4.9 Display Memory <<>> 4.5.1 Data Inconsistency Problems of data consistency exist between display memory, main memory and the caches for two reasons: First, because the memory system for Dragon processors uses a lazy updating scheme, updates may only be performed locally, leaving out-of-date copies at higher levels of the memory hierarchy. Memory addresses that contain different data at different levels of the memory hierarchy are called dirty. A problem exists when Dragon processors create dirty entries for memory addresses corresponding to display memory. The launder operation solves this problem. Launder writes out a selected set of dirty entries from a cache to memory. The entries written are not deleted from the cache. Thus, launder provides a mechanism for re-establishing consistency between the caches and memory. In addition to data inconsistency caused by the lazy updating scheme, data will also become inconsistent as both the Display processor and Dragon processors write into display memory. The launder operation can handle this problem if it is expedient to keep the data referring to display memory cached. A second operation, flush, provides a mechanism for Dragon processors to hand over their data to memory. Flush first launders a selected set of dirty entries and then deletes them from the Dragon processors' caches. If a Dragon processor tries to access this data again, it will get a map fault and the system must ensure that the monitor lock acquired by the display processor is no longer held before permitting Dragon processor to access the data.