Heading:
Alpine lock manager concepts, version 0
Page Numbers: Yes X: 527 Y: 10.5"
CSL Notebook Entry
To:Alpine designersDate:February 24, 1981
From:Mark BrownLocation:PARC/CSL
Subject:Alpine lock manager concepts, version 0File:[Ivy]<Alpine>Doc>LockConcepts0.bravo
XEROX
Attributes:informal, technical, Database, Distributed computing, Filing
Abstract:This memo proposes the scope of the Alpine lock manager. It is largely an interpretation and restatement of several papers on the IMS and System R lock managers. It does not propose an interface; see LockDesign*.bravo.
Papers on locking
I have a copy of each paper listed here. This memo should make it unnecessary for everyone to read all of these papers, but I have found them useful. I have read some of them much more carefully than others.
J. N. Gray, R. A. Lorie, G. R. Putzolu, I. L. Traiger, "Granularity of Locks and Degrees of Consistency in a Shared Data Base," in G. M. Nijssen (ed) Modelling in Data Base Management Systems, North Holland, 1976. (Describes the System R lock manager.)
J. N. Gray, "Notes on Data Base Operating Systems," IBM San Jose Research Report RJ2188, February 1978, Section 5.7. Also appeared in Operating Systems - an Advanced Course, Springer Lecture Notes in Computer Science 60, 1978. (There is a lot of duplication between this and the preceding paper.)
J. N. Gray, "Experience with the System R lock manager," slides for an oral presentation given in Berkeley, 1980. (Describes the frequency of wait and deadlock, endorses update mode locks.)
H. F. Korth, "Locking protocols: general lock classes and deadlock freedom," draft of Princeton University Ph.D. dissertation, 1981. (A clean treatment of a System R style lock manager. Describes how to derive compatability properties of locks systematically, and gives a variable-granularity, deadlock-free locking protocol. But this protocol only prevents deadlock due to lock contention; see below.)
R. Obermarck, "IMS/VS program isolation feature," IBM San Jose Research Report RJ2879, July 1980. (Describes the IMS lock manager.)
R. Obermarck, "Global deadlock detection algorithm," IBM San Jose Research Report RJ2845, June 1980. (Describes the System R* distributed deadlock detection algorithm.)
R. Obermarck, "Deadlock detection for all resource classes," IBM San Jose Research Report RJ2955, October 1980. (Points out that deadlocks occur due to contention for resources that aren’t covered by the usual sort of lock: network connections and heap storage are examples. Then gives an algorithm for detecting deadlocks over all these resources.)
A top-down view of locks
Suppose that a program P performs a sequence of operations on a collection of files F. These files satisfy some application-specific consistency constraints. P is written so that if F satisifes the consistency constraints, then running P will bring F to another state that satisifes the consistency constraints.
Now consider that several programs may access the file system simultaneously. Individual operations on files (read page, write page) can be made atomic by using monitors (for instance, a monitor can keep one program from seeing a partially completed page write performed by another program.) Transactions are used to make sequences of operations atomic, so that programs P and Q can execute simultaneously and yet each can preserve its consistency constraints. This is done by ordering the operations of different transactions so that their execution equivalent to some separate execution of the transactions (P before Q, or P after Q.) When such as ordering is imposed we say that the schedules of the two transactions are serializable. It is not necessary to actually run one transaction before the other: if P and Q both read a page and neither updates it, then both reads may complete before either transaction finishes.
Locking is one technique for serializing the schedules of concurrent transactions. Consider the example of reading and writing a page. Suppose that we impose these rules on transactions that wish to read or write: (1) Before reading a page, obtain a read lock on it; before writing obtain a write lock. (2) At no time may a write lock coexist with any other lock (read or write) on a page, but a transaction may convert its read lock to a write lock without releasing the read lock. (3) No transaction may release any lock before it commits or aborts; after committing or aborting it must release all its locks. It can be proved that these rules allow only serializable schedules.
While the example just given is simple, it contains most of the features of more complicated locking protocols that are used in practice. Rule (1) is typical; we must lock an entity before using it. Locking protocols that allow locking of the same entity at different granularities (locking a page implicitly by locking the file containing it) have more elaborate versions of rule (1). Rule (2) governs the compatability of different lock requests for the same entity; the exact compatability rule depends on the semantics of the operations that are allowed by each type of lock. Rule (3) is called the two phase requirement: a transaction consists of two phases, a growing phase in which locks are acquired and none are released, and a shrinking phase in which locks are released and none are acquired. The two phase requirement is part of all locking protocols that produce serializable schedules.
A bottom-up view of locks
Locks are used to maintain the consistency of shared data under concurrent access. The lock manager (i.e. the module that implements locks) is a low-level scheduler. It decides who must wait, but not who should run among the runnable processes. Other system components provide load control and other processor scheduling functions.
The basic lock call is "lock entity E in mode M for use by transaction T". An example of an entity is a file page; an example of a mode is "read" or "write". We say that the request is granted if the above call returns with the lock set. A program holding a lock in a certain mode has permission to operate on the locked entity in a way specified by the mode. Not all requests are granted right away; the caller may have to wait for another transaction to release a lock. Whether a lock request is be granted may be a function of the request’s mode M, the modes of existing locks on entity E (for transaction T and for other transactions), and the modes of requests already made but not yet granted for locks on entity E.
To lock an entity we must name it. From the lock manager’s point of view, the simplest form of name is a fixed-length string of bits. The lock manager is not concerned with the mapping between names and entities, and knows little or nothing about the structure of names. This allows new clients to use the lock manager without changing the lock manager or its interface.
Monitors are used to implement locks. Data structures whose sharing is controlled by locks need not be protected by monitors. (So in terms of the previous top-down discussion, locks may be used both for guaranteeing the atomicity of whole transactions and for providing the atomicity of their individual operations.) The primary differences between locks and monitors are (1) monitors only allow exclusive access to the monitored structure, while locks allow other policies such at multiple readers, (2) a monitor deadlock is a fatal error, while a lock deadlock can be resolved by rolling back one of the transactions involved.
Because locks are not integrated into the language as monitors are, locks require extra care on the part of the programmer. Clients who are programmed in terms of locks must obey a lock protocol: they must lock the entities that they intend to access before accessing them. The lock manager only ensures that if all its clients obey the lock protocol, then they will not interfere with each other in unexpected ways. (The same is really true about monitors, but the language makes it convenient to follow the correct monitor protocol and the compiler catches certain blunders.) The lock manager is also responsible for detecting deadlocks, and for breaking them by asking the recovery manager to roll back a transaction (releasing locks in the process.) (Note that it is the locks themselves that make backup practical by isolating transactions from one another and hence avoiding cascading backups.)
There are many natural forces that push complexity into the lock manager. We add logically redundant lock modes, such as update mode, to describe our application more precisely and allow higher concurrency and fewer deadlocks. We add a variable-granularity lock protocol (which from the lock manager’s point of view reduces to the intention lock modes) to reduce the number of calls on the lock manager, since lock calls can amount to a significant fraction of the time in a system. Fortunately, adding a new lock mode has no impact on the complexity of the lock manager. We may also complicate the lock manager by giving it advice on scheduling decisions, for instance who to abort first in a deadlock situation.
An added complexity in System R is lock escalation. When a transaction acquires a certain number of locks, System R attempts to trade the most frequently-occurring locks (e.g. page locks on a particular file) for a single lock that covers them all. This trades concurrency for space, but it has worked well, and the code to implement escalation is straightforward. It does add some complexity and may widen the interface.
An added complexity in the Juniper system is range locks. In Juniper, all locks are considered to cover a range of entities. This allows locks to be set on ranges of bytes in a file. Range locks require that the lock data structure represent the linear ordering among certain locks; this means that at some point a list or tree representation must be used instead of simple hashing.
Lock modes and conversions
Locking an entity in a specific mode grants the holder of the lock permission to perform certain operations on the entity. By considering the potential interactions among these operations, we can derive the lock compatability predicate
Compat: PROC [request: LockMode, existing: LockMode] RETURNS [compatible: BOOLEAN].
If Compat[x, y], then a request for a lock of mode x can be granted even when there are existing locks of mode y for other transactions. It is clear that Compat[Read, Write] = FALSE is necessary, and that Compat[Read, Read] = TRUE is ok (we could make it FALSE too, but that would reduce concurrency. In general, compatible actions must commute.)
A conversion is a request by a transaction T to lock an entity E that is already locked by T. If T locks E in mode y, and there are no other locks on E, then clearly we should allow T to lock E in mode x even if ~Compat[x, y]. This explains why we treat conversions as being different from other requests. In System R, conversions are always granted immediately if possible, even if other requests are waiting. (This is a violation of fair scheduling, but favoring a transaction T that already holds a lock is probably a good idea: granting other requests can only make it harder for T to proceed and eventually release the lock it has.)
Conversions are governed by a supremum function
Sup: PROC [request: LockMode, existing: LockMode] RETURNS [LockMode].
To honor a request for mode x when the existing mode is y, the lock manager converts the existing lock to mode Sup[x, y].
Compat and Sup contain all of the mode-specific knowledge in the lock manager.
Variable-granularity locking
A program that performs a sequential scan of a file wants to lock the whole file before it starts, thereby saving the space and time required for one lock per page (or one lock per tuple.) A program that does random-access into a file via an index structure wants to lock only the data that it depends on, so that it can run concurrently with other programs doing the same thing. This is the motivation for locking at different granularities.
Variable-granularity locking requires a more elaborate locking rule than "lock an entity before using it," since an entity might be locked implicitly by a lock on a larger entity that contains it. Therefore to lock, say, a page of a file, we first lock the file in an intention mode, then lock the page. The compatability predicate values for the intention modes ensure that consistency is maintained. As a result, the lock manager "knows" nothing about variable-granularity locking; it just increases the number of lock modes a bit.
Sets of locks
Clients of the lock manager may wish it to implement multiple sets of locks for a given transaction. IMS, System R, and the DBTG data manipulation language do this. The applications seem to be:
(1) Even when transactions use two phase locking, it is sometimes useful to obtain some locks for a short period and then release them. (For instance, in System R "short locks" are used on B-Tree pages to guarantee the consistency of update; a "long lock" is used to cover the update as a whole.) To avoid double bookkeeping, the lock manager should maintain the locks as a set and release them with a single call.
(2) System R supports a notion of save point for a transaction. This is a checkpoint, i.e. a point to which a transaction can roll back in case of lock conflict. If the lock manager knows about sets of locks, its deadlock detection can pick a set of locks to abort rather than a transaction. If this set corresponds to a save point, the transaction can back up just that far.
In any case, it seems essential to have a lock manager call that releases all locks held by a given transaction. Another call that may be useful is one that enumerates a transaction’s locks and allows the caller to downgrade them (from write to read); this is what happens during a Juniper "checkpoint" (commit while preserving the transaction and holding locks.) So a set structure will be maintained on a per-transaction basis. It is may not be much extra effort to allow multiple sets per transaction.
Scheduling competing requests
The System R lock manager uses the following scheduling strategy. If there are no waiters from other transactions for a new request (lock not already held in any mode by the transaction) and it can be granted immediately, then it is granted; otherwise the request is placed at the end of the queue of waiters for this lock. If the request is a conversion (the lock is already held in some mode by the transaction) and it can be granted immediately, then it is granted (regardless of whether anyone is waiting); otherwise, the request is placed after any waiting conversions in the queue, but before any waiting non-conversions. (Note that a waiter can change from being a non-conversion to being a conversion while it is waiting, if another request from the same transaction is granted. In general, waiting requests from a single transaction are kept consecutive in the queue, and are ordered not by insertion time but by strength (the partial order defined by the Sup function), with the weakest mode first).
Deadlock detection
The System R lock manager performs periodic deadlock detection by constructing the waiting-for graph and looking for cycles. (T1 is waiting for T2 if T1 is waiting for a lock on E in mode x, T2 holds a lock on E in mode y, and ~Compat[x, y].) It constructs the graph only if there are two or more waiters in the system. The graph is constructed from scratch each time, rather than being updated incrementally, since we expect deadlock to be rare even when compared with waiting (which is rare compared with locking.) Once a cycle is found, a victim is selected and aborted; in the absence of other information, it makes sense to select the cheapest victim. System R estimates the "cost" of a transaction to be the number of bytes of log it has written.
The paper by Obermarck discusses the elaborations on this that are used by System R* in a distributed, slow-communications environment. We may want to do something different in a distributed, fast RPC environment.
Timeouts must be used to abort long waiters, since the lock manager cannot detect all deadlocks. (A special problem in our environment is the detection of abandoned transactions, so they do not impede useful transactions for too long. This is not a simple problem in the Mesa world since it is hard to tell a crashed machine from a machine in the debugger. In Cedar the problem may be simpler. In any case, this is not a problem of the lock manager.)
Instrumentation
It seems quite useful to build an inexpensive statistics-gathering facility into the lock manager, and leave it on all the time. Knowing the frequency of wait and deadlock will be important in understanding overall system performance.
It would be nice to be able to communicate information on the time spent in lock wait back to individual clients (transactions) as well. Since the accuracy of this information does not need to be great, it should be possible to gather it cheaply.
Statistics on the System R lock manager
Time performance. In System R a lock/unlock pair takes 300 (IBM 370) instructions in the no-wait case, 5000 instructions in the wait case, 310 on the average (hence System R waits once per 500 calls to Lock). The CPU overhead for locking is typically 10% for transactions that do random access via B-Trees (fine-grained locks), and less than 1% for transactions that do mainly sequential processing (page or file locks).
Space performance. A System R lock occupies 40 bytes.
Program size. The System R lock manager is 3000 lines of PL/S, including the deadlock analyzer. System R was not conceived as a distributed system (though transactions involving multiple System R sites may be executed), and it does not detect multi-system deadlocks.