*start* 00432 00024 US Date: 4 Jan. 1982 5:52 pm PST (Monday) From: Kolling.PA Subject: recovery To: mbrown, taft cc: kolling If we can lose the data on a pack permanently, then recovery has to operate on a per volume basis, otherwise losing a pack permanently will also permanently block access to all the other volumes (just on that server? or in the world?) that were involved with transactions touching that pack..... Karen *start* 02253 00024 US Date: 6-Jan-82 16:04:13 PST (Wednesday) From: MBrown.PA Subject: Recovery in the face of volume offline To: Kolling, Taft cc: MBrown I've been giving this some thought, and I feel that the approach I suggested in our meeting is basically ok (taking multiple passes over the log until each volume has been online at least once during a successful recovery.) From the log's point of view, the REAL goal of recovery is to allow log space to be re-used. So we come to the question, how does the log tell that all the volumes have been recovered, if no single recovery finds all volumes online? Recall that our current approach is to write a "completed transaction T" record to the log at some time after all updates of T have made it to disk (or conversely, all premature updates of an aborted transaction T have been undone.) In order to fully automate the recovery process, this record would have to be expanded to include a list of the volumes for which T is complete (or not complete.) (In an extreme design, we write such a record for each disk write under a transaction; breaking it down by volume is a step in this direction.) I am not convinced that even this level of additional complexity is warranted. Instead, the system can be provided, at restart, with a list of volumes that have already been recovered. This list might even by typed in by the operator. Any recovery action that references a volume in this list is considered to succeed without attempting to execute the action. When the system fails in a recovery action due to a volume being offline, it adds that volume to an internal list and proceeds, types the name of the volume at the end of recovery, but does not write any "completed transaction T" records as a result of recovery. An additional complication derives from the possibility that a transaction in the "ready" state has made updates to both online and offline volumes. In this case, I see no alternative but for the operator to manually force the transaction outcome. In summary: having too few drives is a bad deal. For a real server, we should always have a drive for doing backup; perhaps that drive may be appropriated for the time required to make a clean recovery. --mark *start* 00542 00024 US Date: 6 Jan. 1982 4:57 pm PST (Wednesday) From: kolling.PA Subject: Re: Recovery in the face of volume offline In-reply-to: MBrown's message of 6-Jan-82 16:04:13 PST (Wednesday) To: MBrown cc: Kolling, Taft Are we guaranteed (essentially) that we will never permanently lose a pack? (Are we writing the log in two places?) Having the operator type in the list of volumes that have already been recovered makes me extremely nervous. One slip of the hand and the whole server's data is potentially invalid. Karen *start* 01062 00024 US Date: 6 Jan. 1982 5:39 pm PST (Wednesday) From: kolling.PA Subject: Re: Recovery in the face of volume offline In-reply-to: kolling's message of 6 Jan. 1982 4:57 pm PST (Wednesday) To: kolling cc: MBrown, Taft I'm a little confused here. Does the following work: The system starts recovery. Whenever it can't access a volume, it adds that volume to its missing volume list (kept in volatile storage). When it reaches the end of the log, if there are any volumes in the missing volume list it prompts the operator to mount as many of those volumes as s/he can, and then sweeps the log again, only writing to the missing volumes. If any are still missing at the end of the log, it prompts and cycles again, etc. until recovery really completes. Then the operator never has to specify anything and can't mess up by mounting the wrong volume, since presumably the volumeID is obtainable from the volume by the software. (The completed transaction T records can get written if there are no entries in the missing volume list.) Karen *start* 02276 00024 US Date: 7-Jan-82 10:07:58 PST (Thursday) From: MBrown.PA Subject: Re: Recovery in the face of volume offline In-reply-to: kolling's message of 6 Jan. 1982 5:39 pm PST (Wednesday) To: kolling cc: MBrown, Taft You are not confused; something in the spirit of this proposal is certainly possible. Whatever we do, I think it will pay minimize the distinction between the first pass of recovery (in your proposal, first pass tries to access all volumes that are mentioned anywhere in the log) and later passes (in your proposal, later passes try to access only volumes that are part of an exception list), since this distinction may be difficult to isolate in the recovery code (unless we add a distinguished field to each log record that tells what volume it references, so that the recovery manager can decode this much of every log record just as it can presently decode the type and transaction fields.) My prejudice is also against a design that assumes that the server stays up continuously during the multiple retries, since implementing this sounds like added complexity and I think the case we are discussing is not very common. That is why I proposed a solution in which when recovery fails, it really fails (the server types a message, perhaps into a file, and dies), and when it finally succeeds the server just comes up and stays up. Questions at this level are probably best deferred until nearer to implementation. If we can agree now NOT to log individual transaction-volume completions as they occur in recovery, then I am happy for now. I think that would an unjustified additional complexity. I think the discussion of media recovery ("do we ever lose a pack?") is separate from this issue ("do we ever have too few drives?") The short answer to the media recovery question has to be: if the log is intact, we use it (along with the backup system, whatever it may be); otherwise we go to the backup system alone, and lose the guarantee of transaction-consistency. This is true whether the log is doubly-recorded or not: doubly-recoding only makes it less likely that backup needs to be consulted for updates that were protected by the log. It is time to do some more design of the backup system, and I am working on it. --mark *start* 00352 00024 US Date: 8 Jan. 1982 12:49 pm PST (Friday) From: Kolling.PA Subject: pilot question To: levin cc: kolling The manual says "Any attempt to Kill a uniform swap unit is ignored." This seems to imply that there is no way to implement UsePages for a space that is divided into uniform swap units. Is that right? If so, why? Karen *start* 00571 00024 US Date: 8 Jan. 1982 1:32 pm PST (Friday) From: Levin.PA Subject: Re: pilot question In-reply-to: Your message of 8 Jan. 1982 12:49 pm PST (Friday) To: Kolling The implementation of uniform swap units is a kludge: there is NO information stored on a per-swap-unit basis. Thus, Pilot can't remember that one swap unit of a space is dead while another is alive (dead = disk contents uninteresting). I don't know what "UsePages" refers to, so I can't tell you whether it can be implemented for a space that is divided into uniform swap units. Roy *start* 00524 00024 US Date: 11 Jan. 1982 1:42 pm PST (Monday) From: Kolling.PA Subject: one page stuff in FPM To: mbrown, taft cc: kolling About the pinning problem regarding the log file (unpinning a space not guaranteeing that the pages in that space are written out in order): since we are toying with the idea of special one page vm spaces for the leader pages of files, maybe we could use these spaces for log writes? I.e., something like ReadOnePage/UseOnePage which would guarantee a one page vm space? Karen *start* 03047 00024 US Date: 11 Jan. 1982 6:25 pm PST (Monday) From: MBrown.PA Subject: Pilot mapped files To: Levin cc: Kolling, Taft, MBrown Roy, We are now trying to build Alpine's internal low level file system, FilePageMgr, using Pilot. We are more concerned than most Pilot clients, I suspect, with the precise semantics of mapped files: under what conditions can Pilot decide to write a page? Does Pilot ever write out a swap unit that is not dirty? If a space is divided into uniform swap units and only some of them are dirty, are only the dirty ones written? Consider the following approach to managing the log, which is designed to place as few demands on Pilot as possible. Each log page contains a bit that is reserved to say whether the page is "valid" or not. The first store into a page (really, into a page of VM that is mapped to the log page) marks the page invalid. Subsequent stores into the page write log records, without altering the state of the "valid" bit. After all log records have been written, the page is marked valid. Then the log manager proceeds in the same manner on subsequent pages. When the time comes to do a synchronous log write, the log manager performs a Space.ForceOut on each space whose written/unwritten status is unknown. The pages go out in no particular order, but a sequence number on each page (incremented each time the log wraps around) ensures that this does not matter. One attraction of this design is that it avoids our need to pin things in real memory. The only potential problem that I can see in this scheme comes if Pilot decides for some reason to rewrite a already-written page, and the redundant write fails in the middle. (This is a rather low probability event in any case, but we'd like to avoid it if we can.) I can see no reason for Pilot to do this, since the page will never be dirty after it is first written in a valid state. But uniform swap units have always been something of a mystery to me, so I'd like to check on this. If necessary we can just avoid uniform swap units: we aren't creating/destroying spaces on the fly, mainly just changing their mappings, forcing them out, etc. A slightly related question concerns the note in the Pilot Programmer's manual, page 130, on Space.Deactivate. The note seems to imply that Deactivate triggers disk write activity if the space is presently in real memory and dirty, but does not cause the real memory to be freed. This is just what you want for doing a forced log write that spans multiple swap units: first Deactivate each one (generate the disk commands), then ForceOut each one (wait for the commands to finish.) (This behavior of Deactivate may not be what the rest of Alpine wants, however.) Another question concerns the implementation of dual logging. I wonder whether it is better to use mapped I/O and two sets of buffers, or only one set of buffers with CopyOut. The main problem I see with CopyOut is that there is no asynchronous version, analogous to Space.Deactivate. --mark *start* 01063 00024 US Date: 12 Jan. 1982 9:10 am PST (Tuesday) From: Levin.PA Subject: Re: Pilot mapped files In-reply-to: MBrown's message of 11 Jan. 1982 6:25 pm PST (Monday) To: MBrown cc: Kolling, Taft 1) I believe that Pilot will never write a clean swap unit to disk. A swap unit is clean if every page in it is clean, otherwise it is dirty. Uniform swap units are the same as non-uniform ones; if any page is dirty, the entire swap unit is considered to be dirty and consequently the whole thing is written out when the time comes. 2) The note on page 130 of the Pilot Programmer's manual says "Deactivate causes the space toe be asynchronously swapped out after rewriting any dirty pages." Thus the real memory IS freed after the write completes. However, the cost of a ForceOut on a space that is not in real memory (e.g., was just deactivated) is minimal. 3) The difference between CopyOut and ForceOut (i.e., read/write vs. mapped I/O for logging) is minimal. As for asynchrony, you can always FORK CopyOut. Am I missing something? Roy *start* 01295 00024 US Date: 12 Jan. 1982 9:38 am PST (Tuesday) From: MBrown.PA Subject: Re: Pilot mapped files In-reply-to: Levin's message of 12 Jan. 1982 9:10 am PST (Tuesday) To: Levin cc: MBrown, Kolling, Taft 1) Good. I also seem to recall that Pilot won't write out a pinned swap unit, even if it is dirty and ForcedOut. 2) It sounds as though we would have to FORK parallel ForceOut calls to get the behavior that I wanted from Deactivate. The point is that once the log has been written we typically turn right around and read it (carry out intentions); hence we don't want the space to be swapped out, just cleaned up. 3) It is true that for writing the log CopyOut would work fine, since the log is written in a very systematic way. This would certainly eliminate all of the paranoia about Pilot's possibly doing the wrong thing with the log. But we are trying to implement both the log and normal files through the FilePageMgr interface. Using Pilot's mapped interface has the advantage that the dirty bits of the map can be used to identify the swap units of a space that have been dirtied, rather than having the client (who knows, but can make mistakes) supply this information. Maybe we'll have to bite the bullet and add a separate set of calls for the log. --mark *start* 00543 00024 US Date: 12 Jan. 1982 10:22 am PST (Tuesday) From: Levin.PA Subject: Re: Pilot mapped files In-reply-to: MBrown's message of 12 Jan. 1982 9:38 am PST (Tuesday) To: MBrown cc: Kolling, Taft 1) Yes, a ForceOut of a pinned space is a no-op. 2) As you might expect, there is an internal interface that cleans up a space without releasing the memory behind it. If it matters enough to you, I could easily add a Space.Clean operation. However, this sounds like a performance optimization that can wait indefinitely. Roy *start* 00709 00024 US Date: 12 Jan. 1982 12:24 pm PST (Tuesday) From: kolling.PA Subject: more Pilot questions To: levin cc: kolling Assuming an n-page parent space divided entirely into one-page uniform swap units: 1. Does Activate on the parent space gen one disk command or n? 2. Assume the middle m pages are dirty. Is there any dance I can do to cause a write of the contiguous dirty pages in one disk command, without writing non-dirty pages? I.e., we need to avoid writing non-dirty pages to avoid touching unlogged pages, but sequential writes will be very slow if it takes m revolutions to write m pages (or is Pilot fast enough so that it wouldn't need a rev between each command?) Karen *start* 00208 00024 US Date: 12 Jan. 1982 12:41 pm PST (Tuesday) From: Levin.PA Subject: Re: more Pilot questions In-reply-to: Your message of 12 Jan. 1982 12:24 pm PST (Tuesday) To: kolling 1) n. 2) No. *start* 00497 00024 US Date: 12-Jan-82 12:56:23 PST (Tuesday) From: Kolling.PA Subject: Activate To: MBrown cc: Kolling Roy says Activate doesn't do one command either, but that it's not worth sweating this stuff because in Klamath it will all be different (work right) anyway. Also, whether it manages to get contiguous pages without a rev in between is a function of the machine. I'm going to fiddle with getting some timings a bit anyway, partly to learn more about running in Pilot. Karen *start* 01653 00024 US Date: 21 Jan. 1982 6:49 pm PST (Thursday) From: Kolling.PA Subject: I don't understand why the numbers are coming out like this. To: mbrown, taft cc: kolling Here's my basic loop: FOR index: CARD IN [0..1001) DO random: CARD ← RandomCard.Random[]/fudge; IF dirtyPage0 THEN pntr↑ ← 0; IF dirtyPage1 THEN pntr1↑ ← 0; IF dirtyPage2 THEN pntr2↑ ← 0; IF dirtyPage3 THEN pntr3↑ ← 0; IF dirtyPage4 THEN pntr4↑ ← 0; IF dirtyPage5 THEN pntr5↑ ← 0; UNTIL random = 0 DO random ← random - 1; ENDLOOP; IF index # skipIndex THEN PerfStats.Start[timer]; Space.ForceOut[space]; IF index # skipIndex THEN PerfStats.Stop[timer]; ENDLOOP; NOTES: 0..1001 and skipIndex are to control skipping the first time thru the loop. fudge = LAST[CARD]/3000 gives an average random countdown of 24 ms. (a rev = about 20 ms). I know from measurements that the time to do RandomCard.Random[]/fudge and dirtying the pages is epsilon. All the measurements discussed here are for a 2 page space with the first page dirtied each time thru the loop, no uniform swap units underneath. RESULTS: With the random stuff out of the loop, aver = 24, max = 143, min = 16. With the random stuff in and fudge = LAST[CARD]/1500 (n.b., half a rev), aver = 29, max = 144, min = 16. QUESTIONS: What kind of synchronization can explain the average time with no randomization SLIGHTLY GREATER than a rev? (Remember that when no page dirtying was done, the aver time was 3 or 4 ms.) Why does the average time go up when I put in an average delay of half a rev?? Why do I have the hard minimum of 16ms (instead of 4 ms) over several runs? *start* 01232 00024 US Date: 22 Jan. 1982 10:31 am PST (Friday) From: MBrown.PA Subject: Re: I don't understand why the numbers are coming out like this. In-reply-to: Kolling's message of 21 Jan. 1982 6:49 pm PST (Thursday) To: Kolling cc: mbrown, taft The random delay should be uniformly distributed with a range equal to some multiple of the disk revolution time. By choosing a large multiple (100 revolutions, say) you can make the experiment less sensitive to how well your busy-wait loop matches the actual rotation time. You may want to use the random long integer generator to get these long delays. In your test, the delay was uniformly distributed between 0 and 24 ms, but the disk rotates in 20 ms. So 1/6 of the time the random delay is just slightly greater than one revolution; that would tend to bias the results toward higher values. Unfortunately, this theory does not explain the large minimum value that you observe. I think we'll have to call in Maxwell to learn how to use some of the performance tools; it is possible to get a complete trace of I/O activity with something called "Ben", I think. I think you should also try putting in larger countdowns, and running the same test on Dorado. --mark *start* 01319 00024 US Date: 22 Jan. 1982 12:20 pm PST (Friday) From: kolling.PA Subject: Re: I don't understand why the numbers are coming out like this. In-reply-to: MBrown's message of 22 Jan. 1982 10:31 am PST (Friday) To: MBrown cc: Kolling, taft I now believe that over a large range of delay times the average should be slightly greater than one revolution time, because: After initial stabilization occurs, if there is no delay, the computation will start "right after" the write completes and it will finish (remember it is "small") in time to get that page on that revoution, so the total time = exactly, more or less, one rev. Start adding a delay after the write completion that is a small time compared to a rev, and the total measured time will decrease as the delay increases, until the delay gets large enough to make the computation start late enough to just miss being able to write the page in that rev, so as the delay increases, the measured time for the forceout will vary from the worst case (comp time + a little less than 1/28 rev + one rev) to the best case (comp time + 1/28 rev), for an average of comp time + 1/28 rev + 1/2 rev, approx. This still doesn't explain the large minimum value or the isolated case where the measured time is very long, both of which I am still looking at. *start* 00380 00024 US Date: 22 Jan. 1982 12:36 pm PST (Friday) From: kolling.PA Subject: Re: I don't understand why the numbers are coming out like this. In-reply-to: kolling's message of 22 Jan. 1982 12:20 pm PST (Friday) To: kolling cc: MBrown, taft oops, make that "over a large range of delay times the average should be slightly greater than one HALF a revolution time". *start* 00947 00024 US Date: 25 Jan. 1982 4:58 pm PST (Monday) From: kolling.PA Subject: How fast can Pilot talk? To: fiala, lauer cc: mbrown, kolling, taft Anyone wish to claim enough knowledge of Pilot to answer the following (I've already talked to Roy and Ed Taft): I have a file which is contiguous on the Dorado disk, a space 4 pages long, and a loop as follows: dirty all pages: 0, 1, 2, 3 ForceOut and I measure the average, max, and min times (over 1000 iterations) it takes to do the ForceOut. If there are one-space uniform swap units under the space the times are: average: 62 ms, max 96 ms, min 54 ms If there are NO one-space uniform swap units under the space the times are: average: 13 ms, max 22 ms, min 5 ms There is a random delay in the loop to avoid synchronization. Question: Even though the uniform swap units cause 4 disk commands to be generated, should it really take 4 revs to get these four pages? Karen *start* 00568 00024 US Date: 26 Jan. 1982 9:19 am PST (Tuesday) From: MBrown.PA Subject: Re: How fast can Pilot talk? In-reply-to: kolling's message of 25 Jan. 1982 4:58 pm PST (Monday) To: kolling cc: mbrown I have been reading the implementation of ForceOut, and the top level of it looks reasonable (it initiates all the I/O, then waits for it all to complete.) I think that by stopping the world at crucial instants and poking around (with CoPilot) we should be able to learn more about this phenomenon. Lets run your test on my machine and do that. --mark *start* 01017 00024 US Date: 27 Jan. 1982 8:30 am PST (Wednesday) From: Ladner.PA Subject: Re: How fast can Pilot talk? In-reply-to: Kolling's message of 25 Jan. 1982 4:58 pm PST (Monday) To: Kolling cc: MBrown, Taft, SDD-Pilot↑ I don't know that much about the Dorado and the speeds of its components, and I'm not sure I understand the question, but see if this helps ... On a Dolphin or Dandelion, if 4 disk commands are issued for consecutive sectors on the disk, 4 revs would be required to do the I/O. The time to process a disk command exceeds the inter-sector gap time by orders of magnitude. Pilot provides no disk command chaining for these two machines, and neither does the microcode. (Does this answer your question?) Uniform swap units seem to add an onerous amount of overhead more or less errr... uhhhh... uniformly across all space operations. To understand why that is, and in particular, why 4 disk commands were issued, I will have to let the resident expert on uniform swap units answer. *start* 01797 00024 US Date: 27 Jan. 1982 9:00 am PST (Wednesday) From: MBrown.PA Subject: Re: How fast can Pilot talk? In-reply-to: Ladner's message of 27 Jan. 1982 8:30 am PST (Wednesday) To: Ladner cc: Kolling, Taft, Levin, SDD-Pilot↑, MBrown We understand this phenomenon now, through a combination of inspecting counts kept in the disk channel, typing control-swat at an opportune moment, and reading code. In response to your suggestions: the Dorado chains disk commands in microcode, and is capable of transferring consecutive sectors with separate commands. This is demonstrated in the Dorado world-swap code, which runs the disk at full speed. We expected Pilot to generate the separate disk commands in the case we were testing; we tested the case because we were curious about the performance of this, compared to swapping the entire space as a unit. Now, the answer: the implementation of ForceOut is synchronous on a per swap unit basis when uniform swap units are involved. Forcing out a space that has been divided into uniform swap units causes the uniform swap units to be enumerated, and for each dirty swap unit a write is initiated. Unfortunately, this write must complete before the next write on a swap unit of that space can be initiated. The reason is that the entire enclosing space (really swap unit) is checked-out (in CachedRegion terms) when initiating the write, and not checked-in until the write completes. This holds up the next command that is trying to do its check-out. In the course of thinking about this problem, we also noticed that storage for disk commands is statically allocated; in our Pilot, storage for two disk commands is allocated. So we would have missed at least one revolution anyway, waiting to re-use the disk commands. --mark *start* 01060 00024 US Date: 27-Jan-82 9:15:13 PST (Wednesday) From: Knutsen.PA Subject: Re: How fast can Pilot talk? In-reply-to: Ladner's message of 27 Jan. 1982 8:30 am PST (Wednesday) To: Ladner cc: Kolling, MBrown, Taft, SDD-Pilot↑ Reply-To: Knutsen.PA To continue Ladner's response... Why were 4 disk commands issued? Besides the fact that the microcode does not typically support command chaining, there is: "Spaces are the principal units with which virtual memory is swapped" -- Pilot Programmer's Manual. If you want it to swap as one unit, don't subdivide it into smaller swap units. Why is there "an onerous amount of [compute] overhead for uniform swap units"? In the current version of Pilot, uniform swap units are a wart on the swapper. They were added as an "efficiency improvement" long after Pilot was designed and do not live comfortably within the data structures that Pilot has to manage swapping. This will be fixed in the Pilot redesign -- in fact, uniform swap units will then be the most efficient kind of swap unit. Dale *start* 00840 00024 US Date: 27-Jan-82 16:08:21 PST (Wednesday) From: kolling.PA Subject: this is what I needed to know: To: mbrown cc: kolling 1. Will any Alpine files (including the log file) be leader page less? (If so, FPM has to think before it rejects page 0 in PageRun requests to the "normal" ReadAhead/Read/UsePages procedures.) 2. What is it that you are proposing for log files? If they use FPM to get a VMPageRun, and they intend to do the CopyOut themselves before calling FPM.ReleaseVMPageRun, they have to know how to convert the VMPageRun to a space. Or are they to call FPM.CopyOut? Or do all their operations themselves without using FPM at all? (I think I prefer calls to FPM.CopyOut[vm: VMPageRun, fileID: FileID, pageNumber: PageNumber] so if something goes wrong, we can trap all the io in one place.) Karen *start* 00422 00024 US Date: 27 Jan. 1982 4:34 pm PST (Wednesday) From: Kolling.PA Subject: By the way To: mbrown cc: kolling what is it about "the swap unit decision" that "impacts the log implementation"? Are you worried about it wiping out a page after the last page you've dirtied, so there would be a problem in recovery? How were you planning on handling the mirroring anyway, by two sequences of calls to FPM? *start* 01469 00024 US Date: 26 Jan. 1982 3:48 pm PST (Tuesday) From: Taft.PA Subject: FilePageMgr, version 4 To: Kolling cc: MBrown, Taft Basically it looks all right to me. However, I think certain details should be specified before implementation begin -- in particular, signals and errors. The exceptions signalled by each operation should be specified in the style of the Alpine public interfaces. For example, in implementing the FileStore operations, I need to know what happens if I read a nonexistent file page. Will I get notified in a clean way when this happens? Or do I have to first obtain the file's size and check each incoming request in order to prevent an uncatchable AddressFault from happening? I assume that types that aren't defined in FilePageMgr, such as PageRun, are obtained from AlpineEnvironment. I'm slightly nervous about having main-line operations such as ReadPages having to allocate collectable storage on every call (in order to return a LIST OF VMPageSet). Perhaps my nervousness is unwarranted; I'd welcome being reassured on this point. Why is ShareVMPageRun useful? Delete, DeleteImmutable, and SetSize raise an "error if any of the file is currently mapped". I assume that by "mapped" you mean referred to by one or more VMPageRuns with nonzero share counts. The FilePageMgr should take care of flushing VMPageRuns whose share counts have gone to zero, synchronizing with any deferred writing activity, etc. Ed *start* 01177 00024 US Date: 26-Jan-82 17:20:43 PST (Tuesday) From: kolling.PA Subject: Re: FilePageMgr, version 4 In-reply-to: Taft's message of 26 Jan. 1982 3:48 pm PST (Tuesday) To: Taft cc: Kolling, MBrown 1. Yes, signals and errors will be in, but I probably won't know what all of them are until the implementation is underway. Neither do the "Alpine public interfaces" at this point, if they did, I could have finished the error handling in AccessControl already, sigh. 2. Yes to AlpineEnvironment. 3. Likely I should keep a pool of storage for the VMPageRuns; it will be on my list. 4. I don't know about ShareVMPageRun, that's something Mark put in at one point. 5. "Delete, DeleteImmutable, and SetSize raise an error if any of the file is currently mapped". I assume that by "mapped" you mean referred to by one or more VMPageRuns with nonzero share counts." Yes. 6. "The FilePageMgr should take care of flushing VMPageRuns whose share counts have gone to zero, synchronizing with any deferred writing activity, etc." Yes, this is what happens; I thought the stuff at the end of the memo and at releasevmpageruns implied this, but maybe not. Karen *start* 02478 00024 US Date: 28 Jan. 1982 10:25 am PST (Thursday) From: MBrown.PA Subject: Re: FilePageMgr, version 4 In-reply-to: kolling's message of 26-Jan-82 17:20:43 PST (Tuesday) To: kolling cc: Taft, MBrown I think it is ok for ReadPages et al to perform allocations. The units being allocated are fixed-length, meaning that the implementation of FilePageMgr can create its own quantum zone to make allocations more efficient for objects of this type. I think it is worth deferring any other optimizations in this area in the interest of keeping the interface clean. I am not sure I have a specific need for ShareVMPageRun; I think one will arise eventually. It is a trivial procedure to implement (PageSets already need to have share counts because two calls on ReadPages can reference the same PageSet.) It is ok by me to eliminate this procedure now, and add it back in if and when the need arises. For my part, I don't see the immediate need for ForceOutFile, but am willing to take it on faith that we'll find an application later. Pilot will never implement ReplicateImmutable. Pilot has changed its entire philosophy of file naming; file IDs are only guaranteed unique when qualified by a volume ID. This means that some higher-level directory structure is needed to keep track of the fact that two immutable files have the same contents; a unique ID comparison says nothing. All of this implies that (1) there is no need for ReplicateImmutable in FilePageMgr, and (2) all FileIDs in FilePageMgr should become [volume ID, file ID] pairs. (Since this is getting rather bulky to pass around and re-represent everywhere, we may wish to have a centrailzed data structure, shared by the OpenFileMap and the FilePageMgr, that turns long file names into "atoms".) Pilot's handling of multiple logical volumes right now is very poor (you can't point it to a specific volume when you call Map), but Trinity will give a moderate improvement in this, and Klamath should do it right. We should think a bit harder about how page I/Os and file size changes will be synchronized. If transaction A updates pages [10 .. 20) and commits, then transaction B sets the file size to 15, how does SetSize interact with the deferred updates? This is not mainly a FilePageMgr question, since at its level there are only two options: report an error or wait for the error condition (the writer) to go away. But it would be nice to have a handle on this problem. --mark *start* 00578 00024 US Date: 3-Feb-82 18:39:29 PST (Wednesday) From: Kolling.PA Subject: ReleaseVMPageRun questions To: mbrown cc: kolling 1. We've been saying things like "sequential access clients will release pages with the write behind/deactivate option, random access clients will release pages without the deactivate option", etc. but I don't see any way for people to do this through AlpineFile..... 2. What type of clients (of FPM) do you expect to select what combinations of write = {writeAndWait, writeButDontWait, writeBehind} and deactivate: BOOLEAN? Karen *start* 00650 00024 US Date: 5 Feb. 1982 4:03 pm PST (Friday) From: MBrown.PA Subject: Re: FileHandles In-reply-to: Kolling's message of 5-Feb-82 13:52:40 PST (Friday) To: Kolling, Taft cc: MBrown I wonder whether it might turn out to be less work overall to give AccessControl a side door for reading and writing pages of the owner database file without opening it (i.e. give it access at a level where it just passes in a TransHandle and FileHandle to perform read and write on a file, bypassing the open file map and client map and transaction map.) A lot depends on how neatly this fits in with Ed's implementation of these actions. --mark *start* 00390 00024 US Date: 5 Feb. 1982 4:11 pm PST (Friday) From: kolling.PA Subject: Re: FileHandles In-reply-to: MBrown's message of 5 Feb. 1982 4:03 pm PST (Friday) To: MBrown cc: Kolling, Taft As long as AC IO goes thru the locking mechanism, I don't care what door I use. How much locking I need depends on how much concurrency is removed, which we haven't decided yet, I think. *start* 00410 00024 US Date: 4 Feb. 1982 12:06 pm PST (Thursday) From: Kolling.PA Subject: Re: Pilot Redesign In-reply-to: MBrown's message of 4 Feb. 1982 10:28 am PST (Thursday) To: MBrown cc: Taft, Kolling I'm wondering if I should remove the references to immutable files from the FPM interface now. If we support immutability ourselves, wouldn't it be done at a higher level as a file property? Karen *start* 00811 00024 US Date: 5-Feb-82 15:37:01 PST (Friday) From: Taft.PA Subject: Log.RecordType To: MBrown cc: Kolling, Taft I suggest that the Log.RecordType enumeration be decentralized, along the lines of Pilot's FileType. That is, in Log.mesa you say something like: RecordType: TYPE = RECORD [CARDINAL]; TransactionRecordType: TYPE = RecordType [0..99]; FileRecordType: TYPE = RecordType [100..199]; AccessControlRecordType: TYPE = RecordType [200..299]; ... etc. Then in individual, more private defs files, you say things like: workerBegin: TransactionRecordType = [7]; workerReady: TransactionRecordType = [8]; ... etc. It would be nice if this could be done with subranges of enumerated types rather than CARDINALs, but I don't think that's possible; I'll ask Satterthwaite to make sure. Ed *start* 01912 00024 US Date: 5 Feb. 1982 4:51 pm PST (Friday) From: Taft.PA Subject: Satterthwaite's answers To: MBrown, Kolling From these I conclude that (1) the FileMap.Object record will have to be declared in the interface, and (2) we will have to use the Pilot FileType approach for decentralizing Log.RecordType (if we do it at all). --------------------------- Date: 5 Feb. 1982 3:57 pm PST (Friday) From: Satterthwaite.PA Subject: Re: Inline entry procedures In-reply-to: Your message of 5 Feb. 1982 10:45 am PST (Friday) To: Taft cc: Satterthwaite The inline has to be able to see the declaration of the monitor lock. So the best you can do is something like Opaque: TYPE [n]; SemiOpaque: TYPE = MONITORED RECORD [guts: Opaque] but this introduces more problems than it solves, so I don't recommend it. Ed --------------------------- Date: 5 Feb. 1982 4:08 pm PST (Friday) From: Satterthwaite.PA Subject: Re: Decentralized enumerated types In-reply-to: Your message of 5 Feb. 1982 3:46 pm PST (Friday) To: Taft You can use an enumerated type, but again the cure is usually worse than the disease: RecordType: TYPE = MACHINE DEPENDENT { firstTransaction (0), firstFile (100), firstAccessControl (200), (1023)} -- guarantee enough bits .... TransactionType: TYPE = RecordType[firstTransaction..firstFile); workerBegin: TransactionType = LOOPHOLE[7]; workerRead: TransactionType = NEXT[workerBegin]; Unfortunately, this approach does not give workerBegin the nice scoping properties of identifiers declared in the original enumeration. About the only advantage is that there are likely to be fewer values of the enumerated type floating around, and the type doesn't have quite so many operations as CARDINAL. The Pilot folks get most (but not all) of this by using RECORD [CARDINAL]. Ed ------------------------------------------------------------ *start* 02433 00024 US Date: 8 Feb. 1982 2:03 pm PST (Monday) From: Taft.PA Subject: Why LogMap interlocks are not needed To: MBrown, Kolling cc: Taft I've completed a first cut at implementing ReadPages/WritePages and at specifying some of the internal interfaces they depend on. I'm now convinced that no locking is required in the LogMap (except the LogMap monitor itself, for maintaining internal consistency), at least for page-level operations. This depends on doing things in the right order, however. When beginning a new operation, after setting locks, ReadPages and WritePages consult the LogMap. For each overlapping committed intention, it carries out the intention and then deletes the entry from the LogMap. Only then does it actually do the requested operation, perhaps adding new intentions to the LogMap. What happens if two concurrent clients in the same transaction both see the same uncommitted intention in the LogMap? That's simple: they both carry it out. There's nothing logically wrong with that. The only complication arises from the following sequence of events: Client A: finds uncommitted intention in LogMap. A: starts to carry out intention. B: finds uncommitted intention in LogMap. B: starts to carry out intention. B: finishes carrying out intention. B: deletes intention from LogMap. B: puts a new intention for its own write into the LogMap. A: finishes carrying out intention (the old one). A: deletes intention from LogMap, thereby wiping out B's new intention. There is an easy way to prevent this. The LogMap Delete operation requires the client to uniquely identify the LogMap entry to be deleted; that is, in addition to the EntityKey (FileID and PageNumber in this case), the client provides some unique identification (either the RecordID or the TransID will do; I prefer the RecordID, since the client presumably knows it already). The Delete operation will do nothing if the unique identification doesn't match. This eliminates bad interactions between committed and uncommitted intentions in the LogMap. There can't possibly be interactions between two uncommitted intentions belonging to different transactions, since a Lock must be obtained before registering such intentions. If two uncommitted intentions for the same transaction are registered, the later one overrules the earlier one; we've agreed that this is ok. There are no other cases I can think of. Ed *start* 00984 00024 US Date: 8 Feb. 1982 2:21 pm PST (Monday) From: Taft.PA Subject: Lock To: MBrown cc: Kolling, Taft 1. The Lock interface needs an operation for finding out about existing locks. Otherwise there is no way to implement AlpineFile.UnlockPages, which is specified to remove any read locks set by the current transaction but not to disturb any other kinds of locks. 2. I don't understand the purpose of the reference counts described in the comments in Lock.mesa. 3. LockNoWait should be abolished, and its function replaced by a "wait: BOOLEAN" argument to Lock, which, if TRUE, causes Lock to WAIT in the Transaction monitor. If the caller needs to back out of monitors before waiting, it can first call Lock with wait=FALSE; if this fails, back out of its monitors and then call Lock with wait=TRUE (if appropriate). [So far I have not needed to call Lock from within any monitors -- though I am calling it with Work in progress, which I believe is ok.] Ed *start* 00841 00024 US Date: 3-Feb-82 12:12:44 PST (Wednesday) From: MBrown.PA Subject: ReadPages outline To: Taft cc: Kolling, MBrown ReadPages: OpenFileID -> OpenFileHandle (Enter/Exit OFM monitor) OpenFileHandle -> FileHandle, TransHandle, Conversation (Enter/Exit OFM) Check caller's Conversation StartWorking[t] (Enter/Exit TransObject) Acquire locks if necessary (Enter/Exit Locks, perhaps TransObject) FileHandle -> LogMap (Enter/Exit LogMap) data in log or base? do deferred updates SELECT log => do read from log update if necessary (Enter/Exit FileMap) return value read base => return value from FPM (Enter/Exit FileMap) StopWorking[t] (Enter/Exit TransObject) OpenFileMap -> FileObject FileObject contains VolumeID, FileID, nOpenFileHandles (incoming), -> LogMap, -> FPM map *start* 00968 00024 US Date: 8 Feb. 1982 4:55 pm PST (Monday) From: Taft.PA Subject: Why LogMap interlocks may be needed after all To: MBrown, Kolling cc: Taft Mark correctly points out that if my scenario is carried a bit further it will cause trouble (new actions are marked by =>): Client A: finds uncommitted intention in LogMap. A: starts to carry out intention. B: finds uncommitted intention in LogMap. B: starts to carry out intention. B: finishes carrying out intention. B: deletes intention from LogMap. B: puts a new intention for its own write into the LogMap. => B: commits its transaction. => C: carries out B's intention (C is a third client or a background process). A: finishes carrying out intention (the old one). Result: the later intention is clobbered by the earlier one, which is clearly wrong. I'd like to find a way out of this, short of locking individual LogMap entries (as Mark proposed originally); but I haven't found one yet. Ed *start* 01473 00024 US Date: 10 Feb. 1982 9:32 am PST (Wednesday) From: Taft.PA Subject: Locking file properties To: MBrown, Kolling cc: Taft I currently see no difficulty in locking file properties individually, as opposed to locking the entire leader page when any property is accessed. Besides setting individual locks, all that's required is to log changes to properties individually and to serialize the actual leader page modifications with some monitor. Unless you see something wrong with this, I propose to go ahead and do this. What led me to think about file properties was the realization that advancing the high water mark (when writing on newly-allocated pages of a file) of course requires setting an update lock on the high water mark property. I think locking the entire leader page in this case could cause serious loss of concurrency. Also, during sequential writes, the high water mark is advanced during each write. So it seems undesirable to log each update to the high water mark. Instead, I should keep the uncommitted high water mark in volatile storage associated with the file, and only log the new high water mark during phase one of commit. This seems like a good application for a File x Transaction map, which we discussed briefly last week and decided to drop for the time being. Of course, I can also record this information in the LogMap, though doing so does stretch the semantics of the LogMap a bit. What do you think? Ed *start* 01166 00024 US Date: 10 Feb. 1982 3:37 pm PST (Wednesday) From: MBrown.PA Subject: Re: Locking file properties In-reply-to: Taft's message of 10 Feb. 1982 9:32 am PST (Wednesday) To: Taft cc: MBrown, Kolling The other possible high-contention property of a file is the version number. I could be convinced that databases can bypass the high water mark stuff altogether (or even that high water marks are not worth the trouble), but I am pretty sure that version numbers are worth having. So I think your design is a good one. By the way, Gifford says that for the purposes of his algorithms, it is important to define a file's version number to be the number of committed update actions performed on the file -- not the number of committed TRANSactions that updated the file. Also, a transaction in progress should see the version number as the stable version number of the file, plus the number of update actions already executed on the file for this transaction. (This has to do with transactions in which a write quorum is established several times independently.) The file x transaction map seems necessary for this information, too. --mark *start* 02396 00024 US Date: 11 Feb. 1982 4:06 pm PST (Thursday) From: kolling.PA Subject: FPM To: mbrown, taft cc: kolling I believe this is the functionality we arrived at in today's meeting: 1. For Release (ignoring the modifications necessary if there is another client): options: {writeIndividually, writeBatched, clean}, waitForWrite, reuse: BOOL. clean => IF reuse THEN lru ELSE mru. writeBatched, wait => ForceOut after the completion of which {IF reuse THEN lru ELSE mru}. writeBatched, dontWait => A demon will cause the writes at some appropriate time, after the completion of which writes {IF reuse THEN lru ELSE mru}. writeIndividually, wait => ForceOut, after the completion of which mru. writeIndividually, dontWait => Mru. (leave write to the swapper). I propose to ignore waitForWrite if clean. I also propose to ignore reuse if writeIndividually since it requires implementation effort if dontWait is set. We don't expect any client to be requesting these. Question: should "writeBatched, wait" start and wait for completion of the writes for other pages of that file that are marked for writeBatched, or does it just apply to the pageRun being released? Below is the association we expect between clients and options. (I'm a little fuzzy about the log stuff). write option: wait option: reuse option: sequential read: clean FALSE TRUE sequential write: writeBatched FALSE TRUE random read: clean FALSE FALSE random write: writeIndividually FALSE FALSE log: writeIndividually TRUE FALSE writeBatched TRUE FALSE clean FALSE FALSE 2. Because FPM will never take a chunk out of its fpm FileObject map until the chunk has been Space.Unmapped, we can't be messed up by the client giving us incorrect hints such as clean when not clean, except for performance problems. 3. Any request for a page > eof will cause an Error and no data returned. 4. Yes, page 0 will be interpreted as the first data page everywhere in FPM, at Ed's suggestion. Karen *start* 01353 00024 US Date: 12-Feb-82 10:40:14 PST (Friday) From: MBrown.PA Subject: Alpine Volatile Maps To: Taft cc: Kolling, MBrown This looks pretty good. Some detail observations, which may or may not be valid: 1. Open file object "File lock" means "an element of {fileLocks, pageLocks}". The FileInstanceObject holds the lock mode (R, W, IR, etc) in which the whole file is held by the transaction. The mode held there is allowed to be weaker than the true mode. 2. Log map: the TransHandle of the updating transaction must be bundled together with the committed size and uncommitted size. If leader page is updated as a unit, the it needs a special slot in the log map, but the rest of the log map is just the intentions tree. 3. Now I remember the other application for releasing locks: releasing a read lock on a file version number. Actually, this is no different than the problem of releasing the read lock on the leader page that is obtained implicitly by Open (at least in my naive implementation of Open) in order to read the access control lists. We need to deal with this. 4. If the tree of intentions is represented as a balanced binary tree, then an additional 5 words per node are required (two REFs and a balance bit.) I think it is too early to be preoccupied with the space required for this data structure. --mark *start* 00833 00024 US Date: 12-Feb-82 16:27:04 PST (Friday) From: Kolling.PA Subject: question To: mbrown cc: kolling In RedBlackTreeRefImpl, it says "it is now unsafe for multiple processes to be inside of an instance of this module, even if they operate on distinct tables (it was always unsafe to perform concurrent operations on a single table)." I think the FPM routines will generally look like this: mumble get object monitor protecting the fpm area of this FileObject, where the fpm area = map + other stuff. mumble OrderedSymbolTableRef.Lookup etc. mumble release object monitor protecting the fpm area. mumble. This seems to imply that: the object monitor stuff should come out of RedBlackTreeRefImpl, and there should be a module monitor for it, yes? *start* 00636 00024 US Date: 8 Feb. 1982 4:14 pm PST (Monday) From: Kolling.PA Subject: Re: Pilot mapped files In-reply-to: Levin's message of 12 Jan. 1982 10:22 am PST (Tuesday) To: Levin cc: kolling A while back, you mentioned: "there is an internal interface that cleans up a space without releasing the memory behind it. If it matters enough to you, I could easily add a Space.Clean operation. However, this sounds like a performance optimization that can wait indefinitely." I would really like Space.Clean to implement a writeAndDontWait option. Any reason I can't just call that internal interface (which is it?)? Karen *start* 00755 00024 US Date: 29 May 1981 3:35 pm PDT (Friday) From: Taft.PA Subject: FilePageMgr To: MBrown, Kolling cc: Taft It seems to me that the FileStoreID argument can be removed throughout. The Pilot interfaces identify files entirely by FileID, and the FilePageMgr should do likewise. I can't think of any way in which passing the FileStoreID will help the FilePageMgr do its work. By the way, Pilot seems to require that there be only one instance of a file with a given FileID on any machine. If we want to replicate a file with the same ID on multiple volumes, this will cause trouble. The highest level at which you get to specify a Volume as well as a FileID is SubVolume.StartIO, which I don't think we want to be messing with! Ed *start* 02791 00024 US Date: 16 Feb. 1982 10:35 am PST (Tuesday) From: MBrown.PA Subject: Alpine style conventions To: Kolling, Taft cc: MBrown Questions of Cedar/Mesa coding style are bound to arise as we start to write Alpine. There is no way that we'll achieve absolute uniformity in programming style across three individuals, but we should try for as much commonality as is practical. Let me propose the following point of style: Minimize the use of OPEN. Attitudes about OPEN vary widely, but it seems pretty clear that wide use of OPEN can make code difficult to read. Ed Satterthwaite, for instance, rarely uses OPEN except in the following situation: a program module manipulates a set of record types so widely that a programmer has no hope of understanding the program without understanding the record types. Then Ed may OPEN the definitions module containing the record types. Ed almost never OPENs a definitions module in order to get procedures. If he needs many procedures from an interface with a long name, he will import the interface with an explicit short name (IMPORTS I: Inline, S: Space, ... .) Sometimes he gets record types without open by defining the same names in the local scope (TransID: TYPE = AlpineEnvironment.TransID; ... .) Another way to minimize the need for OPEN is to define clusters of operations associated with types, invoking the operations with the "object" notation (handle.Operation[args].) Careful design of defs modules may be required in order to define useful clusters. With clusters, it is possible to achieve very uniform and terse naming conventions. For instance, suppose that a TransID can be derived from a Foo.Handle. Then in interface Foo, define TransID: PROC [self: Handle] RETURNS [AlpineEnvironment.TransID]; and if f is a Foo.Handle, write f.TransID to call the TransID procedure in Foo. In my own programming I prefer to avoid OPEN altogether. I find that by: 1) defining commonly used types in the local scope using "TYPE =", 2) using object notation wherever possible, 3) where that is inappropriate, defining short names (generally one or two characters and all caps) for instances of frequently-used long interface names (IMPORTS S: Space), I never feel the need for OPEN. This also removes the need for USING in my DIRECTORY. USING lists are informative, but in my experience they add too much overhead to the program development process (compilation errors, with generally uninformative messages) to make them productive when actively developing a system. There are tools for adding USING lists to modules after the fact, so once a seciton of Alpine becomes more static the lists can easily be inserted. We can expect the tools for this to improve as Cedar really gets going. --mark *start* 01551 00024 US Date: 26 Aug. 1981 5:18 pm PDT (Wednesday) From: Taft.PA Subject: SIGNAL, etc. In-reply-to: MBrown's message of 25 Aug. 1981 4:56 pm PDT (Tuesday) To: MBrown cc: Kolling, Taft Your suggested treatment of SIGNALs across a remote interface seems like the only reasonable one from the point of view of robustness of the server -- independent of whether or not RPC is eventually able to support SIGNALs in all their generality. I think you are effectively proposing that RETURN WITH ERROR be the only allowed use of SIGNALs across a remote interface. It seems like the standard Mesa semantics of RETURN WITH ERROR are exactly what we want. That is, it unwinds the callee's stack; but catching the SIGNAL, passing parameters, etc., are done in the conventional way from the client's point of view. Can we get the RPC guys to give us these semantics, perhaps even with the same syntax? As you say, for debugging purposes this treatment may be somewhat of a nuisance. But I think there is an easy solution. The top-level procedures in the server (i.e., the ones exporting the remote interface) will have to catch all client programming errors and abstraction failures coming up from below and turn them into RETURN WITH ERROR on remotely exported SIGNALs. The trick is to have the top-level procedures catch these errors conditionally, based on some global debugging switch. If you turn on the debugging switch, these errors will not be caught, and the server will land in CoPilot with the server's stack still intact. Ed *start* 00563 00024 US Date: 22-Feb-82 12:36:10 PST (Monday) From: Kolling.PA Subject: There's one thing I noticed To: mbrown cc: taft, kolling in the new FilePageManager algorithm. Remember that it now doesn't do anything explicit about writes until the last user releases the chunk. This means that the chunk is seen as being dirty only if the last user was a dirtier. Consequently, for seq. access dirty chunks may be dumped in the lru list as lru, making subsequent people wait. Shall I ignore this possibility, or keep a dirty bit to handle it? Karen *start* 00474 00024 US Date: 22 Feb. 1982 1:14 pm PST (Monday) From: Taft.PA Subject: Re: There's one thing I noticed In-reply-to: Kolling's message of 22-Feb-82 12:36:10 PST (Monday) To: Kolling cc: mbrown, taft This doesn't seem like a very likely problem. But if it turns out to be, you can always get the truth about whether a page is dirty by looking at the hardware dirty bit. There is a "friends" interface called PageMap that will get you this information. Ed *start* 02056 00024 US Date: 16 Feb. 1982 7:08 pm PST (Tuesday) From: Taft.PA Subject: Handles and objects once again To: MBrown, Kolling cc: Taft I've converted AlpineClient, File, Lock, Log, Transaction, and VolatileMaps to use the latest conventions for defining handles and objects. After some oscillation, we seem to have settled on the following: When "public internal" defs files (i.e., those exported from one section of Alpine to another) need to refer to each other's definitions, they instead obtain the definitions from a common place, namely AlpineInternal (or AlpineEnvironment, if also seen by Alpine clients). This is straightforward for all definitions besides objects. To enable object notation to work, a RECORD [...] must be interposed. For example, we have the following in AlpineInternal: TransHandle: TYPE = REF TransObject; TransObject: TYPE; and in TransactionMap: Handle: TYPE = RECORD[AlpineInternal.TransHandle]; Any other "public internal" defs file that needs to refer to a TransactionMap.Handle instead refers to an AlpineInternal.TransHandle; for example, in FileMap: GetTransHandle: PROC [FileHandle] RETURNS [AlpineInternal.TransHandle]; SetTransHandle: PROC [FileHandle, AlpineInternal.TransHandle]; However, in ALL other contexts (private defs files and ALL implementation modules), one should refer to the "home" defs file rather than to AlpineInternal; e.g., trans: TransactionMap.Handle; This enables you to use object notation for operations on the handles; e.g., trans.StartWork[...]; The "home" and AlpineInternal types are inter-assignable where necessary, but in the "home ← AlpineInternal" direction you must write extra brackets (a temporary restriction until some future improvements are made in the Cedar compiler's handling of type clustering). I think this happens only when calling procedures that return handles as results. That is, you need brackets in: trans: TransactionMap.Handle ← [fileHandle.GetTransHandle[]]; but not in: fileHandle.SetTransHandle[trans]; *start* 03073 00024 US Date: 1 Feb. 1982 5:29 pm PST (Monday) From: MBrown.PA Subject: Concurrency within a transaction To: Kolling, Taft cc: MBrown We had a short discussion of this topic at our last meeting; I want to capture this in written form, and amplify it somewhat if I can. I would like us to work on this issue since it serves as a useful forcing function: if we understand the design well enough to believe that concurrency within a transaction really works, then we have made progress independent of whether or not we finally choose to implement this feature. 1) Access control impl is programmed as much as possible like a client of Alpine. It makes use of Alpine's page locking service to serialize actions involving a page. If two processes working for the same transaction can possibly call access control impl in a way that triggers an owner database access, then the page locks will not serialize them. (There may be other examples of this problem within access control impl.) 2) When the lock manager selects a victim in a deadlock cycle, then in the presence of multiple processes it cannot "break" the lock immediately. Instead it must be content to mark the transaction "dying", and to prod any processes in lock wait for the transaction. A prodded proces must then inspect its transaction state, and if it is dying must back out of the action that requested the lock. No lock may be broken at this point because other processes may be executing under the assumption that they hold the lock. The way to get the locks released is to abort the transaction. We can arrange this by (1) not allowing new client actions to start (with exception of FinishTransaction), (2) making sure that all actions complete in a finite amount of time, and (3) forking a call to FinishTransaction[abort] on the offending transaction. (The client may also call this when he notices the problem; the implementation is prepared for the parallel calls.) 3) We would like to document the semantics of concurrent actions within a transaction. In our discussions we have taken the position that the effect of two parallel writes to a page should be to give the page one value or the other, not a mixture of the two (and similarly for any mixture of reads and writes.) This seems to imply that all accesses to normal file pages (obtained from FilePageMgr) must be protected by the transaction monitor. By using inline entry procedures, the overhead of this can be made rather small (about 36 Dorado cycles, or 2 microseconds, compared to 50 microseconds to BLT a page.) Notice the following: if the monitor you acquire to do a page access is the same as the monitor you acquire to do other serialization within a transaction (e.g. to avoid having concurrent calls to access control impl) then there is a danger of deadlock. This means that either two monitors per transaction are needed, or that all synchronization of nontrivial operations must be done by manipulating semaphores within the transaction state, rather than by holding the monitor. --mark *start* 00247 00024 US Date: 4-Mar-82 16:11:46 PST (Thursday) From: Kolling.PA Subject: suggestion To: mbrown cc: kolling Maybe the thing to say is that "No change in the mapping state of a chunk is allowed while the chunk is on the lru list." *start* 00550 00024 US Date: 8-Mar-82 13:09:20 PST (Monday) From: Kolling.PA Subject: Re: Read/WritePages limit? In-reply-to: Taft's message of 8 March 1982 12:53 pm PST (Monday) To: Taft, mbrown cc: Kolling If FPM can be asked for a gigantic number of pages in one run the vm allocated to FPM will run out. Do we actually want to tolerate requests for 256 pages at a shot? What size do you envision for FPM's vm pool? (Note that this is different from the previously discussed cache problem where a client declares a seq. access file random.) *start* 01236 00024 US Date: 8 March 1982 6:02 pm PST (Monday) From: Taft.PA Subject: Re: Read/WritePages limit? In-reply-to: Kolling's message of 8-Mar-82 13:09:20 PST (Monday) To: Kolling cc: Taft, mbrown If you want to enforce a smaller limit on the length of a run, then export it through the FilePageMgr interface and I will abide by it (by cutting up my client's request into smaller runs when necessary). Perhaps there is a better way to arrange the interface to ReadPages (and friends). Instead of returning a LIST OF VMPageSet, it should return a single VMPageSet describing some initial interval of the requested PageRun. The caller makes use of that, releases it, and calls ReadPages again on the remainder of the PageRun if necessary. This arrangement gives FilePageMgr complete freedom over how to break up the requested PageRun, and permits it to deal uniformly with all considerations of inconveniently-mapped files, maximum run length, and maximum number of pages mapped simultaneously. It also eliminates the need to cons up a LIST during every call; for good performance, it seems desirable to eliminate unnecessary allocations. If you do make this change, the corresponding change in my code is trivial. Ed *start* 00917 00024 US Date: 8-Mar-82 18:49:21 PST (Monday) From: Kolling.PA Subject: Re: Read/WritePages limit? In-reply-to: Taft's message of 8 March 1982 6:02 pm PST (Monday) To: mbrown cc: Kolling, Taft Ed and I discussed his message. How do you feel about FPM returning just the chunk containing the beginning of the requested PageRun? He says that since he would have to deal with the run length limit anyway, we might as well do it this way as it would save the LIST stuff at the expense of more procedure calls. This would solve the run length problem, but not the cache flooding (since that happens with useCount = 0), but that's okay as there is already a simple count catch in there for that. Also, by "inconveniently-mapped", Ed didn't mean the SetLength when eof is mapped with useCount # 0 problem, but rather was postulating that I was melding chunks together for clients (which I don't). Karen