WatchDoc.tioga
Russ Atkinson (RRA) March 28, 1986 6:30:45 pm PST
Watch Documentation
CEDAR 6.0 — FOR INTERNAL XEROX USE ONLY
Watch Documentation
A tool for watching the system
Russ Atkinson
© Copyright 1984, 1985, 1986 Xerox Corporation. All rights reserved.
Abstract: Watch is a program that maintains and displays statistics on selected system resources and events. Watch also has an automatic power off feature, which will power off the machine provided that there is no load on the machine.
Keywords: Cedar, performance, power off
XEROX  Xerox Corporation
   Palo Alto Research Center
   3333 Coyote Hill Road
   Palo Alto, California 94304

For Internal Xerox Use Only
1. Introduction
Watch is a program that maintains and displays statistics on selected system resources and events. This information is grouped into normal and extended items of information, which are displayed in the normal and extended areas of the Watch display. Normal Cedar users need not know anything about the extended information. Watch also has an automatic power off feature, which will power off the machine provided that there is no load on the machine.
As a Cedar component, Watch is described by Watch.df. The tool is named Watch.bcd.
For client programs that need information that Watch gathers, some of the more interesting numbers are exported through WatchStats.
Watch places a relatively small load on most machines (less than 2% on a Dorado). Its load can further be decreased by increasing the sample interval (see below). When the display is not needed, Watch can be made iconic, where it will take almost no resources.
2. Normal area
The normal area contains the information of interest for most users. It gives a brief indication of how much of the system is being spent in which activities.
The top half of the normal area is organized into three lines of information, with a label at the left of the line, and a bar graph line at the right. The items displayed are:
Words - The number gives the number of words allocated by the SafeStorage manager. The bar graph line gives the number of words per seond allocated, using a logarithmic scale.
CPU Load - The bar graph shows the percentage of the processor that is not idle. The percentage idle is derived from how fast a special idle process is incrementing a counter. The number just after the label is an exponentially decaying average of the CPU load.
DiskIO - The number gives the number of disk IO requests (the sum of read and write requests). The bar graph gives the percentage of the time that the disk IO queue had a request on it.
The bottom half of the normal area is also organized into three lines of information, although the structure is more heterogenous. The lines displayed are:
Free - This line is quite important, because it gives information about the remaining system resources. When these numbers become small, the user is advised to put the machine in a clean state (save edits and so forth), and rollback or reboot. The numbers are:
disk - gives the number of disk pages remaining (a page is 512 bytes). By default, this number should be maintained at not much less than 1000 pages. If this number goes below 512 pages, consider deleting some less important files.
gfi - gives the number of global frame indexes remaining. A CedarMesa module can consume from 1 to 4 gfi's, although 1 is most common. A program will consume gfi's roughly according to the number of modules in it. If this number goes below 20 the user should consider a rollback.
mds - gives the number of pages of MDS remaining. Global and local frames consume MDS, so loading programs is the main cause of MDS exhaustion. If this number goes below 20 the user should consider a rollback.
VM - gives the number of pages of VM (Virtual Memory) remaining. The main cause of VM disappearance is SafeStorage allocation. If this number goes below 500 the user should consider a rollback.
VM run - gives the largest number of contiguous pages of VM remaining. The main cause of VM run disappearance is fragmentation. If this number goes below 100 the user should consider a rollback. Because this number is more expensive to sample than the others, it is only sampled every 30 seconds, so it can report an older sample than VM. The user can override this long sampling interval through the user profile entry Watch.LongPause. To reduce CPU load, this number is not sampled when the machine is idle or when the Watch tool is iconic.
GC - This line has two buttons controlling garbage collection (see below), and a status item indicating why the garbage collector is busy (if it is) or how many words and objects the collector reclaimed the last time it completed.
Sample - This line has two buttons controlling Watch sampling (see below), and a status item indicating which file the file system is fetching or has most recently fetched from a remote file server. While the file system is fetching the file, the backgraound of the label is black, otherwise it is white.
3. Controlling buttons
Watch has several buttons in the normal area that allow simple control of system parameters. All of the buttons have rectangular borders.
GC interval - This button governs the number of words to allocate between automatic garbage collections. Left-clicking this button doubles the number, and right-clicking this button halves the number. The default GC interval is 16000 words, although the user can override this though the user profile entry Watch.GCInterval.
GC - Left-clicking this button initiates a background incremental garbage collection. Right-clicking this button causes a trace and sweep collection (which is more thorough, but significantly slower).
Sample - Any click of this button will force all of the numbers in Watch to be sampled immediately. The window size is also set: left-click for small (the Sample button is the last line), middle-click for medium-size (a number of lines specified by the user profile entry Watch.middleLines {default 2}), and right-clicking for large (all numbers shown).
interval - This button governs the number of seconds between samples. As for the GC interval button, left-clicking this button doubles the number, and right-clicking this button halves the number. The default sample interval is 2 seconds, although the user can override this though the user profile entry Watch.SamplePause.
4. Automatic power off
To conserve electrical power, Watch will power off the machine under the following conditions:
1. The machine has been placed in idle (using the Idle button).
2. Enough continuous no load time has passed (Watch.idleMinutesTilPowerOff ← 10).
Minimum is 5 minutes. Setting a very large value (like 1000000) will prevent Watch from powering off the machine at all. Do not use a number > 2147483647.
3. It is after a specified time of day (Watch.powerOffAfter ← 1900).
4. It is before a specified time of day (Watch.powerOffBefore ← 700).
24 hour time specs are given for ease of implementation and explanation.
The definition of "no load" is that all of the following are true:
1. The exponentially decaying average of CPU load is less than the cutoff (default 5%). This average (rounded to the nearest percent) appears to the right of "CPU Load" in the Watch tool.
5% is an observed good value for no CPU usage on Dorados, after factoring out the load of Watch itself and occaisional Ethernet broadcast packet handling. Although this figure may not be good for other machines, we are really only concerned about Dorados as far as power usage is concerned. The exponential decay allows roughly 10% change in the average per second, which ensures that any significant spikes will go over the limit.
2. No words have been allocated from SafeStorage.
3. No disk transfers have taken place.
Note that Watch.idleMinutesTilPowerOff, Watch.powerOffAfter, and Watch.powerOffBefore are settable from the interpreter, and are also user profile options under the same names. Here are the profile options, showing the defaults:
Watch.idleMinutesTilPowerOff: 10
Watch.powerOffAfter: 1900
Watch.powerOffBefore: 700
Some machines do not handle power off well. If this is the case, the machine profile (but not the user profile) should contain the entry:
Watch.powerOffInhibit: TRUE
The machine profile is stored in the same format as the user profile, but uses the file name machineName.machineProfile, where machineName is the name of the machine. See UserProfileDoc.tioga for more information.
5. Extended area
Except for the first two lines and a few minor exceptions, the extended items are merely samples of numbers available through system interfaces. The items are grouped as follows:
Flushing - indicates which file the file system is deleting from its cache of remote files (when the background is black), or has most recently deleted (if the background is white).
Storing - indicates which file the file system is storing to a remote server (when the background is black), or has most recently stored (if the background is white).
FS - gives the cumulative number of file systems fetches, flushes, and stores.
Disk - The % busy number is the number displayed in the DiskIO bar graph line. The other numbers are derived from Disk.GetStatistics, and give the cumulative number or read requests, and pages, and write requests and pages.
VM - This group reflects some of the numbers available through VMStatistics.
faults (VMStatistics.pageFaults) - cumulative number of page faults.
readOnly (VMStatistics.readOnlyPages) - number of pages that may not be written.
pinned (VMStatistics.pinnedPages) - number of pages that must remain associated with real memory.
conflicts (VMStatistics.checkoutConflicts) - number of times that a VM operation was attempted on a page in use by an IO operation.
Replacement - This group reports some of the numbers available through VMStatistics that measure the performance of the replacement algorithm.
passes (VMStatistics.rmAllocPasses) - number of times the replacement algorithm has cycled through real memory.
pages (VMStatistics.rmReclamations) - total number of pages of real memory reclaimed since startup.
free (VMStatistics.rmFreeList) - number of real pages reclaimed from free list.
old (VMStatistics.rmOldClean) - number of real pages reclaimed that were clean and not recently referenced.
new (VMStatistics.rmNewClean) - number of real pages reclaimed that were clean and recently referenced. If the number of these pages is a significant fraction of the total, either the working set is too large for the available real memory or the replacement algorithm isn't doing its job properly.
dirty (VMStatistics.rmDirty) - number of real pages reclaimed that were dirty. If the number of these pages is a significant fraction of the total, the laundry process isn't doing its job properly.
Laundry - This group reports some of the numbers available through VMStatistics that measure the performance of the laundry process.
passes (VMStatistics.rmCleanPasses) - number of times the laundry process has cycled through real memory.
wakeups (VMStatistics.laundryWakeups) - number of times the laundry process woke up for any reason.
pages (VMStatistics.pagesCleaned) - total number of pages cleaned by the laundry process.
panic (VMStatistics.panicLaundryWakeups) - number of times the laundry process woke up because no clean memory was available to the replacement algorithm.
panicPgs (VMStatistics.pagesCleanedPanic) - number of pages cleaned by the laundry process during panic wakeups.
useless (VMStatistics.uselessLaundryWakeups) - number of times the laundry process woke up and was unable to perform any transfers.
cleanCalls (VMStatistics.laundryCleanCalls) - number of calls on VM.Clean from laundry process.
SwapIn - This group reports some of the numbers available through VMStatistics that measure the performance of VM.SwapIn.
calls (VMStatistics.swapInCalls) - total number of calls on SwapIn.
vRuns (VMStatistics.swapInVirtualRuns) - number of logical disk operations triggered by SwapIn.
pRuns (VMStatistics.swapInPhysicalRuns) - number of physical disk operations triggered by SwapIn.
pages (VMStatistics.swapInPages) - total number of pages SwapIn was asked to swap in.
alreadyIn (VMStatistics.swapInAlreadyIn) - number of pages SwapIn was asked to swap in, but were already in.
undef (VMStatistics.swapInNoRead) - number of pages SwapIn was asked to swap in, but avoided reading because their data state was "undefined".
read (VMStatistics.swapInReads) - number of pages actually read by SwapIn.
dirtyVictims (VMStatistics.swapInDirtyVictims) - number of dirty victims SwapIn was forced to remove synchronously because the laundry process wasn't keeping up.
cleanFailed (VMStatistics.swapInFailedToCleanVictims) - number of dirty victims SwapIn tried to swap out but was unable to do so.
FileStatistics - This group reports some of the numbers available through FileStats that measure the performance of the file system. Each line gives the name of a basic operation, followed by the number of calls on that operation, followed by the total number of pages involved in all calls to that operation, followed by the total number of milliseconds that operation was busy.
Open - measures calls to open a file. The number of pages is the number of pages in the files opened.
Create - measures calls to create a file. The number of pages is the initial number of pages in the files created.
Delete - measures calls to delete a file.
Extend - measures calls to make a file larger.
Contract - measures calls to make a file smaller.
Read - measures calls to read pages from a file.
Write - measures calls to write pages to a file.