GVSupport.tioga
Copyright Ó 1987 by Xerox Corporation. All rights reserved.
Wes Irish, May 8, 1987 2:24:16 pm PDT
GRAPEVINE SUPPORT
Grapevine Support
Andrew D. Birrell
This document describes informally some procedures and techniques that may be used by the central administrators of the Grapevine system. It is highly unlikely that this will be useful to anyone else. The document is in reality the concatenation of my present recollections of what we have been doing as problems arose. The information and recommendations should be treated with suspicion.
File: [Indigo]<Grapevine>Docs>GVSupport.tioga
February 14, 1984 4:58 pm
Last edited by RWeaver -- May 3, 1984 06:11 am PDT
Last edited by Hal Murray -- July 25, 1984 3:42:47 am PDT
Last edited by Wes Irish -- May 8, 1987 2:23:03 pm PDT
XEROX   Xerox Corporation
    Palo Alto Research Center
    3333 Coyote Hill Road
    Palo Alto, California 94304

For Internal Xerox Use Only
Introduction
Grapevine was designed to allow decentralized administration. To a large extent this succeeded, but there are some things that must be centralised (like running the GV registry or making new versions of the software), and there are some things that the various local administrators are unqualified to do (like patching disks to recover from disk errors), and there are some things that we don't trust them to do (like purging dead names before the normal 14 day timeout), and there are some things where we got it wrong (like adding or removing registries, or verious aspects of archived mailboxes). So the central administrators are needed fairly often. The local administrators have always been friendly and helpful, and they are quite well trained at calling for help before they produce a disaster. Most of them are quite non technical.
Make use of the local administrators. If you don't, you will spend absurd amounts of time running Grapevine. It is often tempting to do something yourself because it's easier than sending a message to the local administrator, but remember that whatever problem you're solving will recur someday and life will be easier if the local administrators learn to deal with the problem. Mike Schroeder and I ended up spending a total of about a day or two per month on Grapevine (averaged - it can be quite bursty). Ron Weaver is quite helpful and experienced at providing non-technical assistance.
Access Controls
Local administrators get their privileges by being on the owners or friends acls of the appropriate registry. The precise rights endowed by those two acls are described in the Grapevine Interface protocol specification, or look in the source module ACL.mesa. There doesn't seem to be a useful distinction between those acls, so I tend to ignore the friends acl. Note that registry owners/friends cannot change these acls (nor the membership lists of those groups!) We've been maintaining a distribution list GVAdmin^.pa for sending messages to the local administrators. Remember that you can also send messages to names such as Owners-ES.gv.
There is a list LogReaders^.ms which permits people to read the file GV.log from a server; this isn't particularly useful. At present the log files (but not the archive files) on IFS's are publicly readable; this is probably a bad idea but I haven't had the energy to go fix it. Transport^.ms is the major acl defining who are the central administrators. This list is the acl checked by the viticulturists' entrance Enable command. It is also an owner of GV.gv and therefore has complete access to the database. It is also used by the built in FTP server to allow reading or writing files on the local file system. The individual Wizard.gv can also use the Enable command and is also an owner of GV.gv. This is so that you can acquire privileges on a server without needing to access a group membership list (handy if the server's disk is entirely full, or if enough servers are down that it's hard to perform the recursive membership check on Transport^.ms). If you are logged in as an individual in the GV registry, you can read someone's password by using Maintain. It's also possible to log in as a server; this is occasionally useful for using Lily to look at a server's mailbox. The group IFSAccounts^.ms has owners rights to the DMS directory on each IFS, and so can manipulate the log files and archive files there. Each mail server is a member of that list, and central administrators may want to add themselves (or maybe you should just add Transport^.ms to it).
The individual DeadLetter.ms receives a message whenever a server returns a message to its sender. The name PostMaster.pa is used by Arpanet people to ask about our mail system. The name LaurelSupport.pa is intended for user enquiries about Laurel, but they also use that for enquiries about Grapevine.
The group GVSupport^.pa is on DeadLetter.ms, PostMaster.pa, LaurelSupport.pa and Transport^.ms, and also protects access to the Grapevine directories on [Ivy] and [Indigo], where the sources are kept.
The group LaurelImp^.pa protects access to the DMS directory on Indigo (where Laurel's sources are kept) and the Laurel release directory on Ivy.
The group ExpressMail^.ms controls which lists will be given slower processing for message delivery. This is used to avoid overload by things like Junk^.pa. You may occasionally want to add to this list.
Central System Management
This section describes how to make those changes to the system that are performed by the central administrators, namely adding new servers and changing the set of registries in the world or on a particular server. We also keep central control about configuration decisions about where people's mailboxes should be, although we get the local administrators to implement those decisions. They should be reminded that it's a bad idea to move lots of mailboxes at one time (since it can severely overload the servers).
Historical Enquiries
The servers keep copious log files, backed up in 40 files of 120*512 bytes each. Occasionally a local administrator will ask questions about some misuse of the system or of someone's password. Quite often the identity of the offender will be obvious if you look at the appropriate log files. They can also be handy for tracking down the cause of multi-server problems.
Initializing a New Server named "NewServer"
0. Create a files only directory named "DMS" on a Leaf server near to the Grapevine server's permanent host. Give it a 5000 page quota. Make all permissions available to anyone in the group IFSAccounts^.ms. (Skip if account already exists for another Grapevine server.)
1. Register "NewServer" in the Pup network directory with the net#host# of its permanent host.
2. Include "NewServer" on the right-hand side for the definition for "GrapevineRServers" in the Pup network directory. (There may be a problem here: the right hand side is limited to 20 names. Omitting this step will make the server slightly harder for clients to find.)
3. Create an individual named NewServer.gv with three inbox sites, the correct password, and a connect site of the net#host# where you will first start up the server.
4. Create an individual named NewServer.ms with three inbox sites, the correct password, and a connect site of the net#host# where you will first start up the server.
5. Create a group named Archive-NewServer.ms with a member "[LeafServer]<DMS>", where "LeafServer" is the IFS used in step 0.
6. Create an individual named Log-NewServer.ms with no password and a connect site set to "[LeafServer]<DMS>Log>".
7. Add NewServer.ms as a member of MailDrop.ms.
7a. Wait a while for most existing Grapevine servers to get all these updates.
8. Add NewServer.gv as a member of all registries that the new server is to contain replicas of. This will include gv, ms, internet, auto, and foreign.
9. Add NewServer.ms as a member of IFSAccounts^.ms.
10. Wait a while for most existing Grapevine servers to get all these updates.
11. Find a vacant Dorado with an Alto partition and use CopyDisk to install the primeval server disk from [Indigo]<Grapevine>PrimevalServer.bfs. Boot this, and install the operationg system to rename the file system to be NewServer. Delete the file heap.segments if it exists. Delete MBX* and SL* too.
12. Start the server running by typing "@server.cm".
13. Type the passwords asked for and proceed from the debugger once (uncaught signal "InitialisingHeap" from HeapRestart) as required.
14. When its finally open for business, wait for all accumulated data base updates to complete, then stop the server and CopyDisk it's disks to disks on the permanent host. Restart the server there and then delete the file heap.segments from your original of the startup disks. Recently, we've done the CopyDisk to the remote site by storing the disk image on Indigo then getting the remote people to suck the disks from there with CopyDisk. This requires less intervention from this end.
15. Restore the Dorado's Alto partition to a sensible state (as described in the "Server Building" section).
Note. If you fail after step 12, you must delete the file heap.segments before retrying.
Rebuilding a Server named "NewServer" after Burning its Disks
Perform step 11 of the above recipe. Set the connect sites for NewServer.gv and NewServer.ms to the net#host# of that Dorado. Perform steps 12, 13, 14 and 15 of the above recipe.
Moving a Server
You can move an existing server to a new address quite casually. It will automatically notice its new address and update the database. If the server is a dandelion, you'll first need to update the Pup network directory to map the processor ID to the new Pup address. There is a potential race in the newly moved server finding it's local mailbox - you can fix the race by using maintain to set the correct connect site for the mail server in its local version of the MS registry.
Creating an Entirely New Registry "NewReg"
1. Create the group NewReg.gv
2. Set its remark field to something informative
3. Add appropriate registration servers as members of the group. Remember ot use full RNames such as Cabernet.gv.
4. Wait until you're sure that each of those servers has received an update giving it the final membership list of NewReg.gv (so they all know who to send updates to).
5. Add appropriate administrators as owners of the group NewReg.gv and tell the administrators that it's all ready.
Destroying a Registry "OldReg" Completely
Don't just delete the group OldReg.gv; this should work, but it apparently crashes each server that knows about the registry. So first remove all members from the group, then wait for those updates to propagate to all the concerned servers, then delete the group.
Removing a Registry "OldReg" from an Existing Server
Easy: just remove the server from being a member of the group OldReg.gv.
Adding an Existing Registry to an Existing Server
Hard. The difficulties arise from the problem of synchronizing the newly added server fetching the registry with people making updates to the registry. Really, we should have implemented the midnight database comparison and repair demon and been less paranoid about the database being temporarily out of step. However, here's what you do. Chat to the server in question, log in and enable. Then use the AddRegistry command. This leads you by the hand through a delicate process. I'm not sure what happens if there's a crash in the middle of it. In case of doubt, look at the procedure DoAddRegistry in the source module Enquiry.mesa. Approximately, the server turns itself off, then adds itself to the database in its local copy only, then gets you to add it in everyone else's copy (using some other server), then when that update has propagated (your judgement) it fetches the registry from some other server. The entire exercise is quite sensitive to network or server failures. Sigh!
Software Support
Remember that the Grapevine code is basically an Alto program. The sources for the main part of the server (Server.bcd) are Pilot compatible (approximately). There are Pilot versions of VM and some modules in Server.bcd so that we can run on Dandelions. I don't know anything about building, debugging or running the Dandelion version. The software is very close to running out of MDS memory, so be very careful about using any more!
Components
For the Alto, parts of the software are packaged separately. Different machines use different ones, depending on which servers they run. The servers are started by an Alto executive command line such as
SetTime; Grapevine.image/k TinyPup/l VM/l Server/l MaintainCore/l LilyCore/l Sponge/l;
The "/k" releases memory that would otherwise be reserved for debugger bitmap. The "/l" says to load with code links instead of frame links (saves MDS memory). The "SetTime" ensures that the system knows the time. Beware: if there are no time servers running, this asks the operator to typ in the time, and the operator often gets it wrong. The command line is stored on each server as the file Server.cm. MaintainCore and LilyCore are optional. Server may be replaced by MServer. Server and LilyCore may be replaced by Lily.
Grapevine.image
Use this instead of Mesa.image. It lives on [Indigo]<Grapevine>Image>*. Unmodified parts are still on [Ivy]<Mesa>*. It allows more processes than Mesa.image.
TinyPup
The Alto Mesa Pup package, with some bug fixes. It is TinyPup.bcd and it and the modified modules live on [Indigo]<Grapevine>Pup>*. The unmodified modules are still on [Ivy]<Mesa>Pup>*.
VM
Disk access and file system. It is VM.bcd and lives on [Indigo]<Grapevine>VM>*.
GrapevineUser
This is available as a separate package because it is bound in to Laurel. It is GrapevineUser.bcd and lives on [Indigo]<Grapevine>User>*. It implements server-independent access to the protocols. It is included in Server.bcd, MServer.bcd and Lily.bcd.
Server, MServer
This contains the main body of grapevine. Use Server.bcd if you want to run both a registration server and a mail server. Use MServer.bcd if you want to run just a mail server. Lives on [Indigo]<Grapevine>MS>*.
Lily(Core)
Implements the Lily server (also known as Ernestine; "Lily" as in "Lily Tomlin"). LilyCore.bcd can be loaded with Server.bcd or MServer.bcd. Use Lily.bcd for a stand-alone server. Lives on [Indigo]<Grapevine>Lily>*. I recommend resisting strongly any suggestions for improving this.
Maintain(Core)
The grotty teletype-style Maintain interface. MaintainCore.bcd can be loaded if you're using Server.bcd (it then allows use of Maintain if you're enabled in the viticulturists' entrance chat connection). MaintainCore.bcd should be loaded if you're using Lily.bcd. It is also included in Maintain.laurel for loading in Laurel 6. Lives on [Indigo]<Grapevine>Maintain>*. I recommend resisting strongly any suggestions for improving this.
Sponge
This silly thing is "DO WAIT c ENDLOOP". It is always loaded last to use up the original process because returning to the image file calls StopMesa[]. Lives on [Indigo]<Grapevine>MS>*.
Building a New Version
First make sure you are a member of LaurelImp^.pa. That gives you owner rights to the Grapevine directories on Ivy and Indigo.
Then find a public Dorado with an Alto partition that you're willing to destroy. Then use CopyDisk to copy from [Ivy]<Grapevine>Build.bfs to BFS. Then boot the Alto partition. I think it is password protected, and I forget the password - it's probably either "Viticulture" or "Botrytis". If necessary you can break in from the Alto NetExec using the NewOS command. There are several command files for building or manipulating the various components. There are files such as source.server, source.lily, source.maintain, object.server, files.server, etc. For example, to rebuild Server.bcd you would say
Compile @source.server@;Bind Server/c
Note the "/c" to cause code copying. I believe the binding step standardly produces one warning about code/frame links. There are different command files for VM (different implementor!) - look at *.cm. When you're finished, back up the modified files on [Indigo]<Grapevine> by saying something like
FTP Indigo Dir/c Grapevine>ms st/u @files.server
Then use CopyDisk to copy the disk image back from BFS to [Ivy]<Grapevine>Build.bfs!1. Use the explicit version number to avoid creating another copy on Ivy (the disk quota won't let you anyway!). Note that the disk image is kepy on Ivy and the source backups are on Indigo. This is deliberate - I don't entirely trust even IFS backups.
Finally, it's polite to restore some sort of Alto partition on the Dorado. Standardly, this comes from [Indigo]<BasicDisks>Mesa6-14.bfs. Then use the Alto Executive's Install command to extend the file system to use both model 44's.
Distributing a New Version
I don't know how to to this onto Dandelions. For Altos, use the FTP server provided as part of the running Grapevine server. Be careful: the running server is swapping code from the BCD presently on its disk. So first store the new version under a new file title, then rename the old version as something else, then rename the new version as the real file title. For example, store a new Server.bcd as remote file NewServer.bcd then rename remote file Server.bcd to be remote file OldServer.bcd and rename remote file NewServer.bcd to be remote file Server.bcd. You mustn't delete OldServer.bcd until you've restarted the server. Usually I don't bother to delete OldServer.bcd at all, so you will generally find that file already exists and has to be deleted before you do the first rename. The FTP server supports enumeration with the pattern "*" only. I don't guarantee what will happen if the disk gets full, so try to be consistent about the file title OldServer.bcd so that you know to delete it next time.
Once you have the software installed, chat to the Viticulturists' entrance to restart the server. Log in, enable, and use the Restart command. This defaults the new command line to be "@Server", which is generally what you want. We rarely need to involve the local administrators in this exercise, but it's best to warn them so that they don't worry when it looks as if their server just stopped.
Server Support
This section is about how to deal with ailing servers. Of course, I don't remember all the things that have happened. Usually when something goes wrong you get a phone call from an administrator or an irate user. Sometimes they don't notice for a day or more. Usually the incident is restricted to a single server, but occasionally it is more widespread.
Be aware of the possibility of some systematic software bug crashing multiple (or all!) servers. Also, be aware of the rolling blackout effect, where a problem in one server causes load shedding to another server, which in turn becomes overloaded and sheds onto other servers. This can result in the entire system becoming clogged, and that's very difficult to get out of (see the sub-section on restricted start-ups for some hints).
Many times when it looks as if Grapevine is performing badly or not at all the real cause is that the internet is being flakey. Occasionally some internet links are sufficiently marginal that the servers think they are alternating between up and down, and this can cause very large loads on the servers trying to forward mail to each other. This also causes large loads on the slow links, which aggravates the original problem. Anything going through Polaris is liable to this (because it's the center of a star), as is traffic from Palo Alto to El Segundo (because of the multitude of hops on some of those routes).
Starting and Stopping
The servers are all configured so that they are started by getting to the Alto execuitve and typing "@Server.cm". They can be stopped safely by typing shift-Swat. Just pressing the boot button may stop them in the middle of a disk transfer and so cause a disk error to be reported on a subsequent restart. If the server is in the debugger, remember to get out by typing "K". SOme people try typing "Q", but that just kills the current process.
Database Files
All the interesting information used by a server is kept in ordinary files, without any complexities of disk addresses, etc. So in principle a server could be moved just by moving all the files (and re-installing the OS, Swat, debugger, etc.). I have never tried this. All the BTree files are recalculated on every restart. The files slQueue.*, heap.data and mbx.mailboxes are "truth".
The Viticulturists' Entrance
See the Grapevine Interface document for a summary of the commands. The interesting ones are not available until you are logged in and enabled. Some commands are designed provide temporary relief from overload. Use SetPolicyControls to reject offered load or to suspend processing of message queues or reading of update messages. If the server has run out of free heap, use SetPolicyControls to stop things getting any worse, then use ForceArchive to archive a mailbox. Then, once you have some free heap, use SetArchiveDays and ForceBackgroundProcess to run the background mailbox archiver. The default is to archive mailboxes that haven't been emptied for seven days, so archiving with a four day threshold should yield quite a lot of free heap (once the compactor gets around to it). Note that SetArchiveDays applies only to the next run of the archiver process. Remember that the policy controls may prevent the archiver from running (particularly the one about allowing only one background process at a time). After things are looking better, determin why you were short of free heap. It may be that the server has too many mailboxes or too many registries, or it may be a transient that you're willing to accept (unlikely).
The ObserveLog command is useful for watching how a server is getting on (and it may be convenient to use a second chat connection while you're doing that). There are log entries for the free heap at 5% increments (1% if it's less than 10% free). DisplayStatistics and DisplayPolicyControls are also useful for monitoring the server.
Restricted Start-ups
Sometimes the server is sufficiently overloaded, and the offered load is so great, that by the time it's restarted it has already got itself into more trouble before you can log in to the viticulturists' entrance. You can prevent this by getting someone local to type control-v while it's loading Server.bcd. Then it will ask for a password ("Viticulture"). Then it will set all policy controls (excecpt the viticulturists' entrance) to "not allowed" before continuing with the restart. Remember to set them all back to "allowed" when you're finished. The last time we had a global Grapevine crash, this was the only way to recover. The first server to restart (Cabernet) was accepting connections at one every two seconds for the first hour, and was rejecting vastly more requests than that.
Errors
There aren't many bugs in the existing software. The ones that exist happen every thousand hours or so, and we tend to ignore them. In fact, the local administrators tend to ignore them and just restart the server. The Alto Mesa debugger is available if necessary, but it's very painful (and runs in only 64K). Often, the debugger needs re-installed; type
Runmesa Xdebug Fetch
There are some common symptoms. Note that the cursor tracking is not high priority, so it can freeze just because of a CPU loop elsewhere. You can usually tell a CPU loop from the "Idle Time" statistic on the screen display. There are occasional deadlocks because of too few VM buffers (because of the way that deadlock is implemented, it looks like a CPU loop). There are occasional deadlocks because of too few PUP buffers ("Free PUPs" on the screen). There is a bug that causes an uncaught signal from the BTree package. You can tell if it died on the way to a broken debugger by whether the cursor still has the "grapes" or not.
Disk Errors
Sad. The administrators have standing orders that if they run the scavenger they should tell it not to modify the disk. This is to avoid losing the information about what page is bad. They are not very good about following this rule. Often, the best way to proceed is to let the scavenger fix things up, then restart and see what the restart sequence complains about. If it doesn't complain, things are probably ok. If it does complain, you have a choice. You can try to patch the data structures back together with the remote disk patching tool, or you can cut your losses and reconstruct the server from scratch (thereby losing whatever mail was buffered there). Patching is very intensive in the time of the experts, and may not be worth it. Historically, we have patched things back together.
Administrative Support
This section is about how to assist the local administrators when something goes wrong with the system, other than blatantly broken servers.
"Turkey.reg can't read his mail"
This is probably the most common problem. Sometimes it's not really our fault. This includes bugs in Laurel or Hardy, or disk full on the user's workstation. (With Laurel, if you click NewMail with the right button, it will delete messages from the inbox one at a time, so you can get as many messages as will fit the disk, process them, then get some more. This is useful if the user has a vast number of messages the read.) Most other occurrences are related to reading archived mail. Usually this is just because the archive IFS is down, or an internet link to it is down.
The cause of real difficulty in reading mail is almost always something to do with archive files. There are several things the IFS administrator can do to cause this, so check them first. the IFS might not have its Leaf server enabled (see whether the IFS version number is followed by an "L"). The IFS access controls mnay have been mangled to prevent the server accessing the files. Try chatting to the IFS and logging in as the mail server (e.g. "Cabernet.ms") and connecting to the DMS directory. Check the file protections on the archive files. Remember that the IFS administrator can (wrongly) configure the IFS to avoid using Grapevine for access control lists.
The administrator has been known to delete archive files from the IFS. Sometimes the DMS directory gets restored (incompletely) from backup. Sometimes archive files actually get corrupted. In these cases, information has been lost and cannot be recovered. There is a tool "Vulture" for patching archive files, but I don't think it is worth the trouble. Instead, get the user to read as many messages as possible using Lily. Then give the user a copy of the actual archive file(s) if desired (as text files). Then delete the user. This will cause the servers to abandon any atempt at reading his inbox and just flush it. Then recreate the user. (This will require the actions described below under "I just deleted Turkey.reg").
Emptying the mailbox of Turkey.reg
This requires a little care, because the wrong actions can cause vast quantities of messages to be brought onto the server for return-to-sender processing. (This was the cause of the SOSP7 disaster.) To prevent this, delete the individual entirely - this will cause the server to just flush the mailbox with no attempt at remailing. Just removing all mailbox sites is usually wrong - it will cause remailing to be attempted. Note that the return-to-sender logic will discard a message that's more than seven days old without telling anyone (although it writes a log entry).
"I just deleted Turkey.reg"
Normally, a name can't be recreated until the "dead" entry is purged about fourteen days later. This can be circumvented manually. First, wait until every server for the registry believes the entry is dead (use SetServer and TypeEntry in Maintain to check this). Also be sure that any other updates for the name have finished propagating. Note that you can't do this unless all the servers for the registry are accessible. The chat to each server in turn, log in and enable, and give the ForcePurge command. Then it's ok to recreate the name.
Orphan archive files
Sometimes the server will be unable to delete an archive file. In that case, it just forgets about the file. You can verify that an archive file is an orphan by using the DisplayInboxes command of the viticulturists' entrance on the corresponding mail server. If that mail server believes the user has no archived messages, then you may manually delete the file(s).
LaurelSupport.pa
Some users think Laurel support means Grapevine support, so you should be on the LaurelSupport.pa list. Ignore messages that are purely about Laurel - someone else will handle them (John White at present).
DeadLetter.ms
This name receives a message whenever something gets returned to sender, or whenever a distribution list owner is notified about invalid names on the list. These messages can generally be ignored. This name recieves messages that can't be returned anywhere else because their return-to name is invalid - these are quite rare. This name is also the return-to name for registration server and mail server internal mail. It is also a member of Postmaster.pa, which occasionally receives enquiries or comments from the ArpaNet about our mail service - try to be polite!
Large numbers of returned messages
There is usually a steady trickle of messages to deadletter (about 10 or 20 per day). If there is a sudden burst, try to understand why. Most frequently they are because of message time-outs because some internet link or server was down for too long (more than two days, excluding weekends).
Returned RS internal mail
Each occurrence of this means some update has failed to propagate. The text part of the message says what name the update was for, and you get told which server the message was from and to. There are often multiple messages for one name. For each name whose update did not propagate, you must cause a propagation from the server that originated the update. Do this with Maintain (using SetServer for the appropriate server), by using SetRemark or SetConnect. These always a full update to propagate (AddMember and so on cuase only part of the entry to bee propagated).
Returned MS internal mail
Most of these are not important. The only ones that matter are if the message was addressed to a mail server that no longer should have an inbox for the individual. In that case, the message was intended to provoke remailing from the inbox. You need to trigger this manually be adding that inbox site for the individual then removing it again.
Tools
Over the years we have accumulated some hacks that assist the central administrators. The programming quality is typically quite low. These are all Cedar programs. DBCompare, GVWatcher, InvertDls and DLMap live on [Indigo]<Grapevine>Tools> and they are contained in GVTools.df. The others live on [Indigo]<Grapevine> somewhere.
DBCompare
Compares the database on one server with copies elsewhere. Useful for recovering from inconsistencies. If you find an inconsistency, fix it by doing a SetRemark or SetConnect on whichever server has the better version of the entry (or on both servers if you can't tell which is better).
DBPurge
Enumerates the entire registration database looking for group members/owners/friends that are "dead" names, and removes them. This is the best solution to the problem of someone being deleted and not being removed from lists in another registry. The plan was to run this as a once per day process on the Cedar server-server. Alternatively, get Ron Weaver to run it overnight on his machine. If you arrange for this to run systematically, you will reduce the number of return-to-sender messages, and you can tell local administrators to stop doing RemoveAllMemberships when they delete a name. (They would love to stop doing this!)
GVWatcher
Chats to every server and gets its various statistics. This is quite quick and is a good way to get a feel for whether everything is running ok. It's very simple and should be easy to convert to Pilot.
InvertDLs
Ron Weaver runs this periodically. It runs for several hours and produces files saying what lists everyone is in.
DLMap
Ron Weaver runs this periodically. It runs for several hours and produces files saying what distribution lists exist and some information about each list.
Vulture
This is a utility for patching archive files. It has only been used once and has not been rebuilt for the current version of Cedar. I have never used it. Incidentally, a file containing all zeroes looks like an archive file all of whose messages have been deleted.
Remote Disk Patching
This is a server module for running on the target disk, and a user interface for inside Cedar. It understands about Heaps, and allows patching of heaps or any other file on the target disk. It has not been rebuilt for the current version of Cedar. I have never used it. I believe there is a documentation file with it.
Beware, "well known" facts, and other odds and ends
This section is collection of things everybody who works with Grapevine should know.
Flushing a mailbox
In this context, flush is a reserved word. If an IFS crash destroys an archived mail file, the server gets all confused and the user sees "failed". The only simple way out of this is to flush the mailbox. To do that, don't delete the user, but just remove all of his mailboxes. If you only remove the mailbox at the sick server, that server will bang it's head against the IFS trying to forward the mail in the missing file to the other server(s). After the servers have deleted the mailboxes, add them back to the user. (I don't know any easy trick for how long to wait.)
Moving Mailboxes
Moving mailboxes is tricky. If you move too many messages too rapidly, you can get the servers all tangled up. Don't move any mailboxes with archived mail. It's just too risky. Either flush the mailboxs, or get the users to read their mail.
Recovery of a GV server...
The following is from a mail message I sent out a while back:
If a GV server crashes it will behave like this, either:
A) Just boot it and everything starts up fine, normal glitch.
B) Some problem prevents it from restarting, usually a disk related error.
A is trivial. This message describes what to do in the case of B. There is another reason why you might want to go through this procedure: if the server is running OK but is racking up a number of disk errors it might be headed towards a crash. You can follow this procedure in order to reformat and/or replace the disk before it dies...
Try to save all of the important data files from the Grapevine volume to an IFS before doing much else. Do this by running in Copilot and opening the grapevine volume:
>open grapevine
and then push it onto the search path:
>push <grapevine>
now you can use file tool to put the stuff on the IFS. (NB: you need lots of room.)
The important files are:
Heap.* (3 files)
MBX.Mailboxes
SLQueue.* (about 5 of these)
If you store all of these successfully then you are OK. If you can't save them all then you might be able to recover with some additional work if the critical ones are OK. Critical ones are ... If something catastrophic happened that causes you to loose everything then your SOL. Bring up the server as if it were a new server...
Once the files are saved then do whatever you need to do to make the server OK again, ie) reformat the disk, replace the disk, replace boards...
Once the hardware is fixed then set up the machine as if you were getting ready to bring up a new GV server, but don't boot it to the GV volume yet. Once this has been done boot Copilot. From Copilot open grapevine for ReadWrite:
>open grapevine/w
and then push it onto the search path:
>push <grapevine>
(If at this point it yells something about no root directory then you will need to boot GV with the i switch (this is just so that you can stop it easier). This should create the needed root directory on the grapevine volume. In a minute or two the grapevine volume will ask you about heap size, Shift-Stop back to copilot. Now try to open and push as before.)
Retrieve all of the files from the IFS that you saved earlier. Now close the grapevine volume:
>close grapevine
and then reboot Copilot. (I'm not sure if this is really necessary but you want to MAKE SURE that Copilot doesn't have the grapevine volume open for write anymore...)
Now boot grapevine without any switches. At this point it should start running and your home free.
Note: Unlike when bringing up a "new" GV server, don't change the password when following this procedure. The server will remember its old one. (And it will get very confused if you do try to change it.)
Now, what did I miss?..