A Proposal for Compute Servers in Cedar
Robert Hagmann
This document is a collection of ideas about how to do computation on multiple machines in Cedar. It is not complete, and suggestions are actively solicited. This is a working document that has very little polish, so please forgive its incomplete and uneven presentation. No resources have, as yet, been committed to this effort.
Introduction
Xerox PARC has used the workstation/server model for computation for over a decade. In that time, very little has been done, with the brief exception of the "Worm" four or five years ago, to extend this model to provide for cooperation between equals (instead of workstations and servers), load sharing, process migration, and the execution of large computations. In addition, the Dragon project should give us multiple-processor workstations in the next few years. We should gain more experience in the use of multi-processor computations before these machines are available.
We are rich in total compute power, but often much of this power cannot be tapped by a user. If a computation cannot easily fit on their workstation, then either the user has to use a more powerful workstation (a public Dorado), adjust their problem (e.g., inexact solutions), or change their work habits (work on a public Dorado or at night).
In the product world, there also is the notion of a personalized workstation. Again, there is no facility to increase the computational power available to a user beyond that provided by the workstation. Workstations might be powerful enough to do any computation, but they then could saturate and not be capable of doing other useful work.
We do no load sharing or process migration. All of our inter-machine communications are in the server/client paradigm; two processes cannot easily communicate if they are on different machines unless one is willing to act as a server. This means that a machine may export an RPC style interface, but it now must act as a server for that interface and respond to all calls on that interface. An RPC call always starts up a new process; therefore, application dependent code must be written to identify which of the many computations running on the machine this call is for, and synchronize this RPC call with it in an ad hoc manner.
The goal of this work is to provide a framework for others to build applications and do experiments. It will not be a solution to the, say, problem of restart of distributed programs. However, the primitives should be there for experimentation, and a simple default form of recovery will be provided.
Basic Idea
We will have several Cedar machines (the compute server cluster) that are not personal machines. Originally, these will probably be DLions that will be augmented by public and personal Dorados at night. In the future, Dorados and Dragons would be added. These machines may have special properties such as large memories, special hardware, or faster processors.
Clients can request that certain commands (similar to Commander registered commands) be performed on the cluster. These commands can be run with the cooperation of the workstation (mostly for file storage), or can be run detached from the workstation relying only on the file servers for data storage. The later mode of operation will make it possible for non-Cedar users to access the cluster.
To see how this might work, consider the case of running a compile on the compute cluster with the cooperation of the workstation. By running a bcd on the workstation, a registered command "Compile" can be added to the CommandTool. The user then just types "Compile Foo" to a CommandTool, and a remote compile would occur. The result bcd would appear on the local workstation's file system. Viewers to the error log and other such processing would run on the workstation and be identical to current operations.
One design goal is to use as many current programs as possible with little change except for rebinding. These programs typically swamp the workstation, but are usually coded in a sequential nature. An example of this is the Cedar Compiler. Other programs will be developed by users that will take advantage of the the fact that there are more than one compute servers in the cluster, and will attempt to divide the work between other machines in the cluster with the assistance of the compute server software.
There will be a set of machines that will aways be compute servers. In addition, pool machines may be added to the cluster (usually at night). Provisions are made for personal machines to participate in a limited extent in the cluster, but the protocol for use and allowable operations on these machines will have to be determined.
Issues about machine and process crash, protection, resource location, process communication and coordination, naming, process checkpointing, process status and progress, and resource consumption on personal machines will all be addressed in the system. Most of these are described in the next section.
Implementation Details
This section deals with how the server cluster would operate. It is moderately detailed and requires some familiarity with Cedar.
Types of Programs that can be served
There are two classes of programs that this service is designed to run. The first is the existing sequential program that basically performs some command (in the Cedar sense of command). The command should be compute or memory intensive, and be willing to communicate with the client in a restricted fashion. The command receives a ROPE (a string in Cedar), and returns a ROPE. A stream or two will be opened between the client and the remote process. So far this looks quite similar to the parameters passed to a command. Finally, the file system is used as the means of bulk communication. Thus, files are the primary means of input and output for services running on the cluster. Files may be stored from/to the workstation (for Cedar workstations), or from/to file servers. Good examples of existing programs that fit into this model are the compiler, binder, type setter, scan conversion, dithering, and some DA software. Additional processes can be FORKed on the server, but they will run on the same machine.
A second class of software is that which does not have to be sequential. The programmer has coded the algorithm to take advantage of some large grain parallelism. The programmer, together with facilities provided by the compute server package, explicitly starts up parallel computation and participates in its coordination. ROPE's, streams and files should handle the bulk of the communication between processes, but assistance will be provided to help parts of the computation find each other(i.e., import the remote interface in RPC terminology) and communicate via RPC.
Reasonable precautions will be made by the system when adding a new command. The bcd file will be bound together with some support routines to detect use of improper interfaces. This is not a guarantee that improper activity cannot occur. Trojan horses, processes that manipulate the binary image of objects and memory, circular data structures, non-robust programs (?), FORKed processes that never terminate, and excessive forking are all examples of commands that are not "good citizens", and will be eliminated when they are found.
Commands should have last-one semantics: the command can be partly or completely executed a number of times, and the final complete execution provides the desired result. This is fine for most commands (e.g., the compiler), but not for others (e.g., the standard database update query of give everyone a 10% raise).
How the System Finds a Server to Use
Each local computing area (e.g., CSL) will have a "Summoner" entry in Grapevine. By looking up the entry (e.g., "PaloAlto.Summoner"), the connect machine ID can be obtained (e.g, "3#35#"). This machine is the controller of the compute service for that area's cluster. The RPC interface for the controller is then imported (this is RPC terminology) from this machine. This enables RPC calls to the controller.
The command can run with cooperation of the workstation or in a detached mode. In the first case, the RPC call starts off the binding process to find the compute server. The controller selects a machine from its pool for the client to use. The machine ID is sent back to the client's machine. The interface from the server machine is then imported in preparation for the computational requests to be made. The server is now located, and the command is ready for startup.
For detached mode, the command request is passed to the controller via a RPC call. When this call returns, the request is completely in the control of the the controller. No further communication with the workstation is needed. The controller selects a compute server, imports the interface, and starts up the command. The controller takes responsibility of restarting commands on crashed servers.
None of the above should be visible to the client or user. A computation is requested by the client, and all the lookups and bindings happen automatically inside the support code.
There will be a primary and secondary machines for the controller. User profile entries will tell the machines who they are. If any machine fails to connect to the controller given by Grapevine, that machine has the right to declare itself the controller and register the fact with Grapevine. When the primary or secondary machines are up, they will regain the controller function from any other machine, and the primary will regain it from the secondary.
The controller will also be called by the compute server machines to record their hardware characteristics and performance statistics. Other information sent are the commands that the server are willing to accept, and any restrictions imposed at the server (typically for a personal machine that is temporally a server). This information will be stored on a file server to help with a smoother transition if the controller crashes.
Process Initiation
The compute server software supports two models of computation. The first is the process model, described in this section, and the second is the object model.
From the above, the client's machine or the controller now knows the machine ID of the server to use to compute. The RPC interface for the server is then imported, and a call is made to start up the service. A ROPE is passed as an argument, streams are established, and provisions are made to handle FS requests. For detached computations, input and output streams will be redirected from/to files.
When the request reaches the server, it may have to "Run and Start" (load into memory and initialize) some programs that it finds in a special directory. Once this has occurred, the command can be started. Servers are permitted to refuse to do computations. The client then must ask the controller again for a machine to run the computation.
Object Model Computation
To do object based computation, we have do be able to do the following things: instantiate the object type, create new objects, manipulate the objects by doing procedure calls to them, destroy existing objects, and uninstantiate the object type.
To assist in this, the server will provide for a facility to startup the type on a (possibly) remote machine. This will instantiate the service on a machine, and return a handle to it. This handle can be freely passed around in the computation (e.g., sent as a parameter in a RPC call), and used from any machine in the cluster.
When the service starts, it must register four procedures with the compute server. These procedures are for "create", "actions", "information", and "destroy". Once registration occurs, the type is open for business, and the original startup call to instantiate the type can return. On the machine that instantiated the type, object registration, deregistration, and handle to object lookup will be provided.
There are four operations supported by the system for types. They are restricted in what they can perform, but an RPC interface can always be constructed to overcome any difficulties. First, an object may be created. To perform actions on the handle, an atom and a ROPE are passed, and a ROPE is returned. To do actions on two to five handles, they are passed together with an atom and a ROPE, and a ROPE is returned. The second through fifth handles do not need to be of the same "type" as the first handle. For both of these calls, when the call arrives on the server the (first) handle will be mapped to the object before the call is made. Finally, objects may be explicitly destroyed. Reference counting is not done. Dangling references are allowed; they will cause an ERROR to occur.
Once a type is no longer needed, it may be destroyed by use of the handle to the type. If the type was created with the option to destroy it on process exit, when the creating process exits, the type will be destroyed unless it already has been destroyed. I may implement remote reference counting, and server wide garbage collection, but don't count on it.
Examples of these calls are:
ComputeObjectServices.StartService [commandName: ROPE, destroyOnExit: BOOL ← TRUE, machineID: ROPE ← NIL, loadFileName: ROPE ← NIL] RETURNS [ComputeObjectServices.TypeHandle]
ComputeObjectServices.CreateObject [type: ComputeObjectServices.TypeHandle] RETURNS [ComputeObjectServices.ObjectHandle]
ComputeObjectServices.DoAction1 [handle1: ComputeObjectServices.ObjectHandle, code: ATOM, param: ROPE] RETURNS [result: ROPE]
ComputeObjectServices.DoAction2 [handle1, handle2: ComputeObjectServices.ObjectHandle ← NIL, code: ATOM, param: ROPE] RETURNS [result: ROPE]
...
ComputeObjectServices.DoAction5 [handle1, handle2, handle3, handle4, handle5: ComputeObjectServices.ObjectHandle ← NIL, code: ATOM, param: ROPE] RETURNS [result: ROPE]
ComputeObjectServices.DestroyObject [handle: ComputeObjectServices.ObjectHandle]
ComputeObjectServices.StopService [service: ComputeObjectServices.TypeHandle] RETURNS [returnCode]
Process Checkpoint
Long running processes are encouraged to checkpoint their results in a stable manner. Examples of a stable checkpoint are files written and committed to Alpine, or files written to the local disk of the controller of the computation (either the user's workstation, or for detached computation, the initial machine in the computation) that are renamed (an atomic action) to the checkpoint file name.
Unfortunately, the ability to checkpoint and restart is an application dependent function (at least in current Cedar). No real assistance in this area will be given by the system.
Process Communication
The simple case is where an existing command is to be executed remotely. The command is started, and given an input ROPE and a pair of streams. The workstation cooperates in the execution of the command. All workstation file system (FS) operations are performed remotely as viewed from the server: they are done on the clients machine. Thus, the output files from the command are written onto the workstation's local file system. The file name bindings (attachments) are used from the workstation. Global files opened for read that are readable from "world" (or maybe "CSL^.pa") will be cached on the server and opened there. For use on personal workstations, the caching may be prevented or restricted to reduce the impact on the file cache, and its LRU performance, on these machines.
For detached commands, the specification of what to do is contained in the command to execute; a "df" file; a server and directory to write results; and a string (ROPE) of arguments (together with the user name and password). The "df" file is just a df file used within Cedar. It specifies the local file name binding to global file name (files on file servers) to use for the duration of the command. For example, it might encode the fact that local file name "BasicTime.bcd" really refers to file "[Indigo]<Cedar5.2>BasicTime>BasicTime.bcd" with a certain create data. An open for read, say by the compiler, would really open some copy of the file cached from the server. An empty directory of an unique name will be created on the server, and the name bindings are made into this directory. The specified command is run, and upon completion any files found to have been created in the local directory are copied to the specified server and directory. The copy will be done with the user's userid/password. Only files in the new local directory, and in "///Temp/" , or their sub-directories, may be written. After the copy, the new local directory will be destroyed and as well as all files created in ///Temp/ by this command. On server startup, old files in ///Temp/ will be deleted if possible.
For requests that are started in a detached mode, the requester may not be available to get the completion notice. Mail will be sent to inform the user that the request has completed. Results will be on file servers.
Normal communication is to pass in a ROPE with the parameters; establish and maintain byte streams between child and parent; and return a result ROPE. Normal bulk communication is via the file system: either the results are written directly back to the workstation or copied out to a file server at the end of a command.
However, if a command elects to perform its processing in parallel or if object model computation is being done, the system will provide some assistance. First, a request for parallel execution of a command will created a child on a (possibly) remote machine. A handle is returned to the parent to be used with this child. From then on, the parent can use this handle to perform various functions with the child. Optionally, the system will restart the command if the server crashes on which the child is running. On the parent, the functions on the handle are as follows:
Request status: return a ROPE specifying the progress of the child, or a completion code
Abort process: kill the remote process and clean up the handle
Join process: accept completion and results from the child
Wait: on a list of handles, wait until one of them has a completion code different from running
Find machine ID and unique process ID for child: this can be used to import interfaces from the child, and to discriminate RPC calls from children by using the unique process ID. Note that is automatic restart is used, the machine ID and unique process ID for the child may change asynchronously.
Get ID: for each child created, there is a unique ID. These ID's are intended for use by the parent and child for synchronization and discrimination of RPC calls. Both the child's and parent's ID may be obtained.
From the child, other functions can be performed. They are as follows:
Set status: set a ROPE into the status field. The status is specified by the child, but is maintained by the system. Note that the status can revert in the case of restarted commands on crashed servers.
Find machine ID and unique process ID for parent: this can be used to import interfaces from the parent, and to discriminate RPC calls from parents of children running on this machine by using the unique process ID of the parent.
Get ID: same as above
Object model handles can also be used. There are two functions that are supported. The first is to "Find machine ID" for the import of the interface (RPC terminology again) from the type handle or any object handle. Second, handle lookup, on the machine that instantiated the type, will be performed to map from handles to objects.
The Controller
As stated above, the controller acquires performance statistics from the compute servers and builds a database of commands, performance statistics, resource limits, and machines. The primary and secondary controller machines periodically poll Grapevine to see if they should acquire the controller function. It so, they register themselves with Grapevine, and read the commands and performance database from the file server.
All requests for new computation are then passed through the controller. The controller consults its database of commands and machines, and selects a machine that has sufficient capacity for the command. This machine's ID is returned to the requester, and the database is updated to reflect the anticipated increased load on the machine. If all servers are too busy, the request is queued, by doing stable writes to an Alpine server, and the request waits. When a server that can perform the command becomes available, its machine ID is returned to the requester.
The controller maintains a database, on Alpine, as to the current state of the overall service. Queues of requests, statistics, restrictions, and active servers are maintained. Active detached commands are also recorded so that they can be restarted in case of failure.
The controller and all the servers must respond to a shutdown request. This is to recover the cluster in case of an error that causes the cluster to execute improperly. This can occur when an anti-social computation has gained control of the cluster, or a damaged computation is running wild.
Full Compute Servers
These servers are machines that are dedicated to being compute servers. Their checkpoints are tailored to restrict the use of GFI's and VM during booting. These servers are capable of being coordinators for detached computations.
One interesting function to include on these servers is auto-rollback. When the server has been idle for some time, and significant resources have been used since the last rollback, then the server will initiate rollback.
Public Dorados may be used as full compute servers, but the checkpoint will not be re-made. This means that the functions run on the public machine may have to be restricted. Auto-rollback will be done by special entries in the user profile.
Personal Workstations as Compute Servers
Personal workstations are a good source of cycles, but the protocol for using them must be worked out to minimize the impact on the personal computer owner, provide reasonable protection, and still get good use of the available compute power. To this end, workstation owners will be able to impose restrictions on the use of their machines. This is still an unclear area, but limitations could include the following:
processor: by use of priority and other scheduling policies, the processor impact can be decreased.
memory: by sampling, the physical memory usage by the server processes can be determined. Also, by watching the page replacement rates, an estimate of the thrashing on the workstation can be guessed. Commands can be temporally suspended or killed to reduce thrashing.
GFI: this is a Cedar resource that is needed to run new software. A minimum number of GFI's could be reserved for the workstation owner. The number of GFI's needed to run a command could be estimated from the number it took last time.
processes: some part of the process pool may be given to the server. If it is exhausted, some computation may be killed.
local disk: a restricted amount of disk space and a restricted amount of file flushing from the disk cache may be specified. It is not practical to maintain the LRU queue for the disk cache.
restartable: only commands that are easily restartable and run fairly quickly will be allowed. This is so that the user will feel free to rollback the workstation at any time. The system will discover that this has occurred and restart the command somewhere else without a great loss in progress.
commands: it will be possible to restrict the commands run on a server. Also, the number of concurrent commands, and the number of concurrent commands of the same type can be restricted. The later is intended for the compiler where only one compilation process may be running on a machine at a time.
Personal workstations will be able to enter and leave the cluster by some registered commands with the CommandTool. A rollback will always cause the personal workstation to leave the cluster.
Other Issues
Errors may be raised by the computation. The compute server should catch them, obtain some information about their cause, and send mail to the maintainer of the command. The "Event" action areas generated by the errors may be left on full servers for debugging purposes, but should be suppressed on personal workstations.
Informative requests directed to the controller should be supported. This can be used to allow a program to dynamically decide how many children to spawn.
Facilities could be provided for unreliable, or even reliable, broadcast provided that the cluster all shared an Ethernet (or maybe a few Ethernets so that we can mix DLions and Dorados).
It seems that all issues in computer science come down to naming. As proposed here, the name space for commands will be flat.
A reasonable effort will be made to insure that the program run is the most recent version of that program. Hence, if the compiler has a new version release, then new servers entering into the cluster will always use the new compiler. Old versions will persist until all machines in the cluster at the time of the new version creation have been removed from the cluster.
Two different ways will be provided to register computations. One is via the CommandTool. Any commands registered in a particular directory will be available. These will all be considered versionless. The second way to register is with the server software itself. This must be used by the type instantiation client software for Object Model Computation. It also will provide a version number. The system will keep track of the most recent version number, and insure that new computations always use servers that have this version number. Existing computations will use servers that export the most recent version.
Progress - an opinion
I have successfully run a re-bound Cedar compiler "remotely". All file system calls (FS) were done by binding the compiler with a piece of code that EXPORTed FS, but was not the standard FS. The FS variant bound in with the compiler running on the server cooperated with another program running on the client to do FS operations via RPC. This was all done on my workstation, but byte streams and remote file operations were all performed. It was only a demonstration of capabilities and is not robust enough for general use.
Based on this initial work, I think that all the above is quite possible and can be done in the near future. I have no machines right now for servers, but I do not expect this to be a problem.