Heading:
IDL DEVELOPMENT PROJECT: FINAL REPORT
Page Numbers: Yes X: 527 Y: 10.5"
XEROX Palo Alto Research Center
Inter-Office Memorandum 15 October 1976
To:Distribution
From:R. Kaplan, B. Sheil
Subject:IDL Development Project: Final Report
File:[IFS]<IDL>FinalReport.bravo
This memo reports on the project to implement the IDL data analysis system in INTERLISP on MAXC. It outlines the objectives of the project, describes their realization, and summarizes the capabilities of the resulting system.
Project Objectives
The IDL development project had its origins in PARC’s current and continuing need for a data analysis facility to support empirical research. Several PARC staff members gather data requiring non-trivial statistical data analysis, ranging from data on human subjects (the AIP group, Ron Kaplan) to data on computer system performance. Further, any research program of "user experiments" - the detailed evaluation of the performance of product prototypes, and the systematic study of the relevant principles of human engineering and system design - would require a reliable, powerful data analysis facility. There has been no data analysis software available at PARC to meet these needs.
We considered obtaining a data analysis capability by purchasing computing from outside vendors offering data analysis software. This strategy has been used by the AIP group when the complexity of their analyses exceeds what can conveniently be programmed in an ad hoc manner. However, in addition to the questions of continuing cost and data security, the inconvenience of using such external vendors is a major problem. We decided to avoid these difficulties by importing an externally available package and installing it at PARC.
We evaluated the relative benefits and costs (both financial and manpower) of several candidate systems, including SPSS and Multivariance as well as the Interactive Data-analysis Language (IDL) developed at Harvard by Beau Sheil and Eliot Smith. It appeared that all systems would require considerable modification to bring them up in our unique computing environment, and maintenance costs would probably be equivalent. We chose the IDL system because, although its start-up costs would probably be higher than for other systems, it offered the following significant advantages:
1.It would provide a more powerful, flexible, and extensible set of statistical facilities than the other systems we considered. This is very important because of the non-standard nature of much of PARC’s data analysis.
2.It could more easily take advantage of the interactive and graphical nature of our computing environment.
3.It would provide a framework in which substantively interesting research could be done, in such areas as the conceptual structure of data analysis, techniques for converting experimental software into production systems, and user interfaces for applications programs.
The last point deserves some elaboration. Most applications software (including that built at PARC) uses an interface design based on selection - i.e. a design in which the user is presented with a fixed list of operations that the system can perform (e.g. a menu) and required to select the appropriate operation for his current task. In contrast, the IDL design is based on the hypothesis that a user interface should consist of a factoring of the specific tasks of the application into operators that reflect their underlying concepts, together with a method for combining them together. The rationale for this is that the user, by virtue of his competence in the domain, is at least implicitly aware of the underlying communality of his tasks. He will therefore be able to exploit this communality both to improve his understanding of the structure of his domain and to create new operations that are more suited to his needs than any predefined set could be. Experience with IDL should be helpful in comparing and evaluating these different user-interface philosophies.
Having selected IDL as the system to import, we then considered how best to implement it. We chose INTERLISP on MAXC because of its power and flexibility, and because it supplied many facilities that we would otherwise have had to build ourselves, including the interactive executive already familiar to many of our intended users. The flexibility to experiment with different system configurations was particularly important to us, since we wanted to test a specific philosophy of user interface design and therefore needed to be able to adjust the system in minor ways to determine whether its acceptability could be attributed to that philosophy rather than to incidental details of its design or construction. Finally, the existence of Warren Teitelman’s INTERLISP display package meant we could construct a system in which graphical augmentations could be explored subsequent to and independent of the initial development effort.
Performance Targets
The goal of the development effort was to provide a system with the same capabilities as the Harvard design - a transcription from the PPL code written at Harvard into INTERLISP, with minor changes as necessary to adapt to the INTERLISP environment. While major enhancements to the system, including revision of the language, the reworking of certain design and implementation decisions, and the addition of new capabilities (e.g. primitives for multivariate analyses) have been considered, major revisions to the Harvard design were not the focus of the development effort. Such enhancements (including more elaborate graphics) would be a topic of the IDL project were it extended from a development effort to an area of substantive research.
Regarding specific performance measures, it was expected that, for a standard mix of computational tasks (loading a data set and performing assorted regression and variance analyses), the INTERLISP system should be between 10 and 100 times faster than the PPL implementation. This would allow a researcher to perform these computations in interactive mode (i.e. 2-3 minutes real time) on a moderately loaded MAXC (load average below 4), and thus provide a useful data analysis facility.
Furthermore, the system was expected to be compact enough so that there would be enough space left to load Warren Teitelman’s display facilities. This would allow IDL commands to be constructed by pointing at the screen, and permit scrolling through large data-objects printed by the standard IDL print routines. However, while these facilities were to be demonstrable, there was no guarantee that they would be efficient enough (either in time or space) to be included in the standard version of the system.
Project History
On February 15, Dr. Jan Derksen was hired as a temporary programmer, for a period of not more than eight months, to transcribe the core of the system from the PPL into INTERLISP. During the following month, the major implementation decisions necessitated by the change from PPL to INTERLISP were made, the INTERLISP environment for IDL programming was established, and implementation of the basic data manipulation primitives was begun. Save for the two weeks immediately preceding system delivery, this was the period of peak involvement for Kaplan and Sheil. Much of our effort at this time went into the design and construction of the DECLTRAN package for adding lexically-scoped type declarations to IDL programs. This package, an unforeseen outcome of the project, has proven useful to other INTERLISP programmers and is described more fully below.
At the end of four months, the implementation of the primitive system functions (i.e. array storage and access) was essentially complete. This occupied slightly more than 50% of the project because the primitive system functions of IDL implement an array mechanics of about the same order of complexity as APL. When these were complete, we gathered some preliminary data on system speed for a "typical" computation - the printing of an array. We found a speedup of approximately 5 to 10 over the PPL, not as great as was eventually hoped, but we felt that there were indications that greater speed was possible. On that basis, we decided to continue the project, and reported thus at our mid-project review meeting.
Over the following two months, the bulk of the data-analysis routines were transcribed. At the end of this period, Derksen left the project (after six months, rather than the eight allocated) and the remaining work on the system, mainly debugging checkout and final engineering, was done by the principals. Major performance improvements were achieved at this time through compile-time optimizations based on type-declaration information. Also, the ULAMTRAN package was defined to provide a consistent user-interface by mechanically generating user-interface coercions and error recovery code from DECLTRAN-style declarations.
At this point, the basic system, and the development project, is complete. While the documentation has not been fully revised to reflect all differences from the original PPL implementation, there is a summary document and work on the full manual is proceding. Some small amount of work remains to be done, on a continuing basis, to patch bugs as they are reported, to smooth the interface further, and to help users define and manage their data analysis problems.
During the past month, we have presented the project at the Computing Forum at PARC (13 October 77) and made the first version of the system available. In the short time since then, we have been gratified to receive several inquiries from potential users of whose needs we were not aware when the project was originally proposed. This has reinforced our belief that IDL will meet a real need - one which might have remained latent, although at a high cost in individual inconvenience and unexploited research opportunities, had this facility not been developed.
Evaluation
Performance. The project essentially achieved all its objectives. The system was delivered on time and within the performance bounds specified. It can, and has, been loaded into a Display LISP. On the benchmark that was specified at the beginning of the project, it runs 18.5 times faster than the PPL implementation (20 cpu secs vs. 390), a figure that is conservative because of the amount of printing in the benchmark. For a compute-bound comparison (covariation matrix of a medium-sized (100 by 10) data array), the LISP system was over 70 times faster (11 cpu secs vs. 795). The LISP system should compare even better as we install further planned optimizations gradually over the next year. In particular, the array mechanics has been set up so as to enable a transition to delayed evaluation, a strategy wherein an aggregate expression is evaluated only on demand.
Improvements over the PPL system. Although enhancing the PPL design was not a goal of the development effort, this system is in many ways an improvement over specification. Most of its functions are more robust, more accurate, or handle a wider class of inputs than the PPL version. One such major gain has been in the treatment of the classification structure of experimental designs. The somewhat ad hoc approach of the PPL system has been replaced by a very general description of experimental designs as a distribution of values into factor space. This permits very concise specification, even of complex analysis of variance and covariance designs, including those involving repeated measures, and arbitrary crossing and nesting relationships. The forthcoming revision of the User’s Manual will provide both detailed specifications of how to apply these operators and expository material on data analysis designed to introduce them to researchers at PARC without formal statistical training.
Other unforeseen benefits. An unexpected benefit of the IDL implementation has been the development of programming tools that have application in other contexts. A major aspect of the environment preparation was the development of the DECLTRAN package for adding lexically-scoped type declarations to INTERLISP programs. The original motivation for this package was to control the notorious time and space inefficiencies of INTERLISP’s arithmetic operations. We felt it important to do this in such a way that the programmer need not be aware of what those optimizations might eventually be, lest that knowledge obscure the structure of his programs. Consequently, we developed a mechanism which allows the datatypes of a program’s variables to be declared at the time they are bound, in order to permit the mechanical generation of efficient code for expressions involving variables known to be arithmetic. However, as DECLTRAN also checks that the declarations are satisfied during the execution of both interpreted and compiled code, it proved so useful as a program development tool that it was used throughout the system to provide information for run-time checking and other possible compile-time optimizations. Its use has lead to more efficient, more debugable, and more readable INTERLISP programs, and it is finding wider application in the PARC INTERLISP community, in such projects as KRL, the ORG LALR parser generator, the Dorado timing analyzer, and others.
The ULAMTRAN package was another tool to emerge from the project. In building the IDL user interface, we realized that the declaration information could be used to coerce user-supplied arguments of incorrect type to the correct type, and to provide error reporting and recovery techniques in a consistent way for non-coerceable objects. These coercions are generated mechanically from the argument declarations of functions defined as user entry points. We expect that ULAMTRAN, and the insights it embodies, will also find wider applications in other systems that present a functional interface to their users. Descriptive documents on DECLTRAN and ULAMTRAN will be released in the near future.
If there is one respect in which the project can be criticized, it is that it consumed too much of the principals’ time. Whereas our total involvement was supposed to be kept to a total of 13 man-weeks over the year, it is probable that twice this amount was actually invested, at a cost of the principals’ having less time available for their other projects. There are three reasons for the overrun: it was not possible to delegate design decisions to the temporary programmer in the early parts of the project; fundamental differences in the programming environment (e.g. the lack of locative expressions in LISP) necessitated much more redesign than was anticipated; and the declaration and coercion problems emerged as research "targets of opportunity" which appeared to justify additional involvement. Although the first two of these are to some extent errors in estimation, the extent of the overrun seems reasonable given the size of the project and the unexpected early departure of the temporary programmer. Because of their wider applicability, only a portion of the cost of the DECLTRAN/ULAMTRAN packages should be charged to the IDL project.
Future Plans
For the time being, the only essential activities are user-requested maintenance, completion of the documentation, and occasional user consulting. These are activities that will be spread out over a period of time, so they will not interfere markedly with our other commitments.
The open question concerns the research areas that are available now that an implementation of IDL exists. The use of displays has already been mentioned as a potentially important research topic. More centrally, as experience with the system accumulates, we will probably want to make changes to the concept structure, or to the form of the user interface. Such work is partially maintenance, and partially theoretical work in statistical computing. Finally, at some point in the future, it will be essential, if we are to understand the value of this form of application program design, to conduct an evaluation of the system - either by making it available to a naive user population, or by more rigorous studies of the way it is used at PARC. We are open to suggestions as to what plans should be made in this area.
Acknowledgments
IDL is not the sort of program that is usually implemented in INTERLISP, and we found ourselves exploring paths through the language that have rarely been trod before. We are very much indebted to Larry Masinter and Warren Teitelman for fixing a variety of bugs, correcting a number of non-features, and adding some special hooks for us to hang on. We also appreciate their advice and assistance in finding and fixing our own bugs.
Finally, we are grateful for Eliot Smith’s contributions to the original IDL design.
Distribution:
J. Elkind
J. Rulifson
W. Sutherland
R. Taylor
W. Teitelman
B. Wegbreit