[qv]<IDL>History>ProposalMemo.dm!1>IDL.Proposal

PARC has a current and continuing need for a data analysis facility to support empirical research. This memo reviews possible solutions and proposes the in-house development of a data analysis facility based on the IDL system. While of comparable cost to other alternatives, this approach is seen to provide a superior data analysis facility and a hospitable environment in which research on user interfaces might be done.

Currently, several PARC staff members are gathering data requiring computerized data analysis. The AIP group (Stu Card and Tom Moran) is actively involved in behavioral experimentation. Ron Kaplan (and possibly Jim Morris) of CSL are contemplating experiments. Other staff members have gathered data on computer system performance for which non-trivial data analysis is required. There is currently no data analysis software available at PARC to meet these needs. Furthermore, the initiation of a research program of "user experiments" - the detailed evaluation of the performance of product prototypes, and the systematic study of the relevant principles of human engineering and systems design - is being considered. This research program will not be possible without a reliable, powerful data analysis system being available.

There are two ways in which a data analysis facility could be provided - the purchase of computing from outside vendors offering data analysis software, and the implementation of an in-house facility. The former solution is currently used by the AIP group when the complexity of their analyses exceeds what can conveniently be programmed in an ad hoc manner. However, in addition to the questions of continuing cost and data security, the inconvenience of using such external vendors has been a major problem for them. Furthermore, much of their data analysis is sufficiently non standard that commercially available software is of limited assistance. For these reasons, the commercial vendor is an unattractive option.

The provision of an in-house facility can itself be done in two ways - by purchase of a commercially available program and modifying it for use on Maxc, or by development of a production version of some non commercial design. There are three reasons why PARC should consider investing resources in making a production version of a new design. They are:

1.The installation of an existing program on Maxc is neither substantively interesting nor, given the size of the modifications that may be required, necessarily less expensive than the implementation of an independently motivated design. Over the long term, the costs are almost certainly comparable, given that the costs of both are probably dominated by their maintenance costs.

A design which provides the second and third benefits, and which we propose as PARC’s major investment in data analysis, is the Interactive Data analysis Language (IDL). IDL is an interactive data analysis system designed around the data analysis tasks commonly carried out by social scientists. The design work was done at Harvard by Beau Sheil (in collaboration with Eliot Smith) under the sponsorship of both the Psychology and the Computer Science departments. Its unique feature is that it is based on a very strong hypothesis about the construction of application programs - that is, that the tasks of a domain should be analyzed in terms of the domain’s underlying conceptual structure and that the set of basic operators resulting from this analysis, along with tools for combining them together to form new task descriptions, should be the interface presented to the user. The rationale for this is that the user is at least implicitly aware of the underlying communality of his tasks (by virtue of his competence in the domain) and will therefore be able to exploit this structure (a) to learn and use the conceptually based system more easily because of the low memory load of the small set of basis operators compared to that of the large set of tasks, and (b) to create new tasks that are more suited to his needs than those that might be provided by a conventional task oriented system.

Development of the IDL system would, in addition to providing a very powerful and flexible data analysis system, allow the testing of this underlying theory of application programming. If valid, this would be of far more importance to PARC than the data analysis software used to test it, as the theory gives a strong discipline for the organization of application software of all kinds.

The current status of the IDL system is that an interpreter for IDL has been implemented in the extensible language PPL, and detailed documentation (including tutorial material) has been prepared. The interpreter supports nearly all of the paper design (excepting only small, well understood pieces of code such as the mathematical distribution functions) and has been tested extensively to determine the design’s completeness and consistency. During its development, the opportunity was taken to make several changes to the design based both on detected inconsistencies in the specifications and intuitive evaluations of ease of use. The documentation has been extensively reviewed by representative users and rewritten several times in the interests of clarity.

The primary problem with the current system is that, as PPL does not have a compiler, it is both far too slow and far too fragile to be used for production data analysis. Both Stu Card and Ron Kaplan of PARC, in addition to Harvard users, used the system during 1975 and found that, despite the appeal of the conceptual structure, the working environment was inadequate for other than experimental use. Although, as a consequence, the system is not currently being used, we are convinced of the appeal of the basic design and that a production version would be warmly received.

Given a decision to proceed with IDL development, the next question is that of the implementation environment. Due to the system’s size, its extensive use of floating point, and the requirement that it present an interpretive environment to the user, the Alto does not seem an appropriate host. On Maxc, we can see two implementation strategies. First, the frequently executed code from the PPL version could be hand compiled in order to provide acceptable speed. With the addition of some utility code, this would provide a reasonable facility. Alternatively, the current algorithms could be implemented without change in INTERLISP and the INTERLISP compiler could be used to provide an efficient production system.

Of these two methods, we favor the second. While the first method is probably less expensive than conversion to INTERLISP, the INTERLISP environment provides much greater flexibility once complete. There are two specific reasons why this flexibility is important. First, it has often been suggested to us that the system would be more powerful if it were able to make use of graphics in various ways. While we have no clear intuitions now about how this should be done (and are not including such work in this proposal) we would like to leave the system in a state such that such intuitions could be added, or experimented with, at resonable cost. Second, as the implementation of the IDL system is at least partially motivated by a desire to test a specific philosophy of system design, it is important that we be able to adjust the system in minor ways so that inferences made on the basis of its acceptance do reflect that philosophy rather than small defects in design or construction. Furthermore, in order to answer specific questions about the causes of acceptance or rejection, it is important that the system be in a form that permits experimentation at a reasonable cost. Clearly, such adjustments would be much easier to make in the INTERLISP, rather than in the PPL/machine code, environment.

To summarize, we are proposing that a data analysis facility be provided by implementing the IDL system on Maxc in INTERLISP. This will provide not only an excellent data analysis facility, but an environment in which experiments on user interfaces can be carried out without prohibitive costs.

The basic implementation strategy is to hire a non permanent programmer to do the basic transcription of the existing code, and to become involved in the process ourselves only when design work must be done to preserve existing functionality across the translation, and to ensure that the code being written is constructed in such a way that the pain of later change will be minimal. We have a candidate programmer: one Dr. Jan Dericksen who worked with Jeff Rulifson on the QA4 project at SRI and comes well recommended. As we therefore expect minimal training time, we have budgeted 7 months of his time distributed as follows: startup, familiarization with existing IDL code, PPL, and INTERLISP, and toolbuilding (4 weeks); design work (2 weeks); transcription of existing code (18 weeks); construction of as yet unwritten utility code (4 weeks).

We expect to contribute our time as follows: Kaplan - design work (1 week); code supervision and ongoing discussion (1 day per week by 25 weeks = 5 weeks). Sheil - design work (2 weeks); code supervision and ongoing discussion (1 day per week by 25 weeks = 5 weeks). Thus, we expect to hold our commitment down to 13 man weeks.