Page Numbers: Yes X: 527 Y: -.4 First Page: 1 Not-on-first-page
Margins: Top: 1.1" Bottom: 1" Binding: 5
Odd Heading: Not-on-first-page
0. INTRODUCTION
Even Heading: Not-on-first-page
IDL REFERENCE MANUAL
0. Introduction
The Interactive Data-analysis Language (IDL) is a programming language designed for the analysis of social science data. IDL is radically different from most previous computer systems for social science data analysis, as it is not based on the statistical "package" or subroutine concept, but on the concepts employed in the design of modern programming languages. Thus, IDL provides the user with powerful tools for the manipulation of the type of data most commonly used in social science research, rectangular numeric arrays, and the ability to combine these tools to yield a wide range of statistical analyses. The sophisticated computer user will find similarities between IDL and Iverson’s APL, at least in the facilities for handling arrays, but IDL goes well beyond APL in providing capabilities especially designed for practical data analysis, such as the handling of labels (including the automatic generation of new labels by system routines) and missing data, and the ability to analyze very large data sets.
IDL is unlike most statistical systems in that its effective use requires that the user master a fair amount of material that does not, at first sight, seem at all relevant. As the reader might question our asking him to wade through five chapters of manual before the first explicitly statistical operation appears, the next section attempts to motivate this with a discussion of the advantages of this approach over the conventional one.
0.1 The philosophy of IDL
The concept of a statistical package system (like DATATEXT or SPSS) dates back to the first half of this century. At that time, data analysts were tightly constrained by the computational demands of their analyses. Certain techniques, such as factor analysis, large scale regressions and analyses of variance, were extremely expensive to carry out by hand and thus were rarely used even if the data were suitable. The primary motivation of the designers of the package systems was to relieve users from the computational load of a small number of common analyses.
This preoccupation with a small set of computationally intractable analyses led the package designers to a "shopping list" approach to data analysis: they provided a list of independent subprograms that consume data, print the results, and exit. Unfortunately for the user, if the analysis that he wants is not on the list, or if it is present but not quite right in some detail, the program is of no use to him. Thus, package designers increase the length of the list and the options available for each item in order to reduce the chances that a user’s demands will not be met. This, in turn, makes the packages bigger and therefore more costly both to learn and to use. For the occasional user, the cost of switching packages is considerable, and although the consequent reluctance to do so is deplored by the sophisticated as "package dependence", it is a completely natural reaction under the circumstances.
Freed from many of the constraints of the pre-computer decades, users have become much more demanding in the sophistication and complexity of the analyses that they wish to apply to their datāso there is simply no prospect that the packages’ range can be increased enough to satisfy the user community. Instead, each new addition merely compounds complexity without any real increase in generality.
The approach being followed by IDL is to give the user a powerful set of statistical "building blocks", which provide basic analytic capabilities, and a mechanism for combining these basic operations together in new ways to extend those capabilities. These two components are complementary. The choice of the basic analytic routines is influenced by how useful they will be in defining further analyses. Thus, a routine which computes a multiple regression and prints the result is not a useful building block, as nothing else can be done with it. One that removes variance components from a covariation matrix and returns the new matrix as a result would be very useful, as many common statistics could be defined with it. It is this two part approach, the interaction of the basic tools and the composition mechanism, which constitutes the essential difference between IDL and a conventional system.
The way one uses IDL reflects this difference. With a package system, a closed set of commands is available; with IDL one can construct any command that is needed, from the elementary to the complex and specialized. However, both the base and the composition mechanisms must be understood before the benefits of the language are reaped, and this is our reason for asking your perseverance through so much preliminary material. Although this may be tedious, console yourself with the thought that it is like learning to drive a car. If your object is to get somewhere now, it is quicker to take a bus (provided that the bus goes where you want to go). In the long run, however, you see more places and get there faster if you can drive: the adventuresome can strike off into new terrain, while the timid will find maps to certain commonly visited places in Chapter 8.
0.2 The relation between IDL and LISP
IDL actually consists of a set of programs embedded in a large programming system, Interlisp-10 (hereafter, "Lisp"). These programs (or functions, to use the Lisp terminology) are the basic building blocks mentioned above. Each function has a descriptive name, such as PLUS, ANOVA or MOMENTS, and can be applied to some data objects by specifying its name and the appropriate data in an expression which is given to the Lisp system to be evaluated. Thus, the expression
(PLUS 3 2)
would cause the function PLUS to be applied to the numbers 3 and 2 to produce the value 5. In general, the first item in such an expression denotes the function to be applied, and the remaining items denote the arguments. The arguments themselves may also be expressions. If A represents a set of numbers, then
(MOMENTS (RANK A))
will apply the RANK function to A to compute the ranks of the numbers in A, and then apply MOMENTS to those ranks to compute their count, mean, and variance.
This simple notion of composition is the basic method by which data analysis operations are constructed from the statistical building blocks provided by IDL. Lisp provides several other ways in which functions and expressions may be composed to carry out new tasks. For example, sometimes it is desirable for a complicated expression to be treated as a function in its own right, so that it can be applied as a unit to the result of other computations. Thus, the expression
(QUOTIENT X (LOG X))
will scale the value X by its logarithm. The result of taking, say, the square root of X may be similarly transformed with the expression
(QUOTIENT (SQRT X) (LOG (SQRT X)))
but this entails both conceptual and computational problems. The SQRT function has been distributed throughout the original scaling transformation, so it is no longer clear that two conceptually distinct operations are being composed. Further, the expression is computationally inefficient because it stipulates that the SQRT computation is to be evaluated twice, even though both evaluations will yield the same result.
These problems disappear if the logarithm scaling is made into a function, since this function can then simply be applied to a SQRT argument expression. Functions in Lisp are constructed by enclosing the expressions defining the computations they are to perform in an "expression" beginning with the key-word LAMBDA. The LAMBDA is followed by a list of names which the enclosed expressions may use to refer to the values to which the function is applied. Thus, the scaling operation can be represented as the function
(LAMBDA (Y) (QUOTIENT Y (LOG Y)))
which can appear instead of a function name in the first position of an expression whose second element specifies the argument to which it is to be applied. The argument expression will be evaluated, and its value will become the value of the variable Y given in the LAMBDA-expression. When the body of the LAMBDA is evaluated, that value will become the numerator of the QUOTIENT and its LOGARITHM will be the denominator. The square root composition is then
((LAMBDA (Y) (QUOTIENT Y (LOG Y))) (SQRT X))
Notice that the scaling operation appears as a conceptual unit, and that the SQRT occurs only once. The scaling may be composed with other transformations by changing only the expression to which it is applied. For example,
((LAMBDA (Y) (QUOTIENT Y (LOG Y))) (SIN X))
applies logarithm scaling to the result of the trigonometric SIN of X.
If the computations encapsulated in a LAMBDA-expression are to be used repeatedly, it is convenient to associate a name with that function. The Lisp function PUTD (for put definition) makes the name-function association. Thus, after evaluating the expression
(PUTD (QUOTE LOGSCALE) (QUOTE (LAMBDA (Y) (QUOTIENT Y (LOG Y)))))
the computations described above may be specified by
(LOGSCALE (SQRT X)) and (LOGSCALE (SIN X))
The QUOTEs in the PUTD expression are used to indicate that PUTD is to operate on the actual expressions themselves, rather than the result of evaluating those expressions. In other words, QUOTE is used to suppress evaluation of an expression. Its effect here is that the name LOGSCALE, not the value of the Lisp variable LOGSCALE, is associated with the definition which is the LAMBDA-expression itself, not the result of evaluating the LAMBDA-expression.
The function-defining and function-naming facilities permit elaborate computations to be packaged together and invoked by a single user-chosen name. In this way, the user can augment the IDL building blocks to construct operations specially suited for the particular kinds of analyses he needs to perform.
These and many other Lisp facilities for creating and manipulating functions and applying them to arguments have been maintained in IDL, and most IDL users would profit from a knowledge of them. The basic Lisp concepts necessary to understand the rest of this manual, and the minimal information necessary to start using IDL on a PDP-10/Tenex system can be found in Appendix A. Descriptions of the more sophisticated features of Interlisp are not contained in this manual, in the interests of keeping it concise. There is a separate manual for Interlisp (Teitelman, 1978), and any amount of time invested in reading it would be profitable, though it is perhaps a bit overwhelming for the casual user.
Programmers may wonder why Lisp by itself is an inadequate environment for data analysis. Others may wonder this about some other language, such as APL. The reason is that many facilities that must be provided for the data analyst are either not present or very difficult to implement in the confines of a conventional programming language. For example, it would be very difficult to redefine the arithmetic operations of APL to handle missing data.
0.3 The rest of this manual
The rest of the manual divides into five parts: the basic structure of the system, functions for manipulating the common data objects, functions for basic analyses, a guide to composing common analyses from the basic ones, and technical details. Chapters 1 and 2 cover the basic system and its handling of data. Chapters 3 and 4 cover the basic tools for manipulating data in array form. Both of these sections are essential reading. Chapters 5 and 6 cover the basic routines for data compression and analysis; knowledge of the routines one plans to employ is essential but complete knowledge is not necessary. Chapter 8 outlines the use of IDL for data analysis, and provides extended discussions of certain common cases which may not be obvious from the rest of the manual. The remaining sections cover technical details: Chapter 7 discusses ways in which data can be input to IDL, and the Appendices give a commented protocol from an IDL session, detailed definitions of the operation of the system in a format designed for quick reference (as opposed to the expository descriptions in the chapters), and miscellaneous other information. Finally, there is a glossary of common technical terms used throughout the manual.
Should any aspect of the manual or the IDL system prove awkward or incomprehensible, please feel free to complain and suggest improvements. IDL is a new system, and should continue to evolve as we find out more about the needs of its users. Please participate.