Page Numbers: Yes X: 527 Y: -.4 First Page: 5 Not-on-first-page
Margins: Top: 1.1" Bottom: 1" Binding: 5
Odd Heading: Not-on-first-page
1. DATA REPRESENTATION
Even Heading: Not-on-first-page
IDL REFERENCE MANUAL
1. Data Representation
The basic element of data analysis is the data item, a single observation on a single subject. In IDL, a data item is a number that may be used to represent a measure, category membership, or any other attribute of the subject. IDL places no constraint or interpretation on the data items with which it deals. It is the responsibility of the data analyst to ensure that the operations performed on a data item are consistent with his interpretation of it.
A central property of data is its organization into aggregates. For example, a survey consists of a set of observations repeated for many different subjects. A central concern of the data analyst is the structure of these aggregates. Many data analysis tasks use data with "rectangular" structure, that is, data organized into vectors, tables, matrices or some similar structure. Even when the data is not strictly of this form (e.g., in a design in which one variable is nested within another), it is quite often conveniently represented by a slight variation on a rectangular structure. For this reason, IDL provides both a representation for such structures, the IDL array, and a large set of functions with which to manipulate them. Most of the later chapters of this manual are concerned with those functions. In this chapter, however, we will examine the structure itself, and its intrinsic properties.
1.1 Array structure
An IDL array is a collection of numeric data items with the following properties:
1.It has a fixed number of dimensions. For example, a vector has only one dimension, a matrix has two, a table of SEX by EDUCATION by VOTE has three, and so on. IDL arrays may have as high a dimensionality as the user desires.
2.It has a fixed number of levels on each dimension. This number can vary from dimension to dimension. The SEX by EDUCATION by VOTE table mentioned before might have two levels of SEX, four of EDUCATION, and three of VOTE, resulting in a total of 24 cells. The vector whose elements are the number of levels on each dimension of an array is referred to as the shape of the array. In the case of our table, we say that it has shape [2,4,3]. Note that the shape of an array is itself an array (a vector of length equal to the number of dimensions of the array). Its shape (i.e., the shape of the shape of the array) will be a vector with one element which is the dimensionality of the array.
3.The data items in it can be represented using one of two different representations for numbers (either integer or floating). For various technical reasons, all elements of an array must be of the same type, i.e., if one element of an array is floating, all must be floating. In addition, particular data items can be marked as "missing".
4.It can be labelled. This is a very important feature for the user, as the labels contain his view of what the array actually means. Labels are Lisp atoms. They may be attached to an array, maintained, removed, or changed by the user at will.
5.If it is a matrix, it can be stored in either of two formats. Some matrices are stored so that only the elements below the diagonal are present, and those above are assumed to be the same as their "mirror image" across the diagonal. Such arrays, e.g., correlation matrices, are called symmetric (as opposed to the usual storage format, which is full) and allow considerable savings in space. Although the user can control the storage format of a matrix, he will usually be content to let IDL choose it.
1.2 Terminology
At this point, we introduce some terminology that provides a concise way of stating the behavior of functions described later.
A scalar is a number of any form in any structure that is used to represent anything. For example, 34, 2, 7.3, and 1.12E12 are all scalars. The shape of a scalar is the empty vector.
A vector is an array with only one dimension. If it is necessary to specify its length, we will refer to it as a P-vector where P is an integer giving the length. When the elements of a small vector must be indicated in the text, we will enclose them in square-brackets. Thus, the 2-vector with first element 4 and second element 5 will be represented as [4 5].
A matrix is an array with only two dimensions. If it is necessary to specify its size, we will refer to it as a [P,Q]-matrix, where P and Q are integers giving the number of rows and columns respectively.
An array may be assumed to be without constraints on its shape. Such constraints are noted by specifying that an object is an N-array (indicating that it has N dimensions) or an [P,Q,...]-array where [P,Q,...] is the shape of the array.
A particular element of an N-array is referenced by a subscript, a sequence of N integers indicating its level on each of the N dimensions of the array.
There are occasions when we must enumerate for consideration all the elements of an array, one at a time. Most commonly, the elements will be enumerated in row-major order, which means that the last subscript varies fastest. That is, the elements of a [P,Q,...X,Y]-array will be considered in the order [1,1,...1,1], ... [1,1,...1,Y], [1,1,...2,1], ... [P,Q,...X,Y], etc.
1.3 Labels
An array may be labelled in a variety of ways. The first type of labelling an array may have is a title, a string which describes the array. A typical title might be: "Sociological Variables on 75 Subjects". Titles are completely arbitrary, although IDL will generate titles for newly created objects in a systematic way. Titles serve merely as comments on an array: IDL will print them, but otherwise pays no attention to them.
The title is associated with an array as a whole, but labels may also be associated with each of its dimensions. For each dimension, a whole block of label information may be present. This can include a dimension label for the dimension itself and a selector label (so called because it may be used to select sections out of the object) for each level of that dimension (e.g., row and column labels for a matrix).
Finally, labels may be associated with the values in the cells of an array. Value labels are used to indicate the meaning of numeric codes given to categorical variables. Thus, in a subjects by variables data matrix, one could attach value labels to a variable like SEX (e.g., the value 1 might be labelled MALE, and 2 FEMALE) to indicate the coding used. Each value label is associated with a numeric code, forming a pair, and the set of such pairs for a variable is referred to as its codebook. Value labels can only be attached to one dimension of an array (otherwise each cell would be in the scope of more than one set of value labels, making it impossible to decide which to use), and, even then, will only make sense for certain levels of that dimension. For example, one might have value labelled category variables such as SEX, but one would hardly use such labels for a continuous variable like AGE.
All labelling information is optional. If present, and if the user has requested label printout, the labels are printed whenever an array is printed, making it more easily comprehensible. For example, a simple data array, with two variables on three subjects, might print as:
A Random Matrix
Variable
SubjectSEXAGE
1MALE24
2331
3FEMALE28
Here, "A Random Matrix" is the title of the array, "Variable" is the dimension label for dimension two (the columns), while "Subject" is the label for dimension one (the rows). "SEX" is the selector label for the first column; "AGE" the selector for the second column. (More precisely, "SEX" is the selector for the first level on the second dimension). The first dimension has no selector labels, so integers are printed. Value labels are attached to the Variable dimension, and the code for SEX described above has been used. The value 3 does not have an associated code label for the variable SEX, so the numeric value appears in the table. This is sometimes useful in detecting "wild scores".
1.4 Missing data
With the introduction of aggregates of data comes the possibility of incomplete aggregates. These are troublesome as a "marker" or code for missing data must be stored within the aggregate, and must both be easily distinguished from real data yet act as if it were real data (so that one doesn’t have to be perennially checking for it). IDL uses the value NIL as its missing data code. In most ways, NIL behaves like any other data value and may be stored into and retrieved from any cell of an array. However, as they have a data analytic interpretation as "undefined", missing data codes are treated specially by many IDL functions. For example, most arithmetic functions are defined so that, if given NIL as an argument, they return NIL (rather than producing an error). Thus, one may use elements from an array in arithmetic calculations without first having to check whether any of them are missing. However, if it is desired to treat missing cases specially, they can easily be detected, since the missing data code, unlike any genuine arithmetic value, compares equal (via the Lisp function EQP) to NIL.
1.5 Large arrays [not yet implemented]
Many statistical analyses are based on very large data sets, often ones that are far too large to fit into main memory. IDL allows the user to store arbitrarily large arrays in files and to access these arrays as though they were in main memory. In this way, arrays much larger than the available main memory can be manipulated. The ability to store and access file resident arrays will mainly be useful for arrays of raw data, as they are typically much larger than other arrays. Usually, the user will "compress" large data arrays down into smaller arrays that are suitable for further analysis. Nearly all commonly used statistical methods are based on such compressions and IDL offers a comprehensive set of compression functions which are documented in Chapter 5.
1.6 Function extension
IDL arrays may have an arbitrary number of dimensions, yet many operations used in data analysis are defined either for scalars or for arrays of a particular dimensionality. As examples, the square root function expects a scalar argument and returns its square root (a scalar), while the matrix-inversion function computes a matrix which is the inverse of its matrix argument. Such expectations will be violated by an argument with too few or too many dimensions. Every function decides for itself what it will do with arguments of too few dimensions. Thus, it makes no sense to invert a vector or scalar, so INVERT simply announces an error when it receives such an object. On the other hand, it is reasonable for matrix-multiplication to treat vectors as row or column matrices.
Arguments with too many dimensions are handled uniformly by a general function extension mechanism. On each call to an extended function, the extension mechanism inspects each argument and determines whether it is of greater dimensionality than the function is programmed to process. If so, the overly large object is broken down into smaller slices each of the correct dimensionality, and the function is applied to each of these in turn. The results of these separate applications are collected by the extension mechanism and amalgamated into a single array.
The scalar-to-scalar function SQRT is thus applied elementwise if it receives an array argument, so that the "square root" of the vector [9 16 25 36] is the vector of square roots [3. 4. 5. 6.]. If INVERT is given a [2,3,3]-array, then it will invert the two embedded [3,3]-matrices separately. For example, if the input array represented two within-cell covariation matrices of a one-way classification, the result would be a [2,3,3]-array containing the two within-cell matrix inverses.
A fuller description of the extension mechanism is provided in section 2.3. Ways in which the user may control the break-down of arguments are discussed, as are techniques for extending (i.e., embedding the extension mechanism in) functions defined by the user. We note here that, powerful as this mechanism is, it has one systematic limitation: the shape of the values of the extended function for each argument slice must be the same. This is necessary so that the partial results can be formed into an array, which must be rectangular. Therefore, if the value shape for one argument slice differs from the shape for other slices, the extension mechanism will cause an error and stop the computation.
Extension would fail, for example, if a function expects a scalar and returns a vector whose length is determined by that scalar (e.g., DEAL). Giving such a function the vector [2 4 7] as input would require the construction of a matrix whose first row was two elements long, whose second row was four elements long, and whose third row was seven elements long. Such an object is not rectangular and cannot be an array, so the extension will not be permitted.
All built-in IDL functions are extended unless it is stated otherwise in their description. The brief descriptions of each function found in Appendix D contain the expected dimensionality of each argument for each function.
1.7 Conversion from other representations
Many functions are described in this manual as requiring array arguments with such-and-such a property. If they were all to insist on their requirements being strictly met, IDL would be quite awkward to use. For example, a scalar is formally quite different from a 1-vector which is different from a [1,1]-array. However, all three contain exactly one element, and IDL will accept any of them in situations requiring a single data item.
Similarly, a Lisp list is formally quite distinct from an IDL array. List structures are, however, very easy to type in as the arguments to functions, and Lisp provides many tools for constructing and manipulating them. For convenience, every IDL function will attempt to convert a list structure argument into an IDL array, so that arrays and appropriate lists may be freely intermixed. The basic conversion rule is: a list of k N-dimensional IDL arrays all with the same shape will be converted to an N+1-dimensional array with k levels on its first dimension. The values in the first array in the list will be in the first level of the larger object, the second array goes into the second level, etc.
Thus if A is the 4-vector [1 3 5 7] and B is [2 4 6 8], the value of the expression
(LIST A B)
will be treated by functions expecting an array argument as the [2,4]-matrix
1 3 5 7
2 4 6 8
(The Lisp function LIST constructs a list whose elements are the values of its argument expressions.)
Since scalars behave as 0-dimensional arrays, and since the rule applies recursively, the same effect may be achieved by
(LIST ’(1 3 5 7) B ) or
(LIST ’(1 3 5 7) ’(2 4 6 8) ) or, more compactly,
’( (1 3 5 7) (2 4 6 8) )
The last line makes use of the fact that if all the arguments to LIST are quoted, then all the internal quotes can be eliminated and the LIST itself can be replaced by QUOTE (or ’, the CLISP equivalent).
The conversion will result in an error if the listed objects cannot themselves be converted to arrays of the appropriate dimensionality and shape.