Page Numbers: Yes X: 527 Y: -.4 First Page: 47 Not-on-first-page
Margins: Top: 1.1" Bottom: 1" Binding: 5
Odd Heading: Not-on-first-page
8. DATA ANALYSIS WITH IDL
Even Heading: Not-on-first-page
IDL REFERENCE MANUAL
8. Data Analysis with IDL
This chapter presents some simple ways to do standard analyses, some tricks that might not be apparent from the rest of the manual, and some clues to the effective use of IDL for advanced statistical work. Even a quick reading of this chapter will give the potential IDL user some ideas as to what the system can do. When the time comes that a specific analysis is needed, the appropriate section in this chapter can serve as a guide, adding a more practical flavor to the reference material in other chapters.
8.1 Ante data
Data analysis systems are not ordinarily used before the data are gathered, but there are two ways in which IDL can be of use with no real data in hand at all. The first is in the construction of experimental designs; the second in the generation of artificial data.
8.1.1 Experimental design
Experimental design often requires randomly assigning subjects into groups so that certain constraints are satisfied. With few subjects this can easily be done by hand, but with many subjects or complicated designs it is useful to have mechanical help. The basic IDL tool is DEAL, which returns a random permutation of the integers from one to its integer argument, in a vector of that length. This can be related to the design in two ways. The first indicates which subjects are in each condition. For example, assume a 2 by 2 factorial design (four conditions), with 13 subjects per condition. The problem is to assign 52 subjects (each represented by a subject number from 1 to 52) into one of four conditions. This can be done with
(RESHAPE (DEAL 52) ’(2 2 13) )
which will distribute each of the randomly ordered subject numbers into a cell in a first variable by second variable by subject within condition array. This array can be printed out to provide the condition lists, or it can be saved and further manipulated. One obvious manipulation would be to sort the subject numbers within the conditions, to make it easier to find a given subject.
The second way of using DEAL produces a list in subject number order, giving for each subject the condition that he is in. Here we want to keep the subject numbers fixed and deal around the conditions, rather than vice versa. We generate a vector whose length is equal to the number of subjects and whose elements are the condition numbers, each condition number appearing as often as there are subjects to be assigned to that condition. The vector is randomly permuted by selecting with a DEAL vector:
(RESHAPE ’(1 2 3 4) 52) @ (DEAL 52)
This relies on the cyclic repetition of RESHAPE to produce the desired condition vector.
Note that these two procedures do not in general give the same subject assignments (and most certainly do not if different DEAL vectors are used). It takes a little more trouble to generate both lists for the same design, and that problem is left for the interested reader to explore.
Some experimental situations require that treatments be counter-balanced according to a random Latin square. The SHIFT and DEAL functions combine to produce this kind of experimental plan:
(SHIFT (GENVEC 1 6) (GENVEC 1 6)) @ (LIST (DEAL 6) (DEAL 6))
Because of the extension for its second argument, the SHIFT produces a square each row of which contains the numbers 1 through 6 "rotated" one position from the previous row. This is a 6 by 6 Latin square whose rows and columns are then randomized by the DEAL selections.
8.1.2 Artificial data
The random number generators RAND and RANDN are the key to artificial data generation. The function RAND (from Lisp) generates a uniformly-distributed random number in the range given by its arguments. RANDN generates a random number from the normal distribution with a specified mean and standard deviation. Both functions allow the user to control the numbers produced (and so to generate the same sequence twice, if necessary) by setting the random number generators with the Lisp function RANDSET. Generating data with distributions other than these is non-trivial and expert advice should be sought.
The result of a random generator is a number and can be used anywhere that a number can be. One must be careful to make sure that the generator is called an appropriate number of times. Thus
(RESHAPE (RAND 1 5) 10)
does not generate a vector of ten random numbers between one and five, but rather a vector full of ten copies of the same random number between one and five, just like
(RESHAPE 6+1 10)
generates a 10-vector of sevens. To achieve the former result, the random generator must be called once for each new number desired. For example,
(EAPPLY* ’(LAMBDA NIL (RAND 1 5)) ’(SCALAR) (GENVEC 1 10))
uses EAPPLY* to evaluate (RAND 1 5) for each element of the vector 1 through 10. This produces an image of that 10-vector each element of which is a different number drawn randomly from the uniform distribution between one and five. Note that the elements of the GENVEC were not actually used in the computation at all! Their role was simply to force the extension mechanism to "pulse" RAND ten times. This technique is often useful to cause a computation to be repeated without using loops, especially when the results are to be saved in an array.
We can use a variant of this technique to obtain, from the function RANDN, a vector of ten random numbers from a specified normal distribution:
(RANDN (RESHAPE M 10) SD)
Here, M and SD are scalar variables bound to the desired mean and standard deviation. The RESHAPE produces a 10-vector each element of which is M, and the extension causes RANDN to be called once for each of these ten elements.
A common application of random numbers is random sampling, or making any decision on a stochastic basis. Suppose you want to sample from the elements of a vector with a sampling probability of one fifth. This could be done by testing the expression (RAND 1 5)=1 for each element of the vector, which will be true one fifth of the time in the limit. However, in a finite number of trials it will not be true exactly one fifth of the time but will vary around that. To extract exactly one fifth, randomly chosen, use as a selector one fifth of a DEAL vector of length N where N is the length of the vector being sampled (a DEAL vector of length N/5 will not do, as that selects the first fifth of the source vector, but in a random order). This method samples without replacement. To sample with replacement, select with a vector of length N/5 each element of which is set to (RAND 1 N).
Artificial data generated in these ways may be used directly as input to the compression and analysis routines (e.g., to test the robustness of various analyses), or may be used to increase the "noise" in existing data by adding some appropriately scaled random number. There are various other uses of random generators (e.g., simulations) that are beyond the scope of this chapter.
8.2 Data
Once the data have been collected, they can be modified and reorganized for subsequent analysis using IDL’s array manipulation tools.
8.2.1 Data entry
There are two ways to input data to IDL. The simplest is just to type it in at the point in the session when it is needed. However, this is unwieldy for large amounts of data. In this case, the data may be prepared as a text file, using a text editor, and read into IDL using the functions described in Chapter 7.
8.2.2 Recoding
Usually the form in which data is most easily collected is not the most suitable for subsequent analysis. Sometimes individual data values must be adjusted to compensate for peculiarities of the collection procedure or to eliminate observations that are known to be invalid for idiosyncratic reasons. The simplest way of "purifying" the data matrix is to use the ASSIGN operator to change specific values indicated by a selector. For example,
X@’(101 SEX) ← 2
changes to 2 the value of the variable SEX for subject 101 in the matrix X.
Often, however, the same change must be applied to many different subjects, as when the raw observations must be transformed or combined in some systematic way. Such systematic recoding may be accomplished by means of the array and arithmetic operations using function extension. For example, the column of values representing the variable INCOME in a data array can be converted to a column matrix containing the logarithms of INCOME values via the expression
(LOG X@’((INCOME)))
The resulting matrix may be used independently of X in various analyses. If the transformed values are required for many subsequent analyses, it may be convenient to combine them with the old values in X. The new values can be assigned into X to replace a column of previous values by
X@’((INCOME) ← (LOG X@’((INCOME)))
or a new matrix can be constructed by adjoining the new values to X with
(ADJOIN X (LOG X@’((INCOME)))
In either case, some change in labelling is also usually desirable.
As another example, the variable INCOME can be converted to standard scores by first computing its mean and variance and then performing the appropriate arithmetic operations:
M ← (MOMENTS X@’((INCOME)))
(X@’((INCOME)
M@’(Mean)) / (SQRT M@’(Variance))
The function extension mechanism may be invoked explicitly for more elaborate data transformations. For example, a common technique in the analysis of questionnaire data is to combine answers to individual questions into scale scores, either by summation or averaging. The following expression will produce a subjects by variables matrix whose first column contains the sum of the values in each row of X and whose second column contains the average of those values:
([ELAMBDA ((R VECTOR))
(ADJOIN (RPLUS R) (MOMENTS R)@’(Mean))]
X)
The ELAMBDA will evaluate the ADJOIN form with R bound in turn to each row of X, yielding a different 2-vector each time. The output matrix, formed by combining these together, may be used independently, may replace some 2-column section of X, or may be ADJOINed to X, as described above.
8.2.3 Weighting
The compression routines MOMENTS, COVAR, and PAIRN have an optional argument providing for differential weighting of cases. This argument takes the form of a vector that specifies the weight to be given to each unit of analysis. Weighting vectors may be used either to correct bias in a sampling procedure or to allow the entry of data in frequency compressed form.
The first case arises when a single subject in a subjects by variables data matrix is used to represent several observations rather than just one. For example, if the sampling in a survey results in the relative under- or over-representation of various groups in the sample, each observation in the sample might be weighted by the reciprocal of its group’s sampling fraction, thus allowing unbiased estimates of the population parameters to be obtained. Such a weighting vector is commonly an additional variable selected from the same subjects by variables matrix that is being compressed. However, it might be in a completely separate array or be computed from other information.
The other use of weighting vectors is when there are many observations each of which can be one of a small set of measured values. In this case, it may be more efficient simply to count the number of observations having each value rather than to represent each observation as a row of a subjects by variables data matrix. The observed vector of frequencies can then be used as a weighting vector to compute summary statistics on the measure variables.
8.2.4 Subsetting
Analyses may be confined to particular subsets of a data array by recoding, weighting, selection, or some combination of these.
The recoding technique relies on the fact that almost all compression and analysis routines ignore missing observations, so that transforming a data value to NIL effectively removes it from consideration. As for recoding in general, specific values may be recoded to NIL by assignment, or a systematic transformation may be done to eliminate invalid or wild observations. Thus, if MI is the mean of the variable INCOME and SDI is its standard deviation, all observations two or more standard deviations away from the mean can be omitted using the expression
([ELAMBDA ((IVAL SCALAR))
(if (GREATERP (ABS IVALMI) 2*SDI)
then NIL
else IVAL)]
X@’((INCOME)))
The if-then-else expression returns NIL if IVAL is outside the specified interval and IVAL otherwise. Alternatively, the cut-off points can be placed in a translation table:
(TRANSLATE X@’((INCOME) (LIST (LIST NIL MI2*SDI NIL)
(LIST MI+2*SDI NIL NIL)))
This translates all extreme scores to NIL leaving other scores unchanged.
Subsetting can be achieved through weighting vectors because the compression functions ignore observations whose weights are not greater than zero. Thus, subjects can be excluded from analyses by giving them weights of zero or NIL. Several such vectors may be defined to indicate membership in subgroups which are of continuing interest. By specifying the group vector as the weighting vector for a given compression, only data from the members of that group will be included. This method of subsetting is only applicable to analyses that begin with the weight-accepting compression functions.
The most direct and powerful method of confining analyses to particular subsets of the data is to use selection. If the subjects to be included (or deleted) are known by number or label, then a selection on the subject dimension can be used to produce the desired subset. Thus,
A@(LIST (ADJOIN (GENVEC 1 50) (GENVEC 52 99)) ’(SEX AGE EDUC) )
yields a sub-matrix of A containing the specified three variables for subjects 1 through 99 omitting number 51. This technique is the most general of all, since it it does not depend on special treatment of missing data or weighting vectors.
If several analyses are to be performed on the data for a particular group, it may be convenient to save the subject selector. Thus,
GRP1 ← (ADJOIN (GENVEC 1 50) (GENVEC 52 99))
A@(LIST GRP1 ’(SEX AGE EDUC) )
produces the same result as the example above, but GRP1 is available to select the same set of subjects again.
Since subject selectors can be arbitrary vectors, they can be constructed using IDL’s array functions, such as the ADJOIN and GENVEC above. The function SEEK is especially useful for locating all subjects whose data satisfies a given condition:
(SEEK 1 ([ELAMBDA ((VARS VECTOR))
(if (AND (LESSP VARS@’(AGE) 30)
(GREATERP VARS@’(INCOME) 5000))
then 1
else 0)]
X) )
yields a selection vector containing the subject numbers for all subjects under 30 earning more than $5000. The ELAMBDA considers the vector of variables for each successive subject in X, and returns a vector containing 1 for every subject meeting the specified conditions. The SEEK returns a vector of the subject numbers corresponding to those 1’s.
If subjects have been organized into a cross-classification structure (e.g., by GROUP), subsetting can be accomplished by selecting single levels of the classifying dimensions instead of multiple levels on the subject dimension. Thus, if B is a two way classification (by SEX and VOTE), then
(COVAR B@’(MALE DEMOCRAT ALL (INCOME AGE EDUC)))
will produce the three-variable covariation matrix for only those subjects in the MALE-DEMOCRAT cell of the cross-classification. If COVAR were applied to B without selection, the extension mechanism would cause the compression to be done separately for each of the groups, producing an array of within-cell covariations.
8.3 Correlational statistics
The basis for all continuous multivariate statistics in IDL is the covariation matrix. This is a symmetric matrix each row and column of which corresponds to a variable (column) of some data array, and whose elements contain the covariation between the two variables that are used to select it. Such matrices are produced from data arrays by the COVAR function. This function takes an array as argument, and computes the covariation matrix between the variables which form the columns of the data array. For example, if A is a matrix with four columns, labelled AGE, STATUS, INCOME and VOTEPCT,
C ← (COVAR A)
computes the covariation matrix for these variables and stores it in C. This produces a matrix such as
Covariations of SURVEY DATA
Variable
VariableAGESTATUSINCOMEVOTEPCTConstant
AGE17.868
STATUS1.7330.241
INCOME3.9050.6424.336
VOTEPCT10.0271.8628.45021.388
Constant23.0510.65112.59652.2080.020
The Constant row holds the means of the variables selected by the other columns, and 1/N (where N is the number of subjects that the matrix is based on) in its diagonal (the right corner element). Each of the other cells contains the sum of the products of the deviations around the means for the variable pair specified by the selectors. The number of subjects on which each entry is based can be affected both by weighting and by missing data. However, all the available data are used to calculate each entry, and thus the N’s for different entries will, in general, not be the same. The N appearing in the Constant diagonal is the minimum cell N, and all other entries are scaled so as to appear as if based on this N. The function PAIRN is available to produce the actual pairwise N’s.
8.3.1 Covariance and correlation
A covariance matrix can be produced from a covariation matrix by eliminating the Constant row and column and dividing throughout by N1. This is simply done, for a covariation matrix C, with the sequence
S ← (GENVEC 1 (SHAPE C)@’(1)1)
C ← C@(LIST S S)/(
1/C@’(Constant Constant) 1)
which saves a selector for all but the last row or column in S, selects out the section of the matrix corresponding to variables other than Constant, and divides it by N1, obtained from the Constant diagonal cell of the matrix.
A covariance matrix may be converted to a correlation matrix simply by dividing each element by the square root of the product of its diagonals (i.e. scaling the covariances in terms of the variances). Since a covariance matrix differs from a covariation matrix by only a factor of 1/(N1), this procedure would also work for covariation matrices if the Constant row and column are removed. However, it is easier to use the function NORM, which normalizes a matrix by doing exactly this. Thus, we could obtain a correlation matrix from C with
(NORM C)
NORM will disregard rows with negative diagonal entries (as their square root is not defined), so this implicitly eliminates the last row of the matrix.
Rank-order correlations (Spearman’s r) result from applying the same correlation procedure to variables that have been transformed by RANK to within-variable ranks.
Partial correlation coefficients are obtained from a covariation matrix using regression techniques and will be considered below.
8.3.2 Regression
The SWEEP function is the basis of all regression related statistics. SWEEP takes three arguments: the covariation matrix, an (optional) set of variables to be swept out, and an (optional) set of variables to be swept back in. Intuitively, sweeping a set of variables out of the covariation matrix corresponds to redistributing the variance associated with the swept variables so that it is removed from the unswept variables and divided up among the swept ones. The swept out variables are thus the independent variables of a regression against the unswept out (dependent) ones. Sweeping a set of variables back into the matrix distributes their variance back into the whole matrix. This provides a very flexible basis for all forms of regression analysis, including partial and multiple correlation.
8.3.2.1 Unstandardized regressions
Unstandardized regressions can be produced by the SWEEP function directly from a covariation matrix. For example, using the covariation matrix C from the example above, we can simultaneously produce multiple regressions for each of INCOME and VOTEPCT (the dependent variables) on AGE and STATUS (the independent variables) with
S ← (SWEEP C ’(AGE STATUS))
which would produce
Sweep of Crossproducts of SURVEY DATA, Swept out: AGE Constant STATUS
Variable
Variable AGE STATUSINCOME VOTEPCT Constant
AGE0.185
STATUS1.33013.714
INCOME0.1323.6102.532
VOTEPCT0.62212.1993.0474.911
Constant3.39821.73313.28045.81364.197
Notice that the sign of the diagonal elements indicates which variables have been swept out: the diagonal for a swept out (independent) variable is negative. Thus if SDIAG is the diagonal formed by
(TRANSPOSE S ’(1 1))
the expression
IV ← (SEEK ’MINUSP SDIAG)
gives a selector for the independent variables in the regression and
DV ← (SEEK ’PLUSP SDIAG)
gives a selector for the dependent variables.
There are three types of information in a swept matrix: relations between the swept and the unswept variables, relations between the swept (independent) variables, and relations between the unswept (dependent) variables.
First, for each dependent variable, the regression coefficients for the independent variables in its regression equation are found in the intersection of its row and the columns corresponding to the independent variables (or vice versa, as the matrix is symmetric). For example, the 3-vectors of coefficients for the dependent variables INCOME and VOTEPCT can be selected by
S@(LIST ’INCOME IV)
S@(LIST ’VOTEPCT IV)
and their regression equations can be read off as:
INCOME is estimated as 0.132*AGE + 3.610*STATUS + 13.280
VOTEPCT is estimated as 0.622*AGE
12.199*STATUS + 45.813
Note that each regression equation has three terms, not just the two that were swept with the SWEEP command. This is because the original COVAR matrix was created with the Constant swept already (i.e., the entries are cross products of deviations from the variable means, not cross products of the raw scores). This is done purely for reasons of numerical stability, and the Constant term may be removed from a regression equation (by sweeping it in) in cases when a constant term is not wanted in the equation.
The great flexibility of this approach lies in the fact that other regressions may be computed from the output of the SWEEP operation. As it is itself a covariation matrix, albeit one with the variance redistributed in certain ways, it too can be swept to produce further regressions. For example, if we were dissatisfied with the INCOME regression shown above, and decided that the variable VOTEPCT should be added as an independent variable, we could simply compute
(SWEEP S ’VOTEPCT)
As this does not require the resweeping of AGE and STATUS, it is convenient to add or remove independent variables (by sweeping them out or in). The order of sweeping does not affect the result of the analysis.
The diagonal element for a dependent variable is the amount of variance left unaccounted for by the independent variables. Thus the residual sum of squares for INCOME is
SDIAG@’(INCOME)
This is more easily interpretable if it is scaled by its value before the sweep:
CDIAG ← (TRANSPOSE C ’(1 1))
SDIAG@’(INCOME) / CDIAG@’(INCOME)
The last line gives the proportion of variance unaccounted for by the last sweep operation. That is just 1R2 where R is the multiple correlation between the predicted values from the regression equation and the observed values.
The next interesting set of questions is the relationship between the independent variables. If these are closely related to each other, then the unique contribution of each one to the overall prediction will be small. The diagonal element for an independent variable is the negative reciprocal of its unique variation in a different regression: that of this variable on all the other independent variables. This is a measure of "multicollinearity," or the interdependence of the independent variables in a regression. Like the residual sum of squares, it is more clearly interpretable if it is scaled by the variable’s variance before the SWEEP took place, that is, by the corresponding element of CDIAG. This can be done simultaneously for all independent variables by
1/(CDIAG@(LIST IV) * SDIAG@(LIST IV))
which produces a vector of 1R2s for the multicollinearity regression. The 1R2 figure gives the proportion of variance predicted by this variable which is not predicted by the other variables in the regression. If 1R2 for a predictor is 1.0, then that predictor is totally unrelated to all others and thus contributes maximally to explaining variation in the dependent variable. On the other hand, a 1R2 of 0.2 or less indicates that this independent variable has little effect of its own on any dependent variable; it is just a combination of other independent variables in the regression.
8.3.2.2 Standardized regression
If the SWEEP operations are applied to a NORMalized matrix rather than an unNORMalized one, the result is a standardized regression. The layout of the matrix is the same, but some of the interpretations are easier. The same entries in the matrix are the regression coefficients, but now they are standardized coefficients, also called b or path coefficients (when they are used to label lines of influence between variables in a path model). As the initial variance of each variable has been set to 1.0 by the NORMalization, the diagonal entries do not need to be scaled by the pre-SWEEP entries, as was previously the case. Now, the diagonal element for a dependent variable is the proportion of variance left unexplained by the regression, and the diagonal of an independent variable is now the negative reciprocal of the proportion of variance unique to that variable in the regression.
8.3.2.3 Regression residuals
The analysis of a regression model is not complete with the computation of the regression coefficients. Examination of the residuals from the regression often helps one to detect significant departures from the linear, additive model assumed by regression, and then to improve the model’s fit to the data. Residuals can be computed by using the regression coefficients to calculate the actual predicted values of the regression. Consider our last example, where we derived a regression equation for INCOME. First, we compute a vector of predicted values for the dependent variable:
P ← (MPROD A@’((AGE STATUS)) S@’((AGE STATUS) INCOME))
+ S@’(Constant INCOME)
This expression extracts the matrix of raw scores corresponding to the independent variables AGE and STATUS, multiplies them by the corresponding coefficients, adds them up, and finally adds the Constant coefficient. (Recall that A was the data matrix from which the covariation matrix was originally produced.) The residuals are then produced by subtracting the predicted scores from the real ones, i.e.,
R ← A@’(INCOME) P
In cases where either the residuals or the predicted scores have continuing interest, they can be added to the data array as a new variable.
The vector of residuals can be plotted against the predictor from the regression (here P), or against any other variable with the PLOT function. For example,
(PLOT R P)
plots the residuals generated above against the vector of values predicted by the regression. Plots of residuals against the regression predictor or independent variables often show departures from the form of relationship assumed by the regression model; they can also show the appropriate transformations or additions to the model to correct the problem. For instance, if a residual plot shows an upward- or downward-facing "U" shape, one or more independent variables may have a quadratic relationship with the dependent variable; the addition of a quadratic term in those variables to the equation may improve the model’s fit. Anscombe and Tukey (1963) or Kruskal (1968) contain extensive discussions of the issues involved in plots and transformations of variables.
8.3.2.4 Partial correlations
Information on the relations between the dependent variables, in the form of partial correlations, is also available from the output of SWEEP. Partial correlations are simply ordinary correlations between variables that have had the effect of some common variation removed. Thus, NORMing a covariation matrix from which some set of variables has been swept out gives the partial correlations among the unswept variables, controlling for those swept. This can provide partial correlations controlling for more than one variable.
In a similar fashion, one can define a partial multiple correlation. Assume that there are two sets of variables, X and Y, and one is interested in the multiple correlation of X with some dependent variable Z after controlling for Y. The first step is to remove the variance associated with the set Y by sweeping those variables out of the matrix. This leaves a certain amount of variance common to X and the dependent variable Z, which can be measured by SWEEPing out X and looking at the proportional change in the variance of Z. This is an exact parallel to the sum of squares in ordinary regression. There the post-SWEEP variance of Z is compared with the total variance of Z; here it is compared with its variance after the removal of the variance associated with the set Y.
8.4 Classification based statistics
Classification based statistics are computed from data grouped according to the values of other observations on the same individual. Classifications in IDL are represented in the dimensional structure of an array, usually in its leading dimensions. Thus, the basic data structure for comparing the heights of men and women is a [2,n]-matrix of which SEX is the leading dimension and each row of which contains the heights of a set of either men or women. Summary statistics can then be computed for the two levels of the first dimension. Multi-way classifications are represented by more than one classifying dimension. The usual mechanism for constructing such classified data arrays is GROUP.
GROUP takes a matrix of classifiers or attributes and an array of data values, and arranges the data into classification cells as follows. Each column of the attributes matrix produces a new leading dimension in the output array. Thus, a single column (one variable) classifier induces a one-way classification, a two column classifier groups into a two-way table, and so on. The output is formed by slicing the data array along one dimension (usually the first) and grouping together all slices whose corresponding attribute rows are the same. The data values appear in the cell in classifier space "addressed" by that row. The trailing dimensions of the output are the original dimensions of the values array.
Computation of summary statistics for the cells of a classification is facilitated by the fact that the dimensions of classification are kept in the output of GROUP, so generic functions (such as MOMENTS and COUNTS) will automatically be applied within those cells.
If an attribute variable has a large number of different values relative to the number of observations, e.g., if it is a continuous measure, then it might be necessary first to recode it into intervals using TRANSLATE or ROUND. For example,
(ROUND A@’((TIME)) 20)
could be used to scale a column of response times into 20-second wide intervals.
8.4.1 Frequency statistics
Frequency or contingency tables are usually produced by GROUPing a weighting vector (often a vector of 1’s) by an attributes matrix and then obtaining the sum of the weights within groups by COUNTS, which ignores any NILs in the grouped array.
For example, if a matrix A contained variables labelled SEX and VOTE, then
TAB ← (COUNTS (GROUP A@’((SEX VOTE)) 1) )
could be used to request the two-way cross-tabulation
Counts of Group of 1 by Selected Variables from SURVEY DATA
VOTE
SEX
DEMOCRAT REPUBLIC OTHER
FEMALE
31718127
MALE
35125634
It is often informative just to look at a raw frequency table, to detect such things as underpopulated categories of some variable (they could be collapsed with other categories) and to notice, in the case when the input array was value labelled, the number of subjects omitted because of "wild scores" (i.e., values which were not assigned a value-label).
Tables can be scaled in different ways to make the trends they show more clearly visible. For instance, a two-way table could be displayed as a set of row proportions or as a set of column proportions. These are obtained in IDL by a combination of PLUS-reduction and KEEP:
(KEEP TAB ’VOTE) / (RPLUS (KEEP TAB ’VOTE))
yields the column proportions for TAB. The denominator is a vector with one element for each column in TAB giving the sum across rows for that column:
VOTE
DEMOCRAT REPUBLIC OTHER
66843761
The KEEP in the numerator expression indicates that the columns are to be withheld in the QUOTIENT extension. This means that each element of TAB will be divided by the sum corresponding to its column. The row proportions result simply by specifying that rows are to be kept. In either case, multiplying by 100 converts the proportions to percentages:
100 * (KEEP TAB ’SEX) / (RPLUS (KEEP TAB ’SEX))
Expressions of the same general form will produce proportions or percentages for margins in higher-dimensional arrays: the keeps indicating the desired marginals must appear in both the numerator and denominator to achieve proper alignment for the extension. Thus, for a 4-array A
(RPLUS (KEEP A 1 3 4))
gives the 1, 3, 4-plane of the marginal totals formed by summing across the second dimension, and
100 * (KEEP A 1 3 4) / (RPLUS (KEEP A 1 3 4))
is the corresponding percentaged array.
x2 is the most widely-used test of significance for contingency tables. It allows one to specify an array of "expected values" against which the values in the observed array are tested for a significant departure. By varying the expected values, many possible null hypotheses can be tested, not just the usual one of no relationship between the variables (Goodman, 1972). The x2 statistic is a summation of terms called "x deviates", one from each cell of the table. A x deviate is the squared difference of the observed and expected frequencies scaled by the expected frequency. Often, examination of the table of x deviates, or even the raw deviates (observed minus expected for each cell), is helpful. One can detect patterns of positive and negative deviation in the various regions of the table, or can note which cells contribute large amounts to x2 by showing large x deviates.
The task of computing a x2 for a table thus involves two steps: generating the expected values and calculating the x deviates. As an example, consider the hypothesis of equal counts in all cells of some two dimensional array OBS. The expected value table can be obtained by
EXP ← (RESHAPE (RPLUS OBS)/(RTIMES (SHAPE OBS)) (SHAPE OBS) )
(RTIMES (SHAPE OBS)) is the number of cells in OBS so (RPLUS OBS)/(RTIMES (SHAPE OBS)) will be the count per cell if the total count of OBS is evenly distributed over its cells. Reshaping this number to the shape of OBS gives us the array that we would have under the hypothesis of equal distribution. Alternatively,
(MOMENTS OBS)@’(Mean)
could be used to compute the average cell count. With the observed array OBS and the expected array EXP, we can compute x2. The most straightforward method is simply to write
(OBSEXP)↑2 / EXP
which subtracts the two arrays, squares the result, and divides it by the array of expected values, producing an array of x deviates. A correction for continuity is usually applied if (REDUCE EXP ’MIN) is less than 5. The appropriate formula is then
((ABS OBSEXP)1)↑2 / EXP
The total x2 may be found by PLUS-reducing this deviate array, and its probability may be obtained from the x2 distribution.
The usual expected value array for bivariate independence in a two dimensional table, which takes into account the observed marginals of the variables but no association between them, is computed using matrix product. If OBS is a matrix, then
R ← (RPLUS (KEEP OBS 1))
C ← (RPLUS (KEEP OBS 2))
EXP ← (MPROD R C) / (RPLUS OBS)
will compute the cell-wise expected values. First, the row and column totals are calculated. Their matrix product is then divided by the grand total. x2 can then be computed using the techniques described above.
8.4.2 Measure statistics
The previous section considered comparisons between groups when the only information available was the frequency of occurrence of events or subjects with certain attributes. We now turn to analyses based on measures of various other properties. The first thing to examine about a set of measure values is the frequency with which each value appears in the set. The frequency distribution can be produced by GROUPing the number 1 using a vector of measured values as the classifier, and counting the result with COUNTS. If the values are in a multi-dimensional array instead of a vector, the dimensional structure must first be eliminated, by RESHAPEing the array down to a vector. Thus,
(COUNTS (GROUP (RESHAPE X) 1))
will produce a vector of frequency counts for the various values in X. A weighting vector of the appropriate length can be used instead of 1 to take sampling biases into account. Applying the function HIST to the vector produced by COUNTS will generate a graphical display of the distribution.
After examining the frequency distribution, it is common to analyze the differences in measured values between groups defined by some classification design. The appropriate test statistic and approach for this problem depend on the nature of the classification, the level of measurement of the measure involved, and assumptions about its underlying distribution.
8.4.2.1 Non-parametric methods
The non-parametric methods impose the weakest requirements on the measure in that it need not be interval-scaled and its distribution need not be normal. If the criterion variable is only ordinal scaled, then one of the rank-order statistics should be used. The function RANK computes the set of ranks on which many statistics are based. It returns an image of its arbitrary array argument with each cell containing the rank of the corresponding observation within the set of values found in the array.
The Kruskal-Wallis one-way analysis of variance by ranks is the non-parametric equivalent to the ordinary one-way analysis of variance. It is appropriate when there is a single measure for subjects in a set of independent groups. It requires only ordered data and makes no assumptions about the distribution of the underlying variable. The Kruskal-Wallis H statistic is obtained by assigning to each subject the rank of his score within the set of scores for all subjects pooled together. The MOMENTS of the ranked scores are computed and submitted to ANOVA. The test statistic is not the F that ANOVA produces, but a ratio of quantities easily obtained from the ANOVA. This ratio may be referred to the x2 distribution.
Assume that A is a vector (or [n,1]-matrix) of grouping attributes for a set of scores in the vector S. Then the ranks over all scores grouped by A are obtained by
R ← (GROUP A (RANK S))
Alternatively, if G is a matrix of already grouped scores, the ranks within the pooled groups are
R ← (KEEP (RANK (LEAVE G ALL)) 1)
In either case, the first dimension of R is kept, so that the desired ANOVA table is simply
AN ← (ANOVA (MOMENTS R))
The Kruskal-Wallis H statistic is defined by
H = SSgroups/MStotal
where MStotal is obtained by pooling the group and error lines in the table. Thus
H ← AN@’(2 SumSq) / (RPLUS AN@’((2 3) SumSq))/(RPLUS AN@’((2 3) df))
When the number of subjects in each group is large, this statistic is distributed as x2 with K1 degrees of freedom, for K the number of groups. The degrees of freedom is either
(SHAPE R)@’(1) 1 or AN@’(2 df)
In the special case where there are only two groups, the Kruskal-Wallis test is equivalent to the better known Mann-Whitney U test. The normal approximation for the U test is simply (SQRT H).
Other non-parametric tests are appropriate when there are two or more observations on the same individual. To determine whether two paired observations differ, the sign-test and Wilcoxon signed-ranks test are used. The sign-test requires only ordinal measurement. Subjects are classified into groups depending on the sign of the difference of the two observations, and the numbers of subjects with positive and negative signs are compared. In effect, a one-way contingency table is constructed, using the sign of the difference as a classifier. If V1 and V2 are the two observation vectors and S is the translation table
0 0 NIL
NIL
0 1
0
NIL 1
then
(COUNTS (GROUP (TRANSLATE V1V2 S) 1)))
produces a vector of counts showing the number of positive and negative difference signs. This may be referred to the binomial distribution to determine its probability level. In this example, the matrix S translates tied scores to NIL, negative differences to 1, and positive differences to 1.
The Wilcoxon signed-ranks test assumes ordered metric measurement (i.e., assumes that it makes sense to rank a set of differences between scores), which is stronger than ordinal-level measurement. Subjects are classified by the sign of the difference just as for the sign-test. The classified value is not a subject’s weight, however, but the rank of the absolute value of the difference between the two observations for that subject:
G ← (GROUP (TRANSLATE V1V2 S) (RANK (ABS V1V2))))
The test statistic is either the sum of the positive ranks or the sum of the negative ranks, whichever is less. The expression
(REDUCE (COUNTS G) ’MIN)
provides the appropriate number. If the number of subjects (the length of V1) is less than 25, the probability of this statistic may be determined from published tables (e.g. Siegel, 1956). Otherwise, a normal approximation may be used.
8.4.2.2 Parametric methods (ANOVA)
The basis for analysis of group differences on an interval scaled measure is a MOMENTS table containing the zero through second moments of the observations within each group. The function ANOVA performs an analysis of variance on such a table, providing both F and t tests.
Consider an experiment in which subjects are asked to decide whether visual patterns are the same. The patterns are presented under two lighting conditions in four different colors, and subjects’ response times are recorded. A single observation for each of 40 subjects is entered in a subjects by variables matrix A, with variables labelled LIGHT, COLOR, and TIME. The following expression computes the table of moments needed for determining the effect of the LIGHT and COLOR conditions on TIME scores:
M ← (MOMENTS (GROUP A@’((LIGHT COLOR)) A@’(TIME)))
This causes the TIME values to be cross-classified by LIGHT and COLOR. The MOMENTS are then computed within each level of the classification (MOMENTS only operates over the unkept dimensions), and the result is saved as M. M is a [2,4,3]-array containing the within-cell N, mean, and variance for each cell in the classification:
Moments of Group of Variable TIME from Pattern
Experiment by Selected Variables from Pattern
Experiment keeping LIGHT and COLOR

LIGHT = LOW
Moment
COLOR
N MeanVariance
RED
5.000 6.052 .380
BLUE
5.000 5.802 .397
GREEN
5.000 5.976 .373
ORANGE
5.000 6.520 .391

LIGHT = HIGH
Moment
COLOR
N MeanVariance
RED
5.000 6.454 .379
BLUE
5.000 6.200 .382
GREEN
5.000 6.398 .393
ORANGE
5.000 6.656 .386
The expression
(ANOVA M)
applies ANOVA to test the significance of the differences between the means in this table. This yields the two-way analysis of variance
Anova of Moments of Group of Variable TIME from
Pattern Experiment by Selected Variables from Pattern
Experiment keeping LIGHT and COLOR
Column
Source SumSq
df MS F p
Gnd-mean
1516.469 𔁇.000 1516.469 3936.210.000
LIGHT
  𔁈.911 𔁇.000   𔁈.911   𔁍.555.010
COLOR
  𔁇.800 𔁉.000     .600   𔁇.557.219
L*C
    .139 𔁉.000     .046     .120.948
Error
 󈋈.328 32.000     .385 NIL NIL
In this simple design, the error term is the appropriate denominator for testing all the effects, and the main effect for LIGHT is the only one significant beyond the .05 level. Other denominators would be used for more complicated designs. Thus, if the two LIGHT conditions are considered as representatives of a much larger set of illumination levels and if the results of this experiment are to be generalized to that larger set, then LIGHT should be specified as a random factor in the design:
(ANOVA M ’(LIGHT))
The only change between this and the preceding analysis is that the mean-square for the LIGHT by COLOR interaction becomes the correct denominator for testing the COLOR main effect, and the probability level for that effect drops from .219 to .03.
GROUP, MOMENTS, and ANOVA can be composed to analyze a wide range of experimental designs. For hierarchical nesting of one factor inside another, the GROUPing must be done in two stages to produce what looks like a moments table for a cross-classification. Suppose, for example, that there is an additional factor in the pattern discrimination experiment: the red and blue stimuli are bigger than the green and orange ones. The factor COLOR is thus nested within SIZE. The moments table must have an additional SIZE dimension with two levels (large and small). The effective number of levels on the COLOR dimension is reduced from four to two, because there are only two colors within each size.
The first step is to GROUP the subjects by the non-superordinate factors in the hierarchy, LIGHT and COLOR, just as above. Then, however, the levels on the COLOR dimension are grouped by an attributes vector placing each of the colors in one of the SIZE levels:
G ← (GROUP ’(1 1 2 2)
(GROUP A@’((LIGHT COLOR)) A@’(TIME))
’COLOR)
The attribute list classifies the first two colors into level one on the new size dimension and the second two colors into level two. The result is then submitted to MOMENTS to obtain the desired [2,2,2,3] moments table. Alternatively, MOMENTS can be applied to the first grouping, giving the array M as above. The second GROUP is then performed on the moments instead of the raw data:
G ← (GROUP ’(1 1 2 2) M ’COLOR)
Either way, the result is given to ANOVA, with a third argument specifying that COLOR is nested within the size dimension:
(ANOVA G NIL ’((COLOR 1)))
The size dimension is indicated by number here, since we have not taken the trouble to add the SIZE label to G. The nesting specification suppresses the COLOR by SIZE interaction effects that would otherwise be included as sources of variation. They are pooled into the appropriate COLOR effects instead.
Repeated-measures designs, i.e., those in which observations for more than one condition are recorded from a single subject, require no special mechanism in IDL. The nesting and crossing relationships between the within-subject factors are handled simply by applying GROUP to the variable dimension of the subjects by variables matrix. The between-subject factors are treated in a similar way, the only difference being that the subject dimension itself is considered to be a factor in the design nested within all the between-subject factors. The arguments to ANOVA must specify this implicit nesting relationship and should further indicate that subject is a random factor.
Suppose COLOR and SIZE are within-subject, rather than between-subject, factors, with COLOR still nested within SIZE. That means that four different response times are recorded for each subject, one for each of the four presentation colors, so the data matrix has five variables (including the LIGHT attribute column). The within-subject classification is accomplished by grouping the variables into color categories and then the colors into sizes:
C ← (GROUP ’(1 2 3 4) A@’((2 3 4 5)) ’VARIABLE)
SC ← (GROUP ’(1 1 2 2) C 1)
The first line explodes the four response time variables into a new dimension with four levels, yielding the [4,n,1]-array C (where n is the number of subjects). The second line groups the levels on the first (COLOR) dimension into the appropriate size classes. Again, dimension numbers are used instead of explicitly labelling the results at each step. A third GROUP takes care of the between-subject factor:
LSC ← (GROUP A@’((LIGHT)) SC ’SUBJECT)
The desired moments table may be constructed by
M ← (MOMENTS (KEEP LSC ’SUBJECT))
where the KEEP is needed to retain SUBJECT as a factor in the result. Finally, the analysis of variance is computed by
(ANOVA M ’(SUBJECT) ’((SUBJECT LIGHT) (3 2)))
The second and third dimensions of L correspond to color and size respectively, so the list (3 2) correctly specifies the nesting relationship.
These examples illustrate how complex experimental designs may be analyzed. The same functions provide the much simpler comparisons for which a t test is normally utilized. The t test for independent samples is equivalent to a one-way analysis of variance. A [2,3]-matrix of moments for the two groups is constructed and given to ANOVA. The F statistic for the group effect is the square of the desired t value, and the p column of the table gives its two-tailed probability. If the data for the two groups are in separate vectors V1 and V2, moments can be computed separately and combined together by LIST:
(ANOVA (LIST (MOMENTS V1) (MOMENTS V2)))
The matched-pairs t test is equivalent to a repeated-measures analysis of variance. As above, the two within-subject variables are exploded into a separate dimension, and subjects are included as a random factor in the design. The F statistic for the two variables is the square of the desired t value.
The array operators may be used to examine effects in an analysis of variance design more directly, by successive subtraction of common additive factors from the array of means. Initially, the grand mean is subtracted out, leaving an overall mean of zero. Each cell then contains the effect for that cell of the design. If M is a moments array for some design, this can be accomplished by
E ← M@’(Mean)to compute the array of means
G ← (MOMENTS E)@’(Mean)
to compute the grand mean, and
E ← E
Gto compute the array of cell effects
Here, G is computed as the unweighted average of the cell means. The grand mean using the cell N’s as weights may be computed with POOL by
G ← (POOL M)@’(Mean)
This elimination procedure can be carried out for higher order effects, and in this way, successive effects can be removed from the array. Assume the three factors in the design are labelled A, B, C, and that the AB interaction and the B and C main effects are significant. The AB effects matrix and the vectors of A, B, and C effects might be constructed. These can then be subtracted from E in turn, and the resultant patterns examined for clues as to the direction and locations of effects. When the results look "random", i.e., all numbers about the same size and no consistent patterns of + or signs, then it is time to stop. The first steps of such a process might be
A ← (RPLUS (KEEP E ’A))
B ← (RPLUS (KEEP E ’B))
C ← (RPLUS (KEEP E ’C))
AB ← (RPLUS (KEEP E ’A ’B))
(PPA AB
B)
X ← (KEEP E ’A ’B )
AB
(PPA (KEEP X ’C)
C)
which calculates the A, B, C and AB effects, prints the AB effect with the significant B effect subtracted out, and then prints the array with both the AB and C effects removed. If this last array shows interesting residuals, the ABC interaction could be examined, even if it is known not to be significant.
The EMS function provides the coefficients for the variance components included in the expected value of the mean squares for every row of an analysis of variance. It may be useful for theoretical explorations, for constructing quasi-F values when no legitimate test exists for an effect, and for determining how to remove effects from a model by pooling. An example, for a simple 2 x 4 crossed design with the first factor fixed and the other random, is
(EMS ’(2 4) ’(1))
This produces
Coeff
Source
Factor1 Factor2 1*2
Factor1
400
Factor2
021
1*2
001
The proper denominator for testing the effect of a particular source of variation i is determined as follows: If row i is all zero except for the diagonal element and if the design has a a within-cell error term, the error is the appropriate denominator. Otherwise, if a row below the ith row is identical to the ith row in all respects except that its ith element is zero, that effect is the correct denominator for testing row i. According to these rules, the error term is the correct test for Factor1 and the 1*2 interaction, while the 1*2 interaction is the proper denominator for the Factor2 test.
If an appropriate row does not exist in the EMS table, it may be possible to add or subtract two or more rows to form an estimate of the denominator mean-square. Principles for pooling in this way and for correcting the degrees of freedom of the resulting quasi-F statistic are found in Winer (1971).