Page Numbers: Yes X: 527 Y: -.4 First Page: 39 Not-on-first-page
Margins: Top: 1.1" Bottom: 1" Binding: 5
Odd Heading: Not-on-first-page
6. ANALYSIS FUNCTIONS
Even Heading: Not-on-first-page
IDL REFERENCE MANUAL
6. Analysis Functions
Analysis functions are those that perform statistical analyses of some sort, usually on a compressed object. This chapter gives a relatively formal description of their behavior; their use in data analysis is illustrated in Chapter 8.
6.1 ANOVA
ANOVA computes an analysis of variance table given a MOMENTS array, a specification of its random factors, and a design showing its crossing and nesting relations.
anova[mtable;random;nesting] where mtable is an array of moments, random is an (optional) specification of the factors to be considered random, and nesting is an (optional) specification of the nesting relationships between the factors of the analysis, returns a matrix which gives the summary table for the analysis of variance.
The array of moments is of the form produced by MOMENTS, with a last dimension containing either the N, mean and variance, and with leading dimension(s) forming the classification structure (or factors) for the analysis. If there is only one observation per cell, the moments dimension may have a single level containing just that observation: the N is defaulted to one, and there will be no within-cell error term.
The random argument is a list of factors, specified by dimension name or number, which will be assumed random in the construction of the F-tests. All other factors are assumed to be fixed. For example, in a typical repeated-measures design the subject factor is considered random, so a specification of (SUBJECTS) would be appropriate.
The nesting argument is a list of lists, each one corresponding to a single nested factor (dimension of the moments array). The first element of each list specifies the nested factor; subsequent elements indicate the factors inside which it is nested. As a factor may be denoted by either its dimension name or its number, a nesting specification of
((WARD CITY STATE) (CITY STATE))
indicates that the factor WARD is nested inside the factors CITY and STATE, and that CITY itself is nested within STATE.
In the table of moments, nested factors appear as if they were crossed with their nesting factors. That is, without the nesting specification, the first level on the WARD dimension would be taken to represent the same ward for each of the levels on the STATE dimension, and the ANOVA output would include WARD by CITY interaction effects. The nesting specification eliminates those sources of variation, causing them to be pooled into the appropriate WARD effects.
ANOVA returns a matrix laid out and labelled like a conventional analysis of variance summary table. The dimension labels of the moments table are used to generate appropriate labels for the rows (sources of variation) of the anova table. Statistics for the grand mean are given as the first row, while the within-cell error, if any, is found in the last row. The columns are labelled as SumSq, df, MS, F, and p. The F column is NIL for effects that cannot be tested directly; the EMS function provides expected-mean-square coefficients and may help in computing quasi-F values.
If the within-cell frequencies are not equal, ANOVA computes an unweighted-means approximation.
6.2 EMS (Expected Mean Squares)
EMS computes an array of coefficients that define the expected values of mean squares in an anova design with arbitrary crossing and nesting relations and arbitrary combinations of fixed or random factors. From these coefficients, the structurally appropriate F and quasi-F ratios for testing various null hypotheses may be determined.
ems[nlevels;random;nesting] where nlevels specifies the number of levels for each factor, random is an ANOVA specification of the random factors, and nesting is an ANOVA nesting specification, returns the matrix of effect coefficients for the described design.
If nlevels is a vector, it is interpreted as being the "shape" of the factor portion (all but the Moment dimension) of a moments array. The entry for each factor is the number of levels it has. In a nested design, this is the number of levels a nested factor has within all other factors; thus, if a design has wards nested within cities, and there are 20 wards and 4 cities, then there must be 5 wards in each city, and nlevels must be [5 4] if nesting is ((WARD CITY)).
If nlevels has more than one dimension, it is assumed to be a moments table, and EMS computes the number of levels directly from its shape.
EMS interprets random and nesting in exactly the same way as ANOVA.
EMS returns an [n,n]-matrix, where n is the number of lines in the corresponding ANOVA table, excluding the line for the grand mean and the line for within-cell error, if one exists in the design. The i,j entry in the EMS table is the coefficient for the jth source of variation in the expected-value for the mean square of the i+1th row of the ANOVA table. Consider as an example a simple 2 X 4 crossed design with factor 1 random and 2 fixed. The expression
(EMS ’(2 4) ’(1))
produces the table
Coeff
Source
Factor1 Factor2 1*2
Factor1
400
Factor2
021
1*2
001
This indicates that the expected mean square for Factor2 is 2s22 + s21*2. In a design with a within-cell error term (i.e., more than one observation per cell), every expected mean square includes the term +s2error; this term does not appear in the EMS table. Thus, if a line in the EMS table has a nonzero entry on the diagonal and all other entries zero, its appropriate F-test denominator is the within-cell error line.
6.3 HIST (Histogram)
hist[v;file] prints onto the file file (primary output file if not given) a histogram display of the vector v, returning v. If file is not open for output, it is opened, the histogram is printed, and file is closed. The axis of the histogram will run down the page, and the plot unit (the number of cases represented by a single character) will be scaled with respect to the page width to allow maximum discrimination. Both positive and negative counts are allowed, and the axis of the histogram will be moved to accomodate the range encountered in the vector. To speed the printing of very large histograms, sequences of more than HISTRPTLINES (initially 5) lines with identical counts are suppressed and replaced by a sequence of three vertically aligned dots.
6.4 PLOT
plot[y;x;file] prints a plot of the vector y against the vector x (1 through length(y), if not given) to the file file, returning NIL. If file is not open for output, it is opened, the plot is printed, and file is closed. If x is given, the result will be an x-y (scatter) plot of elements of y against the corresponding elements of x. Otherwise, it will be a series graph. The axes of the plot will appear at the left and bottom sides of the plot, which will be scaled in terms of the page size to give maximum legibility. The vertical size of the plot will be chosen so that both axes are approximately the same length (rather than the same number of print positions). The scaling from length to print positions is controlled by the global variable PLOT.AXIS.RATIO (notionally the ratio of the lengths of vertical and horizontal print positions), initially .6.
6.5 NORM
norm[m] normalizes a matrix m by dividing each entry by the square root of the product of the diagonal elements in its row and column. It is commonly used to norm an inner product representation of a vector space, e.g., to turn covariation matrices into correlation matrices.
For non-square arrays, NORM selects out the top left hand square corner. From this, or from the whole array for square m, those row/columns which have negative diagonal entries are discarded. This is useful when applying NORM to matrices which have had a transformation applied to them which leaves some diagonals negative (e.g., SWEEP). Then, each element of the result is divided by the square root of the product of its diagonals, and the normed matrix is returned.
6.6 SWEEP
SWEEP performs successive orthogonalization of an inner product matrix (such as those produced by COVAR) using the SWP and RSW operators of Dempster (1969). SWEEP may be used to obtain most regression based statistics.
sweep[m;outvars;invars] where m is a matrix and outvars and invars are (optional) selectors for the second dimension of m, returns a matrix which is the result of having swept the columns of m selected by outvars out and the columns selected by invars in (in that order). (In the notation used by Dempster (1969), sweep[m;outvars;invars] may be described as RSW(SWP(m,outvars),invars) ). Either or both outvars and invars may be NIL, in which case the corresponding operation is not carried out. Sweeping will be carried out in the sequence specified in the outvars and invars arguments unless m is non-symmetric. In this case, total pivoting is used to maximize numerical stability.
For a precise definition and discussion of the properties of the SWEEP algorithm, the reader is referred to Dempster (1969). The common statistical interpretations of SWEEP operations are discussed in Chapter 8 below, along with computational methods of extracting information from the results. Here we simply describe the interpretation of the result of sweeping a set of variables out of some covariation matrix. The i,j off-diagonal element of the result of such a sweep contains
for swept i and unswept j, the regression coefficient of variable i as a predictor for variable j in the regression equation including all the swept out variables as predictors.
for unswept i and j, the partial covariation between variables i and j controlling for all the swept out variables. The partial correlation is obtainable by scaling this quantity with respect to its current diagonals using NORM.
The ith diagonal element contains
for swept i, the negative reciprocal of the residual sum of squares of this variable regressed against the other swept variables. This quantity is closely related to the multicollinearity of this variable with the other swept variables.
for unswept i, the residual sum of squares (unaccounted for variation) in variable i. The R2 for i is computed by dividing this quantity by its value before the sweep.
Sweeping a variable in to a matrix merely undoes the effect of previously having swept it out. Sweeping out all variables from a matrix gives the negative of its inverse.