Page Numbers: Yes X: 527 Y: -.4 First Page: 35 Not-on-first-page
Margins: Top: 1.1" Bottom: 1" Binding: 5
Odd Heading: Not-on-first-page
5. COMPRESSION FUNCTIONS
Even Heading: Not-on-first-page
IDL REFERENCE MANUAL
5. Compression Functions
The distinction between compression and analysis functions is basic to IDL. Compression functions are used to transform a data array into a (typically smaller) array for further analysis. The compressed object is intended to be an information rich representation of the data which can serve as the base for many subsequent analyses. Data arrays can be compressed in different ways depending on the types of analysis desired. The different compression routines described here each preserve different aspects of the larger array.
Several of these functions take an optional argument described as a "weighting" vector. This is used when the data is a systematically biased sample of the target population. This bias can often be corrected by counting each subject as representing more or less than one observation. For each subject in the array to be compressed, the corresponding element of a weighting vector indicates the number (not necessarily integral) of observations that the subject is to represent. Thus, a weighting vector must have as many elements as there are subjects. The default weighting for a subject, if no weighting vector is supplied, is 1.0. Subjects with a non-positive or NIL weight are simply omitted.
5.1 COUNTS
counts[a] returns the sum of the values of the array a. It is similar to RPLUS except that it skips over NIL (rather than returning NIL if NIL is encountered, as RPLUS does).
Note that, since the output of GROUP is kept on the dimensions of grouping, COUNTS will operate within each cell of a GROUP classification to produce a contingency table. Thus, for a matrix A,
(COUNTS (GROUP A@’((VOTE SEX)) 1))
produces a cross tabulation of VOTE against SEX.
5.2 COVAR
COVAR computes a covariation matrix, a symmetric matrix of the mean-centered cross products of the columns of a matrix. A covariation matrix can easily be converted to traditional correlation and (variance-) covariance matrices. However, it contains more information than either of these and is thus more useful for subsequent analyses.
covar[a;wt] where a is a matrix, and wt is an (optional) weighting vector, produces a symmetric [c+1,c+1]-matrix whose first c rows and columns correspond to the c variables (columns) of a. The last row and column correspond to a Constant "variable" whose value is taken to be 1.0 for all rows of a. In the first c rows and columns, the ij off-diagonal elements contain the sum of cross-product deviations from the means of the ith and jth variables, while the diagonal element gives the sum of squared deviations from the corresponding variable mean. In the Constant row, the ith off-diagonal element is the mean of the ith variable, while the diagonal element is -1/n, for n the number of observations on which the matrix entries are based.
The result corresponds to a matrix of "raw" cross-products from which the Constant "variable" has been swept out (see section 6.6). COVAR does this as the result is constructed for reasons of numerical stability.
If there are any missing values, then each cross product will be based on all rows for which data exist (i.e., observations are deleted pairwise). The figures in the final result are adjusted to appear as if they are based on the smallest number that any individual entry is based on, and that n is used to compute the Constant diagonal cell. The function PAIRN is available to compute the actual pairwise n.
5.3 GROUP
GROUP produces cross-classifications of data values as defined by the values of other observations on the same individuals. For example, to determine whether the heights of young and old men and women differ, their measured heights are grouped by their age and sex. Summary statistics can then be computed separately for the groups and compared. Cross-classifications in IDL are represented in the dimensional structure of an array, usually in an array’s leading dimensions. Consequently, the extension mechanism will cause certain analysis functions to apply within the cells of the classification without further specification.
group[attribs;values;dim] for an [s,m]-matrix attribs and an n-array values with extent s on dimension dim (optional, default is one), constructs a classification array of m+n dimensions. Each row of attribs is considered as a subscript describing a cell in an m-way classification, and GROUP places the slice of values corresponding to that row into that cell of the classification. The slice corresponding to the ith row of attribs is the plane formed by selecting from values the ith level of dimension dim. The m leading dimensions of the result are induced by the columns of attribs; the n trailing dimensions are the original dimensions of values.
If values is a number, it will be reshaped to a vector of length equal to the number of rows of attribs before the grouping is performed. If values is NIL, it is defaulted to one. This convention simplifies the construction of contingency tables: COUNTS of such a GROUP computes the number of cases in each cell of the classification.
If attribs is an s-vector, it is treated as a [s,1]-matrix.
The levels induced in the classification by each attribute (column of attribs) are determined as follows: If the column has a codebook, then only those values in the codebook form levels in the classification. Any other values are treated as "wild scores" and are omitted. Thus, if A is a matrix with columns SEX and HAIRCOLOR (coded blond, brown, black), then (GROUP A@’((SEX HAIRCOLOR)) ... ) will produce a two by three classification with one cell for each possible combination of SEX and HAIRCOLOR.
If the column does not have a codebook, then each value actually present is used. This requires more computation than the value-labelled case, as a preliminary pass over the attributes is made to determine how many distinct values there are.
The resulting array has the classification dimensions kept, so that generic functions (MOMENTS, COUNTS, RPLUS, etc.) will automatically apply within the levels of the grouping unless these keeps are explicitly overridden.
Frequently, the attributes and values are both variables in the same data matrix. In such cases, the full power of the IDL selection mechanism, including label selectors, can be used to construct the appropriate GROUP arguments. Thus, for a matrix M, a three-way cross classification of the variables AGE and INCOME by the attribute variables SEX, EDUC, and VOTE will result from
(GROUP M@’((SEX EDUC VOTE)) M@’((AGE INCOME)) )
Since the shape of the result depends on the attribute data, GROUP is not extended.
5.4 MOMENTS
MOMENTS computes the first m moments of an array.
moments[a;wt;m] where a is an array, wt is an (optional) weighting vector, and m is an optional (defaulted to two) integer giving the number of moments to be taken, returns a vector containing the zero through m-th moments of the values in a, defined as follows:
M0= the number of non-missing observations, i.e., Swt.
M1= the mean
M2= the sample variance = Swt(aM1)2/(M01), and
Mi= Swt(aM1)i/M0 for i > 3.
The third moment M3 is related to the skewness, and the fourth M4 to the kurtosis by
Skew= M3/M23/2
Kurtosis= M4/M22 3
In other words, the skew and kurtosis are essentially the third and fourth moments, scaled by the variance. Higher moments are available from this routine but are not commonly used. Note that, while the variance is unbiased, subsequent entries of the result are biased estimates of the population parameters.
MOMENTS collapses an entire array down to a vector of moments, ignoring the dimensional structure. The dimensions often represent a cross-classification for which within-cell moments are desired. If some of the dimensions are kept, they will be withheld from MOMENTS, and within-cell results will be computed for the kept dimensions collapsing across all the unkept ones. This meshes nicely with GROUP, which keeps the leading classification dimensions:
(MOMENTS (GROUP M@’((SEX EDUC VOTE)) M@’(INCOME) ))
yields a table of moments appropriate for a 3-way analysis of variance of the variable INCOME in data matrix M.
5.5 PAIRN
pairn[a;wt] for a a matrix and wt an (optional) weighting vector, returns the pairwise n matrix (sum of the weights for rows having data on both of a pair of columns) for the columns of the matrix. If a has c columns, the result will be a symmetric [c,c]-matrix each cell of which gives the n that the corresponding entry of (COVAR a wt) would be based on.
5.6 POOL
POOL could also be thought of as an analysis function, in that it operates on a compression of the original data (a MOMENTS array). However, it acts to further compress the data, so it is treated here. POOL collapses a MOMENTS array into an array with fewer classifying factors (thus its name, as it corresponds to the "pooling" operation of the analysis of variance).
pool[mtable] where mtable is an n-array whose last dimension is interpreted as the Moment dimension of a MOMENTS table, returns a vector formed by collapsing all the observations represented by mtable into a single classification.
By itself, this is not very interesting. Its chief use is in combination with KEEPing to select the dimensions that are to be preserved. Thus, (POOL (KEEP M I J K)) collapses the MOMENTS table M, preserving the classifications which make up its I, J, and K dimensions. The result is exactly the same as MOMENTS would have produced from the original data with only these dimensions kept.