Page Numbers: Yes X: 527 Y: -.4 First Page: 35 Not-on-first-page Margins: Top: 1.1" Bottom: 1" Binding: 5 Odd Heading: Not-on-first-page y576qj(635) 5. COMPRESSION FUNCTIONSy763qc(0,14848)\5f1 10f0 2f1 Even Heading: Not-on-first-pagey756qj(0,17815) IDL REFERENCE MANUALy763qc\5f1 9f0 1f1 5. Compression Functionsy666e12c(635)\f9b The distinction between compression and analysis functions is basic to IDL. Compression functions are used to transform a data array into a (typically smaller) array for further analysis. The compressed object is intended to be an information rich representation of the data which can serve as the base for many subsequent analyses. Data arrays can be compressed in different ways depending on the types of analysis desired. The different compression routines described here each preserve different aspects of the larger array. e36j\71f1 3f0 Several of these functions take an optional argument described as a "weighting" vector. This is used when the data is a systematically biased sample of the target population. This bias can often be corrected by counting each subject as representing more or less than one observation. For each subject in the array to be compressed, the corresponding element of a weighting vector indicates the number (not necessarily integral) of observations that the subject is to represent. Thus, a weighting vector must have as many elements as there are subjects. The default weighting for a subject, if no weighting vector is supplied, is 1.0. Subjects with a non-positive or NIL weight are simply omitted. e12j\672f1 3f0 5.1 COUNTSe18j counts[a] returns the sum of the values of the array a. It is similar to RPLUS except that it skips over NIL (rather than returning NIL if NIL is encountered, as RPLUS does).l4233d2999e8j\53b1B20f1 5f0 27f1 3f0 24f1 3f0 4f1 3f0 20f1 5f0 Note that, since the output of GROUP is kept on the dimensions of grouping, COUNTS will operate within each cell of a GROUP classification to produce a contingency table. Thus, for a matrix A,e12j\31f1 5f0 40f1 6f0 14i6I16f1 5f0 68f1 2f0 (COUNTS (GROUP A@'((VOTE SEX)) 1))l4233e4j produces a cross tabulation of VOTE against SEX.e4j\31f1 4f0 9f1 3f0 5.2 COVARe18j COVAR computes a covariation matrix, a symmetric matrix of the mean-centered cross products of the columns of a matrix. A covariation matrix can easily be converted to traditional correlation and (variance-) covariance matrices. However, it contains more information than either of these and is thus more useful for subsequent analyses. e12j\f1 5f0 12i18I171f8 1f0 covar[a;wt] where a is a matrix, and wt is an (optional) weighting vector, produces a symmetric [c+1,c+1]-matrix whose first c rows and columns correspond to the c variables (columns) of a. The last row and column correspond to a Constant "variable" whose value is taken to be 1.0 for all rows of a. In the first c rows and columns, the ij off-diagonal elements contain the sum of cross-product deviations from the means of the ith and jth variables, while the diagonal element gives the sum of squared deviations from the corresponding variable mean. In the Constant row, the ith off-diagonal element is the mean of the ith variable, while the diagonal element is -1/n, for n the number of observations on which the matrix entries are based.l4233d2999e8j\18b1B18b2B58i1I3i1I23i1I36i1I24b1B110b1B16i1I23i2I89i1I7i1I141i1I43i1I43f8 1f0 2i1I6i1I The result corresponds to a matrix of "raw" cross-products from which the Constant "variable" has been swept out (see section 6.6). COVAR does this as the result is constructed for reasons of numerical stability.e12j\133f1 5f0 If there are any missing values, then each cross product will be based on all rows for which data exist (i.e., observations are deleted pairwise). The figures in the final result are adjusted to appear as if they are based on the smallest number that any individual entry is based on, and that n is used to compute the Constant diagonal cell. The function PAIRN is available to compute the actual pairwise n.e12j\295i1I62f1 5f0 45i1I 5.3 GROUPe18k36 GROUP produces cross-classifications of data values as defined by the values of other observations on the same individuals. For example, to determine whether the heights of young and old men and women differ, their measured heights are grouped by their age and sex. Summary statistics can then be computed separately for the groups and compared. Cross-classifications in IDL are represented in the dimensional structure of an array, usually in an array's leading dimensions. Consequently, the extension mechanism will cause certain analysis functions to apply within the cells of the classification without further specification.e12j\f1 5f0 369f1 3f0 group[attribs;values;dim] for an [s,m]-matrix attribs and an n-array values with extent s on dimension dim (optional, default is one), constructs a classification array of m+n dimensions. Each row of attribs is considered as a subscript describing a cell in an m-way classification, and GROUP places the slice of values corresponding to that row into that cell of the classification. The slice corresponding to the ith row of attribs is the plane formed by selecting from values the ith level of dimension dim. The m leading dimensions of the result are induced by the columns of attribs; the n trailing dimensions are the original dimensions of values.l4233d2999e8j\34i1I1i1I9b7B8i2I6b6B13i1I14b3B66i1I1i1I26b7B54i1I25f1 5f0 21b6B97i1I10b7B39b6B5i1I22b3B7i1I64b7B6i1I52b6B If values is a number, it will be reshaped to a vector of length equal to the number of rows of attribs before the grouping is performed. If values is NIL, it is defaulted to one. This convention simplifies the construction of contingency tables: COUNTS of such a GROUP computes the number of cases in each cell of the classification.l4224e8j\3b6B87b7B41b6B4f1 3f0 95f1 6f0 11f1 5f0 If attribs is an s-vector, it is treated as a [s,1]-matrix.l4224e8j\3b7B7i1I29i1I The levels induced in the classification by each attribute (column of attribs) are determined as follows: If the column has a codebook, then only those values in the codebook form levels in the classification. Any other values are treated as "wild scores" and are omitted. Thus, if A is a matrix with columns SEX and HAIRCOLOR (coded blond, brown, black), then (GROUP A@'((SEX HAIRCOLOR)) ... ) will produce a two by three classification with one cell for each possible combination of SEX and HAIRCOLOR.e12j\70b7B208f1 1f0 26f1 3f0 5f1 9f0 35f1 33f0 91f1 3f0 5f1 9f0 If the column does not have a codebook, then each value actually present is used. This requires more computation than the value-labelled case, as a preliminary pass over the attributes is made to determine how many distinct values there are. e12j The resulting array has the classification dimensions kept, so that generic functions (MOMENTS, COUNTS, RPLUS, etc.) will automatically apply within the levels of the grouping unless these keeps are explicitly overridden.e12j\87f1 7f0 2f1 6f0 2f1 5f0 33i6I Frequently, the attributes and values are both variables in the same data matrix. In such cases, the full power of the IDL selection mechanism, including label selectors, can be used to construct the appropriate GROUP arguments. Thus, for a matrix M, a three-way cross classification of the variables AGE and INCOME by the attribute variables SEX, EDUC, and VOTE will result frome12j\120f1 3f0 90f1 5f0 32f1 1f0 52f1 4f0 4f1 6f0 28f1 3f0 2f1 4f0 6f1 4f0 (GROUP M@'((SEX EDUC VOTE)) M@'((AGE INCOME)) )l4233e6j Since the shape of the result depends on the attribute data, GROUP is not extended.e12j\61f1 5f0 5.4 MOMENTSe18jk50 MOMENTS computes the first m moments of an array. e12j\f1 7f0 20i1I moments[a;wt;m] where a is an array, wt is an (optional) weighting vector, and m is an optional (defaulted to two) integer giving the number of moments to be taken, returns a vector containing the zero through m-th moments of the values in a, defined as follows:l4233d2999e8j\22b1B14b2B40b1B130b1B29b1B M0 = the number of non-missing observations, i.e., Swt.l5292e5j(0,4896)(1,5504)\1f1o252 1f0o0 49f4 1f0 M1 = the meanl5292e5j(635)\1f1o252 1f0o0 M2 = the sample variance = Swt(aM1)2/(M01), andl5292e1j\1f1o252 1f0o0 25f4 1f0 4g1G1f1o252 1f0o0 1f1o4 1f0o0 3f1o252 1f0o0g1G7f1o252 Mi = Swt(aM1)i/M0 for i > 3. l5292e1j\1f1o252i1f0o0I3f4 1f0 4g1G1f1o252 1f0o0 1o4i1o0I2f1o252 1f0o0 5i1I1f3 1f0 The third moment M3 is related to the skewness, and the fourth M4 to the kurtosis bye12j\18f1o252 1f0o0 45f1o252 1f0o0 Skew = M3/M23/2l6048d4233e4j(0,6174)(1,65535)\8f1o252 1f0o0 2f1o252 1o4 3f0o0 Kurtosis = M4/M22 3l6048d4233e4j\12f1o252 1f0o0 2f1o252 1o4 1f0o0 1g1G In other words, the skew and kurtosis are essentially the third and fourth moments, scaled by the variance. Higher moments are available from this routine but are not commonly used. Note that, while the variance is unbiased, subsequent entries of the result are biased estimates of the population parameters. e6j(635) MOMENTS collapses an entire array down to a vector of moments, ignoring the dimensional structure. The dimensions often represent a cross-classification for which within-cell moments are desired. If some of the dimensions are kept, they will be withheld from MOMENTS, and within-cell results will be computed for the kept dimensions collapsing across all the unkept ones. This meshes nicely with GROUP, which keeps the leading classification dimensions:e12j\f1 7f0 254f1 7f0 131f1 5f0 (MOMENTS (GROUP M@'((SEX EDUC VOTE)) M@'(INCOME) ))l4233e6j yields a table of moments appropriate for a 3-way analysis of variance of the variable INCOME in data matrix M.e4j\87f1 6f0 16f1 1f0 5.5 PAIRNe18j pairn[a;wt] for a a matrix and wt an (optional) weighting vector, returns the pairwise n matrix (sum of the weights for rows having data on both of a pair of columns) for the columns of the matrix. If a has c columns, the result will be a symmetric [c,c]-matrix each cell of which gives the n that the corresponding entry of (COVAR a wt) would be based on. l4233d2999e8j\16b1B14b2B54i1I114b1B5i1I42i3I38i1I34f1 5f0 1b1B1b2B 5.6 POOLe18j POOL could also be thought of as an analysis function, in that it operates on a compression of the original data (a MOMENTS array). However, it acts to further compress the data, so it is treated here. POOL collapses a MOMENTS array into an array with fewer classifying factors (thus its name, as it corresponds to the "pooling" operation of the analysis of variance). e12j\f1 4f0 112f1 7f0 81f1 4f0 13f1 7f0 pool[mtable] where mtable is an n-array whose last dimension is interpreted as the Moment dimension of a MOMENTS table, returns a vector formed by collapsing all the observations represented by mtable into a single classification.l4233d2999e8j\19b6B7i1I72f1 7f0 82b6B By itself, this is not very interesting. Its chief use is in combination with KEEPing to select the dimensions that are to be preserved. Thus, (POOL (KEEP M I J K)) collapses the MOMENTS table M, preserving the classifications which make up its I, J, and K dimensions. The result is exactly the same as MOMENTS would have produced from the original data with only these dimensions kept.e12j\79f1 4f0 62f1 21f0 15f1 7f0 7f1 1f0 51f1 1f0 2f1 1f0 5f1 2f0 48f1 7f0