[qv]<IDL>Manual>APPENDIXF.BRAVO!1

This protocol gives an annotated transcript of a data analysis session using IDL. The transcript and its annotations are intended to illustrate some basic techniques of data analysis and the use of IDL operations to accomplish them.

The annotations are set off from the transcript in a small font. The system may be assumed to have typed everything else, except lines beginning with a ← and the first line. These are the user’s commands to the system, given in response to the ←, which is the Interlisp prompt character. The use of the male pronoun in referring to the user in the annotations is neither conventional nor discriminatory. This particular user is a man.

The user begins at the TENEX executive (which prompts with @). He intends to analyze ratings of wines he collected at a recent wine tasting. Having placed these data in the file TASTING.DATA with an ordinary text editor, he uses the TENEX SEE command to verify that the file is in a suitable format:

@SEE TASTING.DATA

(TITLES "The Definitive Wine Tasting" Person Wine)
(LABELS Canyon Heights L’Effete Pallide)
(Ron -2 4 0 4)
(Jeff 2 -1 -4 3)
(Susan 5 4 5 5)
(Henri -10 -9 9 10)
(Kathy 5 -2 3 6)
(Joanne 5 4 -4 3)
(Bob -6 5 6 -3)
(Beau 0 4 2 4)
(Fred -1 1 2 5)
(Janet 4 -2 4 -5)

The file is in the format expected by the IDL function IDLMATRIX: The TITLES list contains the title string for the data matrix and the labels (Person and Wine) for its two dimensions, the LABELS list defines the levels on the Wine dimension, and the remaining lines give the level labels for the Person dimension and the data values. With the data in order, he starts IDL to begin the session. IDL prints out a message identifying the version of the system, greets him, and leaves him at the Interlisp executive (the ← prompt). He immediately types an expression that sets the variable TD to the array described in the file.

The Lisp value of this instruction is the array itself, which appears as the square-bracketed expression. An IDL array is assigned a serial number as it is created, and this number serves as a unique identifier for it. The printed representation (or "print-name") of an array includes its serial number and an indication of its shape. For this array, it indicates that TD has two dimensions labelled Person and Wine, and that Person has 10 levels and Wine has 4. The values of the array are not displayed, so the user requests that the whole array be SHOWn to him.

←SHOW

The Definitive Wine Tasting

Wine
Person Canyon Heights L’Effete Pallide
Ron -2 4 0 4
Jeff 2 -1 -4 3
Susan 5 4 5 5
Henri -10 -9 9 10
Kathy 5 -2 3 6
Joanne 5 4 -4 3
Bob -6 5 6 -3
Beau 0 4 2 4
Fred -1 1 2 5
Janet 4 -2 4 -5

[Array 1: Person=10 Wine=4]

The array TD is printed using SHOW, a command which prints the value of the previous command. TD is a simple data matrix, which purports to report on the results of a wine tasting at which 10 people (named Ron, Jeff, ... , Janet) tasted four wines (named Canyon, ... , Pallide) and rated them on a scale from -10 to 10. Each row of the matrix contains, for one person, the score that they gave to each of the four wines.

The user’s first question is "How well did people like these wines?". This is answered by looking at the mean (average) of all the ratings. The MOMENTS operator computes the first two moments (the first is the mean and the second the variance, plus the zeroth moment which is the count of the number of observations) of all the values in TD and the user SHOWs it.

In order to explore these, the user decides to look first at Wines. He recomputes the MOMENTS, KEEPing out Wines. The result will have the MOMENTS done separately for each wine (i.e., down the columns of the original matrix), giving a matrix where each wine is represented by three moments, rather than ten people. The user instructs the system to print (PPA for Pretty Print Array) the value immediately.

←(PPA (MOMENTS (KEEP TD ’Wine]
Moments of The Definitive Wine
Tasting keeping Wine

Moment
Wine N Mean Variance
Canyon 10.000 .200 26.178
Heights 10.000 .800 19.289
L’Effete 10.000 2.300 17.122
Pallide 10.000 3.200 18.622

[Array 6: Wine=4 Moment=3]

The MOMENTS for the Wines reveal that the French wines were slightly better liked. But is this difference significant, or is it just the slight difference you might expect by chance? The user passes the value of the MOMENTS to the ANalysis Of VAriance to find out. He is able to specify that ANOVA is to work on the output of MOMENTS by using the name IT, which is always bound to the last value computed.

←(ANOVA IT]
[Array 7: Source=3 Column=5]
←SHOW

Anova of Moments of The Definitive Wine Tasting
keeping Wine

Column
Source SumSq df MS F p
Gnd-mean 105.625 1.000 105.625 5.202 .029
Wine 56.475 3.000 18.825 .927 .438
Error 730.900 36.000 20.303 NIL NIL

[Array 7: Source=3 Column=5]

ANOVA is a technique that contrasts the size of differences between groups (here between the various Wines) with the amount of variation in the scores. Such a contrast provides much useful information to the experienced data analyst. Our user notes that the probability that the Grand Mean (the average of the averages) is zero is .029, very unlikely. He was right in his inference that the ratings were significantly positive. However, the differences between the wines are not significantly different from zero (and, therefore, they are not significantly different from each other) as the probability of them being just random perturbations is .438, which is quite high. [Most data analysts consider probability levels greater than 0.05 (1 in 20) too high to permit inference of an effect other than random variation.]

Reflecting on these results, the user realizes that this particular analysis is inappropriate for the data at hand. There are multiple observations (or "repeated measures") for each Person, and this means that some of the "Error" variation might be due to systematic differences between people (i.e., some people might simply like wine more than others). A more elaborate form of the analysis of variance is required so that the inter-wine variation may be tested independently of the inter-person variation. In the correct analysis, both Person and Wine must be treated as experimental factors so that their separate effects and their interaction may be examined. This is accomplished by letting the input to ANOVA be a MOMENTS table constructed with both dimensions of TD kept:

←(ANOVA (MOMENTS (KEEP TD ALL)) ’Person)
[Array 10: Source=4 Column=5]
←SHOW

Anova of Moments of The Definitive Wine Tasting
keeping Person and Wine and with Person considered
random

Column
Source SumSq df MS F p
Gnd-mean 105.625 1.000 105.625 11.300 .008
Person 84.125 9.000 9.347 NIL NIL
Wine 56.475 3.000 18.825 .786 .512
P*W 646.775 27.000 23.955 NIL NIL

[Array 10: Source=4 Column=5]

The second argument (Person) to ANOVA indicates that Person is to be considered a "random" factor. That is, the user considers the ten people who participated in the wine tasting as a small random sample drawn from the set of people in general. He wants to make inferences from the ratings of the particular sample of ten to the ratings that would likely be obtained from the larger population. The analysis of variance procedure selects an appropriate test for each effect of such designs, if one exists. In this case, there is no test for the Person variation, so the probability value is NIL.

Despite the lack of statistical significance, the direction of the wine differences bothers the user. His impression of the evening was that the wines were thought of as being very similar. He did not expect to see a trend, even if slight, towards the French. Following these thoughts, he decides to explore differences between the people. He computes another set of MOMENTS, this time KEEPing out just Person.

←(MOMENTS (KEEP TD ’Person))
[Array 11: Person=10 Moment=3]
←SHOW

Moments of The Definitive Wine
Tasting keeping Person

Moment
Person N Mean Variance
Ron 4.000 1.500 9.000
Jeff 4.000 0.000 10.000
Susan 4.000 4.750 .250
Henri 4.000 .000 120.667
Kathy 4.000 3.000 12.667
Joanne 4.000 2.000 16.667
Bob 4.000 .500 35.000
Beau 4.000 2.500 3.667
Fred 4.000 1.750 6.250
Janet 4.000 .250 20.250

[Array 11: Person=10 Moment=3]

The data does show strong differences between the various Persons. Susan was the most positive; Henri and Jeff the most negative although, interestingly enough, noone actually averaged below the midpoint of the scale. More interesting than the means, however, are the individual variances. Susan, in addition to being positive, had almost no variance (that is, she liked all the wines about the same, as a glance back at the original data shows). Henri, on the other hand, had very different reactions (much more so than anyone else) to the different wines.

The extreme differences between the amount of variance in Henri’s ratings and the amount everywhere else suggests that one or more of Henri’s ratings may be unreliable - maybe he was given a dirty glass, or something. A very useful technique for detecting such unreliable scores is to look for outliers, or values that are clearly more extreme than all the others. This type of screening can be used to detect values that are the results of other influences than the one the investigator had in mind; subjects falling asleep during reaction time experiments, for example.

In this case, a simple but effective method, is to plot a frequency distribution of the scores. GROUP is used to classify a default vector of all ones according to the values found in TD. The values are RESHAPEd (to a vector) initially, otherwise the GROUP would cross classify by the columns (i.e., find all the rows with score x on wine 1, y on wine 2, etc.). Here, the user is interested in putting all instances of the same score together, no matter which column of the array they came from. Then, COUNTS is used to count the number of observations with each value

←(COUNTS (GROUP (RESHAPE TD]
[Array 18: Value=21]
←SHOW

Counts of Group of 1 by Values of The Definitive Wine
Tasting keeping Value
Value
-10= -9= -8= -7= -6= -5=
1 1 0 0 1 1

Value
-4= -3= -2= -1= 0= 1=
2 1 3 2 2 1

Value
2= 3= 4= 5= 6= 7=
3 3 8 7 2 0

Value
8= 9= 10=
0 1 1

[Array 18: Value=21]

←(HIST IT]

Histogram of Counts of Group of 1 by Values of The
Definitive Wine Tasting keeping Value

Value
-10= |*
-9= |*
-6= |*
-5= |*
-4= |**
-3= |*
-2= |***
-1= |**
0= |**
1= |*
2= |***
3= |***
4= |********
5= |*******
6= |**
9= |*
10= |*

Each * indicates 1. The | is at 0.
[Array 18: Value=21]

The HIST shows the form of the distribution clearly. The most common reaction was quite positive (a score of 4). However, there were some negative observations that, in the absence of any equally extreme positive observations, pulled the average down. [The asymmetry around the mean could also have been detected from the higher moments, in particular, from the third moment or skew.] This asymmetry immediately explains the discrepancy between the user’s recollections and the average: his recollections reflected the common reaction, rather than the average.

The HIST also shows four outliers, scores that lie outside the range of all the others. These four scores are, furthermore, at the limits of the scale, so they are doubly suspicious. Going back to the data, the user finds that all four belong to the same individual (Henri), which explains his high variance, and persuades the user to delete him from the data matrix. Whatever was determining Henri’s scores, it clearly had little to do with what was determining everyone else’s (and the direction of the differences even raises a dark suspicion in the user’s mind).

←(NTD ← TD@’((1 2 3 5 6 7 8 9 10) ALL]
[Array 19: Person=9 Wine=4]
←SHOW

Selected Persons from The Definitive Wine Tasting

Wine
Person Canyon Heights L’Effete Pallide
Ron -2 4 0 4
Jeff 2 -1 -4 3
Susan 5 4 5 5
Kathy 5 -2 3 6
Joanne 5 4 -4 3
Bob -6 5 6 -3
Beau 0 4 2 4
Fred -1 1 2 5
Janet 4 -2 4 -5

[Array 19: Person=9 Wine=4]

←(MOMENTS (KEEP NTD 2]
[Array 35: Wine=4 Moment=3]
←SHOW

Moments of Selection of The
Definitive Wine Tasting

Moment
Wine N Mean Variance
Canyon 9.000 1.333 15.000
Heights 9.000 1.889 8.361
L’Effete 9.000 1.556 13.028
Pallide 9.000 2.444 14.528

[Array 35: Wine=4 Moment=3]

. . . and this time, the results show no differences for country of origin. Another way to reduce the impact of outliers is to recode the data. Here, the problem is the non-comparability of the scores from one person to another. A recode which solves this problem is to use the ranks of the values within people, rather than the values themselves. This provides a scaling of the scores to common units. The RANK operator, applied within the (kept) Person dimension (the rows) produces the ranks.

←SHOW

Rank of The Definitive Wine Tasting keeping
Person

Wine
Person Canyon Heights L’Effete Pallide
Ron 1.000 3.500 2.000 3.500
Jeff 3.000 2.000 1.000 4.000
Susan 3.000 1.000 3.000 3.000
Henri 1.000 2.000 3.000 4.000
Kathy 3.000 1.000 2.000 4.000
Joanne 4.000 3.000 1.000 2.000
Bob 1.000 3.000 4.000 2.000
Beau 1.000 3.500 2.000 3.500
Fred 1.000 2.000 3.000 4.000
Janet 3.500 2.000 3.500 1.000

[Array 49: Person=10 Wine=4]

←(MOMENTS (KEEP IT ’Wine]
[Array 53: Wine=4 Moment=3]
←SHOW

Moments of Rank of The Definitive
Wine Tasting keeping Person
keeping Wine

Moment
Wine N Mean Variance
Canyon 10.000 2.150 1.558
Heights 10.000 2.300 .844
L’Effete 10.000 2.450 1.025
Pallide 10.000 3.100 1.156

[Array 53: Wine=4 Moment=3]

Another way of approaching this type of data is to look at the relationship between the ratings produced by various people. Do they agree with each other, and who agrees with whom? The global level of agreement is given by the variance, and could be tested by an ANOVA of the MOMENTS KEEPing Person. The second question, however, requires a different form of analysis, an analysis of the covariations among the ratings. The covariation operator, COVAR, keeps the columns of its (matrix) argument as the objects between which the covariations are recorded, and computes those covariations across the rows. Usually, this preserves the observations and suppresses the subjects of a subjects by observations matrix. Here, the user is interested in the covariation between Persons, so he TRANSPOSEs the data array as he applies COVAR. He also asks for the covariations to be scaled (NORMalized) by their variances, so that each covariation appears in the same units. These scaled covariations, which range from 1.0 to -1.0, are the correlation coefficients, very useful measure-independent indices of covariation.

←(PPA (NORM (COVAR (TRANSPOSE TD]

Norm of Covariations of Transpose of The Definitive Wine Tasting

Person
Person Ron Jeff Susan Henri Kathy Joanne
Ron 1.000
Jeff .141 1.000
Susan -.556 .211 1.000
Henri .243 -.163 .546 1.000
Kathy -.375 .533 .937 .469 1.000
Joanne .163 .800 -.327 -.684 -.023 1.000
Bob .319 -.891 -.507 .200 -.728 -.649
Beau .986 0.000 -.522 .349 -.391 -.000
Fred .689 .169 .200 .838 .300 -.261
Janet -.926 -.492 .333 -.243 .062 -.381

Person
Person Bob Beau Fred Janet
Bob 1.000
Beau .441 1.000
Fred .101 .731 1.000
Janet .056 -.870 -.733 1.000

[Array 56: Person=10 Person=10]

The user’s interest in this data is very practical. If he (Ron) has to miss the next wine tasting, whose opinion should he seek to best predict what his reactions would have been had he been there? Scanning the correlations in the column (or the row) under his name, he finds that Beau is an excellent choice, as he and Beau agree very closely. Joanne or Jeff on the other hand, are very poor sources of information. Strangely enough, if Beau also misses the next tasting, the next best source of information is not Fred, with a correlation of .689, but Janet! The fact that Janet and Ron disagree almost completely makes her a very useful source of information - all Ron has to do is to take her preferences and reverse them and he will have an excellent predictor of the choice he would have made!

Having finished with his immediate investigation of the tasting, the user decides to load in some information he has about the people that were there, to see if any of their characteristics (other than nationality!) predict their preferences. This file, unlike the last one, is defined to set the variable PATTRIB to the data array it contains. Thus, he merely has to LOAD it and PATTRIB will be set automatically.

←LOAD(TCLASS.DATA]
FILE CREATED 12-FEB-78 20:42:52
TCLASSCOMS
<IDL>TCLASS.DATA;2
←(PPA PATTRIB)

People attributes

Variable
Person Sex Experienc Age
Ron Male Expert 31
Jeff Male Some 38
Susan Female None 31
Henri Male Some 23
Kathy Female Some 26
Joanne Female None 32
Bob Male Expert 42
Beau Male Some 29
Fred Male None 27
Janet Female Expert 33

[Array 57: Person=10 Variable=3]

The matrix of attributes has a row for each person with values for each of three attributes: the person’s Sex, previous wine-drinking Experience, and Age. Age is like the wine-ratings the user has been dealing with in that the magnitude of the numerical scores is meaningful. This is not true for Sex, where the variable indicates simply that a person falls into either the Male or Female category. IDL permits labels to be associated with the numerical values of such categorization variables, and those labels are printed instead of the underlying numerical scores. The mapping of values into labels is defined in a codebook associated with a variable; codebooks can be examined by means of the CODE selector:

A codebook is a list of pairs that maps values into labels and vice versa. A different codebook can be associated with each variable (or more generally, with each level on a single dimension, called the value-labelled dimension). Besides allowing arrays to be displayed in a more intuitive format, value-labels and codebooks are used in certain kinds of operations, the most notable of which is the GROUP operation. GROUP was used above to group the wine ratings so that COUNTS could produce a frequency distribution. More generally, GROUP will impose a multi-dimensional structure defined by a categorization matrix on a second array. For example, the following statement groups the wine-tasting data into categories defined by the Sex and Experience of the people:

The attributes for GROUP are the first two columns of the PATTRIB matrix (the selector for the first dimension is defaulted to ALL). GROUP considers each row of PATTRIB to be the address in a Sex by Experience space into which the corresponding row of TD is to be placed (both arguments must have the same number of rows). Since Ron has Sex = Male and Experience = Expert, his data is placed in the Male-Expert cell of the classification space. The array resulting from the GROUP has 4 dimensions, 2 resulting from the attributes, and the original 2 from TD. The variable (column) labels of the attributes become the labels for the first 2 dimensions. The number of levels and the level labels for those dimensions are determined from the variables’ codebooks. The print-name of the resulting array indicates not only the shape, but also the fact that the dimensions resulting from the attributes are implicitly kept. This means that subsequent operations will apply within the levels of those dimensions.

The grouped array has 72 (= 2 x 3 x 3 x 4) cells, whereas there were only 40 observations in the data. The reason is that the GROUP allocates space for the largest number of rows found in any of the conditions; the other conditions, into which fewer rows fall, will be filled with missing data values (NIL). The user issues a SHOW command to display the array, this time giving an argument that specifies the event on Interlisp’s history list whose value he wishes to see:

←SHOW GROUP

Group of The Definitive Wine Tasting by
Selected Variables from People attributes

Kept: Sex Experience

Sex = Male
Experience = None
Wine
Person Canyon Heights L’Effete Pallide
1 -1 1 2 5
2 NIL NIL NIL NIL
3 NIL NIL NIL NIL

Sex = Male
Experience = Some
Wine
Person Canyon Heights L’Effete Pallide
1 2 -1 -4 3
2 -10 -9 9 10
3 0 4 2 4

Sex = Male
Experience = Expert
Wine
Person Canyon Heights L’Effete Pallide
1 -2 4 0 4
2 -6 5 6 -3
3 NIL NIL NIL NIL

Sex = Female
Experience = None
Wine
Person Canyon Heights L’Effete Pallide
1 5 4 5 5
2 5 4 -4 3
3 NIL NIL NIL NIL

Sex = Female
Experience = Some
Wine
Person Canyon Heights L’Effete Pallide
1 5 -2 3 6
2 NIL NIL NIL NIL
3 NIL NIL NIL NIL

Sex = Female
Experience = Expert
Wine
Person Canyon Heights L’Effete Pallide
1 4 -2 4 -5
2 NIL NIL NIL NIL
3 NIL NIL NIL NIL

[Array 62: Sex=2 Experience=3 Person=3 Wine=4; kept Sex Experience]

Up to now, the user has worked with only vectors or matrices, which print as simple 1 or 2-dimensional panels on the page. Higher-dimensional arrays print as a sequence of 2-dimension panels. Each panel is prefaced with a description of the levels on the leading dimensions that the panel represents. The panel itself has the labels for the last two dimensions. For the classified array, there are six panels, corresponding to the 2 x 3 grouping he imposed on the TD matrix.

Having grouped his tasting data, he can now ask how wine ratings varied as a function of the Sex and Experience of the taster. Perhaps Females liked certain wines but not others, or maybe some wines were more attractive to palates with more Experience. The user explores this by taking the MOMENTS of the grouped data, KEEPing the Sex and Experience dimensions. Earlier he used the function KEEP to mark which dimensions of an array he wanted to be preserved. That explicit marking is not necessary here, since the output of GROUP is automatically marked so that the classification dimensions will be kept, on the theory that the user wouldn’t want his carefully constructed group structure to be completely ignored by the very next operation. This being the case, he obtains a Sex by Experience MOMENTS table by requesting:

←(MOMENTS PCTD]
[Array 65: Sex=2 Experience=3 Moment=3]
←SHOW

Moments of Group of The Definitive
Wine Tasting by Selected Variables
from People attributes keeping Sex
and Experience

Sex = Male
Moment
Experience
N Mean Variance
None 4.000 1.750 6.250
Some 12.000 .833 38.152
Expert 8.000 1.000 19.143

Sex = Female
Moment
Experience
N Mean Variance
None 8.000 3.375 9.411
Some 4.000 3.000 12.667
Expert 4.000 .250 20.250

[Array 65: Sex=2 Experience=3 Moment=3]

There is no clear pattern in the Means, other than that there is quite a bit of variation across the different classifiers. The user can get a better assessment of the situation by calling ANOVA on this table to produce a 2-way analysis of variance:

←(PPA (ANOVA IT]
Anova of Moments of Group of The Definitive Wine
Tasting by Selected Variables from People attributes
keeping Sex and Experience

Harmonic mean of cell N’s: 5.538

Column
Source SumSq df MS F p
Gnd-mean 96.194 1.000 96.194 4.437 .043
Sex 8.540 1.000 8.540 .394 .534
Experien 21.561 2.000 10.780 .497 .613
S*E 13.330 2.000 6.665 .307 .737
Error 737.042 34.000 21.678 NIL NIL

[Array 66: Source=5 Column=5]

The relatively high p values confirm the intuition that there are no strong effects in this data. If there had been a significant main-effect for Experience, say, the user might have then computed the MOMENTs for the Experience grouping alone to see where the differences were. To do this, he would need to eliminate one of the two implicitly kept dimensions of PCTD. The function LEAVE, described in Section 2.3, is provided for just this purpose.

GROUP, in conjunction with MOMENTS and ANOVA, is a solid foundation for examining the group differences on a measure or set of measures. The user is also interested in the relationship between a person’s age and wine-ratings. He wonders, for example, whether older people give consistently higher ratings than younger people. He could answer this question by placing the people in a small number of age-groups, say, Young (below 25), Middle (between 25 and 34), and Old (above 35), and then applying the techniques he used above. But combining people of different ages into a single group throws away some of the information present in the data, so the user tries a more powerful analysis. He decides to find out how well wine-ratings can be predicted as a linear function of Age, using a linear regression technique. Initially, he cares more about ratings in general than about ratings on particular wines, so he collapses across the Wine dimension to obtain an average rating for each person. There are many ways of doing this. Here, the user takes the MOMENTS across Person, and selects the mean out of the result.

←(AVRATING ← (MOMENTS (KEEP TD ’Person))@’(Mean]
[Array 71: Person=10]
←SHOW

Mean of The Definitive Wine Tasting keeping Person
Person
Ron Jeff Susan Henri Kathy Joanne
1.500 0.000 4.750 .000 3.000 2.000

Person
Bob Beau Fred Janet
.500 2.500 1.750 .250

[Array 71: Person=10]

The function COVAR, which is the starting point for regressions in IDL, produces a covariation matrix for a single matrix of data. SInce the user wants to include Age and and average rating in a single regression, he must first combine them into a single matrix. He does this with the function ADJOIN, which in its simplest usage joins a set of vectors together end to end. If higher-dimensional objects are given as arguments, IDL’s function extension mechanism is invoked to slice the larger objects down to the vectors that ADJOIN knows how to deal with.

In the present situation, the user wants to think of PATTRIB as a collection of variable-vectors, one for each person. In other words, he wants the Person dimension kept in the result of the joining. Since Person is the first dimension, the default rule for function extension will extend along Variable, just what the user wants, without further specification. AVRATING is a little more complicated. It is a vector already, so the default rule would not break it down any further. Thus, (ADJOIN PATTRIB AVRATING) would simply add the average-rating vector as a series of 10 new variables at the end of each row in PATTRIB. This is clearly not what the user intends. The desired effect is to pair the first element of AVRATING with the first row of PATTRIB, the second element with the second row, and so on. In other words, the user wants to think of AVRATING as a sequence of 10 scalars, not a vector of length 10. He can get the desired affect by KEEPing the first (and only) dimension of AVRATING:

←(PVARS ← (ADJOIN PATTRIB (KEEP AVRATING 1]
[Array 86: Person=10 Variable=4]
←SHOW

Adjoin of People attributes and Mean of The
Definitive Wine Tasting keeping Person
keeping Person

Variable
Person Sex Experienc Age 4
Ron Male Expert 31.000 1.500
Jeff Male Some 38.000 0.000
Susan Female None 31.000 4.750
Henri Male Some 23.000 .000
Kathy Female Some 26.000 3.000
Joanne Female None 32.000 2.000
Bob Male Expert 42.000 .500
Beau Male Some 29.000 2.500
Fred Male None 27.000 1.750
Janet Female Expert 33.000 .250

[Array 86: Person=10 Variable=4]

The average ratings appear as a fourth variable in the new matrix, bound to PVARS. The title and labels for PVARS are not quite appropriate, however, and the user decides to change them. He first adds a variable name for the fourth column, and then he adds a more descriptive title, using IDL’s LABEL and TITLE selectors.

←(PVARS@(LABEL ’Variable 4) ← ’Avrating]
Avrating
←(PVARS@(TITLE) ← "Attributes + Average wine rating"]
"Attributes + Average wine rating"
←(PPA PVARS)

Attributes + Average wine rating

Variable
Person Sex Experienc Age Avrating
Ron Male Expert 31.000 1.500
Jeff Male Some 38.000 0.000
Susan Female None 31.000 4.750
Henri Male Some 23.000 .000
Kathy Female Some 26.000 3.000
Joanne Female None 32.000 2.000
Bob Male Expert 42.000 .500
Beau Male Some 29.000 2.500
Fred Male None 27.000 1.750
Janet Female Expert 33.000 .250

[Array 86: Person=10 Variable=4]

Having gone to the trouble of constructing this matrix, the user decides to save it on a file so that it can be loaded in for future analyses. He does this with Interlisp’s ordinary file-making function, MAKEFILE. First he specifies that the file is to contain the matrix bound to PVARS, by setting the variable NEWVARCOMS to a IDLARRAYS file command. With that done, he makes the file named NEWVAR.

He is now ready to examine the covariation structure of PVARS. He applies COVAR to a selection from PVARS, in order to exclude the variable Sex, for which this type of analysis is inappropriate. He includes the categorized variable Experience, because there is some notion of ordering in its groupings.

←(C ← (COVAR PVARS@’((Experience Age Avrating]
[Array 88: Variable=4 Variable=4]
←SHOW

Covariations of Selected Variables from
Attributes + Average wine rating

Variable
Variable Experienc Age Avrating Constant
Experien 6.000
Age 16.000 283.600
Avrating -6.250 -22.250 21.031
Constant 2.000 31.200 1.625 -.100

[Array 88: Variable=4 Variable=4]

The covariation matrix entries are expressed in raw score terms. It is easier to gauge the size of the relations when they are scaled with respect to the units of measurement and the amount of variance. This is done, as before, by the NORM operator.

←(PPA (NORM IT]
Norm of Covariations of Selected
Variables from Attributes + Average
wine rating

Variable
Variable Experienc Age Avrating
Experien 1.000
Age .388 1.000
Avrating -.556 -.288 1.000

[Array 89: Variable=3 Variable=3]

This correlation matrix shows some interesting patterns, but before exploring them, the user remembers that he has requested the sequence COVAR of NORM to produce correlation matrices before, and realizes that he is likely to request it again. To make it more convenient, he packages them together into a single operation, with Interlisp’s standard function-defining function, DEFINEQ. He associates the name CORR with the new operation, and tests it by replicating the matrix he computed above.

←DEFINEQ((CORR (M)(NORM (COVAR M]
(CORR)

←(PPA (CORR PVARS@’((Experience Age Avrating]
Norm of Covariations of Selected
Variables from Attributes + Average
wine rating

Variable
Variable Experienc Age Avrating
Experien 1.000
Age .388 1.000
Avrating -.556 -.288 1.000

[Array 92: Variable=3 Variable=3]

The correlation matrix shows a fairly high negative correlation (-.556) between Experience and Avrating, suggesting that these particular wines did not appeal to sophisticated drinkers. On the other hand, Experience is correlated positively with Age, which seems reasonable, and Age is also negatively correlated with Avrating. Perhaps the high Experience/Avrating correlation is an artifact of the relationship that both those variables have to Age. To determine whether this is the case, the user wants to remove (or SWEEP out) the variation associated with Age from the matrix.

←(SWEEP C ’(Age]
[Array 93: Variable=4 Variable=4]
←SHOW

Sweep of Crossproducts of Selected Variables
from Attributes + Average wine rating, Swept
out: Age Constant

Variable
Variable Experienc Age Avrating Constant
Experien 5.097
Age .056 -.004
Avrating -4.995 -.078 19.286
Constant .240 .110 4.073 -3.532

[Array 93: Variable=4 Variable=4]

The swept matrix gives the covariation space with the variance on Age removed. Many useful pieces of information can be read off this matrix. For example, NORMing it will give the partial correlations between the (two) variables whose variance remains.

←(PPA (NORM IT]
Norm of Sweep of Crossproducts
of Selected Variables from
Attributes + Average wine rating,
Swept out: Age Constant

Variable
Variable Experienc Avrating
Experien 1.000
Avrating -.504 1.000

[Array 94: Variable=2 Variable=2]

The correlation between Experience and Avrating is reduced (specifically, by about 16% of the joint variance) but a significant relationship still exists. Therefore, the user decides that he will get a better prediction for Avrating if he also removes Experience. He retrieves the output of the previous SWEEP using the Interlisp VALUEOF function, and SWEEPs out Experience. This has the same effect as if they had both been swept out together.

←(S ← (SWEEP (VALUEOF SWEEP) ’Experience]
[Array 95: Variable=4 Variable=4]
←SHOW

Sweep of Crossproducts of Selected Variables
from Attributes + Average wine rating, Swept
out: Age Constant Experience

Variable
Variable Experienc Age Avrating Constant
Experien -.196
Age .011 -.004
Avrating -.980 -.023 14.391
Constant .047 .107 4.308 -3.544

[Array 95: Variable=4 Variable=4]

The new matrix describes the variable space with both Age and Experience removed. As the user is interested in the regression (linear prediction) of the average rating using these two variables, he needs to extract the regression coefficients for Avrating on Age and Experience. There are three of these, coefficients for Age, Experience, and the Constant term, and they are found in the Avrating row (or column) of the matrix, where it meets the corresponding column (or row). For example, the regression coefficient for Age is -.023. The coefficients could be read out of the matrix and typed in to form a linear equation, but it is generally preferable to extract the coefficients directly from the matrix, as they are represented there with far greater precision than appears in the display.

The extraction and the computation of the predicted values could be done by selecting the coefficients (e.g. (S@’(AVRATING Age) for Age) and using them with selections on PVARS in a linear equation involving the operators * and +. This works because the arithmetic operators in IDL have been extended so that they will apply to corresponding elements of the various selections to produce a vector result. However, the user realizes that the desired result is equivalent to a matrix product that can be computed in a single step, by means of the MPROD operator. The arguments to MPROD are the matrix of independent variables, formed by joining the Constant 1 to the end of each row in PVARS, and the vector of coefficients selected from the Avrating column of the swept matrix S:

←[PV ← (MPROD (ADJOIN PVARS@’((Age Experience)) 1)
S@’((Age Experience Constant)(Avrating]
[Array 159: Person=10 Variable=1]
←SHOW

Matrix product of Adjoin of Selected Variables
from Attributes + Average wine rating and 1
and Selection of Sweep of Crossproducts
of Selected Variables from Attributes + Average
wine rating, Swept out: Age Constant Experience

Variable
Person Avrating
Ron .650
Jeff 1.467
Susan 2.610
Henri 1.815
Kathy 1.746
Joanne 2.586
Bob .395
Beau 1.676
Fred 2.702
Janet .603

[Array 159: Person=10 Variable=1]

This regression accounts for ~30% (1-the ratio of the Avrating diagonals before and after the SWEEP) of the variance on Avrating. The rest of the variance may be either random (i.e. accounted for by unmeasured variables) or a reflection of a non-linear relationship between these variables. To find out which, the user decides to examine the residuals (the differences between the predicted and actual values) to see if there is a pattern to them. They can be specified simply by typing in the definition of "residual" and, since eyeballing numbers is of limited utility in detecting trends, the user will immediately plot them against something which will detect the trend he seeks. In this case, as he is worried about possible non-linear relationships between his variables, he plots the residuals against the predicted values. If there are, say, quadratic effects, the plot of differences will show a quadratic form.

←(PLOT AVRATING-PV PV)

Plot of Variable Avrating from Difference of Mean of The
Definitive Wine Tasting keeping Person and Matrix
product of Adjoin of Selected Variables from Attributes
+ Average wine rating and 1 and Selection from Sweep of
Crossproducts of Selected Variables from Attributes +
Average wine rating, Swept out: Age Constant Experience
and Variable Avrating from Matrix product of Adjoin of
Selected Variables from Attributes + Average wine rating
and 1 and Selection from Sweep of Crossproducts of
Selected Variables from Attributes + Average wine rating
Swept out: Age Constant Experience

2.2 | *
|
|
|
|
| *
|
| * *
|
|
.2 | *
|
|
| *
| *
|
| *
|
| *
|
-1.8 | *
:-----------------------:----------------------:
.35 1.55 2.7

X-axis plotted in increments of .05
Y-axis plotted in increments of .2
NIL

In this case, the residuals show no pronounced trends with respect to the predicted values (although the discrete Experience classification is clearly visible in the clustering), so the user decides that the remaining variance is due to other variables which he has not observed. He therefore terminates his analysis session to go gather data on them.