Richard B. Darlington
Cornell University
ALLSETS is a procedure that can scan all these regressions with surprising speed, computing R2 for each. I argue elsewhere that ALLSETS is useful in identifying collinear sets of variables in a regression.
ALLSETS uses a procedure called SWEEP which can be applied to a single variable in a correlation matrix -- that is, to a single row and matching column in the matrix. By SWEEPing any set of variables in the matrix, one can easily compute the R2 for predicting any unswept variable from the swept variables. For instance, if a matrix had 5 variables and you swept variables 1, 3, and 5, then the second diagonal entry in the swept matrix would be 1 - R2, where R2 is the squared multiple correlation predicting the second variable from variables 1, 3, and 5. The fourth diagonal entry would have a similar meaning for variable 4.
Each sweep changes every entry in the matrix. Sweeping is reversible -- that is, sweeping a variable twice returns the matrix to its original state. Thus to say that a variable "is" currently swept is to say that it has been swept an odd number of times, and to say it "is not" currently swept is to say it has been swept an even number of times.
ALLSETS works with the correlation matrix of P predictor variables plus Y, with Y placed last. By performing 2P - 1 sweeps in that matrix, ALLSETS computes all 2P - 1 nonnull values of R2 for predicting Y. (We don't need to calclulate R2 for the "null" regression with no predictors; we know it's 0.) Then performing one more sweep returns the correlation matrix to its original state, except for any accumulated rounding error.
Because of that last point, the last sweep is recommended, since it allows the user to compare the final matrix with the original one, to see how much rounding error has accumulated. The less has accumulated, the more confidence we can have in the R2 values computed along the way. Rounding error is discussed in more detail below.
Let d(Y) denote the last diagonal entry in the matrix -- the diagonal entry for Y. Then after any number of sweeps, R2 for the set of swept variables is R2 = 1 - d(Y).
The necessary sequence of sweeps is easily described. For P = 4 it is
1 2 1 3 1 2 1 4 1 2 1 3 1 2 1 4
That is, sweep variable 1, then 2, then 1 again, etc.
To construct the list of sweeps for any value of P, proceed as follows:
Variable Variables last currently swept swept 1 1 2 1, 2 1 2 3 2, 3 1 1, 2, 3 2 1, 3 1 3 3 -
As already mentioned, one way to check on the accumulation of rounding error is to compare the matrix after 2P sweeps with the original matrix, computing for instance the root mean square change in the entries. Another way is perhaps more satisfying. This way does take more computer time, though that is rarely a major problem, since the whole procedure is very fast. Starting with the final matrix (the matrix after all 2P sweeps), go through the whole process again, performing 2P more sweeps, and compare the values of R2 computed the second time around to the values computed the first time around. The difference between these two sets of values should presumably exceed the amount of rounding error in the first set of values, because more sweeps occurred between those two values than were used to calculate most of the original values.