Many scientists think of random assignment and statistical control (the use of covariates in linear models) as alternative methods of control. It is well known that random assignment has certain advantages over statistical control; see chapter 4 of my book Regression and Linear Models (hereafter abbreviated RLM). However, there are are at least four reasons for using statistical control along with random assignment if the latter is planned. This note outlines those reasons. The first three reasons (control of nonrandom attrition, assessment of indirect effects, and increased power and precision in estimating effect sizes) are well understood by many people. However, I believe that the last of the four reasons has not been described before.
For instance, consider an experiment in which a randomly-assigned half of all subjects are told a mental test indicates they should be especially good at solving problems of a certain type, and are then found to persist longer in trying to solve such problems which are in fact impossible. Of course, subjects are told the truth at the end of the experiment. Was their persistence produced (a) by self-confidence, or (b) by increased liking of the experimenter who had complimented them, or (c) by some other intervening mechanism, or (d) by some combination of these? These various possibilities can be distinguished by a regression predicting perseverance from the independent variable of treatment condition, plus measures of self-confidence and liking of the experimenter, which are the proposed mechanisms in this example. Under choice A we expect only self-confidence to be significant, under B we expect only liking to be significant, under C we expect only treatment condition to be signficant, while D covers all possibilities of two or three significant effects. The use of regression in such cases clarifies not so much the presence of the effect as its nature--that is, the intervening variables that mediate the effect. These are actually measures of indirect effects, which are discussed more fully in RLM Section 7.2.
The diagonal lines in the figure represent the model fitted by regressing posttest onto pretest and a dummy treatment variable. The upper line shows the predicted posttest scores of the treatment group, while the lower line shows the predicted scores of the control group. Assuming random sampling from a population, you can tell without any real calculation that the differences between treatment and control groups cannot be explained by chance, because every one of the treatment-group cases is closer to the treatment-group line than to the control-group line, while every one of the control-group cases is closer to the control-group line. The inferential formulas of RLM Chapter 5 agree with this intuitive conclusion; when they are used to test the hypothesis of no treatment effect, that hypothesis is rejected at the .0000045 level of significance (t = 8.33, df = 11).
We can also estimate the size of the treatment effect. The vertical distance between the two diagonal lines is 3.0; thus 3.0 is the estimated effect of the treatment on posttest scores, as explained in RLM Section 3.2.3. Again you can see without calculation that the lines in the figure must closely approximate the population lines, because randomly discarding any one case from the sample would hardly change at all the placement of either line. Therefore the vertical distance of 3.0 between the lines must be an accurate estimate of the treatment effect. Again the formulas in RLM Chapter 5 agree with our intuition; they show a standard error of only .360 for the estimated treatment effect.
But when we ignore regression formulas, and use a simple two-sample ttest to test the significance of the difference between the two groups, we ignore the information about each case's horizontal placement in Figure 4.1, using only its vertical placement. If this information were all we had, in our noncomputational intuitive testing we wouldn't be nearly so certain about the size or even the existence of the treatment effect, since the treatment-group posttest scores range from 6 to 15, and the control-group scores overlap them substantially, ranging from 2 to 12. The two-sample ttest has this same limitation. Because treatment and control groups had exactly the same means on pretest, the estimated difference between groups is the same whether or not we control for pretest. The previously estimated effect size of 3.0 was simply the difference between the two sample means, which forms the numerator of the ttest. But because the ttest ignores pretest scores, that estimate's standard error is 1.79--nearly five times the value of .360 mentioned above. Thus we find a nonsignificant tof 3/1.79 = 1.68, df = 12, p = .12.
In this example the treatment effect was not significant at even the .05 level without statistical control, but much the same point can apply even if it is. If you had to choose between spending thousands of dollars on a treatment that had been demonstrated effective at just the .05 level, or the same money on another treatment whose effectiveness had been demonstrated at the .001 level, which would you choose? Presumably the latter; after all, 50 times as many ineffective treatments pass tests at the .05 level as at the .001 level. Thus investigators should attempt to show the most significant results they validly can.
I do not mean to imply that one always gains power by indiscriminately adding covariates to a model with random assignment. The more strongly a covariate affects the dependent variable Y, the more power is gained from controlling it. But if a covariate has absolutely no effect on Y, one actually loses a little power by adding it to the model. The power lost is the same as is lost by randomly discarding one case from the sample, so the loss is usually small. But even this small loss suggests that one should not indiscriminately add dozens of extra covariates to the model, just because they happen to be in the data set. Elsewhere I describe and justify a method for selecting a specific set of covariates. In the method's simplest form, you predict the dependent variable from a broad set of relevant covariates, then drop from the model the covariate with the lowest absolute t. As described there, continue dropping covariates one at a time, recomputing the regression after each deletion, until all remaining covariates have absolute t's of 1.42 or higher. Add the independent variable to the regression only after completing this process. Otherwise the covariates correlating highest with the treatment variable--the very ones it is most important to keep--will tend to be deleted because of their redundancy with the treatment variable.
Once we agree that in this extreme example there is some doubt about the treatment's effectiveness, we must ask how extreme an example must be to raise similar doubts. Perhaps we should be concerned about all significant differences between treatment groups on covariates, despite the familiar argument (given in RLM Section 4.1.2) against this position. But we can avoid the whole problem by using linear models along with random assignment. The problem arises because we presume that the covariates correlate with the dependent variable in the population, so that if by chance we draw a sample in which the covariates correlate with the treatment variable as well, then we must presume the sample correlation between the treatment and the dependent variable is at least partly spurious. But as described in RLM Chapter 2, in a linear model with an independent variable X and several covariates, X's sample regression slope can be thought of as the simple regression slope predicting the dependent variable from the portion of X independent of the covariates. This portion of X is exactly uncorrelated with all covariates in the sample studied, not merely in some hypothetical population. This eliminates the problem, which was that X might conceiveably correlate highly with covariates just by chance in the sample studied, even though random assignment assures that this correlation is zero in the population. But the use of a linear model means that we are always using in effect just the portion of X that is independent of the covariates in the sample studied. Even in our extreme example, regression would estimate the treatment effect to be zero, which is the estimate supported by intuition.