Cornell University
Copyright © Richard B. Darlington. All rights reserved.
A second measurement issue stems from the fact that most investigators present each infant with several views of each scene. How should the several looking times be combined? Should they simply be averaged, or is a weighted average better? If so, should the first view receive the most or the least weight? And how unequal should the weights be?
Third, assuming that both compression and averaging will be used, should the compression be done before or after the averaging? In other words, is it best to average and to then compress the average scores, or is it better to compress each individual looking time and then average the compressed scores? Or is some mixture best? That is, might it be best to apply some compression, then average the compressed scores, then further compress the average scores?
I argue in this work that it is in fact possible to choose rationally among this broad array of choices. My answers to the three questions above are:
The central idea behind the present analysis is that the more successfully an analytic method C removes random error from a set of data, the higher r(CACB) will be. That's because the C-scores are influenced both by random error and by the individual infant's nonrandom propensity to look long at stimuli. The more successfully random error is removed from the data, the more important the influence of nonrandom infant propensities will be, and the higher r(CACB) will be.
Of course some of the 44 groups of infants observed more interesting stimuli than others, and the groups differed in age. To control these differences, I combined the 44 groups by first computing variables CA and CB, then adjusting these variables to means of zero within each of the 44 groups, then computing r(CACB) on the adjusted C-scores in the entire set of 504 infants. This essentially controls for group (here denoted G), so the correlation computed this way will be denoted r(CACB .G).
It actually doesn't matter whether one uses averages or sums in the "averaging" part of computing CA and CB. For instance, if one summed three scores using weights of .5, .3, and .2, that sum could be called an average because the three weights sum to 1. But those weights would yield the same value of r(CACB .G) as weights of 1, .6, and .4, because the latter weights are in the same ratio as the former. Thus for simplicity I used a weighted sum in which the first weight b1 was fixed at 1 and the latter two weights b2 and b3 were allowed to vary.
Various degrees of compression can be achieved by "raising" scores to various powers W, where W is forced to fall between 0 and 1. For instance, taking the square root of a score is equivalent to "raising" that score to the .5 power, so that is also equivalent to setting W = .5. Similarly, setting W = .25 is equivalent to taking the fourth root of a score. If W is set equal to 1 there is no compression and the original scores are used. The lower W is set, the more compression is achieved.
W cannot be lowered all the way to 0, since any number to the 0 power is 1 so all scores would become 1. However, lowering W to near zero turns out to achieve essentially the same degree of compression as replacing the original scores by their logarithms. For instance, on a log scale the difference between 50 and 51 is 10.9% of the difference between 5 and 6, and on a scale formed by setting W = .01 the comparable percentage is 11.1%--nearly the same. By setting W even closer to 0, one can simulate with any desired degree of precision the compression characteristics of a log scale. Thus by choosing W somewhere between 0 and 1 one can vary the degree of compression from none at all to the same compression as is achieved in a log scale.
To allow the possibility of compressing a certain amount before averaging, and compressing more after averaging, one can replace the single power W by two powers G and H. One can first "raise" each individual looking time to some power G, then compute a weighted sum across the three views of the same scene, then "raise" that sum to some other power H. For instance, consider the transformation in which G = .5, H = .4, and the three weights for the weighted sum are respectively 1, .8, and .6. Suppose infant Susan has looking times of 25, 30, and 10 seconds for stimulus A. Then for Susan,
CA = ( 1*25.5 + .8*30.5 + .6*10.5).4 = 2.636.
Thus the problem is to find the values of G, H, b2 and b3 that maximize r(CACB .G). Once the problem is stated this way, one can solve it with a mathematical method called steepest ascent. This is one of an entire family of methods that mathematicians and engineers call optimization methods. These are extremely general methods for finding the values of parameters that maximize or minimize any reasonable mathematical expression. In principle, multiple regression and principal component analysis could both be done with optimization programs, since the goal is to find weights that will maximize the multiple correlation or factor variance.
The fact that three of the four parameters (G, b2, and b3) turned out to be 1, allows us to state the recommended approach far more simply. As already mentioned, the recommendation is to use the fourth root of the sum of the individual looking times.
I was also surprised that the first looking time was given the smallest weight of the three, though the difference was insignificant. Local lore had been that it might be best to weight the first looking time more heavily, on the ground that infants began to lose interest by the second and third views of a scene. But it may be that that's precisely why one should give the later views equal weight; we want to see whether infants are sufficiently interested to keep looking during later views.
A third feature of the results fit my prior intuitions more closely. Intuition suggests that some compression of the highest scores is needed. But it also suggests that a log transformation goes too far, for instance treating the difference between 5 and 6 seconds as if it were as large and important as the difference between 50 and 60 seconds. The use of fourth roots matches these intuitions, since it falls between the original scale and the log scale in the degree to which it compresses high scores relative to low scores. For instance, as already mentioned it makes the difference between 50 and 51 seconds be 19% as large as the difference between 5 and 6 seconds, while the comparable percentage for a log scale is 11%.
These latter approaches are designed to minimize sampling error in the final significance tests or confidence bands computed by a typical investigator, while the optimization approach I used is designed to minimize measurement error. One goal of transforming scores is to increase the validity of statistical methods that assume normal distributions, thereby limiting sampling error. But the optimization method I used was not at all designed to assure that distributions would be approximately normal. Rather that method emphasizes the reduction of measurement error as measured by the ability of transformed scores to predict the transformed looking times for other stimuli. By minimizing measurement error one hopes to maximize statistical power, while an approach emphasizing normality or symmetry would emphasize maximizing the validity of statistical methods that assume normal distributions.
To try to transform looking times to normal distributions ignores the question of whether the transformation that best eliminates measurement error does in fact yield a normal distribution. As engineers well know, there are statistical methods that assume highly skewed exponential distributions, or even more highly skewed Weibull distributions. One should not ignore the possibility that the best transformation (which measures the underlying trait with least error) does not yield a distribution that is normal or even symmetric.
However, we would have the best of both worlds if we found that by minimizing measurement error we also produced a scale that is roughly normal, so that we could count on the well-known robustness of parametric statistical methods to protect us from the moderate remaining levels of nonnormality. That does seem to be the case with the composite C recommended here. C's skew value was 1.19, compared to 3.92 for the original 3024 looking times. To present a measure of skew that is more intuitively understandable, I defined the ratio
(90th percentile point - median)/(median - 10th percentile point)
This measure would of course be 1.00 for a symmetric distribution. This ratio is 1.37 for C and 6.10 for raw looking times. Since (6.10 - 1)/(1.37 - 1) = 13.8, by this measure of skew, the skew of C is only about one-fourteenth that of raw looking times. The skew in C is thus noticeable but apparently not high enough to seriously interfere with the validity of standard statistical methods.