In time series applications, correlations among variables are often above .99. But even these high correlations are no problem for modern algorithms. In my own tests of SYSTAT, it calculated regression slopes accurate to 5 significant digits even though 10 independent variables in a regression all correlated with each other at the level of .99999999999 (that's eleven 9's), and calculated slopes accurate to 7 significant digits when the correlations were .999999999 (nine 9's).
As is well known, collinearity does increase the standard errors of individual regression weights. It thus seems obvious that collinearity should damage regression's predictive power, since after all errors in regression weights are what causes the actual predictive power of a sample regression to fall below the predictive power of a hypothetical regression using population weights.
However, the previous paragraph ignores the fact that errors in individual weights can cancel each other out. Consider for instance a simple case in which two regressors X1 and X2 correlate highly positively. It can be shown that if you draw a sample in which the regression weight b(X1) happens to overestimate its true value, then in that same sample b(X2) will very likely underestimate its true value. But since X1 and X2 are highly correlated, these two errors will tend to cancel each other out. In the extreme case in which X1 and X2 correlated .999 with equal standard deviations, an overestimation of b(X1) by 5 would be almost completely corrected (for purposes of predicting the criterion variable Y) by an underestimation of b(X2) by 5. Therefore collinearity does lead to an increase in errors in individual regression weights, but it also leads to an increase in the degree to which those errors cancel each other out. The net effect is that Y is estimated just as accurately, even in new samples, when collinearity is high as when it is low. The same is true for any number of predictor variables. As we shall see, this fact is central to regression's ability to handle time-series problems, where collinearity is often extreme.
In time series applications, Box and Jenkins (1976) emphasize the "central role" (p. 17) of parsimony, arguing especially that the accuracy of forecasts will be degraded by the inclusion of unnecessary terms in a forecasting model. That assumption pervades most uses of ARIMA.
To illustrate my belief that the importance of parsimony in time series analysis has been overstated, I repeated the following experiment 10,000 times. I constructed a time series of 100 points, using a model with just one parameter. The first entry in the series was a standard normal deviate, and each successive entry was an independent standard normal deviate plus .2 times the preceding entry. In ARIMA terminology this is an AR1 series with phi = .2. Since each entry was contructed using just the one preceding entry plus a random deviate, a first-order autoregression is in fact the proper model to use in fitting this data. I fitted this model to the first 99 of the 100 points, used that model to forecast the 100th point, and recorded the forecasting error. I also fitted to the same series a fifth-order autoregression, which in ARIMA would be considered extravagantly unparsimonous.
Across the 10,000 repetitions of this experiment, the root mean squared forecasting error was only 2.2% higher for the second model than for the first. I got exactly the same result when I set phi = .5. This experiment was actually biased against unparsimonious models because of the unrealistic "step function" I used in defining the true contributions of the lagged variables. That is, if I had let the lagged variables B1, B2, B3,... taper off gradually in their true contributions, the advantage of emphasizing parsimony would be even smaller.
The value of 2.2% I observed here is almost exactly the value of 2.1% predicted by a formula used in ordinary regression, which states that under multivariate normality the root mean squared error in forecasting new observations is proportional to 1/sqrt(N-P-1), where N is the sample size (99 in this example) and P is the number of parameters in the model (1 or 5). For more details see Darlington (1990, p. 164) or especially Darlington (1968, pp. 173-174). [In my notation, P is what Darlington (1990) called P+1 and what Darlington (1968) called n+1.]
Further analysis using the expression 1/sqrt(N-P-1) shows that in forecasting contexts, a reasonable rule of thumb is that it pays to include an extra term in a regression if the t for that term exceeds sqrt(2) or about 1.41 in absolute value. Interestingly, 1.41 is about halfway between the value of about .8 which is the expected value of |t| under the null hypothesis, and the t of about 1.96 required for significance at the 5% level.
Differencing a series means replacing each entry in the series by its difference from the preceding entry. In turn replacing these differences by the differences between them yields the second-order differences of the original series. After differencing the ARIMA analyst typically applies autoregression (or a variant named MA to be explained later) to the differenced series, and uses the results of that analysis to forecast the next difference. If first-order differences were used, the forecast of the next difference is then added to the last entry in the original series to forecast the next observation in the series. If second-order differences were used, the forecast of the next second-order difference is added to the last first-order difference to forecast the next first-order difference, and that forecast is then added to the last observation in the original series to forecast the next entry in the series.
According to its advocates, there are two reasons to difference a series. First, ARIMA computer programs use iterative methods to estimate autoregressive slopes, and those iterative methods may fail to converge to an answer unless you first difference the series. Second, differencing may produce a model which fits the data better as measured by mean squared error (MSE). Both these situations tend to arise in series exhibiting long-term trends. One or two differencing steps nearly always removes the trends, and the subsequent "integration" (the just-described process of correcting for the differencing) in effect puts the trends back into the forecasts.
However, differencing has a major potential disadvantage. If each series entry contains some purely random fluctuation not shared with either adjoining entry, then first-order differences contain more of that random fluctuation, and second-order differences contain still more. This can degrade the accuracy of the forecasts.
I prefer two alternatives to differencing. One works better for long-term trends while the other works better for short-term trends. This section introduces just the long-trend method; the short-term method is introduced in a later section. In the long-term method, polynomial terms are combined with autoregressive terms. Polynomial terms are simply powers of TIME. For instance, a model with three polynomial terms (including the linear term) and two autoregressive terms would include the variables TIME, TIME*TIME, TIME*TIME*TIME, B1, and B2. Actually segment terms, spline terms (Darlington, 1990, 301-305), or other similar methods may be used in place of polynomial terms, but for simplicity I limit the current discussion to polynomial terms.
To compare these two methods--differencing and polynomial terms--I repeated the following experiment 10,000 times. In a sample of 100 time-values running from 1 to 100, I defined
x = time2/10 + standard normal deviate
I then used the differencing and polynomial methods to forecast the next point in the series. Since time = 101 for this observation, the optimum forecast is 1012/10 = 1020.1.
Because of the quadratic trend built into this data, two differencings would be required to remove it. Therefore I differenced twice, then used AR1 to forecast the next observation. The 10,000 forecasts made by this approach had a standard deviation of 1.5515 and a mean .3113 below the optimum mean of 1020.1. Because this mean is the mean of 10,000 observations, its standard error is only .0155. Thus the forecasts were surprisingly biased. The bias and standard deviation together produced a root mean square deviation of sqrt(.31132 + 1.55152) = 1.5823 from the theoretically optimum forecast of 1020.1. An AR2 model (with two autoregressive terms) did somewhat better: its bias was worse (.4840 compared to .3113), but its standard deviation was lower (1.2528 compared to 1.5515), producing a somewhat lower root mean squared error (1.3430 compared to 1.5823).
For the polynomial approach I fitted a regression model using a constant, time, time2, and the first-order lag term B1. Actually I knew B1 would be useless in this data set, but I left it in because the point of the example was to illustrate the possibility of combining lag terms with polynomial terms. This time the forecasts showed essentially no bias; the mean of the 10,000 forecasts was only .0018 below the correct value of 1020.1 (compared to .4840 for the better-fitting differencing model), and the standard deviation of the forecasts was .3275 (compared to 1.2528). The root mean squared deviation from the optimum forecast of 1020.1 was sqrt(.00182 + .32752) = .3275, compared to 1.3430 before: a ratio of 4.1 to 1. To estimate the actual forecast error (root mean square deviations from the next observation) rather than deviations from optimum forecasts, we add 1 to the expression within each radical. For differencing we have sqrt(.48402 + 1.25282 + 1) = 1.6744, while for polynomial terms we have sqrt(.00182 + .32752 + 1) = 1.0523 , still a ratio of 1.6 to 1. These data are artificial, but later I report large disparities between the two methods in their ability to fit real data.
To understand better what differencing is doing and why it performs so poorly, it helps to reexamine the equation for a forecast with double differencing:
forecast = final observation + final first-order difference
+ autoregressive component of forecast
The nonautoregressive component of this equation is always important. That is especially true when the autoregressive component is quite small, as it is in this example but also in many real data sets. Graphically, the nonautoregressive component of this equation consists of fitting a straight line connecting the final two observations of the original series, and extending that line one unit to the right. This fact should show clearly why these forecasts contain both so much bias and so much variance. The variance results from the fact that connecting two adjacent points with a straight line makes the slope of that line so highly dependent on random fluctuations of those two points. The bias comes because, although I differenced twice specifically because that is needed to remove trend from a parabola, in accordance with normal ARIMA practice I corrected for the differencing not with a parabola but with a straight line. It can be shown that for any number k of differencing steps, the "integration" (the process of correcting for the differencing) consists of fitting a polynomial of order k-1 to the final k points of the series, and extending that polynomial one unit to the right. Both the bias and the variance of this process should be apparent.
Differencing does work well when apparent trends are produced by a random walk or an approximation to a random walk. But a regression including both polynomial and autoregressive terms does about equally well; in a random walk the polynomial terms would contribute little and would probably be dropped from the model, leaving an autoregressive model which yields essentially the same forecasts you get by differencing.
Given that differencing works well with random walks but not with genuine nonlinear long-term trends, it's worth noting that in the classic Box-Jenkins (1976) work that advocates differencing, nearly all the real-data examples concern either stock prices or chemical reactions. In both these types of example it is especially likely that any apparent long-term trends will actually be random walks. As I have mentioned, if long-term trends were the norm in stock prices, then anyone could make money by betting that apparent trends will continue, and that strategy simply doesn't work. And a chemical mixture in the midst of a reaction clearly has no "memory" other than its current state, so that each change will be entirely a function of that current state plus random change--in other words, a random walk. Box and Jenkins' primary other example involved total monthly sales of international airline tickets from 1949 through 1960; aside from annual cycles the log of these sales exhibited essentially linear growth over those years, thus obscuring the need in many applications for terms to measure long-term curvilinear trends. Actually even in this series a quadratic long-term trend is just significant at the .05 level, even with autoregressive and seasonal terms in the model, but was not discovered by Box and Jenkins.
Even assuming we must find an alternative to differencing, why use polynomials? Isn't it well known that polynomials are notoriously unstable when extended beyond the range of the data used in fitting them? There are two answers to this. We're making a very modest extension beyond the observed range. And mostly it doesn't matter much exactly what method you use to fit curvilinear trends, because as my later real-data examples illustrate, you're usually not dealing with extreme curvilinearity that hits you in the eye. The curvilinearity may be almost invisible even though highly significant. Polynomials are merely the simplest method of fitting such data.
I prefer to think of homeostatic rather than "stationary" time series. In a homeostatic series a trend line exerts this same "magnetic" or homeostatic effect, but the trend line may be a moderately complex curve instead of the flat straight line demanded by stationarity. The trends may be strong or weak, and separately they may be simple as in a straight line or complex as in a higher-order polynomial. Similarly the homeostatic effects may be strong or weak, and separately they may be simple or complex. The trend effects are captured by polynomial terms while the homeostatic effects are captured by autoregressive terms. For either type of effect a strong effect will make the corresponding terms contribute substantially to the model, and the complexity of the effect is measured by the number of separate terms of that type (polynomial or autoregressive) needed to fit the data. If the homeostatic effect is zero as in a random walk, then no polynomial terms will contribute to a model's fit.
When one thinks about real-life problems of the sort mentioned in the opening section, this mixture of trends and autoregressive effects seems to make the most sense. Consider sales of a typical product. Sales may rise over several months, then gradually trend down when a competing product is introduced, then gradually trend up when the original product is improved. These trends may be well measured by a polynomial. In addition, there will be much shorter-term effects of the sort well measured by lag terms; people may shop less for 2 or 3 days during a snowstorm or heat wave or may buy more of a product for a few days after it appears in a popular television show. Thus yesterday's sales will on the average help forecast today's sales, over and above the average for this week or month. I can easily think of similar examples across most or all the areas of application in my opening section.
^xi = -SUM(thetaj xi-j)
where ^xi denotes the estimate of xi, and the summation is across j from 1 to i-1, so that the subscripted values of x run from xi-1 back to x1. The program requires that 0 > theta > -1, so that the coefficient of xi-j both alternates in sign and declines in absolute value as j increases. The minus sign on the entire expression makes the series start with a positive coefficient for xi-1.
I have no serious qualms about moving averages as simple smoothers, but I am unenthusiastic about MA forecasting models for several reasons. First, MA terms in a forecasting model tend to be highly collinear or redundant with autoregressive (AR) terms, so one gains little or nothing by using both. Second, MA models cannot be fitted with an ordinary regression program, and we have seen that there are major advantages to being able to use these programs. Third and most important, MA forecasting models rarely make scientific sense. The next paragraph expands this point.
Despite an extensive search, I have not found in the literature a convincing argument that one should actually expect to find MA processes in real data. Rather, the assertion is often made that MA processes apply when observation xi is primarily a function of the random shock at time i-1 rather than the value we would have expected at time i-1 in the absence of that random shock. This is just the opposite of more standard scientific models in which any observation xi is thought of as the sum of a "true score" and a "random error". The true score is by definition unknowable, but we ordinarily assume that if we could measure it exactly, it would fit better into scientific laws than the observed score, regardless of whether it was used as an independent or dependent variable in those laws. But MA models are based on the odd assumption that even though the true score would fit laws best when the variable in question is the dependent variable, nevertheless the random part of X is the part that will fit best when X is used as an independent variable in some law. As I say, we can neither find nor imagine a situation in which that assumption is plausible. It's easy to think of examples that fit the model of an unrepeated random fluctuation: a hurricane produces a temporary spurt in construction, a fright produces a temporary increase in heart rate, and so on. But when could later observations be predicted better from the unpredictable or random component of these measures than from the predictable or nonrandom component of the same measure?
Why then have MA processes been so prominent in time-series analysis for so long? MA processes do work well in forecasting series whose signs alternate, with a positive value following a negative value which follows a positive one, etc. Such a series would usually be rare in nature, since it would be observed only if the period of alternation happens to coincide with the period of observation, and in most natural processes that would be an amazing coincidence. However, differencing produces such alternation; when a random fluctuation makes xi high, then xi - xi-1 tends to be positive while xi+1 - xi tends to be negative. In a long series of random numbers approximately 2/3 of adjacent differences are opposite in sign. Therefore we suggest that MA processes have appeared so useful mainly because time-series analysts have mostly worked with differenced data. Thus I suggest discarding MA forecasting along with differencing.