A time series may trend upward or downward, as many economic series do, or may fluctuate around a steady mean, as human body temperature does. A series may contain a single cycle, like the daily cycle of body temperature, or may contain several superimposed cycles. For instance, outdoor temperature usually exhibits both daily and annual cycles, while traffic density usually exhibits daily, weekly, and annual cycles.
We shall concentrate on three major goals of time-series analysis. First, you may want to forecast future values of a time series, using either previous values of just that one series, or values from other series as well. Second, you may want to assess the impact of a single event, such as the effect of a new law on the frequency of drunken driving or the effect of a bridge toll on traffic across neighboring bridges. Third, you may study causal patterns, by which we mean the effects of variables rather than events on a series. This requires two or more time series. For instance, if changes in unemployment consistently precede changes in crime, that might imply that unemployment is one of the causes of crime.
Because of this nonindependence, the true patterns underlying time-series data can be extremely difficult to see by visual inspection. Anyone who has looked at a typical newspaper chart of stock-market averages sees trends that seem to go on for weeks or months. But statisticians who have studied the subject agree that such trends occur with essentially the same frequency one would expect by chance, and there is virtually no correlation between one day's stock-market movement and the next day's movement. If there were such a correlation, anybody could make money in the stock market simply by betting that today's trend will continue tomorrow, and it's simply not that easy. In fact, cumulating nearly any series of random numbers will yield a pattern that looks nonrandom.
The same general point arises in nearly all forecasting work. If you have records of monthly sales in a department store for the last 10 years, and are asked to project those sales into the future, those statistics will not reflect the fact that as you work, a new discount store is opening a few blocks away, or the city has just changed the street in front of your store to a one-way street, making it harder for customers to reach your store.
A second problem that arises in time-series forecasting is that you rarely know the true shape of the distribution with which you are working. Workers in other areas often assume normal distributions while knowing that assumption is not fully accurate. However, such a simplification may produce more serious errors in time-series work than in other areas. In much statistical work the problem of non-normal distributions is greatly ameliorated by the fact that you are really concerned with sample means, and the Central Limit Theorem asserts that means are often approximately normally distributed even when the underlying scores are not. However, in time-series forecasting you are often concerned with just one time period--the period for which you want to forecast. Thus the Central Limit Theorem has no chance to operate, and the assumption of normality may lead to seriously wrong conclusions. Even if your forecast is just one of a series of forecasts which you update after each new time period, the forecasts are made one at a time, so that a single seriously wrong forecast may bankrupt your company or lead to your dismissal, and nobody will ever learn that your next 50 forecasts would have been within the range predicted by a normal distribution. Some stock market speculators, who had previously been quite successful, were bankrupted or driven into retirement by the stock market plunge of 1987. That's very different from the situation in which a company hires, all at once, 50 workers identified by a competence test. If one out of the 50 is a spectacular failure, the company (and you the forecaster) will survive because at the very same time the other 49 were turning out well.
One problem with such research is that because the observations within each series are not independent of each other, the probability of finding a high correlation between the two series may be higher than is suggested by standard formulas. Later we describe a solution to this problem.
A second problem is that it is rarely reasonable to assume that the time sequence of the causal patterns matches the time periods in the study. Thus if increased unemployment typically produced an increase in crime exactly six months later but not five months later, then it would be fairly easy to discover that relationship by correlating monthly changes in unemployment with monthly changes in crime six months later. However, it is much more plausible to assume that increased unemployment in January produces a slight rise in crime during February, a further slight rise during March, and so on for several months. Such effects can be much more difficult to detect, though later we do suggest a solution to this problem.
A third problem in analyzing causal patterns is the familiar problem that correlation does not imply causation. As in ordinary regression problems, it helps to be able to control statistically for covariates. Later we describe one way to do this in time-series problems.
The same general logic applies in time-series work. For any of our three major uses of time-series analysis, you predict or forecast each value in the series as accurately as possible from previous values--either in the same series or other series. Then you may draw causal inferences from the fact that some particular factor is or is not of predictive value. As in a previous example, if you could somehow predict the crime rate from previous levels of unemployment, that would suggest that unemployment may be one of the causes of crime.
Thus the topic of forecasting is central to all our major uses of time series. Of course forecasting is often of great interest in its own right, and indeed many discussions of time series don't even mention other applications. Therefore we start with that topic.
In autoregression X usually denotes the dependent variable, which is plotted as a function of TIME. Consider possible ways of forecasting xi from the previous points. One way would be to simply use each entry as the estimate of the following entry. That would actually give the best possible forecasts in a simple random walk. A second way would be to average, say, the last 5 entries before each entry xi and use that average as an estimate of xi. A third way would be to draw a straight line through the last two entries before xi, and extend that line one unit to the right to estimate xi. You could imagine trying out these three methods, seeing which one works best in forecasting entries in the series from the ones preceding them, and then applying the winning method to the last few entries in the series to forecast the next entry.
In all three of these methods, the forecast is a linear function of the observations preceding xi. That point is obvious for the first method, in which the forecast is simply xi-1. A mean is also a linear function of the observations, so the second method also uses a linear function. In using the third method (passing a straight line through points xi-2 and xi-1), you are essentially assuming that the difference between xi and xi-1 will equal the difference between xi-1 and xi-2. Thus the forecast of xi is xi-1 + (xi-1 - xi-2) or 2xi-1 - xi-2, which is also a linear function of the observations.
Autoregression provides a way of examining an extremely broad class of linear functions like these, and selecting the one that works best from this class. In effect you merely have to say which preceding observations you wish to include in the linear function, and autoregression will examine every possible linear function of those observations and select the one that works best in the current sample. If you use only the most recent observations then the number of observations used in making each forecast is the order of the autoregression.
Our three examples might suggest that linear functions include only a limited range of predictive techniques, but linear autoregressive functions include far more techniques than is obvious. For instance, it can be shown that assigning weights of +3, -3, and +1 to xi-1, xi-2, and xi-3 respectively is equivalent to fitting a parabola to those three points and extending the parabola one unit to the right to make a forecast. To illustrate, suppose three points are respectively 1, 4, and 9, the squares of the first three integers. Then the forecast of the next point is 3*9 -3*4 + 1*1 = 27 - 12 + 1 = 16 = 42. Notice we're not talking about fitting just one parabola to an entire series, but rather a separate parabola for each point we're trying to forecast. Some of these parabolas might curve upward while others curve down, and some might be steepest on the right while others are steepest on the left. All they would have in common is that they are all parabolas.
Going even beyond parabolas, consider finding the best-fitting polynomial of order k for a set of n consecutive points, where n >= k+1, and making a forecast by extending that polynomial one unit to the right. Or suppose you do this while weighting the earlier points less than the later points because they're farther from the point to be forecast. It can be shown that all these techniques are linear techniques, and are thus within the class of techniques from which autoregression selects the best one. Therefore autoregression can be an extremely powerful tool for finding a way to predict each point in the series from earlier points.
| X | B1 | B2 |
| 3 | . | . |
| 7 | 3 | . |
| 4 | 7 | 3 |
| 5 | 4 | 7 |
| 2 | 5 | 4 |
| 8 | 2 | 5 |
In a program like SYSTAT, the command LET B1 = LAG(X) will produce the second column from the first, and the command LET B2 = LAG(B1) will then produce the third. In some programs if you want to lag more than one row in a single command, you can add the order of the lag as a second argument. Thus for instance column B2 might be produced directly from column X with the command LET B2 = LAG(X, 2). Once the lagged variables are constructed the regression may be run in the usual way. Thus the name autoregression--the variable is predicted from itself. If all terms in the model are consecutive starting with B1, then the number of lagged terms in the regression is called the order of the autoregression.
The first-order autocorrelation of X is the correlation between X and B1, the second-order autocorrelation is the correlation between X and B2, and so on. (This is not exactly true, though it's usually very close; see other works for exact formulas.) The terminology of autoregression can be used to define a random walk precisely. In a simple random walk, the regression coefficient of B1 is 1 and all other coefficients (and the additive constant) are 0. Thus the best forecast of any entry in the series is simply the previous entry.
For readers familiar with ARIMA who want to see how they can use a regression program to get exactly the same results (regression slopes, mean squared error, and standard errors of slopes) they get from ARIMA, we offer this paragraph. Modify ordinary regression practices in two ways. First, run the regression without a constant term. Second, at the tops of the columns of lagged variables, replace the missing values with zeros. Your results will then be identical to ARIMA results to many decimal places. However, we don't generally recommend either of these modifications to ordinary regression practices, especially for the more complex models we introduce later, for reasons that should ultimately become clear.
For maximum confidence in this forecast, you want to see whether the past success of the forecasting formula was uniform across time. To do that, plot the forecasting errors against time. Look for sections in which the mean error was not zero, and look for sections in which the variance of the errors was larger than at other times. If you don't trust your eyeball to do this, you can use quadratic or higher-order polynomial regression to separately predict e and e2 from TIME. Significant curvilinearity in predicting e suggests that the forecasting model can be improved, perhaps by adding one or more terms for trend; see later sections of this chapter. Significant curvilinearity in predicting e2 cannot be removed by changing the model; it suggests that the model works better at some times than others, and thus suggests caution in interpreting the confidence limits described next.
If your data pass the tests just described, and if your sample was quite large, you can put confidence limits on your forecast without assuming normality. Since you are thinking of the next case in the series as just another case in your sample, it follows that the probability is 95% that its forecasting error won't be among the largest 5% of forecasting errors in the sample. Therefore you can use the actual distribution of forecast residuals to find what size error would make the new error fall in that top 5%, and use that to put 95% confidence limits on your forecast of the next series entry.
If you prefer you can do this without even assuming symmetry of residuals; you can use the actual forecast errors to determine what error would fall in the top 5% of positive errors, and determine separately what error would fall in the top 5% of negative errors. For instance, if your forecast was 40, and the sixth-largest of 120 positive errors was +16 and the fifth-largest (in absolute value) of 100 negative errors was -13, then your 95% confidence limits would be 56 and 27.
If the sample was only moderately large, then you should consider the fact that forecasting errors will tend to be smaller for cases used to develop the forecasting formula than for other cases. If the forecast errors are approximately normal, it helps to assume normality at this point. You can use ordinary tests for normality to check this assumption. The MSE (mean squared error), reported by either a regression program or a time-series autoregression program, equals the sum of squared errors divided not by the sample size N, but by (N - number of parameters used in fitting the model). This helps adjust for the downward bias in individual errors. It's not ordinarily done, but a further adjustment would consist of finding 95% confidence limits by multiplying the square root of MSE not by 1.96, but by the appropriate value of t, which is a little larger. The argument is that you really want to find the ratio between a normally distributed error, and an independently distributed value of sqrt(MSE), and that ratio is distributed as t. For instance, suppose your forecast is 50, MSE = 36, N = 63, and the model contains 3 parameters. Then df = 63 - 3 = 60, and a t table shows that a t of 2.00 should be used for 95% confidence limits. Thus the confidence limits are 50 plus and minus 2 sqrt(36), or 38 and 62.
To construct a forecasting formula that predicts two or more time periods ahead, simply omit the shortest lags from the regression model. For instance, a model predicting X from B3, B4, and B5 forecasts three time periods ahead of the most recent observation used in the forecast. Of course such forecasts are often much less accurate than shorter-term forecasts. As in one-step-ahead forecasts, confidence limits on the forecasts can be found either from normal-curve formulas or from the empirical distribution of forecast residuals.
To study immediate jumps in a series, apply autoregression to the portion of the series preceding the event, and use the methods already described to forecast a series entry after the event. If the event's effect is hypothesized to occur primarily in a single time period, use a one-step-ahead forecast. If the effect is hypothesized to occur over several time periods, a test forecasting several time periods ahead may have greater power. Of course you will typically know in advance whether the series jumped in one step or in several, but at least theoretically you should choose your test (or get colleagues to make the choice) without that knowledge.
As was mentioned earlier, when you use this test you should think as carefully as possible about other events occurring at or about the same time, which might have produced the jump; the test may not distinguish between the effects of two events which occur just a few time periods apart.