Regression Methods in Statistical Process Control

Richard B. Darlington

Abstract

The first few paragraphs of this work describe 5 major advantages that result from the use of multiple regression, simultaneous linear equations, and regression-based time-series analysis in statistical process control (quality control).

* * *

Statistical process control (SPC) is the use of statistical methods to improve the quality or uniformity of the output of a process--usually a manufacturing process. Many SPC methods can apply to other processes as well, such as the processing of checks in a bank.

The Nature of this Document

I make no attempt to describe here the most basic SPC methods, which are discussed in several excellent books such as Ryan (1989) and Mitra (1993). Rather I describe here three more advanced methods which require an understanding of regression. The last of the three also requires understanding of a regression approach to time-series analysis, as in my piece A regression approach to time-series analysis. I have tried to write the time-series material in this document so it can be read without having previously read that other document, but some readers may want to read that document for a deeper understanding of the final sections of this document.

For simplicity I write throughout this document as if all adjustments to processes involved resetting dials on a machine, although of course in reality there are many other ways to adjust a process.

The methods here have the following advantages over more basic methods:

1. Simpler methods often determine merely whether a process needs adjustment, while I describe how a statistically untrained worker can use a simple computer program for simultaneous equations to determine quickly the specific nature and extent of any needed adjustments. I also show how an ordinary regression program can be used for this purpose, so that a separate program for simultaneous equations need not be purchased.

2. I consider the case in which a machine has several dials that may be reset, but each setting affects several characteristics of the finished pieces. Thus it may be very difficult to determine the optimum combination of settings, since changing any dial affects several output characteristics at once. I show how a statistically untrained worker may be able to quickly use the aforementioned computer program to simultaneously calculate a whole new set of dial settings estimated to produce optimum output under the current conditions (temperature, humidity, nature of raw material, etc.).

3. Though many books on SPC describe designs for carefully controlled experiments in which dial settings are varied systematically, the current approach makes no such requirement. While collecting data to enter into the aforementioned simultaneous equations, settings may be varied unsystematically as part of ad hoc attempts to achieve high-quality output as quickly as possible. All that the current approach requires is that you keep track of all the data associated with each piece of output, and that each of these input characteristics vary enough (and independently enough of other input characterisics) to allow you to assess its effects. Later I comment more on this latter point.

4. I assume that the aforementioned analyses will be based on a substantial body of data--at least 100 pieces of output and ideally much more. However, I do not assume that the target specifications of all the pieces in that data set be identical, and I also allow workers using the aforementioned program to change the target specifications frequently. Thus these methods will often work well even if manufacturing runs between specification changes are very short.

5. Like most methods built on hypothesis testing, many SPC methods take an either-or approach to process adjustments--much like the common practice of estimating the true difference between two means to be 0 if one calculates p = .06, but estimating it to equal the observed difference if one calculates p = .04. In contrast, the present approach does not treat current settings as a null hypothesis to be retained until definitively rejected, but rather treats settings as something to be recalculated and adjusted as often as is convenient.

Some Basic Assumptions and Notation

All three methods require that whenever the process changes in fundamental ways, someone with statistical knowledge run a set of regressions predicting the characteristics of finished pieces from the dial settings and uncontrolled conditions (temperature, humidity, etc.) that existed during their manufacture. A major assumption I require, that is not required by many other SPC methods, is linearity in these regressions. I explain this point more fully later, including the use of transformations to achieve the required linearity.

I also require that a computer program for solving simultaneous linear equations be available for quick calculations on the factory floor, though I do not require that the workers making these quick calculations be trained in statistics. As mentioned above, I explain later how an ordinary regression program can be used for this purpose if necessary. In what follows, the original set of regressions will be termed the fundamental analysis, while the later solution of simultaneous equations will be termed a quick analysis.

In addition, I share the following assumptions with nearly all SPC methods. These paragraphs also introduce my basic notation.

1. I assume that each piece, after its manufacture, can be measured on one or more dimensions Y1, Y2, etc., such as length, hole diameter, hardness, etc. I do not assume that every single piece is actually measured on these dimensions, but I do assume that pieces are regularly selected from the output stream for measurement. Typically every kth piece of output is selected for evaluation, though I shall later suggest other possibilities.

2. I assume there is some optimum score Ti (T for "target") on each dimension Yi--e.g. an optimum length or hole size. More basic approaches to SPC sometimes simply classify each piece as "satisfactory" or "unsatisfactory", but consistent with the work of Taguchi and others, I shall treat the output characteristics as continuous dimensions, and assume that the goal is to maximize uniformity of output by minimizing the mean squared deviation of scores on each dimension Yi from their optimum level Ti.

3. I also assume that besides the settings controlled directly by workers, there are several other variables, called covariates, which affect the output and which can be measured but are not easily controlled. Covariates may include temperature of the machine or some crucial part of the machine, atmospheric humidity, and perhaps certain characteristics of the raw material. That is, the nature of the raw material is controllable in theory, but in practice material with given characteristics has been delivered at a particular time, and substitute material is not readily available. Other covariates may measure the elapsed time or number of items produced since a machine was last cleaned or lubricated. If dirt accumulates and lubricant evaporates partly with the mere passage of time and partly as a result of use, then you might consider four covariates of this type: time and number of uses since last cleaning, and time and number of uses since last lubrication. Of course I cannot begin to suggest all the other covariates that might be relevant in a given instance.

There is a widespread exaggeration of the statistical dangers of using too many covariates as variables in regressions; see another article. In a nutshell, the most important stability characteristics of a regression are determined not by the ratio between the number of cases in the sample and the number of variables in the regression, as is widely believed, but by the difference between those two quantities. Therefore with large samples (which are usually available in SPC), adequate sampling stability can usually be achieved even with many covariates. Thus I prefer erring on the liberal side in determining the number of covariates.

More on Settings Versus Covariates

I have described settings as controllable and covariates as uncontrollable, but in fact the controllability of the input parameters may vary along a continuum. Machine temperature may be lowered by slowing the machine, but only at the cost of lost production. The nature of raw material may be uncontrollable in the short run, but may be controlled over time by negotiating with the supplier. On the other hand, dials on the manufacturing machine can usually be reset much more easily.

I assume here that you have identified the p most easily controlled input parameters, where p is the number of output characteristics measured, and that those p input parameters will collectively affect all of the output characteristics so that optimum output can in principle be achieved by manipulating just those p input parameters. I will define settings as the p most easily controlled input parameters, so that the number of "settings" equals the number of output characteristics, and define covariates as the less easily controlled input parameters. Let q denote the number of covariates.

I describe three methods, in order of increasing complexity: the basic regression method, regression with simple corrections, and regression with time-series corrections.

The Basic Regression Method

An example of the regression method

Suppose that pieces being manufactured are baked in an oven, and workers can control both the oven temperature and the time each piece remains in the oven. Temperature and bake time each affect both the size and hardness of the manufactured piece, and there is an ideal size and an ideal hardness.

Further suppose that just before entering the oven, each piece is washed thoroughly with ordinary water. The temperature of that water affects the temperature of the piece at the time it enters the oven, and therefore affects the optimum oven temperature and bake time. The temperature of the water can be measured accurately, but it is uneconomical to try to control the water temperature. Thus the water temperature varies substantially from winter to summer, and even varies somewhat from day to day.

In this example bake time and oven temperature are settings while water temperature is a covariate. For simplicity that is the only covariate in this example, though other examples could have several covariates. The problem is to determine the optimum bake time and oven temperature for a given moment given the water temperature at that moment.

To apply the regression method to this problem, we would run a set of regressions predicting the hardness H and size S of pieces from the bake time B, oven temperature T, and water temperature W that obtained at the time of their manufacture. These regressions might be based on the characteristics of every hundredth piece manufactured, and might use pieces manufactured over several weeks or months so that water temperature varied substantially across the pieces.

Let the lower-case letters a through h denote the coefficients and additive constants found in these two regressions; specifically, the regressions are

H = aB + bT + cW + d

S = eB + fT + gW + h

Let H' and S' denote the desired values on hardness and size; they are of course known. Water temperature W is also a "given" for our analysis, since it cannot be controlled. Putting all the known fixed values on the left sides of the equations, we have

H' - cW - d = aB + bT

S' - gW - h = eB + fT

Since the quantities on the left sides of the equations are known, we will denote them by m and n respectively, so the equations become

m = aB + bT

n = eB + fT

All the lower-case letters in these equations are known, and we want to find the corresponding values of B and T. Ordinary simultaneous-equation methods yield the formulas

B = (mf - nb)/(af - eb)

T = (na - me)/(af - eb)

These equations allow one to combine the values found by the regression with the desired size and hardness of the pieces to find the optimum bake time B and oven temperature T.

In a real-world setting, the regressions used to find the 8 values a through h (what I am calling the "fundamental analysis") might have to be done only once, until some feature of the manufacturing process changes. But then the equations

m = H' - cW - d

n = S' - gW - h

B = (mf - nb)/(af - eb)

T = (na - me)/(af - eb)

would have to be recalculated on the factory floor every time there was a change in water temperature W or desired size and hardness S' and H'. Thus it is desirable to make the latter calculation as simple and automatic as possible. Methods for doing this are described later. This part of the analysis is the "quick analysis".

Using matrix notation unrelated to the previous notation, this procedure can be generalized to any number of settings and covariates as follows:

Then the quick analysis starts with the matrix equation

t = Bss + Bcc + a

Solving for s gives

s = Bs-1(t - Bcc -a)

The values in Bs change only with new fundamental regression analyses, so the matrix inversion need be done only occasionally; the only values that change often are in t and c.

Using a regression program for the "quick" analyses

Consider a no-constant regression in which the number of cases exactly equals the number of predictor variables. If there is no singularity, the regression will always fit the data perfectly. If y denotes the column vector of dependent-variable scores, X denotes the matrix of predictor scores, and b denotes the vector of regression coefficients, then the equation y = Xb will always hold exactly, so b = X-1y. Thus for this case, a regression program simply solves a set of simultaneous linear equations. Therefore, to use a regression program to solve the aforementioned set of linear equations, first make up a data set with p+q+2 "variables" as follows: Then enter a command that for 3 covariates might look like

LET d = t - C1*94 - C2*63 - C3*7 - a

where 94, 63, and 7 are the current values for covariates C1, C2, C3 respectively. In other words, multiply each covariate value by its appropriate coefficients. The variable d computed in this way can serve as y in the equation b = X-1y. Then use a no-constant regression to predict d from S1, S1,...,Sp. The resulting regression slopes are the desired settings.

Additional properties of the regression method

This method requires that the original regressions be linear. You are not allowed to achieve that linearity by adding to the regression quadratic or higher-order polynomial terms of the setting variables. However, you may use such terms for the covariates. And even for the setting variables you may achieve linearity by replacing a variable X by, say, log(X) or sqrt(X) or X2. If all settings are positive, you can try various powers of a setting variable (X.9, X.8, etc.) to maximize linearity. If you do this to achieve linearity in the original regression, then the quick analysis must use the transformed variable(s), and you must add a step to the quick analysis which reverses the transformation. For instance, suppose you had to raise the first setting variable to the power .4 in order to achieve linearity. Then the worker who reads a value of 3.6 from the quick analysis could use a cheap pocket calculator to calculate a setting of 24.6, because 1/.4 = 2.5 and 3.62.5 = 24.6.

As you use this system in manufacturing, you may if you wish save all relevant data (input settings, covariate values, and output characteristics) for the newly manufactured items. When enough data has accumulated, you can repeat the fundamental analysis, in order to get more accurate results from the larger data base. In fact, as you realize belatedly that some of the settings used earlier were far from optimum, you may choose to discard that data from the analysis. That is a second way to achieve adequate linearity, because linearity is more likely to be a good approximation over a short range than over a broad range. Notice I am not suggesting that you discard from the data set the pieces which missed the targets by the largest amounts; rather discard the cases in which the settings were (in retrospect) worst. In other words, discard the pieces for which ^Y, not Y, was farthest from the target values.

As mentioned, the quick analysis may be done whenever covariates change noticeably. One covariate that you know is changing constantly is time--time since last lubrication, time since last cleaning, or simply time since the machine was turned on. The regression method allows you to estimate the amount by which the optimum settings change with time. Thus you might in principle choose to change the settings every few minutes as a function of time, without even bothering to measure other covariates so frequently. Since you are allowed to use polynomial functions of the covariates and time is a covariate, you might derive a polynomial rule which suggests changing a dial by 7 units every 10 minutes just after lubricating a machine, but then gradually increasing those changes to 10 then 12 then 15 units every 10 minutes before lubricating the machine again.

In the fundamental regression analysis the TOLERANCE values are more useful than in many applications. Much research outside SPC involves carefully designed experiments in which no attention need be paid to TOLERANCE values because they all achieve their maximum values of 1.0. Much other research involves independent variables such as rainfall or subject's gender which are totally out of the control of the analyst, so that the analyst might notice the TOLERANCE values but cannot easily control them. SPC represents an intermediate case in which you are trying to achieve maximum quality as quickly as possible without doing carefully controlled experiments, but you can at any time sacrifice the quality of the next few pieces by increasing the range of one of the settings in order to get a better idea of the effects of that particular setting. In this process the TOLERANCE values can help tell you how much you could improve the estimate of the effect of a particular setting by increasing its range. The standard errors of regression slopes will tell you how accurate your current estimates are.

Lagged Variables in Regression

I have so far ignored the problem of determining the optimum time to measure covariates. If a piece sits in a chemical bath for an hour, and the bath's temperature fluctuates but its temperature is taken continuously, which of the many temperature measurements should be entered into the analysis as a covariate? You may want to consider using several of these values as separate covariates. As mentioned earlier, the reasons for my relaxed attitude toward using many covariates are discussed in another article

This means that the very same covariate measurement may appear two or more times in the data set used for the quick analysis, in different rows and columns. This is routinely done in time-series analysis, and is no problem because in regression there is no requirement that variables be mutually independent. To illustrate the process, suppose each piece sits in a chemical bath for 5 minutes. Each minute you remove one piece and replace it with a new piece, so that 5 pieces share the bath at any given moment. You take the bath's temperature just before removing each piece from it. Then the bath's temperature during piece j's immersion is measured by the temperatures taken just before the removal of pieces j, j-1, j-2, j-3, and j-4. You don't know which of these temperatures will be most predictive of piece j's output characteristics, so you'd like to study all of them, treating them as five different covariates TEMP1, TEMP2, TEMP3, TEMP4, and TEMP5, with TEMP1 being the first temperature taken after piece j's immersion and TEMP5 the temperature taken just before piece j's removal. Suppose you have already recorded TEMP5 in a data set, with each piece's TEMP5 measurement appearing on the same row of the data set as that piece's output characteristics. Then in some statistical packages such as SYSTAT, you can easily create columns TEMP4, TEMP3, TEMP2, and TEMP1 with commands like

LET TEMP4 = LAG(TEMP5)
LET TEMP3 = LAG(TEMP4)
LET TEMP2 = LAG(TEMP3)
LET TEMP1 = LAG(TEMP2)

Each of these commands copies the entries in one column into the next rows of another column. Thus after executing these four commands, the entry that had been in row j-4 of column TEMP5 will appear as well in row j-3 of column TEMP4, row j-2 of column TEMP3, row j-1 of column TEMP4, and row j of column TEMP5. Therefore the five temperature measurements taken during piece j's immersion will appear on row j of the data set along with piece j's output characteristics.

If you don't want to use so many columns, in some packages such as SYSTAT 6 you can LAG a column by two or more rows at a time. For instance, suppose in the last example you wanted to use just TEMP1 and TEMP5, the first and last temperature measurements made during the immersion of piece i. You could construct TEMP1 from TEMP5 with the command

LET TEMP1 = LAG(TEMP5, 4)

Using Recent Output to Adjust Settings

Many covariates cannot be measured or can be measured only imperfectly. Perhaps you can measure machine temperature only on the surface of the machine, while the important temperature is deep inside the machine. Or time since last lubrication may be only an imperfect measure of the actual amount of lubrication now on the crucial surfaces deep inside the machine. In the methods of this section, you allow for such problems by repeatedly using the output characteristics of the last few available pieces to correct the settings calculated by the regression method. I consider two correction methods: simple and time-series.

Regression With Simple Corrections

The method of this section assumes that the unmeasured covariates change slowly relative to the rate of production. In this method a worker measures the characteristics of the last few available pieces, and computes the amount by which their average on each characteristic deviates from the target value of that characteristic. A computer data set contains the same columns S1, S2,...,Sp described for the basic regression method. Add to that data set a column titled V, which contains the average deviations just described. Thus a value in V would be negative if the observed average score on a characteristic were above its target value, and positive otherwise. Then solve for d the set of simultaneous equations

V = Sd, where V is the vector of values just described, S is the p x p matrix of S-values, and d is the amount by which the settings should be changed for optimum future output. If you use a regression program to solve the set of simultaneous equations (see above), then predict V from S1, S2,...,Sp in a no-constant regression, and the regression slopes are the values of d.

Regression With Time-Series Corrections

To see a possible limitation of regression with simple corrections, suppose that the last 5 pieces of output have exceeded a target value by 10, 8, 6, 4, and 2 units respectively, in chronological order. Inspecting this series, it's obvious that the best estimate for the next piece is an error of 0, so no dials should be adjusted before producing that piece. But if we simply took the mean error of the last 5 pieces, as in the last method, we would find a mean error of +6, and conclude erroneously that dials should be adjusted. The present method takes into account the time-series nature of processes, and thereby avoids errors like this.

For a more complex example of the kind of problem time-series corrections can handle, consider again the previous example in which both the hardness and size of manufactured pieces were affected by settings on oven temperature and bake time and a covariate of water temperature. Suppose the last 5 errors in hardness are -2, 0, 2, 4, and 6, so the next error is forecast to be 8, while the last 5 errors in size are 11, 8, 5, 2, -1, so the next error is forecast to be -4. Both hardness and size are affected by both oven temperature and bake time, so the problem is to adjust settings on those variables to correct for the forecast errors.

As you might guess, once the errors are forecast, the method is essentially the same as the previous method (regression with simple corrections), with the only difference being that forecast errors instead of average past errors are entered into column V. Thus the major difference between the methods is that we will now use time-series methods to forecast errors. Of course it will usually be more difficult to forecast these errors than it was in the simple examples of the previous two paragraphs.

The central feature of any problem which makes time-series corrections desirable is that fluctuations over even short time intervals may be nonrandom. If unmeasured covariates change slowly enough so that such fluctuations are largely random, then time-series corrections will not improve over simple corrections. I do not assume that all readers are familiar with time-series regression, so I go into some detail illustrating the commands used.

Fitting a Set of Time-Series Regressions

These regressions are fitted in a data set I'll call ERRORS. ERRORS begins with p columns E1, E2, etc. containing the deviations from a set of p target values for a series of output pieces. For instance, in the last example above, ERRORS would be

     E1   E2
     -2   11
      0    8
      2    5
      4    2
      6   -1
It doesn't matter whether entries in ERRORS are (target - observed) or (observed - target); the output I'll describe is the same either way.

After these first p columns, p other columns B1, B2, etc. are created using the LAG command. This can be done in a variety of ways depending on which predictors of errors you decide to use. I shall first consider the simple case in which the error in each characteristic of each piece is to be predicted from the error in the same characteristic of the immediately preceding piece. If p = 3, the commands adding the necessary columns to ERRORS might be

LET B1 = LAG(E1)
LET B2 = LAG(E2)
LET B3 = LAG(E3)

Then fit three separate regressions, the first predicting E1 from B1, the second predicting E2 from B2, and the third predicting E3 from B3. These should be no-constant regressions, so that errors of 0 in past output would be predictive of errors of 0 for the next output. It might seem that in recommending no-constant regression I have overlooked the fact that some settings may need continuous adjustment as a function of time (for instance constantly decreasing grinding time as a grinder with no temperature gauge gradually grinds faster as it heats up during the morning), but I assume that any such adjustments were already being made in the manufacturing process which yielded data set ERRORS.

As in an earlier example, lags of various lengths might be introduced, and all might be used in the regression. For instance, one might compute two lagged variables for each output characteristic with commands like

LET B11 = LAG(E1)
LET B12 = LAG(B11)
LET B21 = LAG(E2)
LET B22 = LAG(B21)
LET B31 = LAG(E3)
LET B32 = LAG(B31)

then predict E1 from B11 and B12, predict E2 from B21 and B22, and predict E3 from B31 and B32.

I now consider three possible modifications of this basic procedure.

First, the regressions should be designed to exclude predictors which will not be available in time during the ordinary manufacturing process. For instance, if a piece is manufactured every minute but it takes 3 minutes to measure the characteristics of a completed piece, then don't use the characteristics of the previous two pieces as predictors in the time-series regressions, because in practice that data will not be available in time. Thus instead of the LAG commands given above, you might use commands like

LET B1 = LAG(E1, 3)
LET B2 = LAG(E2, 3)
LET B3 = LAG(E3, 3)

where the second entry within each parentheses tells the program to lag by 3 units instead of 1.

I will use the term "available" to handle the complication discussed in the previous paragraph. That is, the last few "available" pieces are the last few pieces whose characteristics would be known in time, during the ordinary manufacturing process, to allow them to be used to correct settings for the next piece. Actually I have been using this term throughout this section, even before explaining its full meaning. The availability problem is one reason you should typically use an ordinary regression program for this purpose; a time-series autoregression program typically does not allow you to omit the most recent observations from the set of predictors.

A second modification of the basic commands allows you to use as predictors an average of the output characteristics of several previous pieces. The usefulness of this modification depends partly on the cost and ease of measuring output: the cheaper and easier the measurement, the more measurements you may want to average to form the predictors used in the time-series regressions.

In a third possible modification to the basic command set, you include in each time-series regression the previous errors in other Y-variables as well as previous errors in the same variable Yi. To see why this might be useful, suppose two machines, A and B, sit side by side, and each piece typically enters machine A 30 minutes before entering B. Suppose output characteristic Y1 is most affected by machine A, while Y2 is most affected by B. Suppose the same general environmental conditions--dust, vibration, etc.-- generally affect the two machines similarly. Each time the area is hit by vibration from a passing truck, it affects Y2 starting with pieces entering machine B at the time of the vibration and affects Y1 starting with pieces entering machine A at the same time. But the former pieces actually entered machine A 30 minutes earlier, and thus will appear in data set ERRORS before the other pieces. Therefore errors in Y2 may be predictive of Y1 errors in later output.

You needn't knock yourself out thinking of scenarios like this; rather you can simply use regression to see whether errors in one output characteristic are in fact predictive of errors in other output characteristics for later pieces. Thus I suggest that the time-series regressions should typically predict the error of each output characteristic Yi from all p errors of the previously manufactured pieces. This is the other major reason an ordinary time-series regression program can't ordinarily be used to fit these time-series regressions; such programs typically don't allow prediction using entries in series other than the same series for which predictions are being made.

After the forecasts are made, enter them into column V and carry on from there as in regression with simple corrections.

One point that should be made clear is that the corrections are to an algorithm, not to a set of settings. Further, the algorithm itself includes time-series corrections. For instance, it may be that certain covariate scores (including perhaps TIME) will change for every single piece manufactured, so that settings change for every piece. Suppose that for each piece you first calculate a set of settings by the regression method, then calculate time-series corrections to those settings, adjust the settings accordingly, then produce the piece and measure its errors E1, E2, etc. These errors are the values entered into the time-series analysis to estimate the settings for the next piece, even though the settings used were actually adjusted twice since the last piece--once to allow for changes in covariate scores, and once to incorporate the time-series corrections.

REFERENCES

Mitra, Amitava (1993) Fundamentals of quality control and improvement. New York: Macmillan

Ryan, Thomas P. (1989) Statistical methods for quality improvement. New York: Wiley