Making Sense of Regression Analysis

Add bookmark

Process Excellence Network

Regression analysis is used to investigate and model the relationship between a response variable (Y) and one or more predictors (Xs). For example, the relationship between fill volume (Y) and filler nozzle setting (X1), filler table rotation speed (X2), spring tension (X3) etc. in the case of a beverage filling process or the relationship between process time (Y) and difference between exit gauge and entry gauge (X1), width (X2), length (X3), weight (X4) etc in the case of a metal rolling process.

To do any type of analysis, the data set needs to be well conditioned. If there are data entry errors or outliers in the data set, the regression analysis (or any other data analysis) may give incorrect results and hence lead to incorrect conclusions (and maybe in getting yourself fired!).

The first thing we need to do when we have any set of data is to graph it (Y and Xs) especially when we have a large data set. Graphing the data (for example histograms with normal curve or box plots or time series charts) along with some basic statistics (mean, median, standard deviation, min, max and inter-quartile ranges) will help in getting a good knowledge about the data set and about the process.

Minitab’s graphical summary gives us this information (Stat->Basic Statistics->Graphical Summary). Below is an example:

After eliminating the outliers and data entry errors in the data, we can proceed with the regression analysis. Please note that we are eliminating the outliers only for the purpose of doing regression analysis. The reasons and causes of the outliers and data entry errors need to be investigated and fixed if possible or at least minimized.
[eventPDF]
 
We will use the example of process time for a metal rolling process to explain the insights of the regression analysis. In a metal rolling process, a coil with a certain entry gauge (thickness) enters the rolling mill and gets reduced in thickness (which is the exit gauge) as it gets rolled between metal rollers.
Let us say that we are trying to find out the factors which affect the process time of a metal rolling process. The factors which could determine process time could be difference between exit gauge and entry gauge, width, length and exit weight.
Before we conduct a regression analysis, we can see if there are any statistically valid correlations between the process time (Y) and the factors (Xs).
Here’s the output from Minitab:
 
Pearson correlation of PROCESS_TIME and (Exit Gauge - Entry Gauge)mm = 0.448
P-Value = 0.000
 
Pearson correlation of PROCESS_TIME and LENGTH = 0.655
P-Value = 0.000
 
Pearson correlation of PROCESS_TIME and WIDTH = 0.087
P-Value = 0.000
 
Pearson correlation of PROCESS_TIME and EXIT_WEIGHT = 0.025
P-Value = 0.036
 
So we can see from the p-value, all the factors are having significant correlation with process time. Exit weight and width seem to have low positive correlations.
Now, let’s conduct the regression analysis with the above factors using Minitab.
This is the output we get from Minitab:
The regression equation is
PROCESS_TIME = - 1.10 - 0.677 (Exit Gauge - Entry Gauge)mm + 0.00339 WIDTH
+ 0.00265 LENGTH - 0.000268 EXIT_WEIGHT

Predictor Coef SE Coef T P VIF
Constant -1.1039 0.2257 -4.89 0.000
(Exit Gauge - Entry Gauge)mm -0.67736 0.04395 -15.41 0.000 2.904
WIDTH 0.0033852 0.0002703 12.53 0.000 4.165
LENGTH 0.00265431 0.00004694 56.55 0.000 2.864
EXIT_WEIGHT -0.00026772 0.00004233 -6.32 0.000 4.220

S = 1.52231 R-Sq = 46.6% R-Sq(adj) = 46.6%
 
PRESS = 15911.8 R-Sq(pred) = 46.49%
 
Analysis of Variance
 
Source DF SS MS F P
Regression 4 13858.2 3464.6 1495.01 0.000
Residual Error 6851 15876.6 2.3
Lack of Fit 6843 15865.7 2.3 1.70 0.211
Pure Error 8 10.9 1.4
Total 6855 29734.9
 
Durbin-Watson statistic = 1.88873
 

Lack of fit test
Possible curvature in variable (Exit Gauge - Entry Gauge) mm (P-Value = 0.000)
Possible interaction in variable (Exit Gauge - Entry Gauge) mm (P-Value = 0.000)

Possible curvature in variable WIDTH (P-Value = 0.000)
Possible interaction in variable WIDTH (P-Value = 0.026)

Possible curvature in variable LENGTH (P-Value = 0.000)
Possible interaction in variable LENGTH (P-Value = 0.000)

Possible interaction in variable EXIT_WEIGHT (P-Value = 0.001)
 
Possible lack of fit at outer X-values (P-Value = 0.000)
Overall lack of fit test is significant at P = 0.000
 
 
 
Let us try to interpret the output now.
Since p-value is less than 0.05 for all the factors, all of the factors are significant.

VIF stands for variance inflation factor. VIF is an indicator of multi-collinearity among the predictors (Xs). Moderate multi-collinearity may not be problematic. However, severe multi-collinearity is problematic because it can increase the variance of the regression coefficients, making them unstable and difficult to interpret. VIF measures how much the variance of an estimated regression coefficient increases if our predictors are correlated.

Use the following guidelines to interpret the VIF:

 

VIF

Predictors are

VIF = 1

Not correlated

VIF between 1 to 5

Moderately correlated

VIF between 6 to 10

Highly correlated

VIF > 10

Multi-collinearity is unduly influencing your regression results

 

If the VIF is greater than or equal to 10, we must eliminate the unimportant predictors from our model. We can remove the highly correlated predictors from the model; because they supply redundant information and removing them does not drastically reduce the R2. In our case, the VIF is between 1 to 5 for all the factors which is indicating moderate correlation among predictors. VIF for width and exit weight is higher than the VIF for difference between exit and entry gauge and length.
Standard error of the regression (S): S is measured in the units of the response variable and represents the standard distance data values fall from the regression line, or the standard deviation of the residuals. For a given study, the better the regression equation predicts the response, the lower the value of S. In our case it is low.

The R2, adjusted R2, predicted R2are all 47% approximately.

  • The R2as we know indicates how much of variation in Y is explained by the Xs. So, the factors are accounting for only 46% of the variation in Y.
  • The adjusted R2can be used for comparing the explanatory power of models with different numbers of predictors. The adjusted R2will increase only if the new term improves the model more than would be expected by chance. It will decrease when a predictor improves the model less than expected by chance.
  • PRESS stands for Predicted Sum of Squares. It is used to assess our model's predictive ability. In general, the smaller the PRESS value, the better the model's predictive ability. PRESS is used to calculate the predicted R2which is generally more intuitive to interpret. Together, these statistics can help prevent overfitting the model because it is calculated using observations not included in model estimation.
  • The predicted R2indicates how well the model predicts responses for new observations. Predicted R2 can prevent over-fitting the model and can be more useful than adjusted R2 for comparing models because it is calculated using observations not included in model estimation. Over-fitting refers to models that appear to explain the relationship between the predictor and response variables for the data set used for model calculation but fail to provide valid predictions for new observations. Larger values of predicted R2 suggest models of greater predictive ability.
The p-value for regression is less than 0.05 indicating that the regression is statistically significant.
The standard error of the coefficients (0.044 and 0.000047) is smaller (when compared to the coefficient) for the difference between exit and entry gauge and length factors than the other two factors (exit weight and width).
Graphical output:
In Minitab, we can choose a four-in-one plot to evaluate the residuals visually
Histogram of residuals: Long tails in the plot may indicate skewness in the data. If one or two bars are far from the others, those points may be outliers. In our example, there seem to be no outliers or skewness in the data.
 
Normal probability plot of residuals: The points in this plot should generally form a straight line if the residuals are normally distributed. If the points on the plot depart from a straight line, the normality assumption may be invalid. The p-value for the Anderson-Darling statistic is indicating that the plot does not follow normal distribution in our example case.
Residuals versus fits: This plot should show a random pattern of residuals on both sides of 0. If a point lies far from the majority of points, it may be an outlier. Also, there should not be any recognizable patterns in the residual plot. The following may indicate error that is not random:
  • a series of increasing or decreasing points
  • a predominance of positive residuals, or a predominance of negative residuals
  • patterns, such as increasing residuals with increasing fits
  • In our example, it is indicating there are some outliers

Residuals versus order: This is a plot of all residuals in the order that the data was collected and can be used to find non-random error, especially of time-related effects. A positive correlation is indicated by a clustering of residuals with the same sign. A negative correlation is indicated by rapid changes in the signs of consecutive residuals. In our example, it is not indicating any such pattern.

Now let’s see what happens when we remove the two factors width and exit weight.
 
The regression equation is
PROCESS_TIME = 1.71 - 0.666 (Exit Gauge - Entry Gauge) mm + 0.00263 LENGTH

Predictor Coef SE Coef T P VIF
Constant 1.7112 0.1276 13.42 0.000
(Exit Gauge - Entry Gauge)mm -0.66569 0.04438 -15.00 0.000 2.859
LENGTH 0.00263472 0.00004772 55.21 0.000 2.859

S = 1.54894 R-Sq = 44.7% R-Sq(adj) = 44.7%
 
PRESS = 16466.8 R-Sq(pred) = 44.62%

Analysis of Variance
 
Source DF SS MS F P
Regression 2 13293.0 6646.5 2770.27 0.000
Residual Error 6853 16441.9 2.4
Lack of Fit 5694 14858.5 2.6 1.91 0.000
Pure Error 1159 1583.4 1.4
Total 6855 29734.9
 
Durbin-Watson statistic = 1.82144
 
Lack of fit test
Possible curvature in variable (Exit Gauge - Entry Gauge)mm (P-Value = 0.000 )
Possible interaction in variable (Exit Gauge - Entry Gauge)mm (P-Value = 0.000 )

Possible curvature in variable LENGTH (P-Value = 0.000 )
Possible interaction in variable LENGTH (P-Value = 0.000 )
 
Possible lack of fit at outer X-values (P-Value = 0.000)
Overall lack of fit test is significant at P = 0.000

• The R2, adjusted R2, predicted R2 are all 45% approximately. This means that the two factors (exit weight and width) were not impacting the process time significantly and the regression model with the remaining two factors is as good as with the original four factors. Thus we have reduced our model to two significant factors.
• Lack of fit has become significant (since p-value is less than 0.05)
Lack-of-fit tests: The lack-of-fit tests assess the fit of our model. If the p-value is less than our selected α-level, evidence exists that our model does not accurately fit the data. We may need to add terms or transform our data to more accurately model the data. Minitab calculates two types of lack-of-fit tests:
Pure error lack of fit test: We can use this test if our data contains replicates (multiple observations with identical x-values) and we are reducing our model. Replicates represent "pure error" because only random variation can cause differences between the observed response values. If we are reducing our model and the resulting p-value for lack-of-fit is less than our selected α -level, then we should retain the term we removed from the model.
In our example case, the p-value for lack of fit became significant (less than 0.05 α level) when we removed the two factors (exit weight and width)
Data subsetting lack of fit test: We can use this if our data do not contain replicates and we want to determine if we are accurately modeling the curvature. This method identifies curvature in the data and interactions among predictors that may affect the model fit. Whenever the Data Subsetting p-value is less than α -level, Minitab displays the message "Possible curvature in variable X (P-Value = 0.006)." Evidence exists that this curvature is not adequately modeled. After examining the raw data in a scatterplot, we might try including a higher-order term to model the curvature.

In our example, the regression model is not modeling the curvature properly.
Durbin-Watson statistic: This statistic tests for the presence of autocorrelation in residuals. Autocorrelation means that adjacent observations are correlated. If they are correlated, then least squares regression underestimates the standard error of the coefficients; our predictors may appear to be significant when they may not be.
For example, the residuals from a regression on daily currency price data may depend on the preceding observation because one day's currency price affects the next day's price.
The Durbin-Watson statistic is conditioned on the order of the observations (rows). Minitab assumes that the observations are in a meaningful order, such as time order. The Durbin-Watson statistic determines whether or not the correlation between adjacent error terms is zero. The Durbin-Watson statistic ranges in value from 0 to 4. A value near 2 indicates non-autocorrelation; a value toward 0 indicates positive autocorrelation; a value toward 4 indicates negative autocorrelation.
In our case, there is no autocorrelation in residuals.
Conclusion:
Regression analysis is a very good tool to model the relationship between predictors (Xs) and the response (Y). We have to be careful of high multi-collinearity among predictors, autocorrelation among residuals, violation of assumptions of regression and curvature of predictors while building our regression model because that could lead to incorrect conclusions.
If we determine that our model does not fit well, we should (i) Check to see whether our data are entered correctly, especially observations identified as unusual (ii) Try to determine the cause of the problem. We may want to see how sensitive our model is to the issue. For example, if we have an outlier, we can run the regression without that observation and see how the results differ.