In this case study, we will fit a multiple linear regression model. This example will be simple, and we will also be able to visualize the estimated surface (most of the time, it is not possible neither important). To do this, we will fit the model, check the assumptions, interpret it, visualize the estimated surface, and predict (and plot) new values based on this fitted model.
For this problem, we will analyse data collected in an observational study in a semiconductor manufacturing plant. Data were retrieved from the Applied Statistics and Probability for Engineers book. You can download the .csv
file here. In this plant, the finished semiconductor is wire-bonded to a frame. The variables reported are pull strength (a measure of the amount of force required to break the bond), the wire length, and the height of the die. Thus, the data set contains the following variables:
pull_strength
(quantitative)wire_length
(quantitative)height
(quantitative)In order to explore the data set and perform initial analyses, we have to read it (with R
) first. Provided that the .csv
file is saved within the datasets/
folder, one can read the file in the following way.
col.names <- c('pull_strength', 'wire_length', 'height')
data <- read.csv(file = 'datasets/wire_bond.csv', header = FALSE, sep = ',', col.names = col.names)
head(data, 5)
## pull_strength wire_length height
## 1 9.95 2 50
## 2 24.45 8 110
## 3 31.75 11 120
## 4 35.00 10 550
## 5 25.02 8 295
Now, let’s see a summary of our data.
summary(data)
## pull_strength wire_length height
## Min. : 9.60 Min. : 1.00 Min. : 50.0
## 1st Qu.:17.08 1st Qu.: 4.00 1st Qu.:200.0
## Median :24.45 Median : 8.00 Median :360.0
## Mean :29.03 Mean : 8.24 Mean :331.8
## 3rd Qu.:37.00 3rd Qu.:11.00 3rd Qu.:500.0
## Max. :69.00 Max. :20.00 Max. :600.0
In this case, there are no missing values—which is great. However, since we want to use most information from this data set, it might not be easy to visualize how one attribute (in our case pull_strength
) can be written as a function of more than two variables at the same time. Fortunately, we can still plot data in 3D (using the scatterplot3d()
function from the scatterplot3d
package).
library('scatterplot3d')
scatterplot3d(data$wire_length, data$height, data$pull_strength, color = 2, pch = 19, angle = 70)
Also, it is useful to see how variables are correlated. To do this, we can plot (at least) two types of graphs; in this case, we will use the scatterplotMatrix()
function from the car
package, and the corrplot.mixed()
function from the corrplot
package to plot the correlation (from cor()
) matrix.
library('car')
scatterplotMatrix(data)
library('corrplot')
## corrplot 0.92 loaded
corrplot.mixed(cor(data))
From the second plot, we can see that pull_strength
is highly correlated with wire_length
, and this relation is detailed in the first plot. These graphs will be even more useful once we have more variables (as we will see in our next problem).
Our very first task will be fitting a model with all variables so that we can try to explain how pull_strength
relates with the other variables. We can do this in the following way.
model <- lm(formula = pull_strength ~ ., data = data)
summary(model)
##
## Call:
## lm(formula = pull_strength ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.865 -1.542 -0.362 1.196 5.841
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.263791 1.060066 2.136 0.044099 *
## wire_length 2.744270 0.093524 29.343 < 2e-16 ***
## height 0.012528 0.002798 4.477 0.000188 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.288 on 22 degrees of freedom
## Multiple R-squared: 0.9811, Adjusted R-squared: 0.9794
## F-statistic: 572.2 on 2 and 22 DF, p-value: < 2.2e-16
As we can see from the summary, for a significance level of \(5\%\), all coefficients are different than zero. Thus, we should keep them (for our next problem, we will try to simplify the model further).
Once we have the model, a next step would be performing a residual analysis. To do this, we can plot the diagnostic graphs and run the appropriate tests.
plot(model, which = c(1, 2))
The plot seems okay, but we still have to do the tests. As a remark, and as extracted from this page, “the red line is a LOWESS fit to your residuals vs fitted plot. Basically, it’s smoothing over the points to look for certain kinds of patterns in the residuals. For example, if you fit a linear regression on data that looked like \(y = x^2\), you’d see a noticeable bowed shape”. Regarding the tests, we have the following
ncvTest(model)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.04524206, Df = 1, p = 0.83156
shapiro.test(resid(model))
##
## Shapiro-Wilk normality test
##
## data: resid(model)
## W = 0.95827, p-value = 0.381
Also, from the test results, we fail to reject the equal variance and normality assumptions, meaning that we have a good model for our data.
In particular, the model is given by
\[\begin{align*} \texttt{pull_strength}_i &= 2.264 + 2.744\texttt{wire_length}_i + 0.013\texttt{height}_i \end{align*}\]
And we can plot the fitted model in the following way
plot3d <- scatterplot3d(data$wire_length, data$height, data$pull_strength, color = 2, pch = 19, angle = 70)
plot3d$plane3d(model)
Now that we have a “good” fitted model, we can predict, the value of pull_strength
for new values of wire_length
and height
. For instance, we can predict the value of pull_strength
, such that wire_length
is equal to 7.5 and 12.5 and height
is equal to 150 and 450, respectively. Notice that your new points must lie within the range for the observed data, otherwise your model may not be appropriate for predicting extrapolated points.
newdata <- data.frame(wire_length = c(7.5, 12.5), height = c(150, 450))
(pred1 <- predict(object = model, newdata = newdata, interval = 'confidence'))
## fit lwr upr
## 1 24.72499 23.33957 26.11040
## 2 42.20468 40.92987 43.47948
(pred2 <- predict(object = model, newdata = newdata, interval = 'prediction'))
## fit lwr upr
## 1 24.72499 19.78175 29.66822
## 2 42.20468 37.29130 47.11805
newdata <- cbind(newdata, pull_strength = pred1[, 1])
data <- rbind(data, newdata)
plot3d <- scatterplot3d(data$wire_length, data$height, data$pull_strength, color = c(rep(2, nrow(data) - 2), 3, 3), pch = 19, angle = 70)
plot3d$plane3d(model)
legend('topleft', c('Observed data', 'New data'), col = c(2, 3), pch = 19)
From the above predicted values, notice that this time, we added the confidence interval and the prediction interval. As explained Applied Statistics and Probability for Engineers book, “the prediction interval is always wider than the confidence interval. The confidence interval expresses the error in estimating the mean of a distribution, and the prediction interval expresses the error in predicting a future observation from the distribution at the point \(\mathbf{x}_0\). This must include the error in estimating the mean at that point as well as the inherent variability in the random variable pull_strength
at the same value \(\mathbf{x} = \mathbf{x}_0\)”.