In this case study, we will fit a simple linear regression model. Later, we will also include one categorical variable into the model and check for interaction. The objective here will be transform data (if necessary), fit the model, analyse it, and write the fitted equations for the two studied groups. To do this, we will a the model, check the assumptions, re-fit it (if necessary), interpret it, visualize the estimated curves, and predict new values based on this fitted model for different classes for the categorical variable.

For this problem, we will analyse data about the mileage per gallon performances of various cars. The data set was retrieved from this page (with changes). You can download the `.csv`

file here.

This data set contains the following variables:

`mpg`

(continuous)`cylinders`

(multi-valued discrete)`displacement`

(continuous)`horsepower`

(continuous)`weight`

(continuous)`accelaration`

(continuous)`model year`

(multi-valued discrete)`origin`

(multi-valued discrete)

In order to explore the data set and perform initial analyses, we have to read it (with `R`

) first. Provided that the `.csv`

file is saved within the `datasets/`

folder, one can read the file in the following way.

```
col.names <- c('mpg', 'cylinders', 'displacement', 'hp', 'weight', 'acceleration', 'year', 'origin')
car <- read.csv(file = 'datasets/car.csv', header = FALSE, sep = ',', col.names = col.names)
head(car, 5)
```

```
## mpg cylinders displacement hp weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 16 8 304 150 3433 12.0 70 1
## 3 17 8 302 140 3449 10.5 70 1
## 4 NA 8 350 165 4142 11.5 70 1
## 5 NA 8 351 153 4034 11.0 70 1
```

Now, let’s see a summary of our data.

`summary(car)`

```
## mpg cylinders displacement hp weight
## Min. :12.00 Min. :3.000 Min. : 68.0 Min. : 48.0 Min. :1613
## 1st Qu.:18.00 1st Qu.:4.000 1st Qu.:107.0 1st Qu.: 77.5 1st Qu.:2237
## Median :23.00 Median :4.000 Median :148.5 Median : 92.0 Median :2804
## Mean :23.74 Mean :5.372 Mean :187.9 Mean :100.9 Mean :2936
## 3rd Qu.:29.80 3rd Qu.:6.000 3rd Qu.:250.0 3rd Qu.:113.5 3rd Qu.:3456
## Max. :39.00 Max. :8.000 Max. :400.0 Max. :190.0 Max. :4746
## NA's :6 NA's :2
## acceleration year origin
## Min. : 8.00 Min. :70.00 Min. :1.000
## 1st Qu.:14.00 1st Qu.:73.00 1st Qu.:1.000
## Median :15.50 Median :76.00 Median :1.000
## Mean :15.58 Mean :76.36 Mean :1.558
## 3rd Qu.:17.00 3rd Qu.:79.75 3rd Qu.:2.000
## Max. :23.50 Max. :82.00 Max. :3.000
##
```

As one can see from the above table, some multi-valued discrete attributes are being interpreted as integer values; also, we have `NA`

’s for the `mpg`

and `horsepower`

attributes. To verify (and change) the variable types, we can do the following

```
car$cylinders <- as.factor(car$cylinders)
car$year <- as.factor(car$year)
car$origin <- as.factor(car$origin)
```

Also, as there are too many classes for the `year`

, and as a way to make our analyses simpler, let’s categorize the cars into `old`

and `new`

, such that cars from before `77`

will be labeled as `1`

and the remaining cars will be labeled as `2`

.

```
car$year <- as.factor(sapply(X = car$year, FUN = function (item) { ifelse(item %in% 70:76, 1, 2) }))
summary(car)
```

```
## mpg cylinders displacement hp weight
## Min. :12.00 3: 3 Min. : 68.0 Min. : 48.0 Min. :1613
## 1st Qu.:18.00 4:134 1st Qu.:107.0 1st Qu.: 77.5 1st Qu.:2237
## Median :23.00 5: 1 Median :148.5 Median : 92.0 Median :2804
## Mean :23.74 6: 62 Mean :187.9 Mean :100.9 Mean :2936
## 3rd Qu.:29.80 8: 58 3rd Qu.:250.0 3rd Qu.:113.5 3rd Qu.:3456
## Max. :39.00 Max. :400.0 Max. :190.0 Max. :4746
## NA's :6 NA's :2
## acceleration year origin
## Min. : 8.00 1:132 1:167
## 1st Qu.:14.00 2:126 2: 38
## Median :15.50 3: 53
## Mean :15.58
## 3rd Qu.:17.00
## Max. :23.50
##
```

Now, let’s deal with the missing values. Different approaches could have been taken here, and they highly depend on your problem (and your knowledge about the problem). For this particular example, suppose that we want to describe the `mpg`

data as a function of the `hp`

and `year`

. Since we do not now much about this data, a simpler options would be to exclude the instances with missing values for the `hp`

. Let’s do this.

```
car2 <- car[!is.na(car$hp), c('mpg', 'hp', 'year')]
summary(car2)
```

```
## mpg hp year
## Min. :12.00 Min. : 48.0 1:131
## 1st Qu.:18.00 1st Qu.: 77.5 2:125
## Median :23.00 Median : 92.0
## Mean :23.75 Mean :100.9
## 3rd Qu.:29.80 3rd Qu.:113.5
## Max. :39.00 Max. :190.0
## NA's :6
```

Given this smaller data set, our goal might be to predict the missing values for `mpg`

. However, to do this, we have to have a data set with no `NA`

’s. Let’s name it `car3`

.

`car3 <- car2[!is.na(car2$mpg), ]`

As a last exploration step, let’s plot our data set.

```
plot(mpg ~ hp, pch = 19, col = (as.numeric(year) + 1), data = car3)
legend('topright', c('old', 'new'), col = unique(as.numeric(car3$year) + 1), pch = 19)
```

From the previous plot, although we suspect that a linear model might not be appropriate for this data set as it is, let’s fit it and analyse the results.

In particular, we will fit the following model

\[ \texttt{mpg}_i = \beta_0 + \texttt{hp}_i \cdot \beta_1 + \epsilon_i; \text{ such that } \epsilon_i \overset{\text{i.i.d.}}{\sim} \text{Normal}(0, \sigma^2_{\epsilon}) \]

```
model <- lm(formula = mpg ~ hp, data = car3)
summary(model)
```

```
##
## Call:
## lm(formula = mpg ~ hp, data = car3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3908 -3.3377 -0.0706 2.7713 11.7533
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.543005 0.941244 45.20 <2e-16 ***
## hp -0.188003 0.009001 -20.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.362 on 248 degrees of freedom
## Multiple R-squared: 0.6376, Adjusted R-squared: 0.6361
## F-statistic: 436.2 on 1 and 248 DF, p-value: < 2.2e-16
```

From the above summary, we have strong evidences that both \(\beta_0\) and \(\beta_1\) are different than 0. The residuals do not seem symmetric, though. Also, \(\text{R}^2 =\) 0.6375558. Now, let’s plot the fitted model.

```
plot(mpg ~ hp, pch = 19, col = (as.numeric(year) + 1), data = car3)
abline(model)
legend('topright', c('old', 'new'), col = unique(as.numeric(car3$year) + 1), pch = 19)
```

However, notice that the relation between `hp`

and `mpg`

does not seem to be linear, and the `age`

might also provide information when describe the response variable. Thus, before taking any conclusions from the fitted model, let’s do an analysis of residuals. We will focus on the “Residuals vs Fitted” and “Normal Q-Q” plots.

`plot(model, which = c(1, 2))`

From the “Residuals vs Fitted” plots, we may see that a linear relationship does not correctly describe how `mpg`

is written as a function of `hp`

, since we can see a pattern for the residuals (as opposed to a “well spread and random” cloud of points around \(y = 0\)). Also, from the “Normal Q-Q” plot, the residuals seem to be normally distributed (we will test it).

To confirm these visual analyses, let’s conduct a proper test. To check for the assumption of equal variance, since we have a quantitative regressor, we can use the Score Test, available in the `car`

package through the `ncvTest()`

function. Also, to check for the normality of the residuals, we will use the Shapiro-Wilk test (`shapiro.test()`

).

```
library('car')
ncvTest(model)
```

```
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 6.945645, Df = 1, p = 0.0084024
```

`shapiro.test(resid(model))`

```
##
## Shapiro-Wilk normality test
##
## data: resid(model)
## W = 0.98926, p-value = 0.06041
```

As the p-values are too small for the first test, we have strong evidences against equal variance. On the other hand, we fail to reject the hypothesis of normally distributed residuals (with a significance level of \(5\%\)). Thus, as at least one assumption for this model does not hold, the results might not be reliable.

As a way to overcome this issue, we will transform the data according to the following rule

\[ w(\lambda) = \begin{cases} (y^{\lambda} - 1)/\lambda &, \text{ if } \lambda \neq 0 \\ \log(y) &, \text{ if } \lambda = 0. \end{cases} \]

This can be achieved by using the `boxCox()`

function from the `car`

package. Based on it, we will retrieve the value of `lambda`

and will apply the above transformation.

`bc <- boxCox(model)`

`(lambda <- bc$x[which.max(bc$y)])`

`## [1] -0.5454545`

Now, let’s create a function to transform our data based on the value of \(\lambda\) and based on the above rule. We will name it `tranfData()`

. Also, we will create a function to transform our data back to the original scale. We will name it `tranfData_back()`

. This second function, if \(\lambda \neq 0\), will be given by \(y(\lambda) = (w\lambda + 1)^{1/\lambda}\).

```
transfData <- function (data, lambda) { ((data ^ lambda - 1) / lambda) }
transfData_back <- function (data, lambda) { ((data * lambda + 1) ^ (1 / lambda)) }
```

Therefore, we can easily transform our data, given \(\lambda =\) -0.55, in the following way

```
car3$mpg_transf <- sapply(X = car3$mpg, FUN = transfData, lambda = lambda)
head(car3, 5)
```

```
## mpg hp year mpg_transf
## 1 18 130 1 1.454413
## 2 16 150 1 1.429271
## 3 17 140 1 1.442414
## 8 14 160 1 1.398742
## 9 24 95 1 1.509442
```

```
plot(mpg_transf ~ hp, pch = 19, col = (as.numeric(year) + 1), data = car3)
legend('topright', c('old', 'new'), col = unique(as.numeric(car3$year) + 1), pch = 19)
```

Finally, we can fit again a linear model for the transformed data.

```
model2 <- lm(formula = mpg_transf ~ hp, data = car3)
summary(model2)
```

```
##
## Call:
## lm(formula = mpg_transf ~ hp, data = car3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.079235 -0.021397 0.001967 0.020464 0.074192
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.656e+00 6.607e-03 250.69 <2e-16 ***
## hp -1.622e-03 6.318e-05 -25.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03062 on 248 degrees of freedom
## Multiple R-squared: 0.7267, Adjusted R-squared: 0.7256
## F-statistic: 659.3 on 1 and 248 DF, p-value: < 2.2e-16
```

```
plot(mpg_transf ~ hp, pch = 19, col = (as.numeric(year) + 1), data = car3)
abline(model2)
legend('topright', c('old', 'new'), col = unique(as.numeric(car3$year) + 1), pch = 19)
```

Also, we can analyse the diagnostic plots, as before. As well as conduct the appropriate tests.

`plot(model2, which = c(1, 2))`

`ncvTest(model2)`

```
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 1.087962, Df = 1, p = 0.29692
```

`shapiro.test(resid(model2))`

```
##
## Shapiro-Wilk normality test
##
## data: resid(model2)
## W = 0.98934, p-value = 0.06252
```

As we would expect, the results look much better now. However, we can still use information about `year`

, which seems to play a role in explaining the response variable. That being said, let’s fit this new model.

Notice that we will consider a model with interaction (*an interaction occurs when an independent variable has a different effect on the outcome depending on the values of another independent variable*). For an extensive discussion on this topic, one can refer to this link.

```
model3 <- lm(formula = mpg_transf ~ hp * year, data = car3)
summary(model3)
```

```
##
## Call:
## lm(formula = mpg_transf ~ hp * year, data = car3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.061671 -0.017084 0.001801 0.018128 0.065467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.620e+00 7.767e-03 208.563 < 2e-16 ***
## hp -1.435e-03 6.886e-05 -20.836 < 2e-16 ***
## year2 4.570e-02 1.159e-02 3.945 0.000104 ***
## hp:year2 -1.118e-04 1.133e-04 -0.986 0.325015
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02561 on 246 degrees of freedom
## Multiple R-squared: 0.8103, Adjusted R-squared: 0.808
## F-statistic: 350.3 on 3 and 246 DF, p-value: < 2.2e-16
```

From the above table, we can see that the interaction (`hp:year2`

) is not significant; therefore, we can fit a simpler model.

```
model4 <- lm(formula = mpg_transf ~ hp + year, data = car3)
coeffi <- model4$coefficients
summary(model4)
```

```
##
## Call:
## lm(formula = mpg_transf ~ hp + year, data = car3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.061994 -0.017143 0.001694 0.017929 0.065681
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.624e+00 6.323e-03 256.91 <2e-16 ***
## hp -1.476e-03 5.469e-05 -26.99 <2e-16 ***
## year2 3.477e-02 3.353e-03 10.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02561 on 247 degrees of freedom
## Multiple R-squared: 0.8096, Adjusted R-squared: 0.808
## F-statistic: 525 on 2 and 247 DF, p-value: < 2.2e-16
```

```
plot(mpg_transf ~ hp, pch = 19, col = (as.numeric(year) + 1), data = car3)
abline(coeffi[[1]], coeffi[[2]], col = 2)
abline(coeffi[[1]] + coeffi[[3]], coeffi[[2]], col = 3)
legend('topright', c('old', 'new'), col = unique(as.numeric(car3$year) + 1), pch = 19)
```

Again, we can analyse the diagnostic plots and conduct the appropriate tests.

`plot(model4, which = c(1, 2))`

`ncvTest(model4)`

```
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 2.838162, Df = 1, p = 0.092049
```

`shapiro.test(resid(model4))`

```
##
## Shapiro-Wilk normality test
##
## data: resid(model4)
## W = 0.99006, p-value = 0.08516
```

Thus, for a significance level of \(5\%\) we fail to reject the hypotheses of equal variance and normality. Meaning that this might be an appropriate model for our data. However, recall that we are modelling a transformed data set. We can get a model for our original data by doing the following. For a transformation \(f\), we have the

\[\begin{align*} \texttt{mpg}_i &= f^{-1}(1.624 -0.001\texttt{hp}_i)&, \text{ if } \texttt{year} = 1 \\ \texttt{mpg}_i &= f^{-1}((1.624 + 0.035) -0.001\texttt{hp}_i)&, \text{ if } \texttt{year} = 2 \end{align*}\]

And we can plot it in the following way

```
plot(mpg ~ hp, pch = 19, col = (as.numeric(year) + 1), data = car3)
curve(transfData_back(coeffi[[1]] + coeffi[[2]] * x, lambda = lambda), from = 0, to = 250, add = TRUE, col = 2)
curve(transfData_back((coeffi[[1]] + coeffi[[3]]) + coeffi[[2]] * x, lambda = lambda), from = 0, to = 250, add = TRUE, col = 3)
legend('topright', c('old', 'new'), col = unique(as.numeric(car3$year) + 1), pch = 19)
```

Now that we have a “good” fitted model, we can predict, as suggested before, the values of `mpg`

for which we had `NA`

’s before. We can do this in the following way

```
pos_unk <- which(is.na(car2$mpg))
unknown <- car2[is.na(car2$mpg), ]
(predicted_values <- sapply(X = predict(object = model4, newdata = data.frame(hp = unknown$hp, year = unknown$year)), FUN = transfData_back, lambda = lambda))
```

```
## 1 2 3 4 5 6
## 13.00153 13.98905 12.25847 12.25847 31.38660 22.37227
```

```
car2[is.na(car2$mpg), 'mpg'] <- predicted_values
pch <- rep(19, nrow(car2)); pch[pos_unk] <- 9
plot(mpg ~ hp, pch = pch, col = (as.numeric(year) + 1), data = car2)
curve(transfData_back(coeffi[[1]] + coeffi[[2]] * x, lambda = lambda), from = 0, to = 250, add = TRUE, col = 2)
curve(transfData_back((coeffi[[1]] + coeffi[[3]]) + coeffi[[2]] * x, lambda = lambda), from = 0, to = 250, add = TRUE, col = 3)
legend('topright', c('old', 'new'), col = unique(as.numeric(car3$year) + 1), pch = 19)
```