The solution can be found here.

Introduction

For this tutorial session, we will analyze three (linear regression) problems from top to bottom.

Problem 01

For this problem, we will analyse data about the mileage per gallon performances of various cars. The data set was retrieved from this page (with changes). You can download the .csv file here.

col.names <- c('mpg', 'cylinders', 'displacement', 'hp', 'weight', 'acceleration', 'year', 'origin')
car <- read.csv(file = 'others/car.csv', header = FALSE, sep = ',', col.names = col.names)
head(car, 5)
##   mpg cylinders displacement  hp weight acceleration year origin
## 1  18         8          307 130   3504         12.0   70      1
## 2  16         8          304 150   3433         12.0   70      1
## 3  17         8          302 140   3449         10.5   70      1
## 4  NA         8          350 165   4142         11.5   70      1
## 5  NA         8          351 153   4034         11.0   70      1

Explore the data set, fit an appropriate linear model, check the model assumptions, and plot the results. At the end, make predictions for unknown values.

Problem 02

For this problem, we will analyse data collected in an observational study in a semiconductor manufacturing plant. Data were retrieved from the Applied Statistics and Probability for Engineers book. You can download the .csv file here. In this plant, the finished semiconductor is wire-bonded to a frame. The variables reported are pull strength (a measure of the amount of force required to break the bond), the wire length, and the height of the die.

col.names <- c('pull_strength', 'wire_length', 'height')
wire <- read.csv(file = 'others/wire_bond.csv', header = FALSE, sep = ',', col.names = col.names)
head(wire, 5)
##   pull_strength wire_length height
## 1          9.95           2     50
## 2         24.45           8    110
## 3         31.75          11    120
## 4         35.00          10    550
## 5         25.02           8    295

Explore the data set, fit an appropriate linear model for the data, check the model assumptions, and plot the fitted plan. At the end, make predictions for unknown values.

Problem 03

For this problem, we will analyse a data set with 6 variable (1 response variable + 6 covariates). Although their meaning may not be stated, we will see how important feature selection is when performing multiple regression analysis. You can download the .csv file here.

col.names <- c('var1', 'var2', 'var3', 'var4', 'var5', 'var6', 'response')
data <- read.csv(file = 'others/data.csv', header = FALSE, sep = ',', col.names = col.names)
head(data, 5)
##       var1      var2     var3        var4     var5     var6 response
## 1 68.10730  95.83754 49.66851 0.015061421 2.090953 64.83720 218.5916
## 2 78.18420  97.69040 54.51643 0.042649961 4.320810 74.54103 245.8415
## 3 54.24527 105.20130 49.59829 0.005194938 4.948731 78.74680 264.0839
## 4 54.56271  97.41171 47.21550 0.021132252 5.127075 74.95861 251.3954
## 5 56.75478  95.57443 44.05604 0.027485738 1.801114 63.39468 214.6450

Explore the data set, fit an appropriate (and reduced, based on any feature selection procedure) linear model for the data, check the model assumptions, and plot the results. At the end, make predictions for unknown values.