MATH3091: Computer Lab 08

Analysis of accident.csv data

Load the data in accident.csv into a variable accident.

accident <- read.csv("../datasets/accident.csv") # Change the path, if needed
head(accident)
  number volume        road      time
1      4   1399        Mill   morning
2     20   2276        Mill    midday
3      4   1417        Mill afternoon
4     11   2206 Trumpington   morning
5      9   3276 Trumpington    midday
6      4   1999 Trumpington afternoon

The data in accident.csv concerns the number of road accidents (number) and the volume of traffic (volume), on each of two roads in Cambridge (road), at various times of day (time, taking values morning, midday or afternoon).

We would like to answer questions like:

  1. Is Mill Road more dangerous than Trumpington Road?
  2. How does time of day affect the rate of road accidents?

Let \(Y_{ij}\) denote the number of accidents and \(\mu_{ij}\) denote the expected number of accidents for the \(i\)th road at the \(j\)th time. We assume

\[ Y_{ij} \sim \text{Poisson}(\mu_{ij}), \ i=1, 2 \text{ for Road}, j=1, 2, 3 \text{ for time}. \]

We might reasonably expect the number of accidents to depend on traffic volume, \(v_{ij}\). It is better to work with log volume rather than volume itself as the volumes are very high.

Use the Poisson GLM with canonical log link.

\[ \log (\mu_{ij}) = \mu + \alpha_i + \beta_j + \gamma \log v_{ij}. \] This is equivalent to assuming:

\[ \mu_{ij} = \text{constant} \times \text{$i$-th road effect} \times \text{$j$0th time effect} \times \text{volume}^{\gamma} \]

By default, R will set \(\alpha_1=0\) and \(\beta_1=0\). Thus \(\alpha_2=0\) if the roads are equally risky, \(\beta_2\) represents difference between time 2 and 1, and \(\beta_3\) represents difference between time 3 and 1.

We can fit this model in R as follows.

# Fit a Poisson log-linear model to the data:
acc_glm <- glm(number ~ road + time + log(volume), family = poisson, data = accident)
summary(acc_glm)

Call:
glm(formula = number ~ road + time + log(volume), family = poisson, 
    data = accident)

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)  
(Intercept)     -110.15784   49.89214  -2.208   0.0272 *
roadTrumpington   -6.12327    2.67167  -2.292   0.0219 *
timemidday        -6.12346    3.32020  -1.844   0.0651 .
timemorning       -0.04857    0.56769  -0.086   0.9318  
log(volume)       15.41720    6.88756   2.238   0.0252 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 20.8177  on 5  degrees of freedom
Residual deviance:  1.8823  on 1  degrees of freedom
AIC: 34.826

Number of Fisher Scoring iterations: 4
  • What do you conclude about how the accident rate depends on the road, the time of day and the volume of traffic?

  • Does the scaled deviance indicate an problem with the fit of the model?

  • Conduct hypothesis tests to check whether you could drop any terms from the model.

# Consider model without `volume`.
acc_glm_rt <- glm(number ~ road + time, family  = poisson, data = accident)
anova(acc_glm_rt, acc_glm, test="LRT")
Analysis of Deviance Table

Model 1: number ~ road + time
Model 2: number ~ road + time + log(volume)
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
1         2     7.3677                       
2         1     1.8823  1   5.4854  0.01918 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# We prefer the model with `volume`.

# Consider model without `time`.
acc_glm_rv <- glm(number ~ road + log(volume), family = poisson, data = accident)
anova(acc_glm_rv, acc_glm, test="LRT")
Analysis of Deviance Table

Model 1: number ~ road + log(volume)
Model 2: number ~ road + time + log(volume)
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
1         3     7.5833                       
2         1     1.8823  2    5.701  0.05782 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# It is borderline whether or not to include `time`.

# Consider model without `road`.
acc_glm_tv <- glm(number ~ time + log(volume), family = poisson, data = accident)
anova(acc_glm_tv, acc_glm, test="LRT")
Analysis of Deviance Table

Model 1: number ~ time + log(volume)
Model 2: number ~ road + time + log(volume)
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
1         2     7.5909                       
2         1     1.8823  1   5.7087  0.01688 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# We prefer the model with `road`.

Analysis of hodgkins.csv data

The data in hodgkins.csv concerns 538 patients with Hodgkin’s disease, who have been cross-classified according to two factors, type the histological type of their disease (4 levels) and rtreat, their response to treatment (3 levels).

Load the data in hodgkins.csv into a variable hodgkins.

# Analysis of `hodgkins.csv` data
hodgkins <- read.csv("../datasets/hodgkins.csv") # Change the path, if needed
head(hodgkins)
  count type   rtreat
1    74   Lp positive
2    18   Lp  partial
3    12   Lp     none
4    68   Ns positive
5    16   Ns  partial
6    12   Ns     none

We could display this data in a contingency table.

# Display the data in contingency table form:
xtabs(count ~ type + rtreat, data = hodgkins)
    rtreat
type none partial positive
  Ld   44      10       18
  Lp   12      18       74
  Mc   58      54      154
  Ns   12      16       68

Cross-classification data are extremely common, and can be modelled very effectively using generalised linear models. The observations of the response variable are taken to be the counts (in this case the 12 patient totals) and a generalised linear model is used to determine how the expected counts depend on any explanatory variables (in this case, the factors type and rtreat). The dataset as it is presented in the original hodgkins data frame is already in the correct format to fit such a generalised linear model.

Counts are non-negative integers, so one approach is to treat them as observations of Poisson random variables. The canonical link function is then the \(\log\) function, and Poisson generalised linear models with the log link are called log-linear models.

A possible log-linear model for this data set is \[ Y_i\sim\text{Poisson}(\mu_i),\quad \log\mu_i=\alpha+\beta_T(t_i)+\beta_R(r_i)\qquad i=1,\cdots ,12\,, \] where \(Y_i\) is the \(i\)-th count and \(t_i\) and \(r_i\) are the corresponding levels of type and rtreat.

We can fit this model in R.

# Fit a log-linear model with main effects only.
hod_main <- glm(count ~ type + rtreat, family = poisson, data = hodgkins)

This model only contains the “main effects” type and rtreat. Alternatively, we could fit a model with an interaction between these two main effects, which is the saturated model.

# Fit a log-linear model with interaction between main effects.
hod_int <- glm(count ~ type * rtreat, family = poisson, data = hodgkins)

We can find the estimated cell counts under each of the models.

# Find estimated cell counts under the two models.
fitted(hod_main)
        1         2         3         4         5         6         7         8 
 60.69888  18.94424  24.35688  56.02974  17.48699  22.48327 155.24907  48.45353 
        9        10        11        12 
 62.29740  42.02230  13.11524  16.86245 
fitted(hod_int)
  1   2   3   4   5   6   7   8   9  10  11  12 
 74  18  12  68  16  12 154  54  58  18  10  44 

What do you notice about the estimated cell counts under the saturated model hod_int, compared with the observed counts?

Which model do you prefer?

# Compare the two models
anova(hod_main, hod_int, test = "LRT")
Analysis of Deviance Table

Model 1: count ~ type + rtreat
Model 2: count ~ type * rtreat
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)    
1         6     68.295                         
2         0      0.000  6   68.295 9.14e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

By comparing these two models, we are determining whether type and rtreat are independent, or whether there is significant evidence of association. If we need the interaction term, then this provides evidence that the cross-classifying factors are not independent of one another.