MATH3091: Computer Lab 09

Exploring the data

Load the data in lymphoma.csv into a variable lymphoma.

lymphoma <- read.csv("../datasets/lymphoma.csv") # Change the path, if needed
head(lymphoma)
     Cell    Sex Remis Count
1 Nodular   Male    No     1
2 Nodular   Male   Yes     4
3 Nodular Female    No     2
4 Nodular Female   Yes     6
5 Diffuse   Male    No    12
6 Diffuse   Male   Yes     1

The data represents 30 lymphoma patients classified by sex (Sex), cell type of lymphoma (Cell) and response to treatment (Remis). The aim here is to study the complex dependence structures between the three classifying factors.

Use xtabs to find a three-way (\(2\times 2\times 2\)) contingency table for this data

# make a three-way contingency table for the data
xtabs(Count ~ Cell + Sex + Remis, data = lymphoma)
, , Remis = No

         Sex
Cell      Female Male
  Diffuse      3   12
  Nodular      2    1

, , Remis = Yes

         Sex
Cell      Female Male
  Diffuse      1    1
  Nodular      6    4

Here, the saturated model is the three factor interaction model Cell * Remis * Sex. We can fit a saturated log-linear model as follows

# Fit a saturated model:
ly_sat <- glm(Count ~ Cell * Remis * Sex, data = lymphoma, family = poisson)

Use the fitted method to find the fitted values for this saturated model, and the summary method to find the scaled deviance.

# Fitted values
# The fitted values are the same as the observed counts, as the saturated model fits perfectly.
fitted(ly_sat)
 1  2  3  4  5  6  7  8 
 1  4  2  6 12  1  3  1 
data.frame(fitted = fitted(ly_sat), observed = lymphoma$Count)
  fitted observed
1      1        1
2      4        4
3      2        2
4      6        6
5     12       12
6      1        1
7      3        3
8      1        1
# We can see the scaled deviance in the output from `summary()`.
# The scaled deviance (residual deviance) is 0.
summary(ly_sat)

Call:
glm(formula = Count ~ Cell * Remis * Sex, family = poisson, data = lymphoma)

Coefficients:
                             Estimate Std. Error z value Pr(>|z|)  
(Intercept)                    1.0986     0.5774   1.903   0.0571 .
CellNodular                   -0.4055     0.9129  -0.444   0.6569  
RemisYes                      -1.0986     1.1547  -0.951   0.3414  
SexMale                        1.3863     0.6455   2.148   0.0317 *
CellNodular:RemisYes           2.1972     1.4142   1.554   0.1203  
CellNodular:SexMale           -2.0794     1.3844  -1.502   0.1331  
RemisYes:SexMale              -1.3863     1.5546  -0.892   0.3725  
CellNodular:RemisYes:SexMale   1.6740     2.0817   0.804   0.4213  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance:  2.2288e+01  on 7  degrees of freedom
Residual deviance: -2.2205e-16  on 0  degrees of freedom
AIC: 38.865

Number of Fisher Scoring iterations: 3

Now, we drop the three factor interaction term.

# Drop the three factor interaction term.
ly_glm1 <- update(ly_sat, . ~ . - Cell:Remis:Sex)
summary(ly_glm1)

Call:
glm(formula = Count ~ Cell + Remis + Sex + Cell:Remis + Cell:Sex + 
    Remis:Sex, family = poisson, data = lymphoma)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)   
(Intercept)            1.2197     0.5253   2.322  0.02024 * 
CellNodular           -0.7412     0.8539  -0.868  0.38540   
RemisYes              -1.7080     1.0501  -1.627  0.10384   
SexMale                1.2324     0.5909   2.086  0.03700 * 
CellNodular:RemisYes   3.0836     1.0403   2.964  0.00303 **
CellNodular:SexMale   -1.3843     1.0316  -1.342  0.17960   
RemisYes:SexMale      -0.4175     1.0372  -0.403  0.68726   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 22.28814  on 7  degrees of freedom
Residual deviance:  0.65074  on 1  degrees of freedom
AIC: 37.516

Number of Fisher Scoring iterations: 4

Note the colon instead of the * in Cell:Remis:Sex. What would happen if you used *? Which formula would you use to fit the same model directly (without using update)?

# To fit the same model directly, we could do use
ly_glm1_direct <- glm(Count ~ Cell + Remis + Sex + Cell:Remis + Cell:Sex + Remis:Sex, data = lymphoma, family = poisson)

Consider dropping each of the two-way interaction terms. For instance, to fit the model without the Remis:Sex interaction

# Model without the `Remis:Sex` interaction.
ly_glm2 <- update(ly_glm1, . ~ . - Remis:Sex)

We can compare the models ly_glm1 and ly_glm2 by using a log-likelihood ratio test.

# Log-likelihood ratio tests to drop terms.
anova(ly_glm2, ly_glm1, test = "LRT")
Analysis of Deviance Table

Model 1: Count ~ Cell + Remis + Sex + Cell:Remis + Cell:Sex
Model 2: Count ~ Cell + Remis + Sex + Cell:Remis + Cell:Sex + Remis:Sex
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1         2    0.80948                     
2         1    0.65074  1  0.15873   0.6903

Which model do you prefer? Can you drop any other interaction terms? Can you drop any main effects? Remember that you should only consider dropping a main effect from the model if you have already removed all interaction terms involving that main effect.

# Do not reject null hypothesis of simpler model, prefer `ly_glm2`.

# Now consider dropping `Cell:Remis`.
ly_glm3 <- update(ly_glm2, . ~ . - Cell:Remis)
anova(ly_glm3, ly_glm2, test = "LRT")
# The p-value is small, so we reject the simpler model, and still prefer `ly_glm2`.

# Now consider dropping `Cell:Sex`.
ly_glm4 <- update(ly_glm2, . ~ . - Cell:Sex)
anova(ly_glm4, ly_glm2, test = "LRT")
# The p-value is small, so we reject the simpler model, and still prefer `ly_glm2`.

# We must then keep all the main effects in the model, because we include each of `Cell`, `Remis` and `Sex` in at least one interaction term.

Investigating the dependence structure

Absence of the interaction term Remis:Sex from ly_glm2 does not imply the independence of remission and sex. It merely implies that remission is independent of sex conditional on cell type, that is \[ \mathbb{P}(R,S|C)=\mathbb{P}(R|C)\mathbb{P}(S|C)\,. \]

Another way of expressing this is \[ \mathbb{P}(R|S,C)=\mathbb{P}(R|C), \] that is, the probability of each level of \(R\) given a particular combination of \(S\) and \(C\), does not depend on which level \(S\) takes. Equivalently, we can write \(P(S|R,C)=P(S|C)\). This can be observed by calculating the estimated odds in favour of \(R=\text{yes}\) over \(R=\text{no}\) for the lymphoma dataset.

We now illustrate the above theory. We first find the 8 fitted probabilities which are simply the fitted counts divided by 30 (which is the total number of patients classified).

# Find the expected cell counts under the model `ly_glm2`
fitted_count <- fitted(ly_glm2)
fitted_count
         1          2          3          4          5          6          7 
 1.1538462  3.8461538  1.8461538  6.1538462 11.4705882  1.5294118  3.5294118 
         8 
 0.4705882 
# Find the expected cell probabilities under the model `ly_glm2`
fitted_prob <- fitted_count/(sum(fitted_count))
fitted_prob
         1          2          3          4          5          6          7 
0.03846154 0.12820513 0.06153846 0.20512821 0.38235294 0.05098039 0.11764706 
         8 
0.01568627 

We then form the odds ratios by dividing the probabilities, e.g. \(\frac{0.1282}{0.0385} = 3.33\).

Check that you can replicate Table from the notes. The odds depend only on a patient’s Cell type, and not on their Sex. This means that remission and sex are conditionally independent given cell type. Are they marginally independent?