Statistical Modelling II (MATH3091)

Part II, Chapter 6: Models for Categorical Data

08/05/2025

Preface

 

The aim of this chapter is to cover the theory and application of generalised linear models (GLMs).

These slides are based on material written by previous lecturers of this course, including Sujit Sahu, Dave Woods, and Chao Zheng.

Schedule

Week Lab Session 1 (Thursday) Session 2 (Friday) Session 3 (Friday) Problem sheet
07 No lab 5.1 Exponential family 5.2 Components of a GLM 5.3 Examples of GLMs Sheet 4
08 Lab 5 5.4 MLE 5.5 Confidence intervals PC: sheet 4
09 Lab 6 5.6 Comparing GLMs 5.7 Deviance 5.8 Models with unknown scale Sheet 5
10 Lab 7 6.1 Contingency tables 6.2 Log-linear models PC: sheet 5
11 Lab 8 6.3 Multinomial sampling 6.4 Interpretation for two-way tables 6.5 Interpretation for multiway tables Sheet 6
12 Lab 9 Revision Revision PC: sheet 6

Chapter 6: Models for Categorical Data
Lecture 6.1: Contingency tables

Introduction

A particularly important application of generalised linear models is the analysis of categorical data.

Here, the data are observations of one or more categorical variables for each of a number of units (often individuals). Therefore, each of the units are cross-classified by the categorical variables.

For example, the job dataset gives the job satisfaction and income band of 901 individuals from the 1984 General Social Survey.

Income ($) Very Dissat. A Little Dissat. Moderately Sat. Very Sat.
<6000 20 24 80 82
6000–15000 22 38 104 125
15000–25000 13 28 81 113
>25000 7 18 54 92

Contingency tables

A cross-classification table like this is called a contingency table. This is a two-way table, as there are two classifying variables.

It might also be describe as a \(4 \times 4\) contingency table (as each of the classifying variables takes one of four possible levels).

Each position in a contingency table is called a cell and the number of individuals in a particular cell is the cell count.

Partial classifications derived from the table are called margins. For a two-way table these are often displayed in the margins of the table.

Our aim in Chapter 6 is to model the cell counts of a contingency table and say something about the relationship between the cross-classifying variables. E.g., is there a relationship between income and job satisfaction?

A contingency table of the job dataset, with margins

As before,

Income ($) Very Dissat. A Little Dissat. Moderately Sat. Very Sat. Sum
<6000 20 24 80 82 206
6000–15000 22 38 104 125 289
15000–25000 13 28 81 113 235
>25000 7 18 54 92 171
Sum 62 108 319 412 901

We have added one-way margins, which represent the classification of items by a single variable; income group and job satisfaction, respectively.

A three-way contingency table

The lymphoma dataset gives information about 30 patients, classified by cell type of lymphoma, sex, and response to treatment. This is an example of a three-way contingency table. It is a \(2\times 2\times 2\) or \(2^3\) table.


Cell Type Sex Remission: No Remission: Yes
Diffuse Female 3 1
Diffuse Male 12 1
Nodular Female 2 6
Nodular Male 1 4

Higher-order margins

For multiway tables, higher order margins may be calculated. For example, for lymphoma, the two-way Cell type/Sex margin is


Cell Type Female Male
Diffuse 4 13
Nodular 8 5

Modelling contingency table data

We can model contingency table data using generalised linear models.

To do this, we assume that the cell counts are observations of independent Poisson random variables.

This is intuitively sensible as the cell counts are non-negative integers (the sample space for the Poisson distribution).

Therefore, if the table has \(n\) cells, which we label \(1, \cdots, n\), then the observed cell counts \(y_1, \cdots, y_n\) are assumed to be observations of independent Poisson random variables \(Y_1, \cdots, Y_n\).

We denote the means of these Poisson random variables by \(\mu_1, \cdots , \mu_n\).

Log-linear models

The canonical link function for the Poisson distribution is the log function, and we assume this link function throughout. A generalised linear model for Poisson data using the log link function is called a log-linear model.

The explanatory variables in a log-linear model for contingency table data are the cross-classifying variables.

As these variables are categorical, they are factors. As usual with factors, we can include interactions in the model as well as just main effects.

Such a model will describe how the expected count in each cell depends on the classifying variables, and any interactions between them. Interpretation of these models will be discussed later on.

job dataset in long format

The original data structure of the job dataset is

Income Satisfaction Count
<6000 Very Dissatisfied 20
<6000 A Little Dissatisfied 24
<6000 Moderately Satisfied 80
<6000 Very Satisfied 82
6000–15000 Very Dissatisfied 22
6000–15000 A Little Dissatisfied 38
6000–15000 Moderately Satisfied 104
6000–15000 Very Satisfied 125
15000–25000 Very Dissatisfied 13
15000–25000 A Little Dissatisfied 28
15000–25000 Moderately Satisfied 81
15000–25000 Very Satisfied 113
>25000 Very Dissatisfied 7
>25000 A Little Dissatisfied 18
>25000 Moderately Satisfied 54
>25000 Very Satisfied 92

Log-linear models for job

The original format of the job dataset the same data as the contingency table, but in a different format, sometimes called long format.

A log-linear model is just a Poisson GLM, where the response variable is Count, and Income and Satisfaction are explanatory variables.


job <- read_csv("data/job.csv", show_col_types = FALSE)

loglin_job_1 <- glm(Count ~ Income + Satisfaction, family = poisson, data = job)
loglin_job_2 <- glm(Count ~ Income * Satisfaction, family = poisson, data = job)

Comparing two models (vevox.app, 160-892-474)

loglin_job_1 <- glm(Count ~ Income + Satisfaction, family = poisson, data = job)
loglin_job_2 <- glm(Count ~ Income * Satisfaction, family = poisson, data = job)

Recall Income and Satisfaction are both factors with 4 levels.

What is \(q\), the number of parameters in loglin_job_1?

  • 5
  • 6
  • 7

What is \(p\), the number of parameters in loglin_job_2?

  • 15
  • 16
  • 17

Log-linear models for job (loglin_job_1)

summary(loglin_job_1)

Call:
glm(formula = Count ~ Income + Satisfaction, family = poisson, 
    data = job)

Coefficients:
                                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)                        3.2065     0.1140   28.12   <2e-16 ***
Income>25000                      -0.1862     0.1035   -1.80   0.0719 .  
Income15000-25000                  0.1317     0.0954    1.38   0.1676    
Income6000-15000                   0.3386     0.0912    3.71   0.0002 ***
SatisfactionModerately Satisfied   1.0831     0.1113    9.73   <2e-16 ***
SatisfactionVery Dissatisfied     -0.5550     0.1593   -3.48   0.0005 ***
SatisfactionVery Satisfied         1.3389     0.1081   12.39   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 445.763  on 15  degrees of freedom
Residual deviance:  12.037  on  9  degrees of freedom
AIC: 115.1

Number of Fisher Scoring iterations: 4

Log-linear models for job (loglin_job_2)

summary(loglin_job_2)

Call:
glm(formula = Count ~ Income * Satisfaction, family = poisson, 
    data = job)

Coefficients:
                                                   Estimate Std. Error z value
(Intercept)                                          3.1781     0.2041   15.57
Income>25000                                        -0.2877     0.3118   -0.92
Income15000-25000                                    0.1542     0.2782    0.55
Income6000-15000                                     0.4595     0.2607    1.76
SatisfactionModerately Satisfied                     1.2040     0.2327    5.17
SatisfactionVery Dissatisfied                       -0.1823     0.3028   -0.60
SatisfactionVery Satisfied                           1.2287     0.2321    5.29
Income>25000:SatisfactionModerately Satisfied       -0.1054     0.3581   -0.29
Income15000-25000:SatisfactionModerately Satisfied  -0.1417     0.3197   -0.44
Income6000-15000:SatisfactionModerately Satisfied   -0.1972     0.3002   -0.66
Income>25000:SatisfactionVery Dissatisfied          -0.7621     0.5386   -1.42
Income15000-25000:SatisfactionVery Dissatisfied     -0.5849     0.4520   -1.29
Income6000-15000:SatisfactionVery Dissatisfied      -0.3642     0.4043   -0.90
Income>25000:SatisfactionVery Satisfied              0.4028     0.3468    1.16
Income15000-25000:SatisfactionVery Satisfied         0.1665     0.3137    0.53
Income6000-15000:SatisfactionVery Satisfied         -0.0379     0.2969   -0.13
                                                   Pr(>|z|)    
(Intercept)                                         < 2e-16 ***
Income>25000                                          0.356    
Income15000-25000                                     0.579    
Income6000-15000                                      0.078 .  
SatisfactionModerately Satisfied                    2.3e-07 ***
SatisfactionVery Dissatisfied                         0.547    
SatisfactionVery Satisfied                          1.2e-07 ***
Income>25000:SatisfactionModerately Satisfied         0.769    
Income15000-25000:SatisfactionModerately Satisfied    0.658    
Income6000-15000:SatisfactionModerately Satisfied     0.511    
Income>25000:SatisfactionVery Dissatisfied            0.157    
Income15000-25000:SatisfactionVery Dissatisfied       0.196    
Income6000-15000:SatisfactionVery Dissatisfied        0.368    
Income>25000:SatisfactionVery Satisfied               0.246    
Income15000-25000:SatisfactionVery Satisfied          0.596    
Income6000-15000:SatisfactionVery Satisfied           0.898    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance:  4.4576e+02  on 15  degrees of freedom
Residual deviance: -1.1990e-14  on  0  degrees of freedom
AIC: 121

Number of Fisher Scoring iterations: 3

Comparing two models

Now we conduct a log likelihood ratio test to compare the two models.

anova(loglin_job_1, loglin_job_2)
Analysis of Deviance Table

Model 1: Count ~ Income + Satisfaction
Model 2: Count ~ Income * Satisfaction
  Resid. Df Resid. Dev Df Deviance
1         9         12            
2         0          0  9       12

Here, \(L_{01}\) is \(12.037\), given in the deviance column.

Comparing two models (vevox.app, 160-892-474)

Under \(H_0\), \(L_{01} \sim \chi^2_9\).

We should reject \(H_0\) if \(L_{01} > k\). Which of the following commands could be used to find \(k\), for a test at the \(5\%\) level?

  • dchisq(0.05, df = 9)
  • dchisq(0.95, df = 9)
  • pchisq(0.05, df = 9)
  • pchisq(0.95, df = 9)
  • qchisq(0.05, df = 9)
  • qchisq(0.95, df = 9)

Comparing two models

qchisq(0.95, df = 9)
[1] 16.92

Since \(L_{01} = 12.037 < 16.9\), we do not reject \(H_0\), and we prefer the simpler model, loglin_job_1.

Later, we’ll see that we can interpret loglin_job_1 as meaning that there is no dependence of job satisfaction on income.

Recap: lymphoma data contingency table

Cell Type Sex. Remission: No Remission: Yes
Diffuse Female 3 1
Diffuse Male 12 1
Nodular Female 2 6
Nodular Male 1 4

Recap: lymphoma data in long format

Cell Sex Remis Count
Nodular Male No 1
Nodular Male Yes 4
Nodular Female No 2
Nodular Female Yes 6
Diffuse Male No 12
Diffuse Male Yes 1
Diffuse Female No 3
Diffuse Female Yes 1

A log-linear model for lymphoma

A log-linear model for the contingency table is just a Poisson GLM for this data, where in this case the response variable is Count, and Sex, Remis and Cell are explanatory variables.

The aim is to understand how the probability of remission varies by type of lymphoma and sex.

You will consider log-linear models for this in LAB 09. For example,

lymphoma <- read_csv("data/lymphoma.csv", show_col_types = FALSE)
loglin_lymphoma <- glm(Count ~ Cell * Remis * Sex, data = lymphoma, family = poisson)

Note

In some cases, it might seem more natural to consider a binomial model for the number of patients in remission, given the total number in each group. We will see later on that the inference assuming this binomial model would be the same as assuming the Poisson log-linear model.

Summary

  • We have introduced the idea of a contingency table: a cross-classification of counts according to categorical variables.
  • A log-linear model is a Poisson GLM with the canonical link function (the log link). We can use log-linear models to model contingency table data.
  • Sometimes, the Poisson distribution does not seem to be a natural model for the data. We will consider multinomial models (an extension of the binomial to more than two categories), and show that we make the same conclusions about the relationships between variables with either model.

Chapter 6: Models for Categorical Data
Lecture 6.2 Log-linear models

Recap

  • A contingency table is a cross-classification of counts according to categorical variables.
  • A contingency table is just a particular view of a dataset: we can rewrite the data in “long” format, which looks like the data frame we are familiar with.
  • We can model the cell counts in a contingency table data using log-linear models, which are just Poisson GLMs with the (canonical) log link function.

 

Let’s return to an example to demonstrate these concepts.

A contingency table of the job dataset, with margins

As before,

Income ($) Very Dissat. A Little Dissat. Moderately Sat. Very Sat. Sum
<6000 20 24 80 82 206
6000–15000 22 38 104 125 289
15000–25000 13 28 81 113 235
>25000 7 18 54 92 171
Sum 62 108 319 412 901


The one-way margins represent the classification of items by a single variable; income group and job satisfaction, respectively.

Log-linear models

If the table has \(n\) cells, which we label \(1, \cdots ,n\), then the observed cell counts \(y_1, \cdots ,y_n\) are assumed to be observations of independent Poisson random variables \(Y_1, \cdots ,Y_n\).

We denote the means of these Poisson random variables by \(\mu_1, \cdots, \mu_n\), and model \(log(\mu_i) = \eta_i\), where \(\boldsymbol{\eta}\) is the linear predictor.

The explanatory variables in a log-linear model for contingency table data are the cross-classifying variables. As these variables are categorical, they are factors. As usual with factors, we can include interactions in the model as well as just main effects.

Such a model will describe how the expected count in each cell depends on the classifying variables, and any interactions between them. Interpretation of these models will be discussed later on.

job dataset in long format

The original data structure of the job dataset is

Income Satisfaction Count
<6000 Very Dissatisfied 20
<6000 A Little Dissatisfied 24
<6000 Moderately Satisfied 80
<6000 Very Satisfied 82
6000–15000 Very Dissatisfied 22
6000–15000 A Little Dissatisfied 38
6000–15000 Moderately Satisfied 104
6000–15000 Very Satisfied 125
15000–25000 Very Dissatisfied 13
15000–25000 A Little Dissatisfied 28
15000–25000 Moderately Satisfied 81
15000–25000 Very Satisfied 113
>25000 Very Dissatisfied 7
>25000 A Little Dissatisfied 18
>25000 Moderately Satisfied 54
>25000 Very Satisfied 92

Log-linear models for job

The original format of the job dataset the same data as the contingency table, but in a different format, sometimes called long format.

A log-linear model is just a Poisson GLM, where the response variable is Count, and Income and Satisfaction are explanatory variables.

job <- read_csv("data/job.csv", show_col_types = FALSE)

job$Income <- factor(job$Income, levels = c("<6000", "6000-15000", "15000-25000", ">25000"))
job$Satisfaction <- factor(job$Satisfaction, levels = c("Very Dissatisfied", "A Little Dissatisfied", "Moderately Satisfied", "Very Satisfied"))

# Fit the models
loglin_job_1 <- glm(Count ~ Income + Satisfaction, family = poisson, data = job)
loglin_job_2 <- glm(Count ~ Income * Satisfaction, family = poisson, data = job)

Log-linear models for job (loglin_job_1)

loglin_job_1

Call:  glm(formula = Count ~ Income + Satisfaction, family = poisson, 
    data = job)

Coefficients:
                      (Intercept)                   Income6000-15000  
                            2.652                              0.339  
                Income15000-25000                       Income>25000  
                            0.132                             -0.186  
SatisfactionA Little Dissatisfied   SatisfactionModerately Satisfied  
                            0.555                              1.638  
       SatisfactionVery Satisfied  
                            1.894  

Degrees of Freedom: 15 Total (i.e. Null);  9 Residual
Null Deviance:      446 
Residual Deviance: 12   AIC: 115
summary(loglin_job_1)

Call:
glm(formula = Count ~ Income + Satisfaction, family = poisson, 
    data = job)

Coefficients:
                                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)                         2.6515     0.1410   18.81   <2e-16 ***
Income6000-15000                    0.3386     0.0912    3.71   0.0002 ***
Income15000-25000                   0.1317     0.0954    1.38   0.1676    
Income>25000                       -0.1862     0.1035   -1.80   0.0719 .  
SatisfactionA Little Dissatisfied   0.5550     0.1593    3.48   0.0005 ***
SatisfactionModerately Satisfied    1.6381     0.1388   11.80   <2e-16 ***
SatisfactionVery Satisfied          1.8939     0.1362   13.90   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 445.763  on 15  degrees of freedom
Residual deviance:  12.037  on  9  degrees of freedom
AIC: 115.1

Number of Fisher Scoring iterations: 4

Expected cell counts (vevox.app, 160-892-474)

coef(loglin_job_1)
                      (Intercept)                  Income6000-15000 
                           2.6515                            0.3386 
                Income15000-25000                      Income>25000 
                           0.1317                           -0.1862 
SatisfactionA Little Dissatisfied  SatisfactionModerately Satisfied 
                           0.5550                            1.6381 
       SatisfactionVery Satisfied 
                           1.8939 

The fitted values \(\hat \mu_i = \exp(\mathbf{x}_i^{\top} \hat{\boldsymbol{\beta}})\) give us the expected counts in each cell of the contingency table.


Which of the following would give the expected count of people with income \(<6000\) who were A Little Dissatisfied under model loglin_job_1?

  • \(\exp(0.5550) = 1.70\)
  • \(\exp(2.6515) = 14.1\)
  • \(\exp(3.2065) = 24.7\)

Observed and expected cell counts in contingency table format

job_tab <- xtabs(Count ~ Income + Satisfaction, data = job)
# Puts Arbitrary Margins on Multidimensional Tables or Arrays
addmargins(job_tab)
             Satisfaction
Income        Very Dissatisfied A Little Dissatisfied Moderately Satisfied
  <6000                      20                    24                   80
  6000-15000                 22                    38                  104
  15000-25000                13                    28                   81
  >25000                      7                    18                   54
  Sum                        62                   108                  319
             Satisfaction
Income        Very Satisfied Sum
  <6000                   82 206
  6000-15000             125 289
  15000-25000            113 235
  >25000                  92 171
  Sum                    412 901
fit1_tab <- xtabs(fitted(loglin_job_1) ~ Income + Satisfaction, data = job)
# Puts Arbitrary Margins on Multidimensional Tables or Arrays
addmargins(fit1_tab)
             Satisfaction
Income        Very Dissatisfied A Little Dissatisfied Moderately Satisfied
  <6000                   14.18                 24.69                72.93
  6000-15000              19.89                 34.64               102.32
  15000-25000             16.17                 28.17                83.20
  >25000                  11.77                 20.50                60.54
  Sum                     62.00                108.00               319.00
             Satisfaction
Income        Very Satisfied    Sum
  <6000                94.20 206.00
  6000-15000          132.15 289.00
  15000-25000         107.46 235.00
  >25000               78.19 171.00
  Sum                 412.00 901.00

Log-linear models for job (loglin_job_2)

summary(loglin_job_2)

Call:
glm(formula = Count ~ Income * Satisfaction, family = poisson, 
    data = job)

Coefficients:
                                                    Estimate Std. Error z value
(Intercept)                                           2.9957     0.2236   13.40
Income6000-15000                                      0.0953     0.3090    0.31
Income15000-25000                                    -0.4308     0.3563   -1.21
Income>25000                                         -1.0498     0.4392   -2.39
SatisfactionA Little Dissatisfied                     0.1823     0.3028    0.60
SatisfactionModerately Satisfied                      1.3863     0.2500    5.55
SatisfactionVery Satisfied                            1.4110     0.2494    5.66
Income6000-15000:SatisfactionA Little Dissatisfied    0.3642     0.4043    0.90
Income15000-25000:SatisfactionA Little Dissatisfied   0.5849     0.4520    1.29
Income>25000:SatisfactionA Little Dissatisfied        0.7621     0.5386    1.42
Income6000-15000:SatisfactionModerately Satisfied     0.1671     0.3429    0.49
Income15000-25000:SatisfactionModerately Satisfied    0.4432     0.3896    1.14
Income>25000:SatisfactionModerately Satisfied         0.6568     0.4732    1.39
Income6000-15000:SatisfactionVery Satisfied           0.3263     0.3401    0.96
Income15000-25000:SatisfactionVery Satisfied          0.7515     0.3847    1.95
Income>25000:SatisfactionVery Satisfied               1.1649     0.4647    2.51
                                                    Pr(>|z|)    
(Intercept)                                          < 2e-16 ***
Income6000-15000                                       0.758    
Income15000-25000                                      0.227    
Income>25000                                           0.017 *  
SatisfactionA Little Dissatisfied                      0.547    
SatisfactionModerately Satisfied                     2.9e-08 ***
SatisfactionVery Satisfied                           1.5e-08 ***
Income6000-15000:SatisfactionA Little Dissatisfied     0.368    
Income15000-25000:SatisfactionA Little Dissatisfied    0.196    
Income>25000:SatisfactionA Little Dissatisfied         0.157    
Income6000-15000:SatisfactionModerately Satisfied      0.626    
Income15000-25000:SatisfactionModerately Satisfied     0.255    
Income>25000:SatisfactionModerately Satisfied          0.165    
Income6000-15000:SatisfactionVery Satisfied            0.337    
Income15000-25000:SatisfactionVery Satisfied           0.051 .  
Income>25000:SatisfactionVery Satisfied                0.012 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 4.4576e+02  on 15  degrees of freedom
Residual deviance: 2.2649e-14  on  0  degrees of freedom
AIC: 121

Number of Fisher Scoring iterations: 3

Expected cell counts (vevox.app, 160-892-474)

loglin_job_1 <- glm(Count ~ Income + Satisfaction,  family = poisson, data = job)
loglin_job_2 <- glm(Count ~ Income * Satisfaction,  family = poisson, data = job)

There are \(24\) people with income \(<6000\) who were A Little Dissatisfiedo. The expected count in this cell under model loglin_job_1 is \(24.7\). What is the expected count in this cell under loglin_job_2?

  • less than \(23.9\).
  • \(24\).
  • between \(24.1\) and \(24.6\).
  • \(24.7\).
  • more than \(24.8\).
  • It is not possible to say.

Discussion

Notice that loglin_job_2 is the saturated model (\(16\) cells and \(16\) parameters), therefore, it fits the data exactly.

This happens when the model includes (I) all main effects and (II) all interactions needed to uniquely estimate each cell.

Observed and expected cell counts in contingency table format

job_tab <- xtabs(Count ~ Income + Satisfaction, data = job)
# Puts Arbitrary Margins on Multidimensional Tables or Arrays
addmargins(job_tab)
             Satisfaction
Income        Very Dissatisfied A Little Dissatisfied Moderately Satisfied
  <6000                      20                    24                   80
  6000-15000                 22                    38                  104
  15000-25000                13                    28                   81
  >25000                      7                    18                   54
  Sum                        62                   108                  319
             Satisfaction
Income        Very Satisfied Sum
  <6000                   82 206
  6000-15000             125 289
  15000-25000            113 235
  >25000                  92 171
  Sum                    412 901
fit2_tab <- xtabs(fitted(loglin_job_2) ~ Income + Satisfaction, data = job)
# Puts Arbitrary Margins on Multidimensional Tables or Arrays
addmargins(fit2_tab)
             Satisfaction
Income        Very Dissatisfied A Little Dissatisfied Moderately Satisfied
  <6000                      20                    24                   80
  6000-15000                 22                    38                  104
  15000-25000                13                    28                   81
  >25000                      7                    18                   54
  Sum                        62                   108                  319
             Satisfaction
Income        Very Satisfied Sum
  <6000                   82 206
  6000-15000             125 289
  15000-25000            113 235
  >25000                  92 171
  Sum                    412 901

Multinomial sampling

Although the assumption of Poisson distributed observations is convenient for the purposes of modelling, it might not be a realistic assumption, because of the way in which the data have been collected.

Frequently, when contingency table data are obtained, the total number of observations (the grand total, the sum of all the cell counts) is fixed in advance.

In this case, no individual cell count can exceed the pre-specified fixed total, so the assumption of Poisson sampling is invalid as the sample space is bounded. Furthermore, with a fixed total, the observations can no longer be observations of independent random variables.

Recap: lymphoma data contingency table

The lymphoma dataset gives information about 30 patients, classified by cell type of lymphoma, sex, and response to treatment. This is an example of a three-way contingency table. It is a \(2\times 2\times 2\) or \(2^3\) table.


Cell Type Sex Remission: No Remission: Yes
Diffuse Female 3 1
Diffuse Male 12 1
Nodular Female 2 6
Nodular Male 1 4

Recap: lymphoma data contingency table

For the lymphoma contingency table, the appropriate model to use depends on the process by which these data were collected

  • If the data were collected over a fixed period of time, and that in that time there happened to be 30 patients, then Poisson assumption is perfectly valid.
  • If it had been decided at the outset to collect data on 30 patients, the grand total is fixed at 30, and the Poisson assumption is not valid.

When the grand total is fixed, a more appropriate distribution for the cell counts is the multinomial distribution.

The multinomial distribution

The multinomial distribution is the distribution of cell counts arising when a pre-specified total of \(N\) items are each independently assigned to one of \(n\) cells, where the probability of being classified into cell \(i\) is \(p_i\), \({i=1, \cdots, n}\), so \(\sum_{i=1}^n p_i=1\).

The probability function for the multinomial distribution is \[\begin{align*} f_{\mathbf{Y}}(\mathbf{y};{\mathbf{p}}) &= P(Y_1=y_1,\cdots ,Y_n=y_n) \cr &= \begin{cases} N!\prod_{i=1}^n \frac{p_i^{y_i}}{y_i!} & \text{if $\sum_{i=1}^n y_i=N$} \\ 0 & \text{otherwise.} \end{cases} \end{align*}\]

The binomial is the special case of the multinomial with two cells (\(n=2\)).

Summary

  • We have revised the idea of a contingency table: a cross-classification of counts according to categorical variables.
  • A log-linear model is a Poisson GLM with the canonical link function (the log link). We can use log-linear models to model contingency table data, and find expected cell counts.
  • If the total count is fixed, a multinomial model is more natural than a Poisson one.
  • Next time, we will show that we make the same conclusions about the relationships between variables by using a Poisson log-linear model instead of a multinomial one (provided that we include an intercept in the Poisson model).

Chapter 6: Models for Categorical Data
Lecture 6.3 Multinomial sampling

Recap

  • A log-linear model is a Poisson GLM with the canonical link function (the log link). We can use log-linear models to model contingency table data, and find expected cell counts.
  • If the total count is fixed, a multinomial model is more natural than a Poisson one.
  • How should we do inference for a multinomial model?

Recap: the multinomial distribution

The multinomial distribution is the distribution of cell counts arising when a pre-specified total of \(N\) items are each independently assigned to one of \(n\) cells, where the probability of being classified into cell \(i\) is \(p_i\), \({i=1, \cdots, n}\), so \(\sum_{i=1}^n p_i=1\).

The probability function for the multinomial distribution is \[\begin{align*} f_{\mathbf{Y}}(\mathbf{y};{\mathbf{p}}) &= P(Y_1=y_1,\cdots ,Y_n=y_n) \cr &= \begin{cases} N!\prod_{i=1}^n \frac{p_i^{y_i}}{y_i!} & \text{if $\sum_{i=1}^n y_i=N$} \\ 0 & \text{otherwise.} \end{cases} \end{align*}\]

The binomial is the special case of the multinomial with two cells (\(n=2\)).

A multinomial log-linear model

When the data have been obtained by multinomial sampling, we suppose \(\mathbf{Y} \sim \text{Multinomial}(N, \mathbf{p})\), where \(\log\mu_i=\log(N \cdot p_i)\), \({i=1, \cdots, n}\) is modelled as as a linear function of explanatory variables, so that \(\log \mu_i = \mathbf{x}_i^{\top} \boldsymbol{\beta}\).

Since \(\sum_{i=1}^n p_i = 1\), we have \(\sum_{i=1}^n \mu_i=N\), the grand total–which is fixed in advance.

Equivalence of Poisson log-linear model

Suppose we model instead \(Y_i \sim \text{Poisson}(\mu_i)\), where \(\log \mu_i = \mathbf{x}_i^{\top} \boldsymbol{\beta}\). Suppose that we include an intercept term, e.g. \(x_{i1} = 1\) for \(i = 1, \cdots,n\), so \(\log \mu_i = \beta_1 + \sum_{j=2}^p x_{ij} \beta_j\).

Important

The maximum likelihood estimates for multinomial log-linear parameters are identical to those for Poisson log-linear parameters (with intercept). Furthermore, the maximised log-likelihoods for both Poisson and multinomial models take the form \(\sum_{i=1}^ny_i\log\hat{\mu}_i\), where \(\log \hat{\mu}_i = \mathbf{x}_i^{\top} \boldsymbol{\hat{\beta}}\).

We will see a sketch of the proof of this result (this is also in the lecture notes).

Implications of the equivalence result

Recall that the maximised log-likelihoods for both Poisson and multinomial models take the form \(\sum_{i=1}^ny_i\log\hat{\mu}_i\) as functions of the log-linear parameter estimates.

So any inferences based on maximised log-likelihoods (such as likelihood ratio tests) will be the same.

Therefore, we can analyse contingency table data using Poisson log-linear models, even when the data has been obtained by multinomial sampling. All that is required is that we ensure that the Poisson model contains an intercept term.

Example in R (Part 1)

# Create dataset (long format)
df <- as_tibble(data.frame(A = c(0, 0, 1, 1),
                           B = c(0, 1, 0, 1),
                           count = c(3, 2, 4, 1)))

df
# A tibble: 4 × 3
      A     B count
  <dbl> <dbl> <dbl>
1     0     0     3
2     0     1     2
3     1     0     4
4     1     1     1
# Contingency table
ct <- xtabs(count ~ A + B, data = df)
ct <- addmargins(ct)

ct
     B
A      0  1 Sum
  0    3  2   5
  1    4  1   5
  Sum  7  3  10

Example in R (Part 2)

# Fit models
mod1 <-  glm(formula = count ~ 0 + A + B, family = "poisson", data = df)
mod2 <-  glm(formula = count ~ 1 + A + B, family = "poisson", data = df)

ct1 <- xtabs(fitted(mod1) ~ A + B, data = df)
ct2 <- xtabs(fitted(mod2) ~ A + B, data = df)

ct1 <- addmargins(ct1)
ct2 <- addmargins(ct2)

# Fitted values without intercept
ct1
     B
A          0      1    Sum
  0   1.0000 0.7913 1.7913
  1   2.7913 2.2087 5.0000
  Sum 3.7913 3.0000 6.7913
# Fitted values with intercept
ct2
     B
A        0    1  Sum
  0    3.5  1.5  5.0
  1    3.5  1.5  5.0
  Sum  7.0  3.0 10.0

Sampling with fixed margins

Sometimes margins other than just the grand total may be pre-specified.

For example, consider the lymphoma contingency table.

Cell Type Sex Remission: No Remission: Yes
Diffuse Female 3 1
Diffuse Male 12 1
Nodular Female 2 6
Nodular Male 1 4

It may have been decided at the outset to collect data on \(18\) male patients and \(12\) female patients. Alternatively, perhaps the distribution of both the Sex and Cell Type of the patients was fixed in advance.

Product multinomial sampling

In cases where a margin is fixed by design, the data consist of a number of fixed total subgroups, defined by the fixed margin. Neither Poisson nor multinomial sampling assumptions are valid.

The appropriate distribution combines a separate, independent multinomial for each subgroup

  • If the Sex margin is fixed, then we should model the data with two independent multinomials, one for males with \(N=18\) and one for females with \(N=12\). Each of these multinomials has four cells, representing the cross-classification of the relevant patients by Cell Type and Remission.
  • If the Cell Type/Sex margin has been fixed, then the appropriate distribution is four independent two-cell multinomials (binomials) representing the remission classification for each of the four fixed-total patient subgroups.

The product multinomial distribution

When the data are modelled using independent multinomials, then the joint probability function for the cell counts \(Y_1, \cdots, Y_n\) is the product of multinomial probability functions, one for each fixed-total subgroup. We call this a distribution a product multinomial.

For example, if the Sex margin is fixed for lymphoma, then the product multinomial distribution has the form \[ f_{\mathbf{Y}}(\mathbf{y};{\mathbf{p}})= \begin{cases} N_m!\prod_{i=1}^4 {{p_{mi}^{y_{mi}}}\over{y_{mi}!}} N_f!\prod_{i=1}^4 {{p_{fi}^{y_{fi}}}\over{y_{fi}!}} & \text{if $\sum_{i=1}^4 y_{mi}=N_m$ and $\sum_{i=1}^4 y_{fi}=N_f$} \\ 0 & \text{otherwise,} \end{cases} \] where

  • \(N_m\) and \(N_f\) are the two fixed marginal totals (18 and 12, respectively)
  • \(y_{m1}, \cdots, y_{m4}\) are the cell counts for males and \(y_{f1}, \cdots, y_{f4}\) are the corresponding cell counts for females.
  • \(\sum_{i=1}^4 p_{mi}=\sum_{i=1}^4 p_{fi}=1\).

Analysis using Poisson log-linear models

Using similar reasoning to the multinomial sampling case, we can analyse contingency table data using Poisson log-linear models, even when the data has been obtained by product multinomial sampling.

However, we must ensure that the Poisson model contains a term corresponding to the fixed margin (and all marginal terms). Then, the estimated means for the specified margin are equal to the corresponding fixed totals.

Therefore, when analysing product multinomial data using a Poisson log-linear model, certain terms must be present in any model we fit. If they are removed, the inferences would no longer be valid.

Example: lymphoma data

For the lymphoma dataset, for inferences obtained using a Poisson model to be valid when the Sex margin is fixed in advance, the Poisson model must contain the Sex main effect (and the intercept).

mod_sex <- glm(Count ~ 1 + Sex, data = lymphoma, family = poisson)

# Female: 12, Male: 18
ct_sex <- xtabs(fitted(mod_sex) ~ Cell + Remis + Sex , data = lymphoma)
addmargins(ct_sex)
, , Sex = Female

         Remis
Cell        No  Yes  Sum
  Diffuse  3.0  3.0  6.0
  Nodular  3.0  3.0  6.0
  Sum      6.0  6.0 12.0

, , Sex = Male

         Remis
Cell        No  Yes  Sum
  Diffuse  4.5  4.5  9.0
  Nodular  4.5  4.5  9.0
  Sum      9.0  9.0 18.0

, , Sex = Sum

         Remis
Cell        No  Yes  Sum
  Diffuse  7.5  7.5 15.0
  Nodular  7.5  7.5 15.0
  Sum     15.0 15.0 30.0

Example: lymphoma data

For inferences obtained using a Poisson model to be valid when the Cell Type/Sex margin is fixed in advance, the Poisson model must contain the Cell Type/Sex interaction, and all marginal terms (the Cell Type main effect, the Sex main effect, and the intercept).

mod_cell_sex <- glm(Count ~ 1 + Cell + Sex + Cell:Sex, data = lymphoma, family = poisson)

# Female and Diffuse : 4, Female and Nodular: 8, Male and Diffuse: 13, Male and Nodular: 5
ct_cell_sex <- xtabs(fitted(mod_cell_sex) ~ Remis + Cell + Sex, data = lymphoma)
addmargins(ct_cell_sex)
, , Sex = Female

     Cell
Remis Diffuse Nodular  Sum
  No      2.0     4.0  6.0
  Yes     2.0     4.0  6.0
  Sum     4.0     8.0 12.0

, , Sex = Male

     Cell
Remis Diffuse Nodular  Sum
  No      6.5     2.5  9.0
  Yes     6.5     2.5  9.0
  Sum    13.0     5.0 18.0

, , Sex = Sum

     Cell
Remis Diffuse Nodular  Sum
  No      8.5     6.5 15.0
  Yes     8.5     6.5 15.0
  Sum    17.0    13.0 30.0

Interpreting log-linear models

Log-linear models for contingency tables enable us to determine important properties concerning the joint distribution of the classifying variables. The form of our preferred log linear model for a table will have implications for how the variables are associated.

Each combination of the classifying variables occurs exactly once in a contingency table. Therefore, the model with the highest order interaction (between all the variables) and all marginal terms (all other interactions) is the saturated model.

The implication of this model is that every combination of levels of the variables has its own mean (probability) and that there are no relationships between these means (no structure). The variables are highly dependent.

Example: return to the job dataset

In the job dataset, we might assume that the number of people in each income group is fixed. We are interested in the probability of having each level of satisfaction at each income group, and seeing whether there is any evidence in this data that job satisfaction depends on income.


Income ($) Very Dissat. A Little Dissat. Moderately Sat. Very Sat. Sum
<6000 20 24 80 82 206
6000–15000 22 38 104 125 289
15000–25000 13 28 81 113 235
>25000 7 18 54 92 171
Sum 62 108 319 412 901

Fitting log-linear models

This is an example of product multinomial sampling, with the Income margin fixed: provided that we include an intercept and Income in the model, we can use Poisson log-linear models to analyse the contingency table (as we did before).


job <- read_csv("data/job.csv", show_col_types = FALSE)

job$Income <- factor(job$Income, levels = c("<6000", "6000-15000", "15000-25000", ">25000"))
job$Satisfaction <- factor(job$Satisfaction, levels = c("Very Dissatisfied", "A Little Dissatisfied", "Moderately Satisfied", "Very Satisfied"))

loglin_job_1 <- glm(Count ~ Income + Satisfaction, family = poisson, data = job)
loglin_job_2 <- glm(Count ~ Income * Satisfaction, family = poisson, data = job)

Comparing models

As before, we can compare these models with a log-likelihood ratio test

anova(loglin_job_1, loglin_job_2, test = "LRT")
Analysis of Deviance Table

Model 1: Count ~ Income + Satisfaction
Model 2: Count ~ Income * Satisfaction
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1         9         12                     
2         0          0  9       12     0.21

On the basis of this, we prefer loglin_job_1, which does not have an interaction between Income and Satisfaction.

Expected cell counts for loglin_job_1

The fitted values \(\hat{\mu}_i = \exp(\mathbf{x}_i^{\top} \hat{\boldsymbol{\beta}})\) give us the expected counts in each cell of the contingency table.


fit1_tab <- xtabs(fitted(loglin_job_1) ~ Income + Satisfaction, data = job)
# Puts Arbitrary Margins on Multidimensional Tables or Arrays
addmargins(fit1_tab)
             Satisfaction
Income        Very Dissatisfied A Little Dissatisfied Moderately Satisfied
  <6000                   14.18                 24.69                72.93
  6000-15000              19.89                 34.64               102.32
  15000-25000             16.17                 28.17                83.20
  >25000                  11.77                 20.50                60.54
  Sum                     62.00                108.00               319.00
             Satisfaction
Income        Very Satisfied    Sum
  <6000                94.20 206.00
  6000-15000          132.15 289.00
  15000-25000         107.46 235.00
  >25000               78.19 171.00
  Sum                 412.00 901.00

Fitted conditional prob. (vevox.app, 160-892-474)

Under loglin_job_1, the expected cell counts for Income \(< 6000\) are

fit1_tab[1,]
    Very Dissatisfied A Little Dissatisfied  Moderately Satisfied 
                14.18                 24.69                 72.93 
       Very Satisfied 
                94.20 

Under this model, what is the estimate of the probability \[\mathbb{P}(\text{Satisfaction = Very Satisfied} | \text{Income < 6000})?\]

  • \(94.2 / 100 = 0.94\)
  • \(94.2 / (14.2 + 24.6 + 72.9) = 0.84\)
  • \(94.2 / (14.2 + 24.6 + 72.9 + 94.2) = 0.46\)
  • \(94.2 / 901 = 0.10\)

Fitted conditional prob. (vevox.app, 160-892-474)

Under loglin_job_1, the expected cell counts for Income \(6000\text{-}15000\) are

fit1_tab[2,]
    Very Dissatisfied A Little Dissatisfied  Moderately Satisfied 
                19.89                 34.64                102.32 
       Very Satisfied 
               132.15 

Under this model, what is the estimate of the probability \[\mathbb{P}(\text{Satisfaction = Very Satisfied} | \text{Income = 6000-15000})?\]

  • \(132.2 / (19.9 + 34.6 + 102.3) = 0.84\)
  • \(132.2 / (19.9 + 34.6 + 102.3 + 132.2) = 0.46\)
  • \(132.2 / 901 = 0.15\)

Fitted conditional prob. for loglin_job_1

The fitted conditional probability for each satisfaction level are

fit1_tab[1,] / sum(fit1_tab[1,])
    Very Dissatisfied A Little Dissatisfied  Moderately Satisfied 
              0.06881               0.11987               0.35405 
       Very Satisfied 
              0.45727 

for Income \(< 6000\), and

fit1_tab[2,] / sum(fit1_tab[2,])
    Very Dissatisfied A Little Dissatisfied  Moderately Satisfied 
              0.06881               0.11987               0.35405 
       Very Satisfied 
              0.45727 

for Income \(6000\text{-}15000\).

In fact, we get the same fitted conditional probabilities for each income. Under this model, Income and Satisfaction are independent!

Summary

  • We have considered multinomial and product multinomial models, appropriate when the grand total or some margin is fixed. We can still analyse such contingency tables using a Poisson log-linear models, provided we include certain terms (corresponding to the fixed margin) in the model.
  • We have seen an example of a log-linear model for a two-way table, with no interaction between classifying variables. For this model, we found that the classifying variables are independent of one another.

Chapter 6: Models for Categorical Data
Lecture 6.4 Interpretation for two-way tables

Recap

Last time we considered the implications of a log-linear model with no interaction term for the job dataset.

job <- read_csv("data/job.csv", show_col_types = FALSE)

job$Income <- factor(job$Income, levels = c("<6000", "6000-15000", "15000-25000", ">25000"))
job$Satisfaction <- factor(job$Satisfaction, levels = c("Very Dissatisfied", "A Little Dissatisfied", "Moderately Satisfied", "Very Satisfied"))

loglin_job_1 <- glm(Count ~ Income + Satisfaction, family = poisson, data = job)

We found that under this model, the fitted conditional probabilities for having each level of job satisfaction were the same for each income group.

For this model for a two-way table, with no interaction (no Income:Satisfaction term), we find the the cross-classifying factors (Income and Satisfaction) are independent. Now, we will show that this is true in general, and then go on to look at multi-way tables.

Interpretation for two-way tables

We consider a two-way \(r \times c\) table where the two classifying variables \(R\) and \(C\) have \(r\) and \(c\) levels, respectively.

The saturated model \(R*C\) implies that the two variables are associated. If we remove the RC interaction, we have the model \(R+C\), \[ \log\mu_i=\alpha+\beta_R(r_i)+\beta_C(c_i),\qquad {i=1, \cdots, n} \] where \(n=rc\) is the total number of cells in the table.

Cell probabilities in main-effects model

Because of the equivalence of Poisson and multinomial sampling, we can think of each cell mean \(\mu_i\) as equal to \(N \cdot p_i\) where \(N\) is the total number of observations in the table, and \(p_i\) a cell probability.

As each combination of levels of \(R\) and \(C\) is represented in exactly one cell, it is also convenient to replace the cell label \(i\) by the pair of labels \(j\) and \(k\) representing the corresponding levels of \(R\) and \(C\) respectively. Hence \[ \log \left(p_{jk}\right) = \alpha + \beta_R(j) + \beta_C(k) - \log (N), \] for \(j = 1, \cdots, r\) and \(k = 1, \cdots, c\).

Joint row and column probabilities

We have \[ \log \left(p_{jk}\right)=\alpha+\beta_R(j)+\beta_C(k)-\log(N), \] and \[ \mathbb{P}(R=j,C=k)=\exp[\alpha+\beta_R(j)+\beta_C(k)-\log(N)], \] for \(j=1, \cdots, r\) and \(k=1, \cdots, c\).

So,

\[\begin{align*} 1&=\sum_{j=1}^r\sum_{k=1}^c\exp[\alpha+\beta_R(j)+\beta_C(k)-\log(N)]\cr &={1\over N}\exp[\alpha]\sum_{j=1}^r\exp[\beta_R(j)]\sum_{k=1}^c\exp[\beta_C(k)]. \end{align*}\]

Marginal row and column probabilities

Since \[ \mathbb{P}(R=j,C=k)=\exp[\alpha+\beta_R(j)+\beta_C(k)-\log (N)], \]

we have

\[\begin{align*} \mathbb{P}(R=j)&=\sum_{k=1}^c\exp[\alpha+\beta_R(j)+\beta_C(k)-\log(N)]\cr &={1\over N}\exp[\alpha]\exp[\beta_R(j)]\sum_{k=1}^c\exp[\beta_C(k)], \quad j=1,\cdots,r, \end{align*}\]

and

\[\begin{align*} \mathbb{P}(C=k)&=\sum_{j=1}^r\exp[\alpha+\beta_R(j)+\beta_C(k)-\log(N)]\cr &={1\over N}\exp[\alpha]\exp[\beta_C(k)]\sum_{j=1}^r\exp[\beta_R(j)], \quad k=1,\cdots,c. \end{align*}\]

Independence of cross-classifying variables

Therefore, \[\begin{align*} \mathbb{P}(R=j)\mathbb{P}(C=k) &= {1\over N}\exp[\alpha]\exp[\beta_R(j)]\sum_{k=1}^c\exp[\beta_C(k)] \times \\ &\phantom{=|\;}{1\over N}\exp[\alpha]\exp[\beta_C(k)]\sum_{j=1}^r\exp[\beta_R(j)] \\ &={1\over N}\exp[\alpha]\exp[\beta_C(k)]\exp[\beta_R(j)] \times \\ &\phantom{=|\;}{1\over N}\exp[\alpha]\sum_{j=1}^r\exp[\beta_R(j)] \sum_{k=1}^c\exp[\beta_C(k)] \\ &={1\over N}\exp[\alpha]\exp[\beta_C(k)]\exp[\beta_R(j)]\times 1 \\ &=\mathbb{P}(R=j,C=k) \end{align*}\] for \(j = 1, \cdots, r\) and \(k = 1, \cdots, c\).

Independence of cross-classifying variables

Absence of the interaction \(R*C\) in a log-linear model implies that \(R\) and \(C\) are independent variables.

Absence of main effects is generally less interesting, and main effects are typically not removed from a log-linear model for a contingency table.

Interpretation for multi-way tables

In multi-way tables, absence of a two-factor interaction does not necessarily mean that the two variables are independent.

For example, consider the lymphoma dataset, with 3 binary classifying variables Sex (\(S\)), Cell type (\(C\)) and Remission (\(R\)).

After comparing the fit of several possible models, we find that a reasonable log-linear model for these data is \(R * C + C * S\).

Hence the interaction between Remission and Sex is absent.

lymphoma <- read_csv("data/lymphoma.csv", show_col_types = FALSE)
mod <- glm(Count ~ Remis * Cell + Cell * Sex, data = lymphoma, family = poisson)

Expected cell counts (vevox.app, 160-892-474)

coef(mod)
         (Intercept)             RemisYes          CellNodular 
               1.261               -2.015               -0.648 
             SexMale RemisYes:CellNodular  CellNodular:SexMale 
               1.179                3.219               -1.649 

Which of the following would give the expected count of female patients with nodular cell type who were in remission under model mod?

  • \(\exp(1.261) = 3.53\)
  • \(\exp(1.261 -2.015 -0.648) = 0.25\)
  • \(\exp(1.261 -2.015 -0.648 + 3.219) = 6.15\)
  • \(\exp(1.261 - 2.015 - 0.648 + 1.179 + 3.219 - 1.649) = 3.85\)

Expected cell counts

xtabs(fitted(mod) ~ Cell + Sex + Remis, data = lymphoma)
, , Remis = No

         Sex
Cell       Female    Male
  Diffuse  3.5294 11.4706
  Nodular  1.8462  1.1538

, , Remis = Yes

         Sex
Cell       Female    Male
  Diffuse  0.4706  1.5294
  Nodular  6.1538  3.8462

Fitted cell probabilities

N <- sum(lymphoma$Count)
p <- fitted(mod) / N

xtabs(p ~ Cell + Sex + Remis, data = lymphoma)
, , Remis = No

         Sex
Cell       Female    Male
  Diffuse 0.11765 0.38235
  Nodular 0.06154 0.03846

, , Remis = Yes

         Sex
Cell       Female    Male
  Diffuse 0.01569 0.05098
  Nodular 0.20513 0.12821

Marginal cell probabilities

The estimated probabilities for the two-way Sex/Remission margin are

lymphoma_p_table <- xtabs(fitted(mod) ~ Cell + Sex + Remis, data = lymphoma)
lymphoma_p_margins <- margin.table(lymphoma_p_table, margin = c(2, 3))
lymphoma_p_margins <- addmargins(lymphoma_p_margins)


Sex Remission: No Remission: Yes Sum
Female 0.1792 0.2208 0.4
Male 0.4208 0.1792 0.6
Sum 0.6 0.4 1.0

The estimated conditional probability of remission is \(0.2208/ 0.4 = 0.55\) for female patients, and \(0.1792 / 0.6 = 0.30\) for male patients.

So, remission (\(R\)) and sex (\(S\)) are not independent.

Conditional independence

What the model \(R*C+C*S\) implies is that \(R\) is independent of \(S\) conditional on \(C\), that is \[ \mathbb{P}(R,S|C)=\mathbb{P}(R|C)\mathbb{P}(S|C). \]

Another way of expressing this is \[ \mathbb{P}(R|S,C)=\mathbb{P}(R|C), \] that is, the probability of each level of \(R\) given a particular combination of \(S\) and \(C\), does not depend on which level \(S\) takes.

Equivalently, we can write \(\mathbb{P}(S|R,C)=\mathbb{P}(S|C)\).

Conditional probs. and conditional independence

This can be observed by calculating the estimated conditional probabilities of remission for the lymphoma dataset.

Cell Type Sex Remission: No Remission: Yes \(\hat{\mathbb{P}}(R \mid S, C)\)
Diffuse Female 0.1176 0.0157 0.12
Diffuse Male 0.3824 0.0510 0.12
Nodular Female 0.0615 0.2051 0.77
Nodular Male 0.0385 0.1282 0.77

The probability of remission given Cell type and Sex depend only on a patient’s Cell type, and not on their Sex.

So, \(\mathbb{P}(R|S, C) = \mathbb{P}(R|C)\), and \(R\) is independent of \(S\), given \(C\).

Although \(R\) and \(S\) are conditionally independent given \(C\), they are not marginally independent.

Summary

  • In two-way tables, if there is no interaction between the two cross classifying variables, then those two variable are independent.
  • In three-way tables, we have seen an example with no interaction between a pair of variables, in which those variables were conditionally independent given the remaining variable.
  • Next, we will see a general result on conditional independence in multiway tables.

Chapter 6: Models for Categorical Data
Lecture 6.5 Interpretation for multiway tables

Recap

  • In two-way tables, if there is no interaction between the two cross classifying variables, then those two variable are independent.
  • In three-way tables, we have seen an example with no interaction between a pair of variables (\(R\) and \(S\)), in which those variables were conditionally independent given the remaining variable (\(C\)).

Let’s look at a general result on conditional independence in multiway tables.

General result on conditional independence

In general, suppose we have an \(r\)-way contingency table with classifying variables \(X_1, \cdots ,X_r\).

A log linear model which does not contain the \(X_1 * X_2\) interaction implies that \(X_1\) and \(X_2\) are conditionally independent given \(X_3, \cdots, X_r\), that is \[ \mathbb{P}(X_1,X_2|X_3,\cdots ,X_r)=\mathbb{P}(X_1|X_3,\cdots ,X_r)\mathbb{P}(X_2|X_3,\cdots ,X_r). \]


The proof is similar to the two-way case; therefore, we will not present it here.

Implications for three-way tables

Suppose the factors for a three-way tables are \(X_1\), \(X_2\) and \(X_3\). We can list all models and the implications for the conditional independence structure

  1. Model 1 containing the terms \(X_1, X_2, X_3\). All factors are mutually independent.
  2. Model 2 containing the terms \(X_1*X_2, X_3\). The factor \(X_3\) is jointly independent of \(X_1\) and \(X_2\).
  3. Model 3 containing the terms \(X_1*X_2, X_2*X_3\). The factors \(X_1\) and \(X_3\) are conditionally independent given \(X_2\).
  4. Model 4 containing the terms \(X_1*X_2, X_2*X_3, X_1*X_3\). There is no conditional independence structure. This is the model without the highest order interaction term.
  5. Model 5 containing \(X_1*X_2*X_3\). This is the saturated model. No more simplification of dependence structure is possible.

Interpreting four-way table (vevox.app, 160-892-474)

Suppose that we have a four-way table, cross-clasitified by variables \(A\), \(B\), \(C\), and \(D\).

Suppose we fit the model

mod <- glm(Count ~ A + A*B + B*C + C*D, family = "Poisson")

Under the model mod, which of the following statements are true? Select all that apply.

  • \(A\) is independent of \(B\).
  • \(A\) is conditionally independent of \(B\) given \(C\) and \(D\).
  • \(A\) is conditionally independent of \(D\) given \(B\) and \(C\).
  • \(B\) is conditionally independent of \(C\) given \(A\) and \(D\).
  • \(B\) is conditionally independent of \(D\) given \(A\) and \(C\).

Simpson’s paradox

In 1972-74, a survey of women was carried out in an area of Newcastle. A follow-up survey was carried out 20 years later.

A summary of the responses is

Smoker Dead Alive \(\hat{\mathbb{P}}(\text{Dead} \mid \text{Smoker})\)
Yes 139 443 0.24
No 230 502 0.31

Looking at this table, it appears that the non-smokers had a greater probability of dying: can that be right?

Simpson’s paradox explained

There is an important extra variable to be considered, related to both smoking habit and mortality–age (at the time of the initial survey).

When we consider this variable, we get

Age Smoker Dead Alive \(\hat{\mathbb{P}}(\text{Dead} \mid \text{Age}, \text{Smoker})\)
18–34 Yes 5 174 0.03
18–34 No 6 213 0.03
35–44 Yes 14 95 0.13
35–44 No 7 114 0.06
45–54 Yes 27 103 0.21
45–54 No 12 66 0.15
55–64 Yes 51 64 0.44
55–64 No 40 81 0.33
65–74 Yes 29 7 0.81
65–74 No 101 28 0.78
75– Yes 13 0 1
75– No 64 0 1

Simpson’s paradox explained

Conditional on every age at outset, it is now the smokers who have a higher probability of dying.

The marginal association is reversed in the table conditional on age, because mortality and smoking are associated with age. There are proportionally many fewer smokers in the older age-groups (where the probability of death is greater).

When making inferences about associations between variables, it is important that all other variables which are relevant are considered. Marginal inferences may lead to misleading conclusions.

Summary

  • We have seen how to interpret log-linear models in terms of (conditional) independence between variables.
  • We have seen through Simpson’s paradox how marginal associations can mask or reverse true conditional relationships.

Mission Accomplished

This completes the lecture material!