Part II, Chapter 6: Models for Categorical Data
08/05/2025
The aim of this chapter is to cover the theory and application of generalised linear models (GLMs).
These slides are based on material written by previous lecturers of this course, including Sujit Sahu, Dave Woods, and Chao Zheng.
Week | Lab | Session 1 (Thursday) | Session 2 (Friday) | Session 3 (Friday) | Problem sheet |
---|---|---|---|---|---|
07 | No lab | 5.1 Exponential family | 5.2 Components of a GLM | 5.3 Examples of GLMs | Sheet 4 |
08 | Lab 5 | 5.4 MLE | 5.5 Confidence intervals | PC: sheet 4 | |
09 | Lab 6 | 5.6 Comparing GLMs | 5.7 Deviance | 5.8 Models with unknown scale | Sheet 5 |
10 | Lab 7 | 6.1 Contingency tables | 6.2 Log-linear models | PC: sheet 5 | |
11 | Lab 8 | 6.3 Multinomial sampling | 6.4 Interpretation for two-way tables | 6.5 Interpretation for multiway tables | Sheet 6 |
12 | Lab 9 | Revision | Revision | PC: sheet 6 |
A particularly important application of generalised linear models is the analysis of categorical data.
Here, the data are observations of one or more categorical variables for each of a number of units (often individuals). Therefore, each of the units are cross-classified by the categorical variables.
For example, the job
dataset gives the job satisfaction and income band of 901 individuals from the 1984 General Social Survey.
Income ($) | Very Dissat. | A Little Dissat. | Moderately Sat. | Very Sat. |
---|---|---|---|---|
<6000 | 20 | 24 | 80 | 82 |
6000–15000 | 22 | 38 | 104 | 125 |
15000–25000 | 13 | 28 | 81 | 113 |
>25000 | 7 | 18 | 54 | 92 |
A cross-classification table like this is called a contingency table. This is a two-way table, as there are two classifying variables.
It might also be describe as a \(4 \times 4\) contingency table (as each of the classifying variables takes one of four possible levels).
Each position in a contingency table is called a cell and the number of individuals in a particular cell is the cell count.
Partial classifications derived from the table are called margins. For a two-way table these are often displayed in the margins of the table.
Our aim in Chapter 6 is to model the cell counts of a contingency table and say something about the relationship between the cross-classifying variables. E.g., is there a relationship between income and job satisfaction?
job
dataset, with marginsAs before,
Income ($) | Very Dissat. | A Little Dissat. | Moderately Sat. | Very Sat. | Sum |
---|---|---|---|---|---|
<6000 | 20 | 24 | 80 | 82 | 206 |
6000–15000 | 22 | 38 | 104 | 125 | 289 |
15000–25000 | 13 | 28 | 81 | 113 | 235 |
>25000 | 7 | 18 | 54 | 92 | 171 |
Sum | 62 | 108 | 319 | 412 | 901 |
We have added one-way margins, which represent the classification of items by a single variable; income group
and job satisfaction
, respectively.
The lymphoma
dataset gives information about 30 patients, classified by cell type of lymphoma, sex, and response to treatment. This is an example of a three-way contingency table. It is a \(2\times 2\times 2\) or \(2^3\) table.
Cell Type | Sex | Remission: No | Remission: Yes |
---|---|---|---|
Diffuse | Female | 3 | 1 |
Diffuse | Male | 12 | 1 |
Nodular | Female | 2 | 6 |
Nodular | Male | 1 | 4 |
For multiway tables, higher order margins may be calculated. For example, for lymphoma
, the two-way Cell type
/Sex
margin is
Cell Type | Female | Male |
---|---|---|
Diffuse | 4 | 13 |
Nodular | 8 | 5 |
We can model contingency table data using generalised linear models.
To do this, we assume that the cell counts are observations of independent Poisson random variables.
This is intuitively sensible as the cell counts are non-negative integers (the sample space for the Poisson distribution).
Therefore, if the table has \(n\) cells, which we label \(1, \cdots, n\), then the observed cell counts \(y_1, \cdots, y_n\) are assumed to be observations of independent Poisson random variables \(Y_1, \cdots, Y_n\).
We denote the means of these Poisson random variables by \(\mu_1, \cdots , \mu_n\).
The canonical link function for the Poisson distribution is the log function, and we assume this link function throughout. A generalised linear model for Poisson data using the log link function is called a log-linear model.
The explanatory variables in a log-linear model for contingency table data are the cross-classifying variables.
As these variables are categorical, they are factors. As usual with factors, we can include interactions in the model as well as just main effects.
Such a model will describe how the expected count in each cell depends on the classifying variables, and any interactions between them. Interpretation of these models will be discussed later on.
job
dataset in long formatThe original data structure of the job
dataset is
Income | Satisfaction | Count |
---|---|---|
<6000 | Very Dissatisfied | 20 |
<6000 | A Little Dissatisfied | 24 |
<6000 | Moderately Satisfied | 80 |
<6000 | Very Satisfied | 82 |
6000–15000 | Very Dissatisfied | 22 |
6000–15000 | A Little Dissatisfied | 38 |
6000–15000 | Moderately Satisfied | 104 |
6000–15000 | Very Satisfied | 125 |
15000–25000 | Very Dissatisfied | 13 |
15000–25000 | A Little Dissatisfied | 28 |
15000–25000 | Moderately Satisfied | 81 |
15000–25000 | Very Satisfied | 113 |
>25000 | Very Dissatisfied | 7 |
>25000 | A Little Dissatisfied | 18 |
>25000 | Moderately Satisfied | 54 |
>25000 | Very Satisfied | 92 |
job
The original format of the job
dataset the same data as the contingency table, but in a different format, sometimes called long format.
A log-linear model is just a Poisson GLM, where the response variable is Count
, and Income
and Satisfaction
are explanatory variables.
Recall Income
and Satisfaction
are both factors with 4 levels.
What is \(q\), the number of parameters in loglin_job_1
?
What is \(p\), the number of parameters in loglin_job_2
?
job
(loglin_job_1
)
Call:
glm(formula = Count ~ Income + Satisfaction, family = poisson,
data = job)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.2065 0.1140 28.12 <2e-16 ***
Income>25000 -0.1862 0.1035 -1.80 0.0719 .
Income15000-25000 0.1317 0.0954 1.38 0.1676
Income6000-15000 0.3386 0.0912 3.71 0.0002 ***
SatisfactionModerately Satisfied 1.0831 0.1113 9.73 <2e-16 ***
SatisfactionVery Dissatisfied -0.5550 0.1593 -3.48 0.0005 ***
SatisfactionVery Satisfied 1.3389 0.1081 12.39 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 445.763 on 15 degrees of freedom
Residual deviance: 12.037 on 9 degrees of freedom
AIC: 115.1
Number of Fisher Scoring iterations: 4
job
(loglin_job_2
)
Call:
glm(formula = Count ~ Income * Satisfaction, family = poisson,
data = job)
Coefficients:
Estimate Std. Error z value
(Intercept) 3.1781 0.2041 15.57
Income>25000 -0.2877 0.3118 -0.92
Income15000-25000 0.1542 0.2782 0.55
Income6000-15000 0.4595 0.2607 1.76
SatisfactionModerately Satisfied 1.2040 0.2327 5.17
SatisfactionVery Dissatisfied -0.1823 0.3028 -0.60
SatisfactionVery Satisfied 1.2287 0.2321 5.29
Income>25000:SatisfactionModerately Satisfied -0.1054 0.3581 -0.29
Income15000-25000:SatisfactionModerately Satisfied -0.1417 0.3197 -0.44
Income6000-15000:SatisfactionModerately Satisfied -0.1972 0.3002 -0.66
Income>25000:SatisfactionVery Dissatisfied -0.7621 0.5386 -1.42
Income15000-25000:SatisfactionVery Dissatisfied -0.5849 0.4520 -1.29
Income6000-15000:SatisfactionVery Dissatisfied -0.3642 0.4043 -0.90
Income>25000:SatisfactionVery Satisfied 0.4028 0.3468 1.16
Income15000-25000:SatisfactionVery Satisfied 0.1665 0.3137 0.53
Income6000-15000:SatisfactionVery Satisfied -0.0379 0.2969 -0.13
Pr(>|z|)
(Intercept) < 2e-16 ***
Income>25000 0.356
Income15000-25000 0.579
Income6000-15000 0.078 .
SatisfactionModerately Satisfied 2.3e-07 ***
SatisfactionVery Dissatisfied 0.547
SatisfactionVery Satisfied 1.2e-07 ***
Income>25000:SatisfactionModerately Satisfied 0.769
Income15000-25000:SatisfactionModerately Satisfied 0.658
Income6000-15000:SatisfactionModerately Satisfied 0.511
Income>25000:SatisfactionVery Dissatisfied 0.157
Income15000-25000:SatisfactionVery Dissatisfied 0.196
Income6000-15000:SatisfactionVery Dissatisfied 0.368
Income>25000:SatisfactionVery Satisfied 0.246
Income15000-25000:SatisfactionVery Satisfied 0.596
Income6000-15000:SatisfactionVery Satisfied 0.898
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 4.4576e+02 on 15 degrees of freedom
Residual deviance: -1.1990e-14 on 0 degrees of freedom
AIC: 121
Number of Fisher Scoring iterations: 3
Now we conduct a log likelihood ratio test to compare the two models.
Analysis of Deviance Table
Model 1: Count ~ Income + Satisfaction
Model 2: Count ~ Income * Satisfaction
Resid. Df Resid. Dev Df Deviance
1 9 12
2 0 0 9 12
Here, \(L_{01}\) is \(12.037\), given in the deviance column.
Under \(H_0\), \(L_{01} \sim \chi^2_9\).
We should reject \(H_0\) if \(L_{01} > k\). Which of the following commands could be used to find \(k\), for a test at the \(5\%\) level?
dchisq(0.05, df = 9)
dchisq(0.95, df = 9)
pchisq(0.05, df = 9)
pchisq(0.95, df = 9)
qchisq(0.05, df = 9)
qchisq(0.95, df = 9)
Since \(L_{01} = 12.037 < 16.9\), we do not reject \(H_0\), and we prefer the simpler model, loglin_job_1
.
Later, we’ll see that we can interpret loglin_job_1
as meaning that there is no dependence of job satisfaction on income.
lymphoma
data contingency tableCell Type | Sex. | Remission: No | Remission: Yes |
---|---|---|---|
Diffuse | Female | 3 | 1 |
Diffuse | Male | 12 | 1 |
Nodular | Female | 2 | 6 |
Nodular | Male | 1 | 4 |
lymphoma
data in long formatCell | Sex | Remis | Count |
---|---|---|---|
Nodular | Male | No | 1 |
Nodular | Male | Yes | 4 |
Nodular | Female | No | 2 |
Nodular | Female | Yes | 6 |
Diffuse | Male | No | 12 |
Diffuse | Male | Yes | 1 |
Diffuse | Female | No | 3 |
Diffuse | Female | Yes | 1 |
lymphoma
A log-linear model for the contingency table is just a Poisson GLM for this data, where in this case the response variable is Count
, and Sex
, Remis
and Cell
are explanatory variables.
The aim is to understand how the probability of remission varies by type of lymphoma and sex.
You will consider log-linear models for this in LAB 09. For example,
Note
In some cases, it might seem more natural to consider a binomial model for the number of patients in remission, given the total number in each group. We will see later on that the inference assuming this binomial model would be the same as assuming the Poisson log-linear model.
Let’s return to an example to demonstrate these concepts.
job
dataset, with marginsAs before,
Income ($) | Very Dissat. | A Little Dissat. | Moderately Sat. | Very Sat. | Sum |
---|---|---|---|---|---|
<6000 | 20 | 24 | 80 | 82 | 206 |
6000–15000 | 22 | 38 | 104 | 125 | 289 |
15000–25000 | 13 | 28 | 81 | 113 | 235 |
>25000 | 7 | 18 | 54 | 92 | 171 |
Sum | 62 | 108 | 319 | 412 | 901 |
The one-way margins represent the classification of items by a single variable; income group
and job satisfaction
, respectively.
If the table has \(n\) cells, which we label \(1, \cdots ,n\), then the observed cell counts \(y_1, \cdots ,y_n\) are assumed to be observations of independent Poisson random variables \(Y_1, \cdots ,Y_n\).
We denote the means of these Poisson random variables by \(\mu_1, \cdots, \mu_n\), and model \(log(\mu_i) = \eta_i\), where \(\boldsymbol{\eta}\) is the linear predictor.
The explanatory variables in a log-linear model for contingency table data are the cross-classifying variables. As these variables are categorical, they are factors. As usual with factors, we can include interactions in the model as well as just main effects.
Such a model will describe how the expected count in each cell depends on the classifying variables, and any interactions between them. Interpretation of these models will be discussed later on.
job
dataset in long formatThe original data structure of the job
dataset is
Income | Satisfaction | Count |
---|---|---|
<6000 | Very Dissatisfied | 20 |
<6000 | A Little Dissatisfied | 24 |
<6000 | Moderately Satisfied | 80 |
<6000 | Very Satisfied | 82 |
6000–15000 | Very Dissatisfied | 22 |
6000–15000 | A Little Dissatisfied | 38 |
6000–15000 | Moderately Satisfied | 104 |
6000–15000 | Very Satisfied | 125 |
15000–25000 | Very Dissatisfied | 13 |
15000–25000 | A Little Dissatisfied | 28 |
15000–25000 | Moderately Satisfied | 81 |
15000–25000 | Very Satisfied | 113 |
>25000 | Very Dissatisfied | 7 |
>25000 | A Little Dissatisfied | 18 |
>25000 | Moderately Satisfied | 54 |
>25000 | Very Satisfied | 92 |
job
The original format of the job
dataset the same data as the contingency table, but in a different format, sometimes called long format.
A log-linear model is just a Poisson GLM, where the response variable is Count
, and Income
and Satisfaction
are explanatory variables.
job <- read_csv("data/job.csv", show_col_types = FALSE)
job$Income <- factor(job$Income, levels = c("<6000", "6000-15000", "15000-25000", ">25000"))
job$Satisfaction <- factor(job$Satisfaction, levels = c("Very Dissatisfied", "A Little Dissatisfied", "Moderately Satisfied", "Very Satisfied"))
# Fit the models
loglin_job_1 <- glm(Count ~ Income + Satisfaction, family = poisson, data = job)
loglin_job_2 <- glm(Count ~ Income * Satisfaction, family = poisson, data = job)
job
(loglin_job_1
)
Call: glm(formula = Count ~ Income + Satisfaction, family = poisson,
data = job)
Coefficients:
(Intercept) Income6000-15000
2.652 0.339
Income15000-25000 Income>25000
0.132 -0.186
SatisfactionA Little Dissatisfied SatisfactionModerately Satisfied
0.555 1.638
SatisfactionVery Satisfied
1.894
Degrees of Freedom: 15 Total (i.e. Null); 9 Residual
Null Deviance: 446
Residual Deviance: 12 AIC: 115
Call:
glm(formula = Count ~ Income + Satisfaction, family = poisson,
data = job)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.6515 0.1410 18.81 <2e-16 ***
Income6000-15000 0.3386 0.0912 3.71 0.0002 ***
Income15000-25000 0.1317 0.0954 1.38 0.1676
Income>25000 -0.1862 0.1035 -1.80 0.0719 .
SatisfactionA Little Dissatisfied 0.5550 0.1593 3.48 0.0005 ***
SatisfactionModerately Satisfied 1.6381 0.1388 11.80 <2e-16 ***
SatisfactionVery Satisfied 1.8939 0.1362 13.90 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 445.763 on 15 degrees of freedom
Residual deviance: 12.037 on 9 degrees of freedom
AIC: 115.1
Number of Fisher Scoring iterations: 4
(Intercept) Income6000-15000
2.6515 0.3386
Income15000-25000 Income>25000
0.1317 -0.1862
SatisfactionA Little Dissatisfied SatisfactionModerately Satisfied
0.5550 1.6381
SatisfactionVery Satisfied
1.8939
The fitted values \(\hat \mu_i = \exp(\mathbf{x}_i^{\top} \hat{\boldsymbol{\beta}})\) give us the expected counts in each cell of the contingency table.
Which of the following would give the expected count of people with income \(<6000\) who were A Little Dissatisfied
under model loglin_job_1
?
job_tab <- xtabs(Count ~ Income + Satisfaction, data = job)
# Puts Arbitrary Margins on Multidimensional Tables or Arrays
addmargins(job_tab)
Satisfaction
Income Very Dissatisfied A Little Dissatisfied Moderately Satisfied
<6000 20 24 80
6000-15000 22 38 104
15000-25000 13 28 81
>25000 7 18 54
Sum 62 108 319
Satisfaction
Income Very Satisfied Sum
<6000 82 206
6000-15000 125 289
15000-25000 113 235
>25000 92 171
Sum 412 901
fit1_tab <- xtabs(fitted(loglin_job_1) ~ Income + Satisfaction, data = job)
# Puts Arbitrary Margins on Multidimensional Tables or Arrays
addmargins(fit1_tab)
Satisfaction
Income Very Dissatisfied A Little Dissatisfied Moderately Satisfied
<6000 14.18 24.69 72.93
6000-15000 19.89 34.64 102.32
15000-25000 16.17 28.17 83.20
>25000 11.77 20.50 60.54
Sum 62.00 108.00 319.00
Satisfaction
Income Very Satisfied Sum
<6000 94.20 206.00
6000-15000 132.15 289.00
15000-25000 107.46 235.00
>25000 78.19 171.00
Sum 412.00 901.00
job
(loglin_job_2
)
Call:
glm(formula = Count ~ Income * Satisfaction, family = poisson,
data = job)
Coefficients:
Estimate Std. Error z value
(Intercept) 2.9957 0.2236 13.40
Income6000-15000 0.0953 0.3090 0.31
Income15000-25000 -0.4308 0.3563 -1.21
Income>25000 -1.0498 0.4392 -2.39
SatisfactionA Little Dissatisfied 0.1823 0.3028 0.60
SatisfactionModerately Satisfied 1.3863 0.2500 5.55
SatisfactionVery Satisfied 1.4110 0.2494 5.66
Income6000-15000:SatisfactionA Little Dissatisfied 0.3642 0.4043 0.90
Income15000-25000:SatisfactionA Little Dissatisfied 0.5849 0.4520 1.29
Income>25000:SatisfactionA Little Dissatisfied 0.7621 0.5386 1.42
Income6000-15000:SatisfactionModerately Satisfied 0.1671 0.3429 0.49
Income15000-25000:SatisfactionModerately Satisfied 0.4432 0.3896 1.14
Income>25000:SatisfactionModerately Satisfied 0.6568 0.4732 1.39
Income6000-15000:SatisfactionVery Satisfied 0.3263 0.3401 0.96
Income15000-25000:SatisfactionVery Satisfied 0.7515 0.3847 1.95
Income>25000:SatisfactionVery Satisfied 1.1649 0.4647 2.51
Pr(>|z|)
(Intercept) < 2e-16 ***
Income6000-15000 0.758
Income15000-25000 0.227
Income>25000 0.017 *
SatisfactionA Little Dissatisfied 0.547
SatisfactionModerately Satisfied 2.9e-08 ***
SatisfactionVery Satisfied 1.5e-08 ***
Income6000-15000:SatisfactionA Little Dissatisfied 0.368
Income15000-25000:SatisfactionA Little Dissatisfied 0.196
Income>25000:SatisfactionA Little Dissatisfied 0.157
Income6000-15000:SatisfactionModerately Satisfied 0.626
Income15000-25000:SatisfactionModerately Satisfied 0.255
Income>25000:SatisfactionModerately Satisfied 0.165
Income6000-15000:SatisfactionVery Satisfied 0.337
Income15000-25000:SatisfactionVery Satisfied 0.051 .
Income>25000:SatisfactionVery Satisfied 0.012 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 4.4576e+02 on 15 degrees of freedom
Residual deviance: 2.2649e-14 on 0 degrees of freedom
AIC: 121
Number of Fisher Scoring iterations: 3
There are \(24\) people with income \(<6000\) who were A Little Dissatisfied
o. The expected count in this cell under model loglin_job_1
is \(24.7\). What is the expected count in this cell under loglin_job_2
?
Discussion
Notice that loglin_job_2
is the saturated model (\(16\) cells and \(16\) parameters), therefore, it fits the data exactly.
This happens when the model includes (I) all main effects and (II) all interactions needed to uniquely estimate each cell.
job_tab <- xtabs(Count ~ Income + Satisfaction, data = job)
# Puts Arbitrary Margins on Multidimensional Tables or Arrays
addmargins(job_tab)
Satisfaction
Income Very Dissatisfied A Little Dissatisfied Moderately Satisfied
<6000 20 24 80
6000-15000 22 38 104
15000-25000 13 28 81
>25000 7 18 54
Sum 62 108 319
Satisfaction
Income Very Satisfied Sum
<6000 82 206
6000-15000 125 289
15000-25000 113 235
>25000 92 171
Sum 412 901
fit2_tab <- xtabs(fitted(loglin_job_2) ~ Income + Satisfaction, data = job)
# Puts Arbitrary Margins on Multidimensional Tables or Arrays
addmargins(fit2_tab)
Satisfaction
Income Very Dissatisfied A Little Dissatisfied Moderately Satisfied
<6000 20 24 80
6000-15000 22 38 104
15000-25000 13 28 81
>25000 7 18 54
Sum 62 108 319
Satisfaction
Income Very Satisfied Sum
<6000 82 206
6000-15000 125 289
15000-25000 113 235
>25000 92 171
Sum 412 901
Although the assumption of Poisson distributed observations is convenient for the purposes of modelling, it might not be a realistic assumption, because of the way in which the data have been collected.
Frequently, when contingency table data are obtained, the total number of observations (the grand total, the sum of all the cell counts) is fixed in advance.
In this case, no individual cell count can exceed the pre-specified fixed total, so the assumption of Poisson sampling is invalid as the sample space is bounded. Furthermore, with a fixed total, the observations can no longer be observations of independent random variables.
lymphoma
data contingency tableThe lymphoma
dataset gives information about 30 patients, classified by cell type of lymphoma, sex, and response to treatment. This is an example of a three-way contingency table. It is a \(2\times 2\times 2\) or \(2^3\) table.
Cell Type | Sex | Remission: No | Remission: Yes |
---|---|---|---|
Diffuse | Female | 3 | 1 |
Diffuse | Male | 12 | 1 |
Nodular | Female | 2 | 6 |
Nodular | Male | 1 | 4 |
lymphoma
data contingency tableFor the lymphoma
contingency table, the appropriate model to use depends on the process by which these data were collected
When the grand total is fixed, a more appropriate distribution for the cell counts is the multinomial distribution.
The multinomial distribution is the distribution of cell counts arising when a pre-specified total of \(N\) items are each independently assigned to one of \(n\) cells, where the probability of being classified into cell \(i\) is \(p_i\), \({i=1, \cdots, n}\), so \(\sum_{i=1}^n p_i=1\).
The probability function for the multinomial distribution is \[\begin{align*} f_{\mathbf{Y}}(\mathbf{y};{\mathbf{p}}) &= P(Y_1=y_1,\cdots ,Y_n=y_n) \cr &= \begin{cases} N!\prod_{i=1}^n \frac{p_i^{y_i}}{y_i!} & \text{if $\sum_{i=1}^n y_i=N$} \\ 0 & \text{otherwise.} \end{cases} \end{align*}\]
The binomial is the special case of the multinomial with two cells (\(n=2\)).
The multinomial distribution is the distribution of cell counts arising when a pre-specified total of \(N\) items are each independently assigned to one of \(n\) cells, where the probability of being classified into cell \(i\) is \(p_i\), \({i=1, \cdots, n}\), so \(\sum_{i=1}^n p_i=1\).
The probability function for the multinomial distribution is \[\begin{align*} f_{\mathbf{Y}}(\mathbf{y};{\mathbf{p}}) &= P(Y_1=y_1,\cdots ,Y_n=y_n) \cr &= \begin{cases} N!\prod_{i=1}^n \frac{p_i^{y_i}}{y_i!} & \text{if $\sum_{i=1}^n y_i=N$} \\ 0 & \text{otherwise.} \end{cases} \end{align*}\]
The binomial is the special case of the multinomial with two cells (\(n=2\)).
When the data have been obtained by multinomial sampling, we suppose \(\mathbf{Y} \sim \text{Multinomial}(N, \mathbf{p})\), where \(\log\mu_i=\log(N \cdot p_i)\), \({i=1, \cdots, n}\) is modelled as as a linear function of explanatory variables, so that \(\log \mu_i = \mathbf{x}_i^{\top} \boldsymbol{\beta}\).
Since \(\sum_{i=1}^n p_i = 1\), we have \(\sum_{i=1}^n \mu_i=N\), the grand total–which is fixed in advance.
Suppose we model instead \(Y_i \sim \text{Poisson}(\mu_i)\), where \(\log \mu_i = \mathbf{x}_i^{\top} \boldsymbol{\beta}\). Suppose that we include an intercept term, e.g. \(x_{i1} = 1\) for \(i = 1, \cdots,n\), so \(\log \mu_i = \beta_1 + \sum_{j=2}^p x_{ij} \beta_j\).
Important
The maximum likelihood estimates for multinomial log-linear parameters are identical to those for Poisson log-linear parameters (with intercept). Furthermore, the maximised log-likelihoods for both Poisson and multinomial models take the form \(\sum_{i=1}^ny_i\log\hat{\mu}_i\), where \(\log \hat{\mu}_i = \mathbf{x}_i^{\top} \boldsymbol{\hat{\beta}}\).
We will see a sketch of the proof of this result (this is also in the lecture notes).
Recall that the maximised log-likelihoods for both Poisson and multinomial models take the form \(\sum_{i=1}^ny_i\log\hat{\mu}_i\) as functions of the log-linear parameter estimates.
So any inferences based on maximised log-likelihoods (such as likelihood ratio tests) will be the same.
Therefore, we can analyse contingency table data using Poisson log-linear models, even when the data has been obtained by multinomial sampling. All that is required is that we ensure that the Poisson model contains an intercept term.
R
(Part 1)R
(Part 2)# Fit models
mod1 <- glm(formula = count ~ 0 + A + B, family = "poisson", data = df)
mod2 <- glm(formula = count ~ 1 + A + B, family = "poisson", data = df)
ct1 <- xtabs(fitted(mod1) ~ A + B, data = df)
ct2 <- xtabs(fitted(mod2) ~ A + B, data = df)
ct1 <- addmargins(ct1)
ct2 <- addmargins(ct2)
# Fitted values without intercept
ct1
B
A 0 1 Sum
0 1.0000 0.7913 1.7913
1 2.7913 2.2087 5.0000
Sum 3.7913 3.0000 6.7913
B
A 0 1 Sum
0 3.5 1.5 5.0
1 3.5 1.5 5.0
Sum 7.0 3.0 10.0
Sometimes margins other than just the grand total may be pre-specified.
For example, consider the lymphoma
contingency table.
Cell Type | Sex | Remission: No | Remission: Yes |
---|---|---|---|
Diffuse | Female | 3 | 1 |
Diffuse | Male | 12 | 1 |
Nodular | Female | 2 | 6 |
Nodular | Male | 1 | 4 |
It may have been decided at the outset to collect data on \(18\) male patients and \(12\) female patients. Alternatively, perhaps the distribution of both the Sex
and Cell Type
of the patients was fixed in advance.
In cases where a margin is fixed by design, the data consist of a number of fixed total subgroups, defined by the fixed margin. Neither Poisson nor multinomial sampling assumptions are valid.
The appropriate distribution combines a separate, independent multinomial for each subgroup
Sex
margin is fixed, then we should model the data with two independent multinomials, one for males with \(N=18\) and one for females with \(N=12\). Each of these multinomials has four cells, representing the cross-classification of the relevant patients by Cell Type
and Remission
.Cell Type
/Sex
margin has been fixed, then the appropriate distribution is four independent two-cell multinomials (binomials) representing the remission classification for each of the four fixed-total patient subgroups.When the data are modelled using independent multinomials, then the joint probability function for the cell counts \(Y_1, \cdots, Y_n\) is the product of multinomial probability functions, one for each fixed-total subgroup. We call this a distribution a product multinomial.
For example, if the Sex
margin is fixed for lymphoma
, then the product multinomial distribution has the form \[
f_{\mathbf{Y}}(\mathbf{y};{\mathbf{p}})=
\begin{cases}
N_m!\prod_{i=1}^4 {{p_{mi}^{y_{mi}}}\over{y_{mi}!}}
N_f!\prod_{i=1}^4 {{p_{fi}^{y_{fi}}}\over{y_{fi}!}} &
\text{if $\sum_{i=1}^4 y_{mi}=N_m$ and
$\sum_{i=1}^4 y_{fi}=N_f$} \\
0 & \text{otherwise,}
\end{cases}
\] where
Using similar reasoning to the multinomial sampling case, we can analyse contingency table data using Poisson log-linear models, even when the data has been obtained by product multinomial sampling.
However, we must ensure that the Poisson model contains a term corresponding to the fixed margin (and all marginal terms). Then, the estimated means for the specified margin are equal to the corresponding fixed totals.
Therefore, when analysing product multinomial data using a Poisson log-linear model, certain terms must be present in any model we fit. If they are removed, the inferences would no longer be valid.
lymphoma
dataFor the lymphoma
dataset, for inferences obtained using a Poisson model to be valid when the Sex
margin is fixed in advance, the Poisson model must contain the Sex
main effect (and the intercept
).
mod_sex <- glm(Count ~ 1 + Sex, data = lymphoma, family = poisson)
# Female: 12, Male: 18
ct_sex <- xtabs(fitted(mod_sex) ~ Cell + Remis + Sex , data = lymphoma)
addmargins(ct_sex)
, , Sex = Female
Remis
Cell No Yes Sum
Diffuse 3.0 3.0 6.0
Nodular 3.0 3.0 6.0
Sum 6.0 6.0 12.0
, , Sex = Male
Remis
Cell No Yes Sum
Diffuse 4.5 4.5 9.0
Nodular 4.5 4.5 9.0
Sum 9.0 9.0 18.0
, , Sex = Sum
Remis
Cell No Yes Sum
Diffuse 7.5 7.5 15.0
Nodular 7.5 7.5 15.0
Sum 15.0 15.0 30.0
lymphoma
dataFor inferences obtained using a Poisson model to be valid when the Cell Type
/Sex
margin is fixed in advance, the Poisson model must contain the Cell Type
/Sex
interaction, and all marginal terms (the Cell Type
main effect, the Sex
main effect, and the intercept
).
mod_cell_sex <- glm(Count ~ 1 + Cell + Sex + Cell:Sex, data = lymphoma, family = poisson)
# Female and Diffuse : 4, Female and Nodular: 8, Male and Diffuse: 13, Male and Nodular: 5
ct_cell_sex <- xtabs(fitted(mod_cell_sex) ~ Remis + Cell + Sex, data = lymphoma)
addmargins(ct_cell_sex)
, , Sex = Female
Cell
Remis Diffuse Nodular Sum
No 2.0 4.0 6.0
Yes 2.0 4.0 6.0
Sum 4.0 8.0 12.0
, , Sex = Male
Cell
Remis Diffuse Nodular Sum
No 6.5 2.5 9.0
Yes 6.5 2.5 9.0
Sum 13.0 5.0 18.0
, , Sex = Sum
Cell
Remis Diffuse Nodular Sum
No 8.5 6.5 15.0
Yes 8.5 6.5 15.0
Sum 17.0 13.0 30.0
Log-linear models for contingency tables enable us to determine important properties concerning the joint distribution of the classifying variables. The form of our preferred log linear model for a table will have implications for how the variables are associated.
Each combination of the classifying variables occurs exactly once in a contingency table. Therefore, the model with the highest order interaction (between all the variables) and all marginal terms (all other interactions) is the saturated model.
The implication of this model is that every combination of levels of the variables has its own mean (probability) and that there are no relationships between these means (no structure). The variables are highly dependent.
job
datasetIn the job
dataset, we might assume that the number of people in each income group is fixed. We are interested in the probability of having each level of satisfaction at each income group, and seeing whether there is any evidence in this data that job satisfaction depends on income.
Income ($) | Very Dissat. | A Little Dissat. | Moderately Sat. | Very Sat. | Sum |
---|---|---|---|---|---|
<6000 | 20 | 24 | 80 | 82 | 206 |
6000–15000 | 22 | 38 | 104 | 125 | 289 |
15000–25000 | 13 | 28 | 81 | 113 | 235 |
>25000 | 7 | 18 | 54 | 92 | 171 |
Sum | 62 | 108 | 319 | 412 | 901 |
This is an example of product multinomial sampling, with the Income
margin fixed: provided that we include an intercept and Income
in the model, we can use Poisson log-linear models to analyse the contingency table (as we did before).
job <- read_csv("data/job.csv", show_col_types = FALSE)
job$Income <- factor(job$Income, levels = c("<6000", "6000-15000", "15000-25000", ">25000"))
job$Satisfaction <- factor(job$Satisfaction, levels = c("Very Dissatisfied", "A Little Dissatisfied", "Moderately Satisfied", "Very Satisfied"))
loglin_job_1 <- glm(Count ~ Income + Satisfaction, family = poisson, data = job)
loglin_job_2 <- glm(Count ~ Income * Satisfaction, family = poisson, data = job)
As before, we can compare these models with a log-likelihood ratio test
Analysis of Deviance Table
Model 1: Count ~ Income + Satisfaction
Model 2: Count ~ Income * Satisfaction
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 9 12
2 0 0 9 12 0.21
On the basis of this, we prefer loglin_job_1
, which does not have an interaction between Income
and Satisfaction
.
loglin_job_1
The fitted values \(\hat{\mu}_i = \exp(\mathbf{x}_i^{\top} \hat{\boldsymbol{\beta}})\) give us the expected counts in each cell of the contingency table.
fit1_tab <- xtabs(fitted(loglin_job_1) ~ Income + Satisfaction, data = job)
# Puts Arbitrary Margins on Multidimensional Tables or Arrays
addmargins(fit1_tab)
Satisfaction
Income Very Dissatisfied A Little Dissatisfied Moderately Satisfied
<6000 14.18 24.69 72.93
6000-15000 19.89 34.64 102.32
15000-25000 16.17 28.17 83.20
>25000 11.77 20.50 60.54
Sum 62.00 108.00 319.00
Satisfaction
Income Very Satisfied Sum
<6000 94.20 206.00
6000-15000 132.15 289.00
15000-25000 107.46 235.00
>25000 78.19 171.00
Sum 412.00 901.00
Under loglin_job_1
, the expected cell counts for Income
\(< 6000\) are
Very Dissatisfied A Little Dissatisfied Moderately Satisfied
14.18 24.69 72.93
Very Satisfied
94.20
Under this model, what is the estimate of the probability \[\mathbb{P}(\text{Satisfaction = Very Satisfied} | \text{Income < 6000})?\]
Under loglin_job_1
, the expected cell counts for Income
\(6000\text{-}15000\) are
Very Dissatisfied A Little Dissatisfied Moderately Satisfied
19.89 34.64 102.32
Very Satisfied
132.15
Under this model, what is the estimate of the probability \[\mathbb{P}(\text{Satisfaction = Very Satisfied} | \text{Income = 6000-15000})?\]
loglin_job_1
The fitted conditional probability for each satisfaction level are
Very Dissatisfied A Little Dissatisfied Moderately Satisfied
0.06881 0.11987 0.35405
Very Satisfied
0.45727
for Income
\(< 6000\), and
Very Dissatisfied A Little Dissatisfied Moderately Satisfied
0.06881 0.11987 0.35405
Very Satisfied
0.45727
for Income
\(6000\text{-}15000\).
In fact, we get the same fitted conditional probabilities for each income. Under this model, Income
and Satisfaction
are independent!
Last time we considered the implications of a log-linear model with no interaction term for the job dataset.
job <- read_csv("data/job.csv", show_col_types = FALSE)
job$Income <- factor(job$Income, levels = c("<6000", "6000-15000", "15000-25000", ">25000"))
job$Satisfaction <- factor(job$Satisfaction, levels = c("Very Dissatisfied", "A Little Dissatisfied", "Moderately Satisfied", "Very Satisfied"))
loglin_job_1 <- glm(Count ~ Income + Satisfaction, family = poisson, data = job)
We found that under this model, the fitted conditional probabilities for having each level of job satisfaction were the same for each income group.
For this model for a two-way table, with no interaction (no Income:Satisfaction
term), we find the the cross-classifying factors (Income
and Satisfaction
) are independent. Now, we will show that this is true in general, and then go on to look at multi-way tables.
We consider a two-way \(r \times c\) table where the two classifying variables \(R\) and \(C\) have \(r\) and \(c\) levels, respectively.
The saturated model \(R*C\) implies that the two variables are associated. If we remove the RC interaction, we have the model \(R+C\), \[ \log\mu_i=\alpha+\beta_R(r_i)+\beta_C(c_i),\qquad {i=1, \cdots, n} \] where \(n=rc\) is the total number of cells in the table.
Because of the equivalence of Poisson and multinomial sampling, we can think of each cell mean \(\mu_i\) as equal to \(N \cdot p_i\) where \(N\) is the total number of observations in the table, and \(p_i\) a cell probability.
As each combination of levels of \(R\) and \(C\) is represented in exactly one cell, it is also convenient to replace the cell label \(i\) by the pair of labels \(j\) and \(k\) representing the corresponding levels of \(R\) and \(C\) respectively. Hence \[ \log \left(p_{jk}\right) = \alpha + \beta_R(j) + \beta_C(k) - \log (N), \] for \(j = 1, \cdots, r\) and \(k = 1, \cdots, c\).
We have \[ \log \left(p_{jk}\right)=\alpha+\beta_R(j)+\beta_C(k)-\log(N), \] and \[ \mathbb{P}(R=j,C=k)=\exp[\alpha+\beta_R(j)+\beta_C(k)-\log(N)], \] for \(j=1, \cdots, r\) and \(k=1, \cdots, c\).
So,
\[\begin{align*} 1&=\sum_{j=1}^r\sum_{k=1}^c\exp[\alpha+\beta_R(j)+\beta_C(k)-\log(N)]\cr &={1\over N}\exp[\alpha]\sum_{j=1}^r\exp[\beta_R(j)]\sum_{k=1}^c\exp[\beta_C(k)]. \end{align*}\]
Since \[ \mathbb{P}(R=j,C=k)=\exp[\alpha+\beta_R(j)+\beta_C(k)-\log (N)], \]
we have
\[\begin{align*} \mathbb{P}(R=j)&=\sum_{k=1}^c\exp[\alpha+\beta_R(j)+\beta_C(k)-\log(N)]\cr &={1\over N}\exp[\alpha]\exp[\beta_R(j)]\sum_{k=1}^c\exp[\beta_C(k)], \quad j=1,\cdots,r, \end{align*}\]
and
\[\begin{align*} \mathbb{P}(C=k)&=\sum_{j=1}^r\exp[\alpha+\beta_R(j)+\beta_C(k)-\log(N)]\cr &={1\over N}\exp[\alpha]\exp[\beta_C(k)]\sum_{j=1}^r\exp[\beta_R(j)], \quad k=1,\cdots,c. \end{align*}\]
Therefore, \[\begin{align*} \mathbb{P}(R=j)\mathbb{P}(C=k) &= {1\over N}\exp[\alpha]\exp[\beta_R(j)]\sum_{k=1}^c\exp[\beta_C(k)] \times \\ &\phantom{=|\;}{1\over N}\exp[\alpha]\exp[\beta_C(k)]\sum_{j=1}^r\exp[\beta_R(j)] \\ &={1\over N}\exp[\alpha]\exp[\beta_C(k)]\exp[\beta_R(j)] \times \\ &\phantom{=|\;}{1\over N}\exp[\alpha]\sum_{j=1}^r\exp[\beta_R(j)] \sum_{k=1}^c\exp[\beta_C(k)] \\ &={1\over N}\exp[\alpha]\exp[\beta_C(k)]\exp[\beta_R(j)]\times 1 \\ &=\mathbb{P}(R=j,C=k) \end{align*}\] for \(j = 1, \cdots, r\) and \(k = 1, \cdots, c\).
Absence of the interaction \(R*C\) in a log-linear model implies that \(R\) and \(C\) are independent variables.
Absence of main effects is generally less interesting, and main effects are typically not removed from a log-linear model for a contingency table.
In multi-way tables, absence of a two-factor interaction does not necessarily mean that the two variables are independent.
For example, consider the lymphoma
dataset, with 3 binary classifying variables Sex
(\(S\)), Cell type
(\(C\)) and Remission
(\(R\)).
After comparing the fit of several possible models, we find that a reasonable log-linear model for these data is \(R * C + C * S\).
Hence the interaction between Remission
and Sex
is absent.
(Intercept) RemisYes CellNodular
1.261 -2.015 -0.648
SexMale RemisYes:CellNodular CellNodular:SexMale
1.179 3.219 -1.649
Which of the following would give the expected count of female patients with nodular cell type who were in remission under model mod
?
The estimated probabilities for the two-way Sex
/Remission
margin are
Sex | Remission: No | Remission: Yes | Sum |
---|---|---|---|
Female | 0.1792 | 0.2208 | 0.4 |
Male | 0.4208 | 0.1792 | 0.6 |
Sum | 0.6 | 0.4 | 1.0 |
The estimated conditional probability of remission is \(0.2208/ 0.4 = 0.55\) for female patients, and \(0.1792 / 0.6 = 0.30\) for male patients.
So, remission (\(R\)) and sex (\(S\)) are not independent.
What the model \(R*C+C*S\) implies is that \(R\) is independent of \(S\) conditional on \(C\), that is \[ \mathbb{P}(R,S|C)=\mathbb{P}(R|C)\mathbb{P}(S|C). \]
Another way of expressing this is \[ \mathbb{P}(R|S,C)=\mathbb{P}(R|C), \] that is, the probability of each level of \(R\) given a particular combination of \(S\) and \(C\), does not depend on which level \(S\) takes.
Equivalently, we can write \(\mathbb{P}(S|R,C)=\mathbb{P}(S|C)\).
This can be observed by calculating the estimated conditional probabilities of remission for the lymphoma
dataset.
Cell Type | Sex | Remission: No | Remission: Yes | \(\hat{\mathbb{P}}(R \mid S, C)\) |
---|---|---|---|---|
Diffuse | Female | 0.1176 | 0.0157 | 0.12 |
Diffuse | Male | 0.3824 | 0.0510 | 0.12 |
Nodular | Female | 0.0615 | 0.2051 | 0.77 |
Nodular | Male | 0.0385 | 0.1282 | 0.77 |
The probability of remission given Cell type
and Sex
depend only on a patient’s Cell type
, and not on their Sex
.
So, \(\mathbb{P}(R|S, C) = \mathbb{P}(R|C)\), and \(R\) is independent of \(S\), given \(C\).
Although \(R\) and \(S\) are conditionally independent given \(C\), they are not marginally independent.
Let’s look at a general result on conditional independence in multiway tables.
In general, suppose we have an \(r\)-way contingency table with classifying variables \(X_1, \cdots ,X_r\).
A log linear model which does not contain the \(X_1 * X_2\) interaction implies that \(X_1\) and \(X_2\) are conditionally independent given \(X_3, \cdots, X_r\), that is \[ \mathbb{P}(X_1,X_2|X_3,\cdots ,X_r)=\mathbb{P}(X_1|X_3,\cdots ,X_r)\mathbb{P}(X_2|X_3,\cdots ,X_r). \]
The proof is similar to the two-way case; therefore, we will not present it here.
Suppose the factors for a three-way tables are \(X_1\), \(X_2\) and \(X_3\). We can list all models and the implications for the conditional independence structure
Suppose that we have a four-way table, cross-clasitified by variables \(A\), \(B\), \(C\), and \(D\).
Suppose we fit the model
Under the model mod
, which of the following statements are true? Select all that apply.
In 1972-74, a survey of women was carried out in an area of Newcastle. A follow-up survey was carried out 20 years later.
A summary of the responses is
Smoker | Dead | Alive | \(\hat{\mathbb{P}}(\text{Dead} \mid \text{Smoker})\) |
---|---|---|---|
Yes | 139 | 443 | 0.24 |
No | 230 | 502 | 0.31 |
Looking at this table, it appears that the non-smokers had a greater probability of dying: can that be right?
There is an important extra variable to be considered, related to both smoking habit and mortality–age (at the time of the initial survey).
When we consider this variable, we get
Age | Smoker | Dead | Alive | \(\hat{\mathbb{P}}(\text{Dead} \mid \text{Age}, \text{Smoker})\) |
---|---|---|---|---|
18–34 | Yes | 5 | 174 | 0.03 |
18–34 | No | 6 | 213 | 0.03 |
35–44 | Yes | 14 | 95 | 0.13 |
35–44 | No | 7 | 114 | 0.06 |
45–54 | Yes | 27 | 103 | 0.21 |
45–54 | No | 12 | 66 | 0.15 |
55–64 | Yes | 51 | 64 | 0.44 |
55–64 | No | 40 | 81 | 0.33 |
65–74 | Yes | 29 | 7 | 0.81 |
65–74 | No | 101 | 28 | 0.78 |
75– | Yes | 13 | 0 | 1 |
75– | No | 64 | 0 | 1 |
Conditional on every age at outset, it is now the smokers who have a higher probability of dying.
The marginal association is reversed in the table conditional on age, because mortality and smoking are associated with age. There are proportionally many fewer smokers in the older age-groups (where the probability of death is greater).
When making inferences about associations between variables, it is important that all other variables which are relevant are considered. Marginal inferences may lead to misleading conclusions.
Mission Accomplished
This completes the lecture material!