# Data wrangling and table summaries of case-control studies

A Case-control study compares patients who have a disease or outcome of interest (cases) with patients who do not have the disease or outcome (controls), and looks back retrospectively to compare how frequently the exposure to a risk factor is present in each group to determine the relationship between the risk factor and the disease.

Case control studies are observational because no intervention is attempted and no attempt is made to alter the course of the disease. The goal is to retrospectively determine the exposure to the risk factor of interest from each of the two groups of individuals: cases and controls. These studies are designed to estimate odds.

Case control studies are also known as “retrospective studies” and “case-referent studies.”

In the classic textbook of Breslow and Day about data analysis of cancer research,this is the table of the study about risk factors for oesophageal cancer: From Breslow and N. E. Day, ch 4.

We will use dplyr and ggplot2 to graph this data. In this project, we will recreate this table the tidyverse way.

First, we load the meta-package tidyverse thant contains packages as dplyr for data wrangling among others.

library(tidyverse)

The dataset of the book can be found here

df <- read_csv("http://bit.ly/data_esoph", col_names = FALSE)
## Parsed with column specification:
## cols(
##   X1 = col_integer(),
##   X2 = col_integer(),
##   X3 = col_integer(),
##   X4 = col_integer()
## )

Let’s see the first six rows of the dataset

head(df)
## # A tibble: 6 x 4
##      X1    X2    X3    X4
##   <int> <int> <int> <int>
## 1     1     2     1     0
## 2     1     1     1     0
## 3     1     2     4     0
## 4     1     2     2     0
## 5     1     2     1     0
## 6     1     2     1     0

Data is without column names. The variables are:

COL VAR RANGE/VALUES
1 Age group 1 = 25-34
(years) 2 = 35-44
3 = 45-54
4 = 55-64
5 = 65-74
6 = 75+
2 Alcohol 1 = 0-39
(gms/day) 2 = 40-79
3 = 80-119
4 = 120+
3 Tobacco 1 = 0- 9
(gms/day) 2 = 10-19
3 = 20-29
4 = 30+
4 Case or 1 = case
Control 0 = control

Now, we will ad the column names:

colnames(df) <- (c("age", "alc", "tob", "cc"))

Check:

head(df)
## # A tibble: 6 x 4
##     age   alc   tob    cc
##   <int> <int> <int> <int>
## 1     1     2     1     0
## 2     1     1     1     0
## 3     1     2     4     0
## 4     1     2     2     0
## 5     1     2     1     0
## 6     1     2     1     0

Since we know the codes, we will recode all the dataset using the function mutate and case_when of dplyr

Age groups 1 = 25-34 2 = 35-44 3 = 45-54 4 = 55-64 5 = 65-74 6 = 75+

df <- df %>%
mutate(
age_grp =
case_when(
age == 1 ~ "25-34",
age == 2 ~ "35-44",
age == 3 ~ "45-54",
age == 4 ~ "55-64",
age == 5 ~ "65-74",
TRUE ~ "75+",
)
)
head(df)
## # A tibble: 6 x 5
##     age   alc   tob    cc age_grp
##   <int> <int> <int> <int> <chr>
## 1     1     2     1     0 25-34
## 2     1     1     1     0 25-34
## 3     1     2     4     0 25-34
## 4     1     2     2     0 25-34
## 5     1     2     1     0 25-34
## 6     1     2     1     0 25-34

OK!

The same for the rest of the variables Alcohol

df <- df %>%
mutate(
alc_grp =
case_when(
alc == 1 ~  "0-39",
alc == 2 ~ "40-79",
alc == 3 ~ "80-119",
TRUE ~ "120+"
)
)

Tobacco

df <- df %>%
mutate(
tob_grp =
case_when(
tob == 1 ~  "0- 9",
tob == 2 ~ "10-19",
tob == 3 ~ "20-29",
TRUE ~ "30+"
)
)

Group

df <- df %>%
mutate(
cc_grp =
case_when(
cc == 0 ~ "control",
TRUE ~ "case"
)
)

Now, omit the former columns

df <- df %>%
select(age_grp:cc_grp)

and now, we have to give the order of the factors for the ordinal variables age, alcohol and tobbaco

df %>%
mutate(age_grp = factor(age_grp, levels = c("25-34",
"35-44"   ,
"45-54"   ,
"55-64"   ,
"65-74"   ,
"75"
)))
df %>%
mutate(alc_grp = factor(alc_grp, levels = c("0-39",
"40-79",
"80-119",
"120"
)))
df %>%
mutate(tob_grp = factor(tob_grp, levels = c("0- 9",
"10-19",
"20-29",
"30+"
)))

Now we have the data ready for the analysis!

Let’s make the table 1 We have two options, first the traditional table:

table(df$age_grp, df$cc_grp)
##
##         case control
##   25-34    1     115
##   35-44    9     190
##   45-54   46     167
##   55-64   76     166
##   65-74   55     106
##   75+     13      31

Here we can add the margins with

addmargins(table(df$age_grp, df$cc_grp))
##
##         case control Sum
##   25-34    1     115 116
##   35-44    9     190 199
##   45-54   46     167 213
##   55-64   76     166 242
##   65-74   55     106 161
##   75+     13      31  44
##   Sum    200     775 975

or make a proportional table, with

options(digits = 2) # limit the digits to two decimals
prop.table(table(df$age_grp, df$cc_grp))*100
##
##          case control
##   25-34  0.10   11.79
##   35-44  0.92   19.49
##   45-54  4.72   17.13
##   55-64  7.79   17.03
##   65-74  5.64   10.87
##   75+    1.33    3.18

Since the size of the groups is different, this table is not useful. But we can change the calculation of the proportion, to add the prop by columns instead of rows:

prop.table(table(df$age_grp, df$cc_grp), 2)*100 # note the ,2 added. That means % by col. 
##
##         case control
##   25-34  0.5    14.8
##   35-44  4.5    24.5
##   45-54 23.0    21.5
##   55-64 38.0    21.4
##   65-74 27.5    13.7
##   75+    6.5     4.0

This is better.

The same table in the dplyr way:

df %>%
group_by(age_grp, cc_grp) %>%
summarise(n = n()) %>%
spread(cc_grp, n)
## # A tibble: 6 x 3
## # Groups:   age_grp 
##   age_grp  case control
##   <chr>   <int>   <int>
## 1 25-34       1     115
## 2 35-44       9     190
## 3 45-54      46     167
## 4 55-64      76     166
## 5 65-74      55     106
## 6 75+        13      31

or as proportional table:

df %>%
count(age_grp, cc_grp ) %>%
mutate(prop = prop.table(n)*100) %>%
select(-n) %>%
spread(cc_grp, prop)
## # A tibble: 6 x 3
##   age_grp  case control
##   <chr>   <dbl>   <dbl>
## 1 25-34   0.103   11.8
## 2 35-44   0.923   19.5
## 3 45-54   4.72    17.1
## 4 55-64   7.79    17.0
## 5 65-74   5.64    10.9
## 6 75+     1.33     3.18

The dplyr version of the table with proportions is

options(digits = 3)
df %>%
count(age_grp, cc_grp) %>%
group_by(cc_grp) %>%
mutate(prop = n / sum(n)) %>%
select(-n) %>%
spread(cc_grp, prop, fill = 0)
## # A tibble: 6 x 3
##   age_grp    case control
##   <chr>     <dbl>   <dbl>
## 1 25-34   0.00500  0.148
## 2 35-44   0.0450   0.245
## 3 45-54   0.230    0.215
## 4 55-64   0.380    0.214
## 5 65-74   0.275    0.137
## 6 75+     0.0650   0.0400

Also, there is a new package called janitor, full of nice functions. One of them allow to make such table with a simple syntax:

df %>%
janitor::crosstab(age_grp, cc_grp, percent = 'col')  # this means: use the package janitor to create a crosstable of this variables and adding the percent by columns. You can change the latter to 'row'
##   age_grp  case control
## 1   25-34 0.005   0.148
## 2   35-44 0.045   0.245
## 3   45-54 0.230   0.215
## 4   55-64 0.380   0.214
## 5   65-74 0.275   0.137
## 6     75+ 0.065   0.040

For alcohol:

df %>%
group_by(alc_grp, cc_grp) %>%
summarise(n = n()) %>%
spread(cc_grp, n)
## # A tibble: 4 x 3
## # Groups:   alc_grp 
##   alc_grp  case control
##   <chr>   <int>   <int>
## 1 0-39       29     386
## 2 120+       45      22
## 3 40-79      75     280
## 4 80-119     51      87

and tobacco:

df %>%
group_by(tob_grp, cc_grp) %>%
summarise(n = n()) %>%
spread(cc_grp, n)
## # A tibble: 4 x 3
## # Groups:   tob_grp 
##   tob_grp  case control
##   <chr>   <int>   <int>
## 1 0- 9       78     447
## 2 10-19      58     178
## 3 20-29      33      99
## 4 30+        31      51

So, in this post we had recreated the table from the case-control study of (o)esophageal cancer in Ille-et-Vilaine, France in the Breslow and Day textbook.