Exploratory Data Analysis with Tables

Introduction

Tables allow you to explore and summarize data efficiently. While graphs are more intuitive for discovering relationships and trends, tables have the advantage of providing detailed information and allowing descriptive statistics and data summaries to be delivered.

Usually scientific articles in medicine begin with a table that shows the characteristics of the sample of patients. In this post, we will use the janitor and table1 packages to summarize data and make an example of table 1 using the NHANES database.

Packages used

Before using each of these packages you must install them using the command

install.packages(“package”)

and then call using the command

library(package)

Also after installed, each package be invoked specifically for command via the format

package::command()

for example, after install the package janitor, you can use function clean_names via

janitor::clean_names(dataset)

I suggest using the pacman package to handle other packages, and can be installed with

install.packages(“pacman”)

Dataset

The National Health and Nutrition Examination Survey (NHANES) is a program of studies designed to assess the health and nutritional status of adults and children in the United States. More information at https://wwwn.cdc.gov/nchs/nhanes/tutorials/default.aspx

# install.packages("NHANES") 
library(NHANES)
data(NHANES)

Tables with r-base

One variable table

Univariate tables are useful for detecting the presence of errors in coding and for obtaining a general summary of the data.

The general syntaxis to indicate a column or variable in R is dataset$variable. For example, if we want to select the column Gender from the NHANES dataframe, the command is

table(NHANES$Gender)
## 
## female   male 
##   5020   4980

To obtain the total we use the command addmargins()

addmargins(table(NHANES$Gender))
## 
## female   male    Sum 
##   5020   4980  10000

Now a table by marital status

table(NHANES$MaritalStatus)
## 
##     Divorced  LivePartner      Married NeverMarried    Separated 
##          707          560         3945         1380          183 
##      Widowed 
##          456

To get the percentages you use the command prop.table and add *100 at the end to make it more readable

prop.table(table(NHANES$MaritalStatus)) * 100 
## 
##     Divorced  LivePartner      Married NeverMarried    Separated 
##     9.777348     7.744434    54.556769    19.084497     2.530770 
##      Widowed 
##     6.306182

The equivalent in tidyverse format is:

NHANES %>% 
 group_by(Gender) %>% 
  summarise(n = n())
## # A tibble: 2 x 2
##   Gender     n
##   <fct>  <int>
## 1 female  5020
## 2 male    4980

Two variables table

Tables become more useful even when used to cross tabulate between two nominal variables.

For example, to make a summary table indicating race by gender, the command is: The proportion table is

prop.table(table(NHANES$Race1, NHANES$Gender))
##           
##            female   male
##   Black    0.0614 0.0583
##   Hispanic 0.0320 0.0290
##   Mexican  0.0452 0.0563
##   White    0.3221 0.3151
##   Other    0.0413 0.0393

Now in percentage

prop.table(table(NHANES$Race1, NHANES$Gender)) * 100
##           
##            female  male
##   Black      6.14  5.83
##   Hispanic   3.20  2.90
##   Mexican    4.52  5.63
##   White     32.21 31.51
##   Other      4.13  3.93

and with the percentage by row

prop.table(table(NHANES$Race1, NHANES$Gender), 1) * 100
##           
##              female     male
##   Black    51.29490 48.70510
##   Hispanic 52.45902 47.54098
##   Mexican  44.53202 55.46798
##   White    50.54928 49.45072
##   Other    51.24069 48.75931

We can choose the number of digits with options(digits = n)

options(digits = 3)
prop.table(table(NHANES$Race1, NHANES$Gender), 1) * 100
##           
##            female male
##   Black      51.3 48.7
##   Hispanic   52.5 47.5
##   Mexican    44.5 55.5
##   White      50.5 49.5
##   Other      51.2 48.8
addmargins(prop.table(table(NHANES$Race1, NHANES$Gender), 1) * 100)
##           
##            female  male   Sum
##   Black      51.3  48.7 100.0
##   Hispanic   52.5  47.5 100.0
##   Mexican    44.5  55.5 100.0
##   White      50.5  49.5 100.0
##   Other      51.2  48.8 100.0
##   Sum       250.1 249.9 500.0
addmargins(prop.table(table(NHANES$Race1, NHANES$Gender), 2) * 100)
##           
##            female   male    Sum
##   Black     12.23  11.71  23.94
##   Hispanic   6.37   5.82  12.20
##   Mexican    9.00  11.31  20.31
##   White     64.16  63.27 127.44
##   Other      8.23   7.89  16.12
##   Sum      100.00 100.00 200.00

Tables in tidyverse syle

To create this table

table(NHANES$Race1, NHANES$Gender)
##           
##            female male
##   Black       614  583
##   Hispanic    320  290
##   Mexican     452  563
##   White      3221 3151
##   Other       413  393

In tidyverse format is:

NHANES %>% 
  group_by(Gender, Race1) %>% 
  summarise(n = n()) %>% 
  spread(Gender, n)
## # A tibble: 5 x 3
##   Race1    female  male
##   <fct>     <int> <int>
## 1 Black       614   583
## 2 Hispanic    320   290
## 3 Mexican     452   563
## 4 White      3221  3151
## 5 Other       413   393

Proportion table in tidyverse style

This table:

prop.table(table(NHANES$Race1, NHANES$Gender), 2) *100
##           
##            female  male
##   Black     12.23 11.71
##   Hispanic   6.37  5.82
##   Mexican    9.00 11.31
##   White     64.16 63.27
##   Other      8.23  7.89

And to get the proportions in tidyverse style, is:

NHANES %>%
  group_by(Gender, Race1) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n) * 100) %>% 
  select(-n) %>% 
  spread(Gender, freq)
## # A tibble: 5 x 3
##   Race1    female  male
##   <fct>     <dbl> <dbl>
## 1 Black     12.2  11.7 
## 2 Hispanic   6.37  5.82
## 3 Mexican    9.00 11.3 
## 4 White     64.2  63.3 
## 5 Other      8.23  7.89

Tables with janitor::tabyl

The janitor package, designed for data cleaning, contains the tabyl command which is a table 2.0, with some of the following utilities:

NHANES %>% 
  janitor::tabyl(Race1) 
##     Race1    n percent
##     Black 1197  0.1197
##  Hispanic  610  0.0610
##   Mexican 1015  0.1015
##     White 6372  0.6372
##     Other  806  0.0806

Or a two-variables table:

NHANES %>% 
  janitor::tabyl(Race1, Gender) %>% 
  janitor::adorn_totals(where = "row")
##     Race1 female male
##     Black    614  583
##  Hispanic    320  290
##   Mexican    452  563
##     White   3221 3151
##     Other    413  393
##     Total   5020 4980
NHANES %>% 
  janitor::tabyl(Race1, Gender) %>% 
  janitor::adorn_totals(where = "col")
##     Race1 female male Total
##     Black    614  583  1197
##  Hispanic    320  290   610
##   Mexican    452  563  1015
##     White   3221 3151  6372
##     Other    413  393   806

You can add the margins with janitor::adorn_totals(where = c(“row”,“col”))

NHANES %>% 
  janitor::tabyl(Race1, Gender) %>% 
  janitor::adorn_totals(where = c("row","col"))
##     Race1 female male Total
##     Black    614  583  1197
##  Hispanic    320  290   610
##   Mexican    452  563  1015
##     White   3221 3151  6372
##     Other    413  393   806
##     Total   5020 4980 10000

To limit the number of decimals, we do it in the following way:

NHANES %>% 
  janitor::tabyl(Race1, Gender) %>% 
  janitor::adorn_percentages(denominator = "col") %>% 
  janitor::adorn_pct_formatting(digits = 0) 
##     Race1 female male
##     Black    12%  12%
##  Hispanic     6%   6%
##   Mexican     9%  11%
##     White    64%  63%
##     Other     8%   8%
NHANES %>% 
  janitor::tabyl(Race1, Gender) %>% 
  janitor::adorn_percentages(denominator = "row") %>% 
  janitor::adorn_pct_formatting(digits = 0) 
##     Race1 female male
##     Black    51%  49%
##  Hispanic    52%  48%
##   Mexican    45%  55%
##     White    51%  49%
##     Other    51%  49%

One of the most useful functions of tabyl is that it allows you to make a table by combining totals and percentages, either by row or column.

NHANES %>% 
  janitor::tabyl(Race1, Gender) %>% 
  janitor::adorn_totals(where = c("row","col")) %>% 
  janitor::adorn_percentages(denominator = "col") %>% 
  janitor::adorn_pct_formatting(digits = 0) %>% 
  janitor::adorn_ns(position = "front")
##     Race1      female        male        Total
##     Black  614  (12%)  583  (12%)  1197  (12%)
##  Hispanic  320   (6%)  290   (6%)   610   (6%)
##   Mexican  452   (9%)  563  (11%)  1015  (10%)
##     White 3221  (64%) 3151  (63%)  6372  (64%)
##     Other  413   (8%)  393   (8%)   806   (8%)
##     Total 5020 (100%) 4980 (100%) 10000 (100%)

Now withouth the % sign

NHANES %>% 
  janitor::tabyl(Race1, Gender) %>% 
  janitor::adorn_totals(where = c("row","col")) %>% 
  janitor::adorn_percentages(denominator = "row") %>% 
  janitor::adorn_pct_formatting(digits = 1, rounding = "half to even", affix_sign = FALSE) %>% 
  janitor::adorn_ns(position = "front") 
##     Race1      female        male         Total
##     Black  614 (51.3)  583 (48.7)  1197 (100.0)
##  Hispanic  320 (52.5)  290 (47.5)   610 (100.0)
##   Mexican  452 (44.5)  563 (55.5)  1015 (100.0)
##     White 3221 (50.5) 3151 (49.5)  6372 (100.0)
##     Other  413 (51.2)  393 (48.8)   806 (100.0)
##     Total 5020 (50.2) 4980 (49.8) 10000 (100.0)

Tables in (old) SPSS style

The expss package provides tabulation functions with support for ‘SPSS’-style labels, multiple / nested banners, weights, multiple-response variables and significance testing.

pacman::p_load(expss)

Now, we can create a table

NHANES::NHANES %>% 
  expss::tab_cells(AgeDecade, Race1, Education) %>% 
  expss::tab_cols(Gender) %>% 
  expss::tab_stat_cpct() %>% 
  expss::tab_last_sig_means(subtable_marks = "both") %>% 
  expss::tab_pivot() %>% 
  expss::set_caption("Table with summary statistics and significance marks.") %>% 
  htmlTable()
Table with summary statistics and significance marks.
 Gender 
 female     male 
 A     B 
 AgeDecade 
    0-9  13.5     15.2  
    10-19  14.2     14.3  
    20-29  14.1     13.9  
    30-39  14.0     13.7  
    40-49  14.1     14.8  
    50-59  12.9     14.1  
    60-69  9.9 > B   9.1 < A
    70+  7.2     4.9  
   #Total cases  4827.0     4840.0  
 Race1 
   Black  12.2     11.7  
   Hispanic  6.4     5.8  
   Mexican  9.0     11.3  
   White  64.2 > B   63.3 < A
   Other  8.2     7.9  
   #Total cases  5020.0     4980.0  
 Education 
   8th Grade  5.7     6.8  
   9 - 11th Grade  10.9     13.7  
   High School  20.9     21.1  
   Some College  32.6 > B   30.2 < A
   College Grad  29.9     28.2  
   #Total cases  3677.0     3544.0  

Table 1

Finally, to make the first table of a scientific report, the Table1 package allows summarizing several variables grouped by factors, as follows

NHANES %>% 
  table1::table1(~Age + Poverty + Race1 | Gender, data = .)
female
(n=5020)
male
(n=4980)
Overall
(n=10000)
Age
Mean (SD) 37.6 (22.7) 35.8 (22.0) 36.7 (22.4)
Median [Min, Max] 37.0 [0.00, 80.0] 36.0 [0.00, 80.0] 36.0 [0.00, 80.0]
Poverty
Mean (SD) 2.76 (1.68) 2.84 (1.68) 2.80 (1.68)
Median [Min, Max] 2.63 [0.00, 5.00] 2.75 [0.00, 5.00] 2.70 [0.00, 5.00]
Missing 380 (7.6%) 346 (6.9%) 726 (7.3%)
Race1
Black 614 (12.2%) 583 (11.7%) 1197 (12.0%)
Hispanic 320 (6.4%) 290 (5.8%) 610 (6.1%)
Mexican 452 (9.0%) 563 (11.3%) 1015 (10.2%)
White 3221 (64.2%) 3151 (63.3%) 6372 (63.7%)
Other 413 (8.2%) 393 (7.9%) 806 (8.1%)

and with some shading

table1::table1( ~ Age + Poverty + Race1| Gender, data = NHANES, 
                topclass="Rtable1-zebra" )
female
(n=5020)
male
(n=4980)
Overall
(n=10000)
Age
Mean (SD) 37.6 (22.7) 35.8 (22.0) 36.7 (22.4)
Median [Min, Max] 37.0 [0.00, 80.0] 36.0 [0.00, 80.0] 36.0 [0.00, 80.0]
Poverty
Mean (SD) 2.76 (1.68) 2.84 (1.68) 2.80 (1.68)
Median [Min, Max] 2.63 [0.00, 5.00] 2.75 [0.00, 5.00] 2.70 [0.00, 5.00]
Missing 380 (7.6%) 346 (6.9%) 726 (7.3%)
Race1
Black 614 (12.2%) 583 (11.7%) 1197 (12.0%)
Hispanic 320 (6.4%) 290 (5.8%) 610 (6.1%)
Mexican 452 (9.0%) 563 (11.3%) 1015 (10.2%)
White 3221 (64.2%) 3151 (63.3%) 6372 (63.7%)
Other 413 (8.2%) 393 (7.9%) 806 (8.1%)

The J Clin Epidemiology published an excellent guide that helps to design table 1, available at Hayes-Larson, E., Kezios, K.L., Mooney, S.J., Lovasi, G., 2019. Who is in this study, anyway? Guidelines for a useful Table 1. J. Clin. Epidemiol. 114, 125–132.

comments powered by Disqus