An Epidemiology Companion in R Markdown

Foreword

Epidemiology is a field of study that crosses between the study of medicine and public policy, and attempts to track and assess measures of human health in society through rigorous data keeping systems and mathematical analysis using statistical models. Given that statistics and data management are at the heart of epidemiology, the R programming language and RStudio IDE provide an ideal platform for the study of this subject. This guide compiles many frequently used formulas, visualizations and theorems in this field of study into the R programming language and presents these concepts in a textbook-style guide. In this guide, each chapter contains a different general subject matter, and contains a variety of formulas that can be called as a function. These functions may be called in a series, or nested into other formulas, to generate useful calculations and visualizations.

One aspect of Epidemiology is to use Routine Health Information Systems (RHIS) which record vital statistics, which are agglomerated from civil registration systems, to analyze health outcomes in a population.

Areas of Study

We will explore some real-world implementations of these data sources, and applications of the key concepts in epidemiology concepts, such as:

  • Accessing publicly available health Data
  • Transforming raw data into usable formats
  • Understanding the data
  • Properly documenting and citing your data sources
  • Visualizing the data in R using GGPlot2
  • How to re-use code and apply to different health metrics

Why R?

R and R Markdown provides a powerful platform for general statistical programming. Visualization capabilities in R are enhanced through the “Grammar of Graphics” as implemented through the GGPlot2 and GGForce packages. Data manipulation functions like select, mutate, and summarize are added by the dplyr package, enabling more convenient and readable R language constructs, using the pipe %>% operator. This operator can help your write intuitive syntax in R that resembles common structured query languages like SQL.

This document will not provide basic instruction in R. It is assumed that the reader has some familiarity with the R language, syntax patterns, installation, and configuration. If this is your first introduction to R, please check out the appendix to learn more about the basics of this language and installing the RStudio programming environment. The Rmd files (which this document was written in) may be downloaded from github, enabling the reader to change any portion of this document and customize the word problems to cover different applied subjects.

Intended Audience

This guide is intended for medical professionals, health care workers, researchers, teachers, students, or anyone else who needs to generate, visualize, apply, or solve epidemiology problems.

Load packages

This Rmarkdown file will use several packages as defined below. This code will load the packages only of they are not already installed. These packages will be loaded into the local scope via the library functions.

if(!require(tidyr)) install.packages("tidyr", repos = "http://cran.us.r-project.org")
if(!require(pracma)) install.packages("pracma", repos = "http://cran.us.r-project.org")
if(!require(dplyr)) install.packages("dplyr", repos = "http://cran.us.r-project.org")
if(!require(ggplot2)) install.packages("ggplot2", repos = "http://cran.us.r-project.org")
if(!require(ggforce)) install.packages("ggforce", repos = "http://cran.us.r-project.org")
if(!require(gtools)) install.packages("gtools", repos = "http://cran.us.r-project.org")
if(!require(matlib)) install.packages("matlib", repos = "http://cran.us.r-project.org")
if(!require(kableExtra)) install.packages("kableExtra", repos = "http://cran.us.r-project.org")
if(!require(RcppAlgos)) install.packages("RcppAlgos", repos = "http://cran.us.r-project.org")
if(!require(latex2exp)) install.packages("latex2exp", repos = "http://cran.us.r-project.org")
if(!require(survey)) install.packages("survey", repos = "http://cran.us.r-project.org")

library(tidyr)
library(pracma)
library(ggplot2)
library(ggforce)
library(gtools)
library(matlib)
library(kableExtra)
library(RcppAlgos)
library(latex2exp)
library(dplyr)
library(survey)

options(scipen=999)

Using RHIS Systems to Access Publicly Available Health Data

RHIS Sources may aggregate data from a variety of inputs including surveys, administrative and medical records, insurance claims, vital records, health surveillance, disease registries, and peer-reviewed literature.

For more in this subject, check out:

National Library of Medicine

In this RMarkdown document we will look into some available health data sources, and how they can be used to create health statistics.

We will explore how to use R to wrangle data from RHIS sources by using tidyr and dplyr, and how to visualize the data in R using the GGPlot package.

Survey Data: NHANES

In this section we will explore how to use R to download data from the NHANES “National Health and Nutrition Examination Survey”

NHANES Main Page on the CDC

From the NHANES main page, click Survey Data and Documentation to see available data sources.

NHANES Data Sources

Demographics – One file that includes demographic variables as well as survey weights and other survey design variables
Examination – Individual files on Audiometry, Blood Pressure, Body Measures, Muscular Strength, Oral Health, Vision Exam, etc.
Laboratory – Individual files on Urine Collection, Hepatitis A virus, HIV, Heavy Metals, Plasma Glucose, Total Cholesterol, Triglycerides, etc.
Questionnaire – Individual files on Alcohol Use, Balance, Blood Pressure, Diabetes, Drug Use, Social Support, Vision, Weight History, etc.
Dietary – Individual files for the Dietary Interview, Supplement Use, etc.

We will choose the 2017-2018 data since the newer data sources are not universally available yet, and the data can be greatly complicated by the outbreak of COVID. SO for simplicity we will use this pre-pandemic data.

NHANES Data Sources

The challenge in working with this data will be in figuring out how to combine multiple data sources, such as demographic and questionnaire data into one data frame in R; assigning meaningful values to the codes and data points that are returned; removing rows that are inapplicable to an analysis; and effectively communicating the meaning of the data to the reader.

NHANES Tutorial

In the online tutorials section of the NHANES web site, there are example code blocks provided in R to help you get started. These pull data from the 2013-2015 data surveys, downloading data in the SAS transport XPT format.

NHANES Tutorial

#'
options(survey.lonely.psu='adjust')

# Display Version Information
cat("R package versions:\n")
## R package versions:
for (p in c("base", "survey","dplyr")) { 
  cat(p, ": ", as.character(packageVersion(p)), "\n")
}
## base :  4.1.2 
## survey :  4.1.1 
## dplyr :  1.0.10
#' # Data preparation
# Download & Read SAS Transport Files
# Demographic (DEMO)
download.file("https://wwwn.cdc.gov/nchs/nhanes/2013-2014/DEMO_H.XPT", tf <- tempfile(), mode="wb")
DEMO_H <- foreign::read.xport(tf)[,c("SEQN","RIAGENDR","RIDAGEYR","SDMVSTRA","SDMVPSU","WTMEC2YR")]
download.file("https://wwwn.cdc.gov/nchs/nhanes/2015-2016/DEMO_I.XPT", tf <- tempfile(), mode="wb")
DEMO_I <- foreign::read.xport(tf)[,c("SEQN","RIAGENDR","RIDAGEYR","SDMVSTRA","SDMVPSU","WTMEC2YR")]

# Mental Health - Depression Screener (DPQ) 
download.file("http://wwwn.cdc.gov/nchs/nhanes/2013-2014/DPQ_H.XPT", tf <- tempfile(), mode="wb")
DPQ_H <- foreign::read.xport(tf)
download.file("http://wwwn.cdc.gov/nchs/nhanes/2015-2016/DPQ_I.XPT", tf <- tempfile(), mode="wb")
DPQ_I <- foreign::read.xport(tf)

# Append Files
DEMO <- bind_rows(DEMO_H, DEMO_I)
DPQ <- bind_rows(DPQ_H, DPQ_I)

# Merge DEMO and DPQ files and create derived variables

One <- left_join(DEMO, DPQ, by="SEQN") %>%
  # Set 7=Refused and 9=Don't Know To Missing for variables DPQ010 thru DPQ090 ##
  mutate_at(vars(DPQ010:DPQ090), ~ifelse(. >=7, NA, .)) %>%
  mutate(. , 
         # create indicator for overall summary
         one = 1,
         # Create depression score as sum of variables DPQ010 -- DPQ090
         Depression.Score = rowSums(select(. , DPQ010:DPQ090)),
         # Create depression indicator as binary 0/100 variable. (is missing if Depression.Score is missing)
         Depression= ifelse(Depression.Score >=10, 100, 0), 
         # Create factor variables
         Gender = factor(RIAGENDR, labels=c("Men", "Women")),
         Age.Group = cut(RIDAGEYR, breaks=c(-Inf,19,39,59,Inf),labels=c("Under 20", "20-39","40-59","60 and over")),
         # Generate 4-year MEC weight (Divide weight by 2 because we are appending 2 survey cycles)                                                             
         # Note: using the MEC Exam Weights (WTMEC2YR), per the analytic notes on the               
         #       Mental Health - Depression Screener (DPQ_H) documentation                          
         WTMEC4YR = WTMEC2YR/2 ,
         # Define indicator for analysis population of interest: adults aged 20 and over with a valid depression score
         inAnalysis= (RIDAGEYR >= 20 & !is.na(Depression.Score))
         ) %>% select(., -dplyr::starts_with("DPQ"))
# drop DPQ variables

#' ## Define survey design 
# Define survey design for overall dataset
NHANES_all <- svydesign(data=One, id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC4YR, nest=TRUE)
                    
# Create a survey design object for the subset of interest: adults aged 20 and over with a valid depression score 
# Subsetting the original survey design object ensures we keep the design information about the number of clusters and strata
NHANES <- subset(NHANES_all, inAnalysis)

#' ## Analysis
# Define a function to call svymean and unweighted count
getSummary <- function(varformula, byformula, design){
  # Get mean, stderr, and unweighted sample size
  c <- svyby(varformula, byformula, design, unwtd.count ) 
  p <- svyby(varformula, byformula, design, svymean ) 
  outSum <- left_join(select(c,-se), p) 
  outSum
}

#' ### Calculate prevalence of depression overall, by gender, by age group, and by age and gender
#' Adults
getSummary(~Depression, ~one, NHANES)
#' By sex
getSummary(~Depression, ~Gender, NHANES)
#' By age
getSummary(~Depression, ~Age.Group, NHANES)
#' By sex and age
getSummary(~Depression, ~Gender + Age.Group, NHANES)
#' ### Compare Prevalence Between Men And Women
svyttest(Depression~Gender, NHANES)$p.value %>% as.numeric 
## [1] 0.0000001706236
svyttest(Depression~Gender, subset(NHANES, Age.Group=="20-39"))$p.value %>% as.numeric
## [1] 0.0001131167
svyttest(Depression~Gender, subset(NHANES, Age.Group=="40-59"))$p.value %>% as.numeric
## [1] 0.0005705859
svyttest(Depression~Gender, subset(NHANES, Age.Group=="60 and over"))$p.value %>% as.numeric
## [1] 0.003983706
#' ### Pairwise t-testing by age groups for total, men, and women
#' Differences by age group, among all adults
# Full output from svyttest command
svyttest(Depression~Age.Group, subset(NHANES, Age.Group=="20-39" | Age.Group=="40-59"))
## 
##  Design-based t-test
## 
## data:  Depression ~ Age.Group
## t = 0.79398, df = 29, p-value = 0.4337
## alternative hypothesis: true difference in mean is not equal to 0
## 95 percent confidence interval:
##  -1.079836  2.450262
## sample estimates:
## difference in mean 
##          0.6852129
# Displaying p-values only
svyttest(Depression~Age.Group, subset(NHANES, Age.Group=="20-39" | Age.Group=="40-59"))$p.value %>% as.numeric
## [1] 0.433655
svyttest(Depression~Age.Group, subset(NHANES, Age.Group=="20-39" | Age.Group=="60 and over"))$p.value %>% as.numeric
## [1] 0.8245736
svyttest(Depression~Age.Group, subset(NHANES, Age.Group=="40-59" | Age.Group=="60 and over"))$p.value %>% as.numeric
## [1] 0.6001652
#' Differences by age group, among men
svyttest(Depression~Age.Group, subset(NHANES, Gender=="Men" & (Age.Group=="20-39" | Age.Group=="40-59")))$p.value %>% as.numeric
## [1] 0.7927599
svyttest(Depression~Age.Group, subset(NHANES, Gender=="Men" & (Age.Group=="20-39" | Age.Group=="60 and over")))$p.value %>% as.numeric
## [1] 0.5938032
svyttest(Depression~Age.Group, subset(NHANES, Gender=="Men" & (Age.Group=="40-59" | Age.Group=="60 and over")))$p.value %>% as.numeric
## [1] 0.4339905
#' Differences by age group, among women
svyttest(Depression~Age.Group, subset(NHANES, Gender=="Women" & (Age.Group=="20-39" | Age.Group=="40-59")))$p.value %>% as.numeric
## [1] 0.3508035
svyttest(Depression~Age.Group, subset(NHANES, Gender=="Women" & (Age.Group=="20-39" | Age.Group=="60 and over")))$p.value %>% as.numeric
## [1] 0.7530381
svyttest(Depression~Age.Group, subset(NHANES, Gender=="Women" & (Age.Group=="40-59" | Age.Group=="60 and over")))$p.value %>% as.numeric
## [1] 0.2201892

Visualizing Data for Descriptive Epidemiology

Now that we have retrieved the NHANES data and summarized the results, we can visualize the data using GGplot. The field of descriptive epidemiology aims to interpret the data and find compelling narratives. This data on reported incidence of depression shows a very stark contrast between women and men in the rates at which they experience depression. We will use a descriptive headline to call attention to the stark contrast in the data being presented. Its a good idea to consider person, place and time in the presentation of the data, so we have been sure to mention these are sampled from the U.S. population (the place), comparing gender and age (the persons), and specifying the source and time range (the time) in the caption below the chart. The

sum_tbl <- getSummary(~Depression, ~Gender + Age.Group, NHANES)

sum_tbl %>% ggplot(aes(y=Depression,x=Age.Group, fill=Gender)) + geom_col(position='Dodge') + labs(title='US Women are twice as likely to experience depression',
                                                                                                   subtitle='This pattern applies to all ages, but most prevalent in middle age.',
                                                                                                   caption='NHANES Continuous Survey & Demographic Data 2013-2016 (1)',
                                                                                                   x='Age Group',
                                                                                                   y='Depression Score'
                                                                                                   )

Understanding the NHANES Depression data

The transformations in the data above represent raw data collected from a questionnaire, asking 20,146 US Adults age 18+ to self assess their symptoms of Depression. It seems fairly clear that the visualization implies a greater incidence of depression in women than men. But the exact nature of the data, and how we arrived a this conclusion is not particularly clear without additional analysis. It is important when we display a visualization to note what the X and Y axis are, and to make sure the units are clear.

In this case, the X axis is fairly obvious: “Age Groups”. But the Y axis is not. What is a Depression Score? What do those numbers mean? We need to be sure that we are accurately communicating what each chart axis means. Is a 6 supposed to mean 6% of adults experience depression, or a severity of 6? 6 out of what? We need to quantify the depression score.

Let’s take a deep dive into the the data routine to try to interpret the meaning of the Y axis.

Creating a Binary Depression Score

In the procedure above, on the mutate step, we are creating what is being called a “Depression Score”. This score is a binary measure (either 0 or 100) that is assigned based on the sum of columns DPQ010:DPQ090.

So now we must look at the informational page for the Mental Health Depression Screener to understand how that number is calculated. This is becoming more complicated tthat you might have guess looking at the chart!

That number is not really an exact measure of how many people experience ANY depression - it is a measure of how many people experience several symptoms of depression mildly or a few symptoms of depression acutely. Anyone knowledgeable about human nature might say that we all experience some depression or sadness from time to time; its not as simple as asking someone “are you depressed” as a yes or no question!

Understanding the questions

NHANES - Data Documentation, Codebook, and Frequencies: Mental Health - Depression Screener

Thus, the DPQ stats 010 through 090 aim to uncover how severe the specific symptoms of depression are. Since are fairly open ended questions, the results are somewhat “open to interpretation” by the survey respondent.

  • DPQ010 - Have little interest in doing things
  • DPQ020 - Feeling down, depressed, or hopeless
  • DPQ030 - Trouble sleeping or sleeping too much
  • DPQ040 - Feeling tired or having little energy
  • DPQ050 - Poor appetite or overeating
  • DPQ060 - Feeling bad about yourself
  • DPQ070 - Trouble concentrating on things
  • DPQ080 - Moving or speaking slowly or too fast
  • DPQ090 - Thought you would be better off dead

Understanding the answers

For each of the above, the response could have been 0, 1, 2, 3, 4, 7, 9, or .

Code Description

  • 0 - Not at all
  • 1 - Several days
  • 2 - More than half the days
  • 3 - Nearly every day
  • 7 - Refused
  • 9 - Don’t know
  • . - Missing

Now we can see this is a little more complicated than a yes or no question asking if you are depressed.

NHANES has very thoroughly explored the issue of the symptoms of depression, and tried to quantify each on based on frequency, while also giving respondents an “exit ramp” so they don’t have to answer. Presumably this would ensure that only persons who are comfortable self-assessing their depression would answer in these questions.

So, now understanding what each of the questions are and what the responses could be, we can see why in the code sample, answers greater than or equal to 7 (Refused, Don’t Know or Missing) are being removed from analysis - they are not instructive as to the question of how individuals rank in terms of their experience of depression. Not answering cannot be interpreted as meaning that you are or you aren’t depressed.

Adding it up and Classifying the Respondent

Having added up the scores, there is a somewhat arbitrary mutation being applied - it basically says that if the respondent answered a TOTAL of 10 or greater to all 9 questions we will consider them as a “100” meaning essentially “they are feeling depressed”. Answering less than 10 means they are a 0. Not being an expert in clinical psychology, I am not certain that this is a “standard” way of assessing this property. But it seems fairly reasonable.

We are saying that you would have to either answer as a few possible examples: Four of the nine questions as “Nearly every day” (3 * 4 = 12) or you would have to answer Eight of the questions with 1 “Several Days” and One with a 2 “More than half the days”)” (4 * 2) + (2 * 1) = 10)

Considering the reporting methodology and confounders.

So this assignment of a binary score is assessing people who have either experienced a few symptoms very severely or many symptoms some of the time. This seems fairly reasonable, but it does lose some of the specificity of the analysis. One might say that someone who answered DPQ090 “Thought you would be better off dead” with 3 “Nearly every day” should always be classified as depressed, even of none of the other symptoms apply. Using the strict criteria that only a 10 or greater = DEPRESSED (the binary measure), fails to capture the severity of this persons condition.

So it may be important when doing a simple analysis to either explore some of the specific questions in greater detail, or at least to note the methodology of the analysis very carefully and instruct the viewer of teh study as to exactly how the score is being calculated. Its also worth noting to the end viewer whether the score is intended to be interpreted as a “absolute” measure or more of a “relative measure” - for example, comparing women to men. If the measure is meant to be relative, we are not so much worried about the exact number of people, because we are simply interested in answering the question of which demographic is “more likely to be affected” by this condition.

Applying weights to extrapolate from the sample set to the general US Population

In the case of health data, interpreting the results oif a sample can be a complex matter. In this survey, the data is not just provided as raw responses to surveys, but rather, the outcomes of these surveys, and their applicability to the general population must be applid through rigorous curation, statistical analysis and extrapolation of the data. The responses given should be cleaned up to ensure completeness; this includes adjusting for non-response, correcting for incorrect/out of range values, and applying a weight to each response to attempt to ensure that the sum of all responses accurately represents the complete US population. According to the sampling estimation methodology this process involves a three stepo consideration:

“The first step involves the computation of stage base weights to compensate for unequal probabilities of selection. The second step adjusts for nonresponse to reduce the potential bias. In the third step, the sample weights are poststratified to the reference population. Poststratification is used to compensate for possible inadequacies in the eligible population and to reduce variances in the estimation procedure.”

Based on the above we can now understand why the code sample also applies a weight modifier. This helps us ensure that even if the sample under represents women or people over 60 for example, the final analysis will have considered that incongruent representation, and attempted to use accepted methods to correct for that.

Modifying the provided code to dig deeper on depression

The code sample provided by NHANES above provides an excellent starting point for further analysis. IN my experience as a professional programmer for over 25 years - using pre-supplied code and replacing variables carefully, can often give you the greatest overall productivity and impact of your time.

So we will now re-use the cod above that mutates the data, but this time, instead of looking at a “summary score” we will look at individual answers to the depression questions already noted above.

Since we already have th data in the environment, its not necessary to re-run the download of the data file. All we need to do is to recreate th “one” table with the combined demographic and survey data. BUt this time we will

  • break out multiple measures instead of summing them up into a binary score.

  • multiply the measures by 33.3 so that an answer of:

  • 0 - Not at all = Score of 0

  • 1 - Several days = Score of 33.3

  • 2 - More than half the days = Score of 66.6

  • 3 - Nearly every day = Score of 99.9

multiplier <- 33.3

all_measures <- left_join(DEMO, DPQ, by="SEQN") %>%
  # Set 7=Refused and 9=Don't Know To Missing for variables DPQ010 thru DPQ090 ##
  mutate_at(vars(DPQ010:DPQ090), ~ifelse(. >=7, NA, .)) %>%
  mutate(. , 
         # create indicator for overall summary
         one = 1,
         # Create depression score as sum of variables DPQ010 -- DPQ090
         Little.Interest = DPQ010*multiplier,
         Feeling.Down = DPQ020*multiplier,
         Trouble.Sleeping = DPQ030*multiplier,
         Feeling.Tired = DPQ040*multiplier,
         Poor.Apetite = DPQ050*multiplier,
         Feel.Bad.Self = DPQ060*multiplier,
         Trouble.Concentrating = DPQ070*multiplier,
         Speaking.Slow.Fast = DPQ080*multiplier,
         Better.Off.Dead = DPQ090*multiplier,
         Depression.Score = rowSums(select(. , DPQ010:DPQ090)),
         # Create depression indicator as binary 0/100 variable. (is missing if Depression.Score is missing)
         Depression= ifelse(Depression.Score >=10, 100, 0), 
         # Create factor variables
         Gender = factor(RIAGENDR, labels=c("Men", "Women")),
         Age.Group = cut(RIDAGEYR, breaks=c(-Inf,19,39,59,Inf),
                         labels=c("Under 20", "20-39","40-59","60 and over")),
         # Generate 4-year MEC weight (Divide weight by 2 because we are appending 2 survey cycles)                                                             
         # Note: using the MEC Exam Weights (WTMEC2YR), per the analytic notes on the               
         # Mental Health - Depression Screener (DPQ_H) documentation                          
         WTMEC4YR = WTMEC2YR/2 ,
         # Define indicator for analysis population of interest: adults aged 20 and over with a valid depression score
         inAnalysis= (RIDAGEYR >= 20 & !is.na(Depression.Score))
         ) %>% 
  # drop DPQ variables
  select(., -dplyr::starts_with("DPQ"))

#' ## Define survey design 
# Define survey design for overall dataset
NHANES_all <- svydesign(data=all_measures, id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC4YR, nest=TRUE)
                    
# Create a survey design object for the subset of interest: adults aged 20 and over with a valid depression score 
# Subsetting the original survey design object ensures we keep the design information about the number of clusters and strata
NHANES <- subset(NHANES_all, inAnalysis)

NHANES
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (60) clusters.
## subset(NHANES_all, inAnalysis)
#' By sex and age
getSummary(~Little.Interest, ~RIAGENDR, NHANES)
sum_tbl <- getSummary(~Little.Interest+Feeling.Down+Trouble.Sleeping+Feeling.Tired+Poor.Apetite+Feel.Bad.Self+Trouble.Concentrating+Speaking.Slow.Fast+Better.Off.Dead, ~Age.Group + Gender, NHANES)

sum_tbl %>% ggplot(aes(y=Trouble.Sleeping,x=Age.Group, fill=Gender)) + 
  geom_col(position='dodge') + 
  labs(title='US Women experience more severe sleep difficulties',
       subtitle='This pattern increases with age in women.',
       caption='NHANES Continuous Survey & Demographic Data 2013-2016 (1)')

sum_tbl <- getSummary(~Little.Interest, ~Gender + Age.Group, NHANES)

sum_tbl %>% ggplot(aes(y=Little.Interest,x=Age.Group, fill=Gender)) + 
  geom_col(position='Dodge') + 
  labs(title='US Women experience more severe lack of interest in doing things',
       subtitle='This pattern increases with age in women.',
       caption='NHANES Continuous Survey & Demographic Data 2013-2016 (1)')

sum_tbl <- getSummary(~Better.Off.Dead, ~Gender + Age.Group, NHANES)

sum_tbl %>% ggplot(aes(y=Better.Off.Dead,x=Age.Group, fill=Gender)) + geom_col(position='Dodge') + labs(title='Older US Men more likely to feel they are better off dead',
                                                                                                   subtitle='Older women are less likely than men to report extreme feelings of life doubt.',
                                                                                                   caption='NHANES Continuous Survey & Demographic Data 2013-2016 (1)')

complete_tbl <- getSummary(~Little.Interest+Feeling.Down+Trouble.Sleeping+Feeling.Tired+Poor.Apetite+Feel.Bad.Self+Trouble.Concentrating+Speaking.Slow.Fast+Better.Off.Dead, ~Age.Group + Gender, NHANES)
tidy_data <- complete_tbl %>% select(Gender,
                                     Age.Group, 
                                     Little.Interest,
                                     Feeling.Down,
                                     Trouble.Sleeping,
                                     Feeling.Tired,
                                     Poor.Apetite,
                                     Feel.Bad.Self,
                                     Trouble.Concentrating,
                                     Speaking.Slow.Fast,
                                     Better.Off.Dead) %>% pivot_longer(cols=c(Little.Interest,Feeling.Down,Trouble.Sleeping,Feeling.Tired,Poor.Apetite,Feel.Bad.Self,Trouble.Concentrating,Speaking.Slow.Fast,Better.Off.Dead))
tidy_data
tidy_data %>% ggplot(aes(x=value,y=name, fill=Gender)) + 
  geom_col(position='Dodge') + 
  labs(title='US Women report higher incidence of most signs of depression',
       subtitle='Sleep issues are most common',
       caption='NHANES Continuous Survey & Demographic Data 2013-2016 (1)')

tidy_data %>% 
  ggplot(aes(x=value,y=name, fill=Gender)) + 
  geom_col(position='Dodge') + 
  facet_wrap(~Age.Group)+
  labs(title='US Women more report higher incidence of most signs of depression',
       subtitle='But Men are more likely to feel they would be better off dead',
       caption='NHANES Continuous Survey & Demographic Data 2013-2016 (1)')

Digging in to the data on a more granular level, it is clear that tiredness and trouble sleeping are the two symptoms with the greatest overall incidence. This is perhaps unsurprising as the two are likely associated with each other - lack of sleep and tiredness go hand in hand. Its also clear that (thankfully) people feeling they would be “better off dead” has the lowest incidence. One interesting note here is that this rather morbid indicator is the only one that was found to be more prevalent in men than in women.

Applying the code to a new data set

First we will navigate back to the available surveys and choose the same date range as the previous analysis. https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2013

From here we will examine the questionnaire data group, since that is most similar to the Mental Health Depression Screener data example provided. https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Questionnaire&CycleBeginYear=2013

For this analysis we will instead examine reported alcohol use. A profile of this data may be found here: https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/ALQ_H.htm

In the codebook section, you can find the variable name(s) you will want to extract.

For this analysis we will look at the questionaire for alcohol use instead of eth mental health screener. The demographics data is already loaded, so we will re-use that from above.

A few modifications to the data allow re-use of the same code.

  • Change the data files to read ALQ_H and ALQ_I from 2013-2014 and 2015-2016 respectively
  • Examine variable ALQ130 “Number of drinks per day”
  • Remove the ifelse(Depression.Score >=10, 100, 0), line to use the raw numbers data instead of a binary measure.
# Mental Health - Depression Screener (ALQ) 
download.file("http://wwwn.cdc.gov/nchs/nhanes/2013-2014/ALQ_H.XPT", tf <- tempfile(), mode="wb")
ALQ_H <- foreign::read.xport(tf)
download.file("http://wwwn.cdc.gov/nchs/nhanes/2015-2016/ALQ_I.XPT", tf <- tempfile(), mode="wb")
ALQ_I <- foreign::read.xport(tf)

# Append Files
DEMO <- bind_rows(DEMO_H, DEMO_I)
ALQ <- bind_rows(ALQ_H, ALQ_I)

# Merge DEMO and ALQ files and create derived variables

One <- left_join(DEMO, ALQ, by="SEQN") %>%
  # Set 7=Refused and 9=Don't Know To Missing for variables ALQ010 thru ALQ090 ##
  #mutate_at(vars(ALQ130), ~ifelse(. >=7, NA, .)) %>%
  mutate(. , 
         # create indicator for overall summary
         one = 1,
         # Create depression score as sum of variables ALQ010 -- ALQ090
         Alcohol.PerDay = rowSums(select(. , ALQ130)),
         #Alcohol.EverHeavy = rowSums(select(. , ALQ151)),
         # Create depression indicator as binary 0/100 variable. (is missing if Depression.Score is missing)
         DrinksPerDay=  Alcohol.PerDay, #
         #HeavyDrinker = ifelse(Alcohol.EverHeavy  == 1, 1, 0), 
         # Create factor variables
         Gender = factor(RIAGENDR, labels=c("Men", "Women")),
         Age.Group = cut(RIDAGEYR, breaks=c(-Inf,19,29,39,49,59,Inf),labels=c("Under 20", "20-29","30-39","40-49","50-59","60 and over")),
         # Generate 4-year MEC weight (Divide weight by 2 because we are appending 2 survey cycles)                                                             
         # Note: using the MEC Exam Weights (WTMEC2YR), per the analytic notes on the               
         #       Mental Health - Depression Screener (ALQ_H) documentation                          
         WTMEC4YR = WTMEC2YR/2 ,
         # Define indicator for analysis population of interest: adults aged 20 and over with a valid depression score
         inAnalysis= (!is.na(Alcohol.PerDay)) #RIDAGEYR >= 21 & 
         ) %>% 
  # drop ALQ variables
  select(., -dplyr::starts_with("ALQ"))
One
#' ## Define survey design 
# Define survey design for overall dataset
NHANES_all <- svydesign(data=One, id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC4YR, nest=TRUE)
                    
# Create a survey design object for the subset of interest: adults aged 20 and over with a valid depression score 
# Subsetting the original survey design object ensures we keep the design information about the number of clusters and strata
NHANES <- subset(NHANES_all, inAnalysis)

#' ### Calculate prevalence of depression overall, by gender, by age group, and by age and gender
#' Adults
getSummary(~DrinksPerDay, ~one, NHANES)
#' By sex
getSummary(~DrinksPerDay, ~Gender, NHANES)
#' By age
getSummary(~DrinksPerDay, ~Age.Group, NHANES)
#' By sex and age
getSummary(~DrinksPerDay, ~Gender + Age.Group, NHANES)
sum_tbl <- getSummary(~DrinksPerDay, ~Gender + Age.Group, NHANES)

sum_tbl %>% ggplot(aes(y=DrinksPerDay,x=Age.Group, fill=Gender)) + 
  geom_col(position='dodge') + 
  labs(title='Men Drink about 20% more than women daily',
       subtitle='Consumption declines with age.',
       caption='NHANES Continuous Survey & Demographic Data 2013-2015',
       x='Age Group', 
       y='Alcoholic beverages consumed per day (2)')

sum_tbl <- getSummary(~DrinksPerDay, ~Gender + Age.Group, NHANES)

sum_tbl %>% ggplot(aes(y=DrinksPerDay,
                       x=Age.Group, 
                       color=Gender, 
                       ymin=DrinksPerDay-se, 
                       ymax=DrinksPerDay+se),position='dodge') + 
  geom_point() + 
  geom_errorbar() + 
  labs(title='Men Drink about 20% more than women daily',
       subtitle='Consumption declines with age.',
       caption='NHANES Continuous Survey & Demographic Data 2013-2015',
       x='Age Group', 
       y='Alcoholic beverages consumed per day (2)')

One <- left_join(DEMO, ALQ, by="SEQN") %>%
  # Set 7=Refused and 9=Don't Know To Missing for variables ALQ010 thru ALQ090 ##
  #mutate_at(vars(ALQ130), ~ifelse(. >=7, NA, .)) %>%
  mutate(. , 
         # create indicator for overall summary
         one = 1,
         # Create depression score as sum of variables ALQ010 -- ALQ090
         #Alcohol.PerDay = rowSums(select(. , ALQ130)),
         Alcohol.EverHeavy = rowSums(select(. , ALQ151)),
         # Create depression indicator as binary 0/100 variable. (is missing if Depression.Score is missing)
         #DrinksPerDay=  Alcohol.PerDay, #
         HeavyDrinker = ifelse(Alcohol.EverHeavy  == 1, 1, 0), 
         # Create factor variables
         Gender = factor(RIAGENDR, labels=c("Men", "Women")),
         Age.Group = cut(RIDAGEYR, breaks=c(-Inf,19,29,39,49,59,Inf),labels=c("Under 20", "20-29","30-39","40-49","50-59","60 and over")),
         # Generate 4-year MEC weight (Divide weight by 2 because we are appending 2 survey cycles)                                                             
         # Note: using the MEC Exam Weights (WTMEC2YR), per the analytic notes on the               
         #       Mental Health - Depression Screener (ALQ_H) documentation                          
         WTMEC4YR = WTMEC2YR/2 ,
         # Define indicator for analysis population of interest: adults aged 20 and over with a valid depression score
         inAnalysis= (!is.na(Alcohol.EverHeavy)) #RIDAGEYR >= 21 & 
         ) %>% 
  # drop ALQ variables
  select(., -dplyr::starts_with("ALQ"))
One
#' ## Define survey design 
# Define survey design for overall dataset
NHANES_all <- svydesign(data=One, id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC4YR, nest=TRUE)
                    
# Create a survey design object for the subset of interest: adults aged 20 and over with a valid depression score 
# Subsetting the original survey design object ensures we keep the design information about the number of clusters and strata
NHANES <- subset(NHANES_all, inAnalysis)

sum_tbl <- getSummary(~HeavyDrinker, ~Gender + Age.Group, NHANES)

sum_tbl %>% ggplot(aes(y=HeavyDrinker,x=Age.Group, fill=Gender)) + geom_col(position='Dodge') + 
  labs(title='Older men increasingly drink in excess',
       subtitle='Women less so.',
       caption='NHANES Continuous Survey & Demographic Data 2013-2015',
       x='Age Group', 
       y='% who drink 4-5 drinks daily (2)')

Annotated Bibliography

In this section, we will provide resource for teh reader to find the actual data sources of our analysis. By numbering the references, we can provide an exact link to teh information vignette on the source site for each chart

Analytics Guidelines https://wwwn.cdc.gov/nchs/nhanes/analyticguidelines.aspx

Estimation Procedures https://www.cdc.gov/nchs/data/series/sr_02/sr02_177.pdf

  1. NHANES Depression Statistics 2013-2016 https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DPQ_H.htm

  2. NHANES Alcohol Use Statistics 2013-2016 https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/ALQ_H.htm


This concludes “An Epidemiology Companion in R Markdown” - Thanks for reading, and please check back for more installments:

In “A Probability Companion in R Markdown”, we will explore more fully the subjects of probability, statistics, and additional applications of expected value, including subjects of linear modeling and inference. We will look at how to use R to generate histograms, ogives, box plots, and other statistical charts.

In “A Linear Algebra Companion in R Markdown” we will more thoroughly explore the subject of vector spaces, and how these to visualize vectors geometrically using the principles of Linear Algebra. This guide will explore more about the inner product, number spaces, and the concepts of magnitude and orthogonality.

Appendix: Installing R

R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. To get the latest version of R, please visit:

https://www.r-project.org/

Packages used in this document are available from the Comprehensive R Archive Network (CRAN):

http://cran.us.r-project.org

For more info on R packages visit:

https://cran.r-project.org/web/packages/

Notes on Publishing RMarkdown

This document was generated by RMarkdown using R code and LaTeX, and rendered to HTML.

Additional formatting has been applied to results variables using the kable and kableExtra packages. These code blocks are not shown in the output, for readability.