Introduction

So far, we have learned followings:

  1. Study Design
  2. Quantifying Extent of Disease
  3. Summarizing Data collected in the sample
  4. The Role of Probability

Probably, it is a good time to have a look a real world data, BRFSS data, set and how these knowledges are applied.

BRFSS Data

2018 BRFSS data can be downloadable from Here.

Go to the link for more detailed information. For a simple download, click here. Download

Codebook, downloadable here. Download

Unzip it, and put LLCP2018.XPT in data directory.

In [1]:
pacman::p_load(RCurl, foreign, downloader, survey, srvyr, ggplot2, dplyr, tidyverse)
In [2]:
brfss18 <- read.xport('data/LLCP2018.XPT')
In [3]:
sprintf("BRFSS data have %d recores, and %d number of variables", nrow(brfss18), ncol(brfss18))
'BRFSS data have 437436 recores, and 275 number of variables'
In [4]:
brfss18[1:5,]
X_STATEFMONTHIDATEIMONTHIDAYIYEARDISPCODESEQNOX_PSUCTELENM1...X_MAM5022X_RFPAP34X_RFPSA22X_RFBLDS3X_COL10YRX_HFOB3YRX_FS5YRX_FOBTFSX_CRCRECX_AIDTST3
1 1 01052018 01 05 2018 1100 20180000012.018e+09 1 ... NA NA NA NA NA NA NA NA NA 2
1 1 01122018 01 12 2018 1100 20180000022.018e+09 1 ... NA 1 NA NA NA NA NA NA NA 2
1 1 01082018 01 08 2018 1100 20180000032.018e+09 1 ... NA NA NA NA NA NA NA NA NA 2
1 1 01032018 01 03 2018 1100 20180000042.018e+09 1 ... NA NA 2 2 2 2 2 2 2 2
1 1 01122018 01 12 2018 1100 20180000052.018e+09 1 ... NA NA NA NA NA NA NA NA NA 2

Research Question 1

Since the data is huge, let's narrow down our research question.

Prevalance of overweight/obesity and General health level (fair or poor) for each state in US

Codebook info for Selected Variables:

Education: Figs/educag

Income: Figs/income

General Health: Figs/gh

BMI: Figs/BMI

Sate ID info:

In [5]:
state.id <- read.csv("https://raw.githubusercontent.com/arilamstein/complexsurvey/master/brfss_state.csv", stringsAsFactors=F)
In [6]:
brfss18[1:5, c('X_PSU', 'X_STSTR','X_LLCPWT', 'X_STATE', 'X_EDUCAG', 'INCOME2', 'GENHLTH', 'X_BMI5CAT', 'SEX1')]
X_PSUX_STSTRX_LLCPWTX_STATEX_EDUCAGINCOME2GENHLTHX_BMI5CATSEX1
2.018e+0911011 173.18511 4 6 2 2 2
2.018e+0911012 1540.50431 4 4 3 4 2
2.018e+0911011 607.09811 2 3 5 3 2
2.018e+0911011 822.67221 2 3 1 3 1
2.018e+0911012 2229.72841 4 99 2 NA 2
In [7]:
sub_brfss = brfss18 %>% select( c('X_STATE', 'X_EDUCAG', 'INCOME2', 'GENHLTH', 'X_BMI5CAT', 'SEX1') )
#sub_brfss = brfss18 %>% select( c('X_STATE', 'X_EDUCAG', 'INCOME2', 'GENHLTH', 'X_BMI5CAT', 'SEX') )
In [20]:
brfss.design <- brfss18 %>%
  as_survey_design(ids=X_PSU,
                   weight=X_LLCPWT,
                   nest=T,
                   strata=X_STSTR,
                   variables= c(X_BMI5CAT, X_EDUCAG, INCOME2, GENHLTH, SEX1, X_STATE))
In [33]:
brfss.design <- brfss.design %>%
  mutate(
    X_EDUCAG2   = car::recode(X_EDUCAG, "1='<High School'; 2='Graduated High School'; 3='Attended College or Technical School'; 4='Graduated College or Technical School'; else=NA"),
    X_EDUCAG2   = factor(X_EDUCAG2, levels = c('<High School', 'Graduated High School', 'Attended College or Technical School', 'Graduated College or Technical School')),
    X_EDUCAG2   = fct_explicit_na(X_EDUCAG2),
    X_BMI5CAT2  = car::recode(X_BMI5CAT, "c(3,4)=1; NA=NA;  else=0"),
    X_BMI5CAT2  = as.factor(X_BMI5CAT2),
    X_BMI5CAT2a = car::recode(X_BMI5CAT2, "1='Obese/overweight'; NA=NA; else='Not Obese/overweight'"),
    X_BMI5CAT2a = factor(X_BMI5CAT2a, levels=c('Obese/overweight', 'Not Obese/overweight'), ordered=TRUE),
    X_BMI5CAT2 = fct_explicit_na(X_BMI5CAT2),
    INCOME2a   = car::recode(INCOME2, "1='<10,000'; 2='10,000~15,000'; 3='15,000~20,000'; 4='20,000~25,000'; 5='25,000~35,000'; 6='35,000~50,000'; 7='50,000~75,000'; 8='>75,000';  NA=NA; else=NA"),
    INCOME2a   = factor(INCOME2a, levels = c('<10,000', '10,000~15,000', '15,000~20,000', '20,000~25,000', '25,000~35,000', '35,000~50,000', '50,000~75,000', '>75,000')),
    INCOME2a = fct_explicit_na(INCOME2a),
    GENHLTH2  = car::recode(GENHLTH, "c(4,5)=1; c(1,2,3) = 0; NA=NA; else = NA"),
    GENHLTH2  = as.factor(GENHLTH2),
    GENHLTH2a = car::recode(GENHLTH2, "1='Fair/Poor'; NA=NA; else='G/VG/Excelent'"),
    GENHLTH2a = factor(GENHLTH2a, levels=c('Fair/Poor', 'G/VG/Excelent'), ordered=TRUE),
    GENHLTH2 = fct_explicit_na(GENHLTH2),
    SEX1   = car::recode(SEX1, "1='MALE'; 2='FEMALE'; else=NA"),
    SEX1   = factor(SEX1, levels = c('MALE', 'FEMALE') )  ,
    SEX1 = fct_explicit_na(SEX1)
  )

Take long time (3:52~4:07)

In [11]:
options(survey.lonely.psu = "certainty")

state.bmi <- brfss.design %>%
  group_by(X_STATE, X_BMI5CAT2) %>%
  summarize(prevalence = survey_mean(na.rm=T, vartype = c("ci")),
            N = survey_total(na.rm=T)) %>%
  mutate(X_STATE = as.character(X_STATE))
Warning message:
"Factor `X_BMI5CAT2` contains implicit NA, consider using `forcats::fct_explicit_na`"
In [12]:
state.bmi[1:10,]
X_STATEX_BMI5CAT2prevalenceprevalence_lowprevalence_uppNN_se
1 0 0.30327440.28689300.31965591071098.433241.715
1 1 0.69672560.68034410.71310702460681.337593.253
1 NA NA NA NA NA NA
2 0 0.35786890.33071850.3850192 187890.5 7980.116
2 1 0.64213110.61498080.6692815 337135.6 8709.593
2 NA NA NA NA NA NA
4 0 0.35292430.33485930.37098931764359.650518.087
4 1 0.64707570.62901070.66514073234898.261068.633
4 NA NA NA NA NA NA
5 0 0.29535270.27554880.3151566 630629.624012.588
In [13]:
data(county.regions, package="choroplethrMaps")
county.regions[1:5,]
regioncounty.fips.charactercounty.namestate.namestate.fips.characterstate.abb
11001 01001 autaugaalabama01 AL
361003 01003 baldwinalabama01 AL
551005 01005 barbouralabama01 AL
151007 01007 bibb alabama01 AL
21009 01009 blount alabama01 AL
In [14]:
county.regions <- county.regions %>%
  select(state.fips.character, state.name) %>% unique
In [15]:
county.regions[1:5,]
state.fips.characterstate.name
101 alabama
8302 alaska
10104 arizona
12705 arkansas
19606 california
In [16]:
state.bmi <- state.bmi %>%
  mutate(X_STATE = str_pad(X_STATE, 2, side="left", pad="0")) %>%
  left_join(county.regions, by=c("X_STATE"="state.fips.character")) %>%
  mutate(state.name = ifelse(X_STATE=="72", "puerto rico",
                             ifelse(X_STATE=="66", "guam", state.name)))
In [17]:
state.bmi[1:10,]
X_STATEX_BMI5CAT2prevalenceprevalence_lowprevalence_uppNN_sestate.name
01 0 0.30327440.28689300.31965591071098.433241.715alabama
01 1 0.69672560.68034410.71310702460681.337593.253alabama
01 NA NA NA NA NA NAalabama
02 0 0.35786890.33071850.3850192 187890.5 7980.116alaska
02 1 0.64213110.61498080.6692815 337135.6 8709.593alaska
02 NA NA NA NA NA NAalaska
04 0 0.35292430.33485930.37098931764359.650518.087arizona
04 1 0.64707570.62901070.66514073234898.261068.633arizona
04 NA NA NA NA NA NAarizona
05 0 0.29535270.27554880.3151566 630629.624012.588arkansas
In [18]:
state.bmi %>% filter(X_BMI5CAT2==1) %>% 
    ggplot(aes(y=reorder(state.name, prevalence), x=prevalence, xmin=prevalence_low, xmax=prevalence_upp)) +
    geom_point() +
    geom_errorbarh(height=0) +
    labs(y="State", x=NULL,
       title="BMI Overweight and Obese by State",
       caption="2014 CDC BRFSS") +
    theme_bw() +
    theme(axis.text.y=element_text(size=rel(0.8)))
In [23]:
state.bmi2 <- state.bmi %>%
  filter(X_BMI5CAT2==1) %>%
  select(region=state.name, value=prevalence)
choroplethr::state_choropleth(state.bmi2, title="Computed body mass index categories (overweight or obese) choropleth map", num_colors=9)
Warning message in super$initialize(map.df, user.df):
"Your data.frame contains the following regions which are not mappable: guam, puerto rico"
In [ ]:

In [35]:
state.ht <- brfss.design %>%
  group_by(X_STATE, GENHLTH2) %>%
  summarize(prevalence = survey_mean(na.rm=T, vartype = c("ci")),
            N = survey_total(na.rm=T)) %>%
  mutate(X_STATE = as.character(X_STATE))
In [36]:
state.ht[1:5,]
X_STATEGENHLTH2prevalenceprevalence_lowprevalence_uppNN_se
1 0 0.7655323540.7519328980.7791318092916875.77 40775.014
1 1 0.2277733120.2143656310.241180994 867875.08 26890.580
1 (Missing) 0.0066943340.0037111650.009677503 25507.14 5814.350
2 0 0.8404530610.8205039710.860402150 467618.84 8662.001
2 1 0.1552184200.1354832580.174953582 86361.82 5748.416
In [37]:
state.ht <- state.ht %>%
  mutate(X_STATE = str_pad(X_STATE, 2, side="left", pad="0")) %>%
  left_join(county.regions, by=c("X_STATE"="state.fips.character")) %>%
  mutate(state.name = ifelse(X_STATE=="72", "puerto rico",
                             ifelse(X_STATE=="66", "guam", state.name)))
In [38]:
state.ht[1:5,]
X_STATEGENHLTH2prevalenceprevalence_lowprevalence_uppNN_sestate.name
01 0 0.7655323540.7519328980.7791318092916875.77 40775.014 alabama
01 1 0.2277733120.2143656310.241180994 867875.08 26890.580 alabama
01 (Missing) 0.0066943340.0037111650.009677503 25507.14 5814.350 alabama
02 0 0.8404530610.8205039710.860402150 467618.84 8662.001 alaska
02 1 0.1552184200.1354832580.174953582 86361.82 5748.416 alaska
In [39]:
state.ht %>% filter(GENHLTH2==1) %>% 
    ggplot(aes(y=reorder(state.name, prevalence), x=prevalence, xmin=prevalence_low, xmax=prevalence_upp)) +
    geom_point() +
    geom_errorbarh(height=0) +
    labs(y="State", x=NULL,
       title="Health Condition Fair or Poor by State",
       caption="2014 CDC BRFSS") +
    theme_bw() +
    theme(axis.text.y=element_text(size=rel(0.8)))
In [40]:
state.ht2 <- state.ht %>%
  filter(GENHLTH2==1) %>%
  select(region=state.name, value=prevalence)
choroplethr::state_choropleth(state.ht2, title="General Health level categories (fair or poor) choropleth map", num_colors=9)
Warning message in super$initialize(map.df, user.df):
"Your data.frame contains the following regions which are not mappable: guam, puerto rico"

Research Question 2

So far, we used survey design, and observational weights, X_LLCPWT, are used for the analysis. However, many times we may not need such weight modifications. This time, let's assume each sample in the survey have the same importance. and look at research question :

How is the prevalance of Overweight/Obesity and General Health level (fair or poor) with respect to Educational level or annual income in Wisconsin State?

Take Wisconsin state(55) from data

In [41]:
brfss18 <- brfss18 %>% tbl_df()

brfss_wisc <- brfss18 %>% filter(X_STATE == 55)

Do some preprocessing:

In [42]:
brfss_wisc <- brfss_wisc %>%
  mutate(
    X_EDUCAG2   = car::recode(X_EDUCAG, "1='<High School'; 2='Graduated High School'; 3='Attended College or Technical School'; 4='Graduated College or Technical School'; else=NA"),
    X_EDUCAG2   = factor(X_EDUCAG2, levels = c('<High School', 'Graduated High School', 'Attended College or Technical School', 'Graduated College or Technical School')),
    X_EDUCAG2   = fct_explicit_na(X_EDUCAG2),
    X_BMI5CAT2  = car::recode(X_BMI5CAT, "c(3,4)=1; NA=NA;  else=0"),
    X_BMI5CAT2  = as.factor(X_BMI5CAT2),
    X_BMI5CAT2a = car::recode(X_BMI5CAT2, "1='Obese/overweight'; NA=NA; else='Not Obese/overweight'"),
    X_BMI5CAT2a = factor(X_BMI5CAT2a, levels=c('Obese/overweight', 'Not Obese/overweight'), ordered=TRUE),
    X_BMI5CAT2 = fct_explicit_na(X_BMI5CAT2),
    X_BMI5CAT2a = fct_explicit_na(X_BMI5CAT2a),
    INCOME2a   = car::recode(INCOME2, "1='<10,000'; 2='10,000~15,000'; 3='15,000~20,000'; 4='20,000~25,000'; 5='25,000~35,000'; 6='35,000~50,000'; 7='50,000~75,000'; 8='>75,000';  NA=NA; else=NA"),
    INCOME2a   = factor(INCOME2a, levels = c('<10,000', '10,000~15,000', '15,000~20,000', '20,000~25,000', '25,000~35,000', '35,000~50,000', '50,000~75,000', '>75,000')),
    INCOME2a = fct_explicit_na(INCOME2a),
    GENHLTH2  = car::recode(GENHLTH, "c(4,5)=1; c(1,2,3) = 0; NA=NA; else = NA"),
    GENHLTH2  = as.factor(GENHLTH2),
    GENHLTH2a = car::recode(GENHLTH2, "1='Fair/Poor'; NA=NA; else='G/VG/Excelent'"),
    GENHLTH2a = factor(GENHLTH2a, levels=c('Fair/Poor', 'G/VG/Excelent'), ordered=TRUE),
    GENHLTH2 = fct_explicit_na(GENHLTH2),
    GENHLTH2a = fct_explicit_na(GENHLTH2a),
    SEX1   = car::recode(SEX1, "1='MALE'; 2='FEMALE'; else=NA"),
    SEX1   = factor(SEX1, levels = c('MALE', 'FEMALE') )  ,
    SEX1 = fct_explicit_na(SEX1)
  )

Prevalence of Overweight/Obesity, General Health with respect to educational level

In [44]:
tb_wisc <- brfss_wisc  %>%
    select(X_EDUCAG2, X_BMI5CAT2a, INCOME2a, GENHLTH2a, SEX1, X_BMI5CAT) %>%
    group_by(X_EDUCAG2) %>% 
    summarize(prevalence_highBMI = sum(as.numeric(X_BMI5CAT2a) == 1) / sum(as.numeric(X_BMI5CAT2a) %in% 1:2),
              prevalence_GENHLTH2a = sum(GENHLTH2a == 'Fair/Poor') / sum(!(GENHLTH2a %in% '(Missing)')),
              prevalance_overweight = sum(X_BMI5CAT == 3, na.rm=T) / sum(X_BMI5CAT %in% 1:4, na.rm=T),
              prevalance_obesity = sum(X_BMI5CAT == 4, na.rm=T) / sum(X_BMI5CAT %in% 1:4, na.rm=T),
            N = n()) 

tb_wisc
X_EDUCAG2prevalence_highBMIprevalence_GENHLTH2aprevalance_overweightprevalance_obesityN
<High School 0.7098214 0.3654618 0.3392857 0.3705357 249
Graduated High School 0.7341772 0.2155340 0.3684951 0.3656821 1547
Attended College or Technical School 0.7334361 0.1704225 0.3559322 0.3775039 1424
Graduated College or Technical School0.6483180 0.1000569 0.3681957 0.2801223 1759
(Missing) 0.5000000 0.1764706 0.3333333 0.1666667 17

Prevalence of General Health level(fair/poor) w.r.t Education level

In [49]:
tb_wisc %>%
ggplot( aes(y = prevalence_GENHLTH2a, x = X_EDUCAG2) ) +
    geom_bar(stat="identity") +
    geom_text(aes(label=round(prevalence_GENHLTH2a,2)), vjust=1.6, color="white", size=3.5)+
    xlab('Education Level') +
    ylab('Prevalence of Genral Health(fair/poor)') +
    theme_bw() + 
#    theme_minimal() +
    theme(
        text = element_text(size = 17),
        axis.text.x = element_text(angle=35, hjust = 1)
    )
In [ ]:

Relationship between Income and Educational level

In [50]:
brfss_wisc  %>%
    select(INCOME2a, X_EDUCAG2) %>%
    group_by(INCOME2a, X_EDUCAG2) %>%
    tally() %>%
    group_by(INCOME2a) %>%
    mutate(edu_ratio = n / sum(n)) %>% ggplot() +
    geom_col(aes( x = INCOME2a, y = edu_ratio, fill = X_EDUCAG2),
             position = "stack") +
    xlab('Income') +
    ylab('Education level proportions') +
    theme_bw() + 
#    theme_minimal() +
    theme(
        text = element_text(size = 10),
        axis.text.x = element_text(angle=35, hjust = .9)
    )

Prevalence of General Health (Fair/Poor) w.r.t Annual Income

In [51]:
brfss_wisc  %>%
    select(X_EDUCAG2, X_BMI5CAT2a, INCOME2a, GENHLTH2a, SEX1, X_BMI5CAT) %>%
    group_by(INCOME2a) %>% 
    summarize(prevalence_highBMI = sum(as.numeric(X_BMI5CAT2a) == 1) / sum(as.numeric(X_BMI5CAT2a) %in% 1:2),
              prevalence_GENHLTH2a = sum(GENHLTH2a == 'Fair/Poor') / sum(!(GENHLTH2a %in% '(Missing)')),
              prevalance_overweight = sum(X_BMI5CAT == 3, na.rm=T) / sum(X_BMI5CAT %in% 1:4, na.rm=T),
              prevalance_obesity = sum(X_BMI5CAT == 4, na.rm=T) / sum(X_BMI5CAT %in% 1:4, na.rm=T),
            N = n()) %>%
    ggplot( aes(y = prevalence_GENHLTH2a, x = INCOME2a) ) +
    geom_bar(stat="identity") +
    geom_text(aes(label=round(prevalence_GENHLTH2a,2)), vjust=1.6, color="white", size=3.5)+
    xlab('INCOME') +
    ylab('Prevalence Gen. Health (Fair/Poor)') +
    theme_bw() + 
#    theme_minimal() +
    theme(
        text = element_text(size = 15),
        axis.text.x = element_text(angle=45, hjust = 1)
    )

Distribution of Annual Income for residents in WI

In [52]:
brfss_wisc  %>%
    select(INCOME2a) %>%
    group_by(INCOME2a) %>% 
    summarize(n = n()) %>%
    ggplot( aes(y = n, x = INCOME2a) ) +
    geom_bar(stat="identity") +
    geom_text(aes(label=round(n/sum(n),2)), vjust=1.6, color = "white") +
    geom_text(aes(label=n), vjust=-.3) +
    xlab('INCOME') +
    ylab('Frequency') +
    theme_bw() + 
#    theme_minimal() +
    theme(
        text = element_text(size = 15),
        axis.text.x = element_text(angle=45, hjust = 1)
    )
In [ ]:

In [ ]:

In [ ]:

In [ ]:

In [ ]:

In [ ]:

In [ ]: