analyze the behavioral risk factor surveillance system (brfss) with r and monetdb

the behavioral risk factor surveillance system (brfss) aggregates behavioral health data from 400,000 adults via telephone every year.  it's um *clears throat* the largest telephone survey in the world and it's gotta lotta uses, here's a list neato.  state health departments perform the actual data collection (according to a nationally-standardized protocol and a core set of questions), then forward all responses to the centers for disease control and prevention (cdc) office of surveillance, epidemiology, and laboratory services (osels) where the nationwide, annual data set gets constructed.  independent administration by each state allows them to tack on their own questions that other states might not care about.  that way, florida could exempt itself from all the risky frostbite behavior questions.  in addition to providing the most comprehensive behavioral health data set in the united states, brfss also eeks out my worst acronym in the federal government award - onchit a close second.

annual brfss data sets have grown rapidly over the past half-decade: the 1984 data set contained only 12,258 respondents from 15 states, all states were participating by 1994, and the 2011 file has surpassed half a million interviews.  if you're examining trends over time, do your homework and review the brfss technical documents for the years you're looking at (plus any years in between).  what might you find?  well for starters, the cdc switched to sampling cellphones in their 2011 methodology.

unlike many u.s. government surveys, brfss is not conducted for each resident at a sampled household (phone number).  only one respondent per phone number gets interviewed.  did i miss anything?  well if your next question is frequently asked, you're in luck.

all brfss files are available in sas transport format so if you're sittin' pretty on 16 gb of ram, you could potentially read.xport a single year and create a taylor-series survey object using the survey package.  cool.  but hear me out:  the download and importation script builds an ultra-fast monet database (click here for speed tests, installation instructions) on your local hard drive.  after that, these scripts are shovel-ready.  consider importing all brfss files my way - let it run overnight - and during your actual analyses, code will run a lot faster.  the brfss generalizes to the u.s. adult (18+) (non-institutionalized) population, but if you don't have a phone, you're probably out of scope.  this new github repository contains three scripts:

download all microdata.R
  • initiate the monetdblite database
  • download, unzip, and import each year specified by the user
  • create and save the taylor-series linearization complex sample designs

2011 single-year - analysis examples.R
  • run the well-documented block of code to re-initiate the monetdb server
  • load the r data file (.rda) containing the taylor-series linearization design for the single-year 2011 file
  • perform the standard repertoire of analysis examples

replicate cdc weat - 2010.R
  • run the well-documented block of code to re-initiate the monetdb server
  • load the r data file (.rda) containing the taylor-series linearization design for the single-year 2010 file
  • replicate statistics from this table, pulled from the cdc's web-enabled analysis tool




click here to view these three scripts



for more detail about the behavioral risk factor surveillance system, visit:
  • the centers for disease control and prevention behavioral risk factor surveillance system homepage
  • the behavioral risk factor surveillance system wikipedia entry

notes:

if you're just scroungin' around for a few statistics, the cdc's web-enabled analysis tool (weat) might be all your heart desires.  in fact, on slides seven, eight, nine of my online query tools video, i demonstrate how to use this table creator.  weat's more advanced than most web-based survey analysis - you can run a regression.  but only seven (of eighteen) years can currently be queried online.


confidential to sas, spss, stata, sudaan users: when statistical languages are plotted on cartesian coordinates, what-you-paid-for vs. what-you-get are best represented as y = 1/x.  time to transition to r.  :D