archived asdfree: analyze the national vital statistics system (nvss) with r and monetdb

ever since the dawn of the internet, the centers for disease control and prevention (cdc) has maintained a big data archive called the national vital statistics system (nvss). the hardworking quants in hyattsville release two major annual microdata files: the first containing one record per birth in the united states, the second containing one record per death in the united states. if you want to calculate the probability that a thirty-six-year-old mother is going to have quadruplets or the number of sixty-three-year-olds who were struck dead by lightning, you have arrived.

but wait. there's more. in addition to the massive nationwide tables of every birth (natality) and death (mortality), the national center for health statistics (nchs) compiles annual microdata files with every fetal death over twenty weeks and not one but two infant mortality research files. in the previous paragraph, i think i mentioned that nchs publishes a big file with every birth and a separate big file with every death during the year. well what if you want to connect the birth records to death records? who-was-born-and-then-died, if you will. alright, so anyone who makes it past infancy (one year of life) cannot be linked; there's no way to merge the birth records in the 1970 microdata file with the forty-year-olds who died in the 2010 file. however, if you're only concerned with analyzing infant deaths, you've got two options:

a period-linked infant mortality data set: the numerator data table contains one record for every infant in the united states who died in a given year, regardless of whether that infant was born in that same given year or was born in the year prior. the denominator data table contains one record for every infant born in the united states in the given year. so any infant who was born in late 2004 and then died before their first birthday but after january 1st of 2005 will be included in the numerator file but not in the denominator file. since it's mathematically screwy to have stuff in your numerator that's not also in your denominator, this period-linked method is imperfect. but the data collection and processing turnaround time are far faster than method #2 below, and the nationwide rates come close enough that it's used as the basis of most every cdc infant mortality publication.
a cohort-linked infant mortality data set: the numerator data table contains one record for every infant in the united states who was born in a given year, and then who died in that same year or in the following year. the denominator data table contains one record for every infant in the united states who was born in a given year. so this is a simpler population to talk about; it's just every infant born in the united states in a given year, who died within 365.25 days of their dob.

as published, my scripts only go back to the year two thousand. it was quite a slog to blow the dust off of the ancient sas importation scripts over at the national bureau of economic research (nber) and get everything loaded into r's very own sascii package. gaps in the available nber syntax have to be filled in by manually combing through the file layouts stored deep inside the cdc's pdf user guides, so i erred on the side of not spending the rest of my life in fixed-width file-hell and put my foot down at the start of this century. if you know your way around an r console, extending my work back further in time requires more stamina than smarts. anyway, my work auto-loads everything into monetdb. so when you run a sql query, you get your answer faster than the speediest of gonzaleses. this new github repository contains three scripts:

download all microdata.R

create the batch (.bat) file needed to initiate the monet database in the future
download every zip file hosted by the cdc, unzip the contents to the working directory
loop through each dat file in the current working directory, import them into monet with read.sascii.monetdb
create a well-documented block of code to re-initiate the monetdb server in the future

replicate cdc tabulations.R

initiate the same monetdb server instance, using the same well-documented block of code as above
replicate statistics found in the natality data tables provided by the cdc
replicate statistics found in the period-linked infant death data tables provided by the cdc
replicate statistics found in the fetal death data tables provided by the cdc
look at your watch, because that would not have been nearly as fast in any other statistical programming language

replicate age-adjusted death rate.R

connect to the mortality microdata stored in monetdb
initiate a y2k census bureau age-stratified population counts data.frame
read.fwf in a y2.01k census bureau bridge file with one-record-per-year-of-birth-per-other-stuff
calculate the deaths-within-age-group from the startlingly hyperfast monetdb
merge a few tables, do a little algebra, match the published age-adjusted death rate
close up shop for the evening

click here to view these three scripts

for more detail about the national vital statistics system (nvss) microdata, visit:

the page with the microdata
the official fact sheet detailing the purpose of it all

notes:

a few years ago, the de-identified data police must have won an internal dispute because the latest nvss files now lack geographic identifiers so you can't tally up anything by county of residence. if you're only interested in an analysis at the state- or highly-populated-county-level, take the cdc's wonder for a whirl. if you cannot tolerate even an ounce of data suppression in your results, beeline to the nearest research data center.

confidential to sas, spss, stata, and sudaan users: why are you huffing and puffing into that kazoo when you could be conducting the vienna philharmonic? time to transition to r. :D