archived asdfree: analyze the united states decennial census public use microdata sample (pums) with r and monetdb

during his tenure as secretary of state, thomas jefferson oversaw the first american census way back in 1790. some of my countrymen express pride that we're the oldest democracy, but my heart swells with the knowledge that we've got the world's oldest ongoing census. you'll find the terms 'census' and 'enumeration' scattered throughout article one, section two of our constitution. long story short: the united states census bureau has been a pioneer in the field of data collection and dissemination since george washington's first term. tis oft i wonder how he would have felt betwixt r and monetdb.

for the past few decades, the bureau has compiled and released public use microdata samples (pums) from each big decennial census. these are simply one- and five- and ten-percent samples of the entire united states population. for the earlier censuses, although a microdata file containing five percent of the american population sounds better, the one-percent files might be more valuable for your analysis because fewer fields have to be suppressed or top-coded for respondent privacy (in compliance with title 13 of the u.s. code).

if you're not sure what kind of census data you want, read the missouri census data center's description of what's available. these public use microdata samples are most useful for very specific analyses of tiny areas (your neighborhood). it'd be wise to review the bureau's at a glance page to decide whether you can just use one of the summary files that they've already constructed - why re-invent the table?

my syntax below only loads the one- and five-percent files from 1990 and 2000, as well as the ten-percent files from 2010. earlier releases can be obtained from the national archives or the university of minnesota's magical ipums or the missouri state library's informative missouri census data center. note that the 2010 pums only has a few variables. this isn't as much of a loss as it might sound: the 2010 census dropped the long form - historically, one-sixth of american households were instructed to answer a lot of questions and the other five-sixths of us just had to answer a few. starting in 2010, everyone just had to answer a few, and the more detailed questions are now asked of roughly one percent of the united states population on an annual basis (as opposed to decennial) with the spanking new american community survey. read this for more detail. kinda awesome. this new github repository contains three scripts:

download and import.R

figure out the data structures for the 1990, 2000 and 2010 pums for both household and person files
download, unzip, and import each file for every year and size specified by the user into monetdb
create and save a merged/person-level design object to make weighted analysis commands a breeze

fair warning: this full script takes a loooong time. run it friday afternoon, get out of town for the weekend, and if you've got a fast processor and speedy internet connection, monday morning it should be ready for action. otherwise, either download only the years and sizes you need or - if you gotta have 'em all - run it, minimize it, and then don't disturb it for a week.

2000 analysis examples.R

re-initiate the monetdblite server
load the r data file (.rda) containing the weighted design object for the one-percent and five-percent files
perform the standard repertoire of analysis examples, this time using sqlsurvey functions - sorry no standard errors

replicate control counts table.R

re-initiate the monetdblite server
query the y2k household and merged tables to match the census bureau's published control counts

click here to view these three scripts

for more detail about the united states decennial census public use microdata sample, visit:

the us census bureau's 1990, 2000, and 2010 census homepages.
the american factfinder homepage, for all your online table creation needs
the national archives, with identifiable data releases up to 1940. grandpa's confidentiality be damned!

notes:

analyzing trends between historical decennial censuses (would that be censii?) and the american community survey is legit. not only legit. encouraged. instead of waiting ten years to analyze long-form respondents, now you and i have access to a new data set every year. if you like this new design, thank a re-engineer.

so how might one calculate standard errors and confidence intervals in the pums? there isn't a good solution. ipums (again, who i love dearly) has waved its wand and created this impressive strata variable for each of the historical pums data sets. in a previous post, i advocated for simply doubling the standard errors but then calculating any critically-important standard errors by hand with the official formula (1990 here and 2000 there and here's 2010). starting with the 2005 american community survey, replicate weights have been added and the survey data world has been at (relative) peace.

confidential to sas, spss, stata, and sudaan users: fred flintstone thinks you are old-fashioned. time to transition to r. :D