for the past few decades, the bureau has compiled and released public use microdata samples (pums) from each big decennial census. these are simply one- and five- and ten-percent samples of the entire united states population. for the earlier censuses, although a microdata file containing five percent of the american population sounds better, the one-percent files might be more valuable for your analysis because fewer fields have to be suppressed or top-coded for respondent privacy (in compliance with title 13 of the u.s. code).
if you're not sure what kind of census data you want, read the missouri census data center's description of what's available. these public use microdata samples are most useful for very specific analyses of tiny areas (your neighborhood). it'd be wise to review the bureau's at a glance page to decide whether you can just use one of the summary files that they've already constructed - why re-invent the table?
my syntax below only loads the one- and five-percent files from 1990 and 2000, as well as the ten-percent files from 2010. earlier releases can be obtained from the national archives or the university of minnesota's magical ipums or the missouri state library's informative missouri census data center. note that the 2010 pums only has a few variables. this isn't as much of a loss as it might sound: the 2010 census dropped the long form - historically, one-sixth of american households were instructed to answer a lot of questions and the other five-sixths of us just had to answer a few. starting in 2010, everyone just had to answer a few, and the more detailed questions are now asked of roughly one percent of the united states population on an annual basis (as opposed to decennial) with the spanking new american community survey. read this for more detail. kinda awesome. this new github repository contains three scripts:
download and import.R
- figure out the data structures for the 1990, 2000 and 2010 pums for both household and person files
- download, unzip, and import each file for every year and size specified by the user into monetdb
- create and save a merged/person-level design object to make weighted analysis commands a breeze
2000 analysis examples.R
- re-initiate the monetdblite server
- load the r data file (.rda) containing the weighted design object for the one-percent and five-percent files
- perform the standard repertoire of analysis examples, this time using sqlsurvey functions - sorry no standard errors
replicate control counts table.R
- re-initiate the monetdblite server
- query the y2k household and merged tables to match the census bureau's published control counts
click here to view these three scripts
for more detail about the united states decennial census public use microdata sample, visit:
- the us census bureau's 1990, 2000, and 2010 census homepages.
- the american factfinder homepage, for all your online table creation needs
- the national archives, with identifiable data releases up to 1940. grandpa's confidentiality be damned!
notes:
analyzing trends between historical decennial censuses (would that be censii?) and the american community survey is legit. not only legit. encouraged. instead of waiting ten years to analyze long-form respondents, now you and i have access to a new data set every year. if you like this new design, thank a re-engineer.
so how might one calculate standard errors and confidence intervals in the pums? there isn't a good solution. ipums (again, who i love dearly) has waved its wand and created this impressive strata variable for each of the historical pums data sets. in a previous post, i advocated for simply doubling the standard errors but then calculating any critically-important standard errors by hand with the official formula (1990 here and 2000 there and here's 2010). starting with the 2005 american community survey, replicate weights have been added and the survey data world has been at (relative) peace.
confidential to sas, spss, stata, and sudaan users: fred flintstone thinks you are old-fashioned. time to transition to r. :D