archived asdfree: analyze the survey of income and program participation (sipp) with r

if the census bureau's budget was gutted and only one complex sample survey survived, pray it's the survey of income and program participation (sipp). it's giant. it's rich with variables. it's monthly. it follows households over three, four, now five year panels. the congressional budget office uses it for their health insurance simulation. analysts read that sipp has person-month files, get scurred, and retreat to inferior options. the american community survey may be the mount everest of survey data, but sipp is most certainly the amazon. questions swing wild and free through the jungle canopy i mean core data dictionary. legend has it that there are still species of topical module variables that scientists like you have yet to analyze. ponce de león would've loved it here. ponce. what a name. what a guy.

the sipp 2008 panel data started from a sample of 105,663 individuals in 42,030 households. once the sample gets drawn, the census bureau surveys one-fourth of the respondents every four months, over four or five years (panel durations vary). you absolutely must read and understand pdf pages 3, 4, and 5 of this document before starting any analysis (start at the header 'waves and rotation groups'). if you don't comprehend what's going on, contact them.

since sipp collects information from respondents regarding every month over the duration of the panel, you'll need to be hyper-aware of whether you want your results to be point-in-time, annualized, or specific to some other period. the analysis scripts below provide examples of each. at every four-month interview point, every respondent answers every core question for the previous four months. after that, wave-specific addenda (called topical modules) get asked, but generally only regarding a single prior month. to repeat: core wave files contain four records per person, topical modules contain one. if you stacked every core wave, you would have one record per person per month for the duration of the panel. mmmassive. ~100,000 respondents x 12 months x ~4 years. have an analysis plan before you start writing code so you extract exactly what you need, nothing more. better yet, modify something of mine. cool? this new github repository contains eight, you read me, eight scripts:

1996 panel - download and create database.R
2001 panel - download and create database.R
2004 panel - download and create database.R
2008 panel - download and create database.R

since some variables are character strings in one file and integers in another, initiate an r function to harmonize variable class inconsistencies in the sas importation scripts
properly handle the parentheses seen in a few of the sas importation scripts, because the SAScii package currently does not
create a monetdblite database, initiate a variant of the `read.SAScii` function that imports ascii data directly into a sql database (.db)
download each microdata file - weights, topical modules, everything - then read 'em into sql

2008 panel - full year analysis examples.R

define which waves and specific variables to pull into ram, based on the year chosen
loop through each of twelve months, constructing a single-year temporary table inside the database
read that twelve-month file into working memory, then save it for faster loading later if you like
read the main and replicate weights columns into working memory too, merge everything
construct a few annualized and demographic columns using all twelve months' worth of information
construct a replicate-weighted complex sample design with a fay's adjustment factor of one-half, again save it for faster loading later, only if you're so inclined
reproduce census-published statistics, not precisely (due to topcoding described here on pdf page 19)

2008 panel - point-in-time analysis examples.R

define which wave(s) and specific variables to pull into ram, based on the calendar month chosen
read that interview point (srefmon)- or calendar month (rhcalmn)-based file into working memory
read the topical module and replicate weights files into working memory too, merge it like you mean it
construct a few new, exciting variables using both core and topical module questions
construct a replicate-weighted complex sample design with a fay's adjustment factor of one-half
reproduce census-published statistics, not exactly cuz the authors of this brief used the generalized variance formula (gvf) to calculate the margin of error - see pdf page 4 for more detail - the friendly statisticians at census recommend using the replicate weights whenever possible. oh hayy, now it is.

2008 panel - median value of household assets.R

define which wave(s) and specific variables to pull into ram, based on the topical module chosen
read the topical module and replicate weights files into working memory too, merge once again
construct a replicate-weighted complex sample design with a fay's adjustment factor of one-half
reproduce census-published statistics, not exactly due to topcoding (read more about topcoding by searching this and that user guide for, well, `topcoding`). huh. so topcoding affects asset statistics.

replicate census poverty statistics.R

load a single wave of data
limit the table to the variables needed for an example analysis
construct the complex sample survey object
print statistics and standard errors matching the target replication table

click here to view these eight scripts

for more detail about the survey of income and program participation (sipp), visit:

the reengineering of the survey of income and program participation, snore.
the survey of income and program participation wikipedia entry

notes:

the survey of income and program participation is happening now, red-hot. everything you need is available, albeit somewhat hidden. there's a short introduction, an official ftp site - with codebooks - census publications based on sipp - don't miss the table packages - aww cool even questionnaires. the core variable codebook might not win any beauty pageants, but it'd be a wise use of time to slowly scroll through the first fifty variables. interview months take place after `srefmon == 4` and actual times of month and year can be determined with the `rhcalmn` + `rhcalyr` variables.

perhaps more than any of the other data sets on this website, working with sipp will get more comfortable as you increase your ram. so long as you manipulate these files with sql commands inside the monetdblite database that my automated-download scripts create, you'll process these data line-by-line and therefore be untethered from any computer hardware limits. but the moment a dbReadTable or dbGetQuery command pulls something into working memory, you'll begin gobbling up those precious four, eight, or sixteen gigabytes on your local computer. in practice, this simply requires that you to define the columns you need at the start, then limit what gets read-in to only those variables. you'll see it done in my scripts. if you don't copy that strategy -fair warning- you may hit allocation errors. maybe keep the performance tab of your windows task manager handy and take out the trash.

confidential to sas, spss, stata, and sudaan users: watch this. time to transition to r. :D