the sipp 2008 panel data started from a sample of 105,663 individuals in 42,030 households. once the sample gets drawn, the census bureau surveys one-fourth of the respondents every four months, over four or five years (panel durations vary). you absolutely must read and understand pdf pages 3, 4, and 5 of this document before starting any analysis (start at the header 'waves and rotation groups'). if you don't comprehend what's going on, contact them.
since sipp collects information from respondents regarding every month over the duration of the panel, you'll need to be hyper-aware of whether you want your results to be point-in-time, annualized, or specific to some other period. the analysis scripts below provide examples of each. at every four-month interview point, every respondent answers every core question for the previous four months. after that, wave-specific addenda (called topical modules) get asked, but generally only regarding a single prior month. to repeat: core wave files contain four records per person, topical modules contain one. if you stacked every core wave, you would have one record per person per month for the duration of the panel. mmmassive. ~100,000 respondents x 12 months x ~4 years. have an analysis plan before you start writing code so you extract exactly what you need, nothing more. better yet, modify something of mine. cool? this new github repository contains eight, you read me, eight scripts:
1996 panel - download and create database.R
2001 panel - download and create database.R
2004 panel - download and create database.R
2008 panel - download and create database.R
- since some variables are character strings in one file and integers in another, initiate an r function to harmonize variable class inconsistencies in the sas importation scripts
- properly handle the parentheses seen in a few of the sas importation scripts, because the SAScii package currently does not
- create a monetdblite database, initiate a variant of the `read.SAScii` function that imports ascii data directly into a sql database (.db)
- download each microdata file - weights, topical modules, everything - then read 'em into sql
2008 panel - full year analysis examples.R
- define which waves and specific variables to pull into ram, based on the year chosen
- loop through each of twelve months, constructing a single-year temporary table inside the database
- read that twelve-month file into working memory, then save it for faster loading later if you like
- read the main and replicate weights columns into working memory too, merge everything
- construct a few annualized and demographic columns using all twelve months' worth of information
- construct a replicate-weighted complex sample design with a fay's adjustment factor of one-half, again save it for faster loading later, only if you're so inclined
- reproduce census-published statistics, not precisely (due to topcoding described here on pdf page 19)
2008 panel - point-in-time analysis examples.R
- define which wave(s) and specific variables to pull into ram, based on the calendar month chosen
- read that interview point (srefmon)- or calendar month (rhcalmn)-based file into working memory
- read the topical module and replicate weights files into working memory too, merge it like you mean it
- construct a few new, exciting variables using both core and topical module questions
- construct a replicate-weighted complex sample design with a fay's adjustment factor of one-half
- reproduce census-published statistics, not exactly cuz the authors of this brief used the generalized variance formula (gvf) to calculate the margin of error - see pdf page 4 for more detail - the friendly statisticians at census recommend using the replicate weights whenever possible. oh hayy, now it is.
2008 panel - median value of household assets.R
- define which wave(s) and specific variables to pull into ram, based on the topical module chosen
- read the topical module and replicate weights files into working memory too, merge once again
- construct a replicate-weighted complex sample design with a fay's adjustment factor of one-half
- reproduce census-published statistics, not exactly due to topcoding (read more about topcoding by searching this and that user guide for, well, `topcoding`). huh. so topcoding affects asset statistics.
replicate census poverty statistics.R
- load a single wave of data
- limit the table to the variables needed for an example analysis
- construct the complex sample survey object
- print statistics and standard errors matching the target replication table
click here to view these eight scripts
for more detail about the survey of income and program participation (sipp), visit:
- the reengineering of the survey of income and program participation, snore.
- the survey of income and program participation wikipedia entry
notes:
the survey of income and program participation is happening now, red-hot. everything you need is available, albeit somewhat hidden. there's a short introduction, an official ftp site - with codebooks - census publications based on sipp - don't miss the table packages - aww cool even questionnaires. the core variable codebook might not win any beauty pageants, but it'd be a wise use of time to slowly scroll through the first fifty variables. interview months take place after `srefmon == 4` and actual times of month and year can be determined with the `rhcalmn` + `rhcalyr` variables.
perhaps more than any of the other data sets on this website, working with sipp will get more comfortable as you increase your ram. so long as you manipulate these files with sql commands inside the monetdblite database that my automated-download scripts create, you'll process these data line-by-line and therefore be untethered from any computer hardware limits. but the moment a dbReadTable or dbGetQuery command pulls something into working memory, you'll begin gobbling up those precious four, eight, or sixteen gigabytes on your local computer. in practice, this simply requires that you to define the columns you need at the start, then limit what gets read-in to only those variables. you'll see it done in my scripts. if you don't copy that strategy -fair warning- you may hit allocation errors. maybe keep the performance tab of your windows task manager handy and take out the trash.
confidential to sas, spss, stata, and sudaan users: watch this. time to transition to r. :D