although the census bureau employs the survey administrators and produces the main how-to documents (both faq and overviews), city government actually pays the bill and gets the glory: the preliminary 2011 report with all the fun facts and the older but more complete 2008 report. the microdata include four exciting files: a person-level file for occupied units, a household-level file for occupied units, a household-level file for vacant units, and a household-level file for units that didn't yield an interview (solely for adjusting the vacant-unit statistics). most urban planning and policy wonks line up the occupied and vacant household-level files to calculate a vacancy rate, but depending on your mission, you might need some person-level action as well. by the way, the nyc.gov report is six months older than the latest 2011 microdata, so don't panic if your stats are off by a whisker. this new github repository contains three scripts:
download all microdata.R
- download, import, save each of the four data files into a single year-specific .rda file back to 2002
- bumper sticker idea for nychvs data users: if you can read this, thank a furman center for the sas import scripts.
2011 analysis examples.R
- load all available tables for a single year of data
- construct the complex sample survey object, but it's fake - see note below.
- run example analyses that calculate perfect means, medians, quantiles, totals
replicate contract items 2008.R
- load all available tables for a single year of data
- construct the complex sample survey object, but it's fake - see note below.
- thoroughly explain a back-of-the-envelope calculation for standard errors, confidence intervals, variances
- print statistics that match exactly - and confidence intervals more conservative than - the target replication table
click here to view these three scripts
for more detail about the new york city housing and vacancy survey, visit:
notes:
hint for statistical illiterates: if the data point you're looking for isn't in the nyc.gov grand report, check the census bureau's copious online tables too.
as described in detail in the comments of the replication script, it's impossible to exactly match the census-published confidence intervals. here's one snippet of a longer conversation about how users cannot automate the computation of standard errors (discussed at footnote five) with the nychvs. the `segment` variable (mentioned in the e-mail) does not get released due to confidentiality concerns. either calculate them by hand with the infuriating generalized variance formula recommended in each year's source and accuracy statement (2008, 2011) or use the back-of-the-envelope method i invented that approximates census-published confidence intervals conservatively. when i learned that users couldn't automate the matching of census-published numbers, i tried to be a bootstrapping young lad and come up with some fancy standard error computation methodology. but it turns out that multiplying the un-adjusted errors by two gets as close to the right answer as anything else. if you're writing the final draft of a research product destined to get heavy exposure, you might have to calculate confidence intervals by hand or pay the census bureau for a custom run. but for those of us who can live with an occasional false negative in our lives, try it my way.
confidential to sas, spss, stata, and sudaan users: i look at you the way new yorkers look at jersey. time to transition to r. :D