before you dig into the public use microdata sample (pums), see if you can find the numbers you're looking for on american factfinder and be done with it. here's why: the census bureau has strict rules about respondent confidentiality, so they cannot disclose anything that makes it easy to isolate specific individuals or corporations. larger firms would be a cinch to pick out of microdata; by just limiting your data set to `state` equals "maryland" and `industry` equals "health care and social assistance" then sorting by the number of employees, you'd find the johns hopkins medical institutions in a complete dataset pretty fast. so the pums tosses out a bunch of larger companies (mostly publicly-owned) that they call "not classifiable." to understand the damage, compare the first two rows of this table: the pums contains records representing 97% of all firms, but since larger businesses have been disproportionately tossed, firms included in the pums employ only 48% of all americans in the labor force and represent only 40% of all payroll and 36% of all commerical revenue nationwide. one more caveat: instead of one-record-per-firm, the pums has one-record-per-firm-per-state-per-industry, so think of it as something between establishment- and firm-level. and no, you won't be able to aggregate those establishments (storefronts) up to the firm-level. lucky for you, the weights do sum up to all classifiable, non-publicly-held firms. maybe think of this file as a survey of smallish business owners. no more bad news.
there's just one technical document for the sbo pums, read pdf pages one through five. please. after that, they start describing variance calculations that i've gone out of my way to automate for you. they recommend this never-before-seen hybrid complex sample design that just uses basic weights to calculate means, medians and totals, then a weirdly-detached multiple imputation procedure for the standard errors. hakuna matata, i've written custom functions so you can focus on your research instead of translating their ancient greek. this new github repository contains three scripts:
download and import.R
- download the big n simple public use microdata file
- import the csv file directly into a monetdblite database
- add a few obvious columns, following the tech doc weighting scheme
2007 single-year - analysis examples.R
- connect to that monetdblite database you've previously initiated
- copy the main table `y` over to a table `x` and a bunch of mini-tables `x1` through `x10` that it's cool if you screw around with, since you can always delete and re-create them from your pristine `y` table later
- set up the hybrid complex sample survey design and class it as something special
- generate the usual rigamarole of examples using code familiar to other multiply-imputed survey data analysis
recode and replicate.R
- connect to that monetdblite database you've previously initiated
- copy over that main table `y` to `x` so you can make your mistakes on `x`
- implement the same variable constructions seen in this census bureau-provided sas code to determine which firms are at least fifty percent minority-owned, with sql. sql, sql, everywhere
- splinter your recoded table `x` into ten miniature `x1` through `x10` tables, then construct those same two complex sample design objects
- run just one svyby statement inside a multiple imputation combine function, and immediately generate every little statistic and standard error found in this census bureau-provided tabulation
click here to view these three scripts
for more detail about the survey of business owners, visit:
- this survey's methodology website
- the economic census homepage, for the whole shebang of business stats from the bureau
notes:
in addition to the pums (2007-only) and the 2002 and 2007 american factfinder data, the census bureau provides a battery of tables and reports back as far as 1992 that might have the statistic you seek. if you need something not shown, you could always open up your wallet and buy a custom data table.
confidential to sas, spss, stata, and sudaan users: your statistical language is merely an illusion, albeit a very persistent one. time to transition to r. :D