the initial release of the 2008 bsapufs was accompanied by some major fanfare in the world of health policy, a big win for government transparency. unfortunately, the final files that cleared the confidentiality hurdles are heavily de-identified and obfuscated. prime examples:
- none of the files can be linked to any other file. not across years, not across expenditure categories
- costs are rounded to the nearest fifth or tenth dollar at lower values, nearest thousandth at higher values
- ages are categorized into five year bands
soapbox:
cms released free public data sets that could only be analyzed with a software package costing thousands of dollars. so even though the actual data sets were free, researchers still needed deep pockets to buy sas. meanwhile, the unsquelched and therefore superior data sets are also available for many thousands of dollars. researchers with funding would (reasonably) just buy the better data. researchers without any financial resources - the target audience of free, public data - were left out in the cold. no wonder these bsapufs haven't been used much.
that ends now. using r, monetdb, and the personal computer you already own (mine cost $700 in 2009), researchers can, for the first time, seriously analyze these medicare public use files without spending another dime. woah. plus hey guess what all you researcher fat-cats with your federal grant streams and your proprietary software licenses: r + monetdb runs one heckuva lot faster than sas. woah^2. dump your sas license water wings and learn how to swim. the scripts below require monetdb. click here for speed tests. vroom.
since the bsapufs comprise 5% of the medicare population, ya generally need to multiply any counts or sums by twenty. although the individuals represented in these claims are randomly sampled, this data should not be treated like a complex survey sample, meaning that the creation of a survey object is unnecessary. most bsapufs generalize to either the total or fee-for-service medicare population, but each file is different so give the documentation a hard stare before that eureka moment. this new github repository contains three scripts:
2008 - download all csv files.R
- loop through and download every zip file hosted by cms
- unzip the contents of each zipped file to the working directory
2008 - import all csv files into monetdb.R
- initiate the monet database in the future
- loop through each csv file in the current working directory and import them into the monet database
- initiate the same monetdb server instance, using the same well-documented block of code as above
- replicate nine sets of statistics found in data tables provided by cms
click here to view these three scripts
for more detail about the basic stand alone medicare claims public use files (bsapufs), visit:
- the centers for medicare and medicaid's bsapuf homepage
- a joint academyhealth webinar given by the organizations that partnered to create these files - cms, impaq, norc
notes:
the replication script has oodles of easily-modified syntax and should be viewed for analysis examples. just run sql queries - sas users, that's "proc sql;" for you. never used sql? start fresh with this tutorial. once you know the sql command you want to run on the data, you're almost done. for operations that make changes to the data tables, use dbSendQuery(). for operations that only read the data tables, use dbGetQuery().
don't ever use dbReadTable() on the outpatient, carrier, dme, or prescription drug event tables - they'll likely cause r to crash.
confidential to sas, spss, stata, and sudaan users: why are you using software that's twenty years shy of medicare eligibility itself? time to transition to r. :D