analyze the program for international student assessment (pisa) with r and monetdb

the authoritative source for evaluating educational achievement across nations, the program(me) for international student assessment ranks the math, science, and reading skills of 15-year-olds in more than sixty countries.  coordinated by the organisation for economic co-operation and development (oecd) and released every three years, this data set gives finland reason to gloat and anti-poverty advocates in the united states reason to fight.  participating countries must sample at least 5,000 teenagers, though some governments survey many more in order to provide education researchers with enough of a sample to perform within-country comparisons.  in the world of cross-border standardized testing, this is the big momma.

to understand what's possible with pisa, either visit the international products page or - if you only care about one country - start on the participating economies page and click through to the country-specific website (so here's america's).

instead of processing the pisa microdata line-by-line, the r language stoically attempts to read everything into memory at once.  to avoid the unpleasantness of a seized-up computer, i tweaked, pruned, manicured that code to work on multiply-imputed big survey data.  if you're already familiar with syntax used for the survey package, be patient and read my examples carefully when something doesn't behave as you expect it to.  gimme some good news: sqlsurvey uses ultra-fast monetdb (click here for speed tests).  monetdb imports, writes, recodes data slowly, but reads it hyper-fast.  a magnificent trade-off: data exploration typically requires you to think, send an analysis command, think some more, send another query, repeat.  importation scripts (especially the ones i've already written for you) can be left running overnight sans hand-holding.

pisa is a pita to analyze, because it's both multiply-imputed (like the survey of consumer finances) and big data (like the american community survey).  to help researchers deal with that complexity, the twentieth-century-dwelling statisticians at oecd wrote sas macros and spss functions as part of their analysis manual.  well guess what?  those languages are prohibitively expensive, so i've done gone and translated everything over to the r language, precisely reproducing their published results, then automating the download and importation into everybody's favorite monetdb.  say buh-bye to buying proprietary statistical software.  this new github repository contains four scripts:


download import and design.R
  • create the batch (.bat) file needed to initiate the monet database in the future
  • download, unzip, and import each file for every year and size specified by the user
  • split all `plausible value` variables into five, yeah, five tables to account for the uncertainty of imputed responses
  • create a well-documented block of code to re-initiate the monetdb server in the future

analysis examples.R
  • run the well-documented block of code to re-initiate the monetdb server
  • load the r data file (.rda) containing all five, yup, five replicate weight designs for the 2012 file
  • detour and coerce a numeric variable to categorical, then match some compendium statistics in the ict file.
  • perform the standard repertoire of analysis examples, using a jolly mix of sqlsurvey and custom functions

extract specific countries.R
  • run the well-documented block of code to re-initiate the monetdb server 
  • subset the 2009 student-interviews file to only the nation of brazil, read those records into working memory
  • save and then re-load a multiply-imputed brazil-only survey design object that no longer requires monetdb
  • match a brazilian statistic and standard error in the oecd's official technical documentation

replicate oecd publications.R
  • run the well-documented block of code to re-initiate the monetdb server
  • load the r data file (.rda) containing the five, yay, five designs for the 2009 file
  • match every type of statistic in the oecd's official technical documentation



click here to view these four scripts



for more detail about the program for international student assessment (pisa), visit:

if you're just looking for a couple data points, you ought to give the australian council for educational research's interactive data selection tools a spin.  it's a menu-drive table creator, so easy-to-use but inflexible.

you wouldn't be analyzing the program for international student assessment right now without the work of not one but two dr. thomas lumleys.  (or, in latin, lumlii)  if you decide to hand-write a thank-you letter for all of their hard work using jefferson's polygraph, you won't even need to switch out the paper to fill in specific names.  just another example of the unparalleled efficiencies you'll find when working in the r language with monetdb.

confidential to sas, spss, stata, and sudaan users: you are kissing the wrong frogs.  time to transition to r.  :D