I’ve been mucking around in R recently trying to teach myself a few new tricks related to my job. In the course of Googling around, I found that Biostat Matt put together a slick script for converting the 2009 Area Resource File (ARF) from a SAS file into a format that R can consume.
The file is difficult to work with directly because of its size and format. Because the data are stored as text, numbers are stored inefficiently (after conversion and compression, the equivalent R data file is 10% of the original size). In cases like this, the saving-grace is human readability. Although the file is ASCII, or rather extended ASCII (I found an accent in San Sebastiàn, PR), it’s not human-readable because the 6256 fields aren’t delimited and are variable in length. Hence, it’s nearly impossible to visually track where fields begin and end. The data are distributed with a SAS macro to read the data into a SAS dataset.
The ARF is a ridiculously comprehensive database of health care statistics for every county in the U.S. It is a huge dataset–the 2011-12 version comprises 3231 observations of 6261 variables.
I updated Matt’s C script for the 2011-2012 ARF –the first part of my update is shown below. In order to use the script, you’ll have to download the full file from this gist. (I left out some important bits for the sake of readability on this page.)