Monday 2 February 2015

Reading SAS into R

The sas7bdat package has been around for a while. It allows some SAS datasets to be read into R directly. However it didn't deal with compressed SAS datasets at all! Recently I discovered the sas7bdat.parso package which is by the same author and it uses the parso Java library for reading SAS datasets.

I tried it today and it worked flawlessly! However it requires Java 7 which may not be available on most corporate PCs still running Windows XP. Having said that my IT guys installed it for me with no issues on Windows XP.

Here are the simple steps required to use the sas7bdat.parso package

  1. Make sure you have Java 7 or above installed on your computer (https://www.java.com/en/download/help/download_options.xml)
  2. Install the package rJava, devtools, and sas7bdat.parso

Once the package has been installed you can read in SAS datasets (which all have the extension .sas7bdat) using this code:

The code behind the function read.sas7bdat.parso is simplistic. It simply converts the SAS dataset to a CSV before reading it into R using read.csv. There are very obvious steps that you can take to improve the code. I use the data.table package so the simplest I can think of is to replace the read.csv function with data.table's fread, which should read in the data much faster and return a data.table instead of data.frame. For example:

As of the latest version of the sas7bdat.parso package the function read.sas7bat.parso now has a READ_FUNC parameter. You can specify READ_FUNC = data.table::fread and it will return a data.table.

There are other potential opportunities at improving the package. Currently the read.sas7bdat.parso converts the SAS dataset into csv first. If this conversion step can be skipped and allow the data to be read in more directly then it would result in more speed benefits. Also the ability to read it in as a stream or connection so the data can be processed in chunks would be highly desirable too!

4 comments:

  1. I successfullly installed devtools and rJava, but I get the following error message when trying to install sas7bdata.parso:


    Error : .onLoad failed in loadNamespace() for 'rJava', details:
    call: fun(libname, pkgname)
    error: JAVA_HOME cannot be determined from the Registry
    Error : package 'rJava' could not be loaded
    Error: loading failed
    Execution halted
    *** arch - x64
    ERROR: loading failed for 'i386'
    * removing 'C:/Users/austin.lasseter/Documents/R/win-library/3.3/sas7bdat.parso'
    Error: Command failed (1)

    ReplyDelete
    Replies
    1. Hi Austin,
      I had similar problem. You have to check you JAVA_HOME path by running this command in R: Sys.getenv("JAVA_HOME") and compare the output with actual location of your Java installation. Also, keep in mind what version of R (32bit or 64 bit) you are using.

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. HI, I am getting following error when installing sas7bdat.parso

    devtools::install_github("BioStatMatt/sas7bdat.parso",force = "TRUE")
    Downloading GitHub repo BioStatMatt/sas7bdat.parso@master
    from URL https://api.github.com/repos/BioStatMatt/sas7bdat.parso/zipball/master
    Installing sas7bdat.parso
    "C:/PROGRA~1/R/R-32~1.4RE/bin/x64/R" --no-site-file --no-environ --no-save --no-restore --quiet CMD INSTALL \
    "C:/Users/nveeramachaneni/AppData/Local/Temp/Rtmp82ZMKe/devtools18ac15eef7d/BioStatMatt-sas7bdat.parso-867f26a" \
    --library="C:/Users/nveeramachaneni/Documents/R/win-library/3.2" --install-tests

    Error in setwd(dir = new) : cannot change working directory

    ReplyDelete