# Michael Kane on Bigmemory

Handling big data sets has always been a concern for R users. Once the size of the data set reaches above 50% of RAM, it is considered “massive” and can literally become impossible to work with on a standard machine. The bigmemory project, by Michael Kane and Jay Emerson, is one approach to dealing with this class of data set. Last Monday, December 13th, the New England R Users Group warmly welcomed Michael Kane to talk about bigmemory and R.

Bigmemory is one package (of 5 in the bigmemory project) which is designed to extend R to better handle large data sets. The core data structures are written in C++ and allow R users to create matrix-like data objects (called big.matrix). These big.matrix objects are compatible with standard R matrices, allowing them to be used wherever a standard matrix can. The backing store for a big.matrix is a memory-mapped file, allowing it to take on sizes much larger than available RAM.

Mike discussed the use of bigmemory on the well-known Airline on-time data which includes over 20 years of data on roughly 120 million commercial US airline flights. The data set is roughly 12 GB in size and considering that read.table() recommends the maximum data size to be 10%-20% of RAM, it is nearly impossible to work with on a standard machine. However, bigmemory allows you to read in and analyze the data without problems.

Mike also showed how bigmemory can be used with the MapReduce (or split-apply-combine) method to greatly reduce the time required by many statistical calculations. For example, if one were trying to determine if older planes suffer greater delays, you need to know how old each of the 13,000 planes are. This calculation, running on a standard 1 core system is estimated to require nearly 9 hours to compute. Even when running in parallel on 4 cores, it can take nearly 2 hours. However, using bigmemory and the split-apply-combine method, the computation takes a little over one minute!

The bigmemory project was recently awarded the 2010 John M. Chambers Statistical Software Award and was presented to Mike at the 2010 Joint Statistical Meetings held in August.

Thanks for this post. I tried bigmemory once, it is great for a large CSV file with only numbers, however, when my CSV has multiple formats, for instance, some columns are in double, some other columns are in character, Bigmemeory fails to read it and reports error, while read.table() is able to read a CSV with multiple formats, however, it is incapable of handling a large CSV. Any suggestion for my case? thank you Josh.

If you want to use bigmemory with mixed-type columns you need to do a numeric encoding for non-numeric columns manually which is pretty easy to do using R’s scan function. However, you may also consider using a database. Hadley Wickham has a great example of how to do this with the airline on-time data set at http://stat-computing.org/dataexpo/2009/.

I think this is very interesting approach, but it’s not clear why, in principle, you would need to load the 12GB into memory — virtual or otherwise — at once.

Many operations, from sums to correlation matrices, can be efficiently calculated by sequentially reading through the data.

One of the goals of the renjin project — http://code.google.com/p/renjin, is to build a new R interpreter that gives developers the flexibility to back R language objects with many different types of storage, including database connections or flat text files with a rolling buffer. We’re probably a good six months from a beta release, but one of the ultimate goals is to make R a better platform for “big data”.

While it is true that operations like sums and correlation matrices can be performed using other approaches like a rolling buffer, these calculations still require that data is loaded into memory. bigmemory makes the process of moving data from disk to memory transparent and does so in a way that is compatible with synchronous and asynchronous approaches to big-data calculations along with optimized linear algebra libraries. As a result, a big.matrix object is not only a mechanism for managing data, like a database or flat file, it is also a numeric object that can be used directly when performing calculations.

Hi, currently im working with bigmemory library on linux and i’m trying to read data from a csv which size is 17GB on a pc with 4GB of RAM, when i perform:

After a few seconds i get:

Error en scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :

could not allocate memory (2048 Mb) in C function ‘R_AllocStringBuffer’

Calls: read.big.matrix -> read.big.matrix -> scan