Back in 2012, I wrote a short post describing some of the issues I was having doing sampling in R using the function written by Ananda. Anada actually commented on my post directing me to an updated version of the code he has written, which is greatly improved.
However, I still feel that there is room for improvement. One common example is many times during the model building process, you wish to split data into testing, training and validation data sets. This should be extremely easy and done in a standardized way.
Thus, I have set out to simplify this process. My goals are to write a function in to split data frames R that:
1. Allows the user to specify any number of splits and the size of splits (the splits need not necessarily all be the same size).
2. Specify the names of the resulting split data.
3. Provide default values for the most common use case.
4. Just “work”. Ideally this means that the function is intuitive and can be used without requiring the user to read any documentation. This also means that it should have minimal errors, produce useful error messages and protect against unintended usage.
5. Be written in a readable and well commented manner. This should facilitate debugging and extending functionality, even if this means performance is not 100% optimal.
I have written this code as part of my package that is in development called helpRFunctions, which is designed to make R programming as painless as possible.
The function takes just a few arguments:
1. df : The data frame to split
2. pcts Optional. The percentage of observations to put into each bucket.
3. set.names Optional. What to name the resulting data sets. This must be the same length as the pcts vector.
4. seed Optional. Define a seed to use for sampling. Defaults to NULL which is just the normal random number generator in R
The function then returns a list containing data frames named according to the set.names argument.
Here is a brief example on how to use the function.
install.package('devtools') # Only needed if you dont have this installed.
df <- data.frame(matrix(rnorm(110), nrow = 11))
t <- split.data(df)
training <- t$training
testing <- t$testing
Let’s see how the function performs for a large-ish data frame.
df <- data.frame(matrix(rnorm(110*1000000), nrow = 11*1000000))
Less than 3 seconds to split 11 million rows! Not too bad. The full code can be found here. I always, I welcome any feedback or pull requests.