R – Adam On Analytics

Back in 2012, I wrote a short post describing some of the issues I was having doing sampling in R using the function written by Ananda. Anada actually commented on my post directing me to an updated version of the code he has written, which is greatly improved.

However, I still feel that there is room for improvement. One common example is many times during the model building process, you wish to split data into testing, training and validation data sets. This should be extremely easy and done in a standardized way.

Thus, I have set out to simplify this process. My goals are to write a function in to split data frames R that:

1. Allows the user to specify any number of splits and the size of splits (the splits need not necessarily all be the same size).

2. Specify the names of the resulting split data.

3. Provide default values for the most common use case.

4. Just “work”. Ideally this means that the function is intuitive and can be used without requiring the user to read any documentation. This also means that it should have minimal errors, produce useful error messages and protect against unintended usage.

5. Be written in a readable and well commented manner. This should facilitate debugging and extending functionality, even if this means performance is not 100% optimal.

I have written this code as part of my package that is in development called helpRFunctions, which is designed to make R programming as painless as possible.

The function takes just a few arguments:

1. df : The data frame to split

2. pcts Optional. The percentage of observations to put into each bucket.

3. set.names Optional. What to name the resulting data sets. This must be the same length as the pcts vector.

4. seed Optional. Define a seed to use for sampling. Defaults to NULL which is just the normal random number generator in R

The function then returns a list containing data frames named according to the set.names argument.

Here is a brief example on how to use the function.

install.package('devtools') # Only needed if you dont have this installed. library(devtools) install_github('adam-m-mcelhinney/helpRFunctions') df <- data.frame(matrix(rnorm(110), nrow = 11)) t <- split.data(df) training <- t$training testing <- t$testing
Let’s see how the function performs for a large-ish data frame.

df <- data.frame(matrix(rnorm(110*1000000), nrow = 11*1000000)) system.time(split.data(df))

Less than 3 seconds to split 11 million rows! Not too bad. The full code can be found here. I always, I welcome any feedback or pull requests.

So I was trying to obtain 1000 random samples from 30 different groups within approximately 30k rows of data. I came across this function:

http://news.mrdwab.com/2011/05/20/stratified-random-sampling-in-r-from-a-data-frame/

However, when I ran this function on my data, I received an error that R ran out of memory. Therefore, I had to create my own stratified sampling function that would work for large data sets with many groups.

After some trial and error, the key turned out to be sorting based on the desired groups and then computing counts for those groups. The procedure is extremely fast, taking only .18 seconds on a large data set. I welcome any feedback on how to improve!

stratified_sampling<-function(df,id, size) {
#df is the data to sample from
#id is the column to use for the groups to sample
#size is the count you want to sample from each group

# Order the data based on the groups
df<-df[order(df[,id],decreasing = FALSE),]

# Get unique groups
groups<-unique(df[,id])
group.counts<-c(0,table(df[,id]))
#group.counts<-table(df[,id])

rows<-mat.or.vec(nr=size, nc=length(groups))

# Generate Matrix of Sample Rows for Each Group
for (i in 1:(length(group.counts)-1)) {
start.row<-sum(group.counts[1:i])+1
samp<-sample(group.counts[i+1]-1,size,replace=FALSE)

rows[,i]<-start.row+samp

}

sample.rows<-as.vector(rows)
df[sample.rows,]
}

Tag: R

Update to Stratified Sampling in R

R Function for Stratified Sampling