**Tags**

So I was trying to obtain 1000 random samples from 30 different groups within approximately 30k rows of data. I came across this function:

http://news.mrdwab.com/2011/05/20/stratified-random-sampling-in-r-from-a-data-frame/

However, when I ran this function on my data, I received an error that R ran out of memory. Therefore, I had to create my own stratified sampling function that would work for large data sets with many groups.

After some trial and error, the key turned out to be sorting based on the desired groups and then computing counts for those groups. The procedure is extremely fast, taking only .18 seconds on a large data set. I welcome any feedback on how to improve!

stratified_sampling<-function(df,id, size) {

#df is the data to sample from

#id is the column to use for the groups to sample

#size is the count you want to sample from each group

# Order the data based on the groups

df<-df[order(df[,id],decreasing = FALSE),]

# Get unique groups

groups<-unique(df[,id])

group.counts<-c(0,table(df[,id]))

#group.counts<-table(df[,id])

rows<-mat.or.vec(nr=size, nc=length(groups))

# Generate Matrix of Sample Rows for Each Group

for (i in 1:(length(group.counts)-1)) {

start.row<-sum(group.counts[1:i])+1

samp<-sample(group.counts[i+1]-1,size,replace=FALSE)

rows[,i]<-start.row+samp

}

sample.rows<-as.vector(rows)

df[sample.rows,]

}

DaNiu

said:Hi, I know this is an older post but I found it really useful! Is it possible to do weighted sampling w/ a similar method? For instance, if I wanted to subset on a few different variables, then select x% from those variables?

mrdwab

said:Hi Adam,

I know this is old news now, but I recently came across this post and wanted to let you know that there have been some major additions to

since the time the function was originally written. I’m not surprised you ran into memory issues with my old version!

The biggest change is that the newer version (found here: https://gist.github.com/mrdwab/933ffeaa7a1d718bd10a) makes use of “data.table” so it is

veryefficient.There are also enhancements in how it handles subsetting, using multiple variables for stratification, and so on.

And, hopefully, you won’t still be running into memory problems!

~ Ananda

adamsanalytics

said:Ananda,

Thanks for the update. I’ll try to find the time to check out the new code and update the post.

Adam

adamsanalytics

said:Ananda,

I checked out your new code and it looks good. I wrote a new function to simplify the sampling process and extend the functionality for arbitrarily sized and named buckets.

Let me know if you have any feedback or want to collaborate.

https://adammcelhinney.com/2014/11/25/update-to-stratified-sampling-in-r/

Pingback: Update to Stratified Sampling in R | Adam On Analytics