R Function for Stratified Sampling

So I was trying to obtain 1000 random samples from 30 different groups within approximately 30k rows of data. I came across this function:

http://news.mrdwab.com/2011/05/20/stratified-random-sampling-in-r-from-a-data-frame/

However, when I ran this function on my data, I received an error that R ran out of memory. Therefore, I had to create my own stratified sampling function that would work for large data sets with many groups.

After some trial and error, the key turned out to be sorting based on the desired groups and then computing counts for those groups. The procedure is extremely fast, taking only .18 seconds on a large data set. I welcome any feedback on how to improve!

stratified_sampling<-function(df,id, size) {
#df is the data to sample from
#id is the column to use for the groups to sample
#size is the count you want to sample from each group

# Order the data based on the groups
df<-df[order(df[,id],decreasing = FALSE),]

# Get unique groups
groups<-unique(df[,id])
group.counts<-c(0,table(df[,id]))
#group.counts<-table(df[,id])

rows<-mat.or.vec(nr=size, nc=length(groups))

# Generate Matrix of Sample Rows for Each Group
for (i in 1:(length(group.counts)-1)) {
start.row<-sum(group.counts[1:i])+1
samp<-sample(group.counts[i+1]-1,size,replace=FALSE)

rows[,i]<-start.row+samp

}

sample.rows<-as.vector(rows)
df[sample.rows,]
}

8 thoughts on “R Function for Stratified Sampling”

DaNiu says:

September 13, 2013 at 9:00 pm

Hi, I know this is an older post but I found it really useful! Is it possible to do weighted sampling w/ a similar method? For instance, if I wanted to subset on a few different variables, then select x% from those variables?

Reply
1. Catarina says:
  
  August 21, 2018 at 4:06 am
  
  Hi,
  
  Did you find out how to do the stratified weighted sampling?
  
  Thank you,
  
  Catarina
  
  Reply
mrdwab says:

September 29, 2014 at 10:43 am
Hi Adam,

I know this is old news now, but I recently came across this post and wanted to let you know that there have been some major additions to
```
stratified
```
since the time the function was originally written. I’m not surprised you ran into memory issues with my old version!

The biggest change is that the newer version (found here: https://gist.github.com/mrdwab/933ffeaa7a1d718bd10a) makes use of “data.table” so it is very efficient.

There are also enhancements in how it handles subsetting, using multiple variables for stratification, and so on.

And, hopefully, you won’t still be running into memory problems!

~ Ananda
Reply
1. adamsanalytics says:
  
  October 10, 2014 at 2:11 pm
  
  Ananda,
  
  Thanks for the update. I’ll try to find the time to check out the new code and update the post.
  
  Adam
  
  Reply
  1. adamsanalytics says:
    
    November 25, 2014 at 10:57 pm
    
    Ananda,
    
    I checked out your new code and it looks good. I wrote a new function to simplify the sampling process and extend the functionality for arbitrarily sized and named buckets.
    
    Let me know if you have any feedback or want to collaborate.
    
    Update to Stratified Sampling in R
Pingback: Update to Stratified Sampling in R | Adam On Analytics
mua micro says:

April 10, 2018 at 3:02 am

Excellent blog you have here but I was curious if you knew of any user
discussion forums that cover the same topics talked about here?
I’d really love to be a part of community where I can get feed-back from other experienced
people that share the same interest. If you have any recommendations, please let me know.
Thanks a lot!

Reply
dai ly nuoc khoang vinh hao says:

December 16, 2018 at 8:42 pm

I enjoy what you guys tend to be up too. Such clever work and reporting!
Keep up the terrific works guys I’ve you guys to my own blogroll.

Reply