Tags
So I was trying to obtain 1000 random samples from 30 different groups within approximately 30k rows of data. I came across this function:
http://news.mrdwab.com/2011/05/20/stratified-random-sampling-in-r-from-a-data-frame/
However, when I ran this function on my data, I received an error that R ran out of memory. Therefore, I had to create my own stratified sampling function that would work for large data sets with many groups.
After some trial and error, the key turned out to be sorting based on the desired groups and then computing counts for those groups. The procedure is extremely fast, taking only .18 seconds on a large data set. I welcome any feedback on how to improve!
stratified_sampling<-function(df,id, size) {
#df is the data to sample from
#id is the column to use for the groups to sample
#size is the count you want to sample from each group
# Order the data based on the groups
df<-df[order(df[,id],decreasing = FALSE),]
# Get unique groups
groups<-unique(df[,id])
group.counts<-c(0,table(df[,id]))
#group.counts<-table(df[,id])
rows<-mat.or.vec(nr=size, nc=length(groups))
# Generate Matrix of Sample Rows for Each Group
for (i in 1:(length(group.counts)-1)) {
start.row<-sum(group.counts[1:i])+1
samp<-sample(group.counts[i+1]-1,size,replace=FALSE)
rows[,i]<-start.row+samp
}
sample.rows<-as.vector(rows)
df[sample.rows,]
}
Hi, I know this is an older post but I found it really useful! Is it possible to do weighted sampling w/ a similar method? For instance, if I wanted to subset on a few different variables, then select x% from those variables?
Hi,
Did you find out how to do the stratified weighted sampling?
Thank you,
Catarina
Hi Adam,
I know this is old news now, but I recently came across this post and wanted to let you know that there have been some major additions to
since the time the function was originally written. I’m not surprised you ran into memory issues with my old version!
The biggest change is that the newer version (found here: https://gist.github.com/mrdwab/933ffeaa7a1d718bd10a) makes use of “data.table” so it is very efficient.
There are also enhancements in how it handles subsetting, using multiple variables for stratification, and so on.
And, hopefully, you won’t still be running into memory problems!
~ Ananda
Ananda,
Thanks for the update. I’ll try to find the time to check out the new code and update the post.
Adam
Ananda,
I checked out your new code and it looks good. I wrote a new function to simplify the sampling process and extend the functionality for arbitrarily sized and named buckets.
Let me know if you have any feedback or want to collaborate.
https://adammcelhinney.com/2014/11/25/update-to-stratified-sampling-in-r/
Pingback: Update to Stratified Sampling in R | Adam On Analytics
Excellent blog you have here but I was curious if you knew of any user
discussion forums that cover the same topics talked about here?
I’d really love to be a part of community where I can get feed-back from other experienced
people that share the same interest. If you have any recommendations, please let me know.
Thanks a lot!