Tags
So I was trying to obtain 1000 random samples from 30 different groups within approximately 30k rows of data. I came across this function:
http://news.mrdwab.com/2011/05/20/stratified-random-sampling-in-r-from-a-data-frame/
However, when I ran this function on my data, I received an error that R ran out of memory. Therefore, I had to create my own stratified sampling function that would work for large data sets with many groups.
After some trial and error, the key turned out to be sorting based on the desired groups and then computing counts for those groups. The procedure is extremely fast, taking only .18 seconds on a large data set. I welcome any feedback on how to improve!
stratified_sampling<-function(df,id, size) {
#df is the data to sample from
#id is the column to use for the groups to sample
#size is the count you want to sample from each group
# Order the data based on the groups
df<-df[order(df[,id],decreasing = FALSE),]
# Get unique groups
groups<-unique(df[,id])
group.counts<-c(0,table(df[,id]))
#group.counts<-table(df[,id])
rows<-mat.or.vec(nr=size, nc=length(groups))
# Generate Matrix of Sample Rows for Each Group
for (i in 1:(length(group.counts)-1)) {
start.row<-sum(group.counts[1:i])+1
samp<-sample(group.counts[i+1]-1,size,replace=FALSE)
rows[,i]<-start.row+samp
}
sample.rows<-as.vector(rows)
df[sample.rows,]
}