• About
  • Visual Basic for Applications (VBA)

Adam On Analytics

~ Ramblings on analytics, business, statistics and anything else

Adam On Analytics

Tag Archives: R

Update to Stratified Sampling in R

25 Tuesday Nov 2014

Posted by adamsanalytics in Statistics

≈ 1 Comment

Tags

analytical tools, Analytics, data science, R, statistics

Back in 2012, I wrote a short post describing some of the issues I was having doing sampling in R using the function written by Ananda. Anada actually commented on my post directing me to an updated version of the code he has written, which is greatly improved.

However, I still feel that there is room for improvement. One common example is many times during the model building process, you wish to split data into testing, training and validation data sets. This should be extremely easy and done in a standardized way.

Thus, I have set out to simplify this process. My goals are to write a function in to split data frames R that:

1. Allows the user to specify any number of splits and the size of splits (the splits need not necessarily all be the same size).

2. Specify the names of the resulting split data.

3. Provide default values for the most common use case.

4. Just “work”. Ideally this means that the function is intuitive and can be used without requiring the user to read any documentation. This also means that it should have minimal errors, produce useful error messages and protect against unintended usage.

5. Be written in a readable and well commented manner. This should facilitate debugging and extending functionality, even if this means performance is not 100% optimal.

I have written this code as part of my package that is in development called helpRFunctions, which is designed to make R programming as painless as possible.

The function takes just a few arguments:

1. df : The data frame to split

2.  pcts Optional. The percentage of observations to put into each bucket.

3. set.names Optional. What to name the resulting data sets. This must be the same length as the pcts vector.

4. seed Optional. Define a seed to use for sampling. Defaults to NULL which is just the normal random number generator in R

The function then returns a list containing data frames named according to the set.names argument.

Here is a brief example on how to use the function.

install.package('devtools') # Only needed if you dont have this installed.
library(devtools)
install_github('adam-m-mcelhinney/helpRFunctions')
df <- data.frame(matrix(rnorm(110), nrow = 11))
t <- split.data(df)
training <- t$training
testing <- t$testing

Let’s see how the function performs for a large-ish data frame.


df <- data.frame(matrix(rnorm(110*1000000), nrow = 11*1000000))
system.time(split.data(df))

Less than 3 seconds to split 11 million rows! Not too bad. The full code can be found here. I always, I welcome any feedback or pull requests.

Advertisement

R Function for Stratified Sampling

10 Tuesday Apr 2012

Posted by adamsanalytics in Uncategorized

≈ 7 Comments

Tags

R, stratified sampling

So I was trying to obtain 1000 random samples from 30 different groups within approximately 30k rows of data. I came across this function:

http://news.mrdwab.com/2011/05/20/stratified-random-sampling-in-r-from-a-data-frame/

However, when I ran this function on my data, I received an error that R ran out of memory. Therefore, I had to create my own stratified sampling function that would work for large data sets with many groups.

After some trial and error, the key turned out to be sorting based on the desired groups and then computing counts for those groups. The procedure is extremely fast, taking only .18 seconds on a large data set. I welcome any feedback on how to improve!

stratified_sampling<-function(df,id, size) {
#df is the data to sample from
#id is the column to use for the groups to sample
#size is the count you want to sample from each group

# Order the data based on the groups
df<-df[order(df[,id],decreasing = FALSE),]

# Get unique groups
groups<-unique(df[,id])
group.counts<-c(0,table(df[,id]))
#group.counts<-table(df[,id])

rows<-mat.or.vec(nr=size, nc=length(groups))

# Generate Matrix of Sample Rows for Each Group
for (i in 1:(length(group.counts)-1)) {
start.row<-sum(group.counts[1:i])+1
samp<-sample(group.counts[i+1]-1,size,replace=FALSE)

rows[,i]<-start.row+samp

}

sample.rows<-as.vector(rows)
df[sample.rows,]
}

Subscribe

  • Entries (RSS)
  • Comments (RSS)

Archives

  • June 2019
  • May 2019
  • April 2019
  • November 2014
  • October 2014
  • August 2014
  • April 2012
  • August 2011

Categories

  • Neural Networks
  • Optimization
  • Real Estate
  • SAS
  • Statistics
  • Uncategorized
  • VBA

Meta

  • Register
  • Log in

Create a free website or blog at WordPress.com.

Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here: Cookie Policy
  • Follow Following
    • Adam On Analytics
    • Already have a WordPress.com account? Log in now.
    • Adam On Analytics
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar