• About
  • Visual Basic for Applications (VBA)

Adam On Analytics

~ Ramblings on analytics, business, statistics and anything else

Adam On Analytics

Tag Archives: analytical tools

Update to Stratified Sampling in R

25 Tuesday Nov 2014

Posted by adamsanalytics in Statistics

≈ 1 Comment

Tags

analytical tools, Analytics, data science, R, statistics

Back in 2012, I wrote a short post describing some of the issues I was having doing sampling in R using the function written by Ananda. Anada actually commented on my post directing me to an updated version of the code he has written, which is greatly improved.

However, I still feel that there is room for improvement. One common example is many times during the model building process, you wish to split data into testing, training and validation data sets. This should be extremely easy and done in a standardized way.

Thus, I have set out to simplify this process. My goals are to write a function in to split data frames R that:

1. Allows the user to specify any number of splits and the size of splits (the splits need not necessarily all be the same size).

2. Specify the names of the resulting split data.

3. Provide default values for the most common use case.

4. Just “work”. Ideally this means that the function is intuitive and can be used without requiring the user to read any documentation. This also means that it should have minimal errors, produce useful error messages and protect against unintended usage.

5. Be written in a readable and well commented manner. This should facilitate debugging and extending functionality, even if this means performance is not 100% optimal.

I have written this code as part of my package that is in development called helpRFunctions, which is designed to make R programming as painless as possible.

The function takes just a few arguments:

1. df : The data frame to split

2.  pcts Optional. The percentage of observations to put into each bucket.

3. set.names Optional. What to name the resulting data sets. This must be the same length as the pcts vector.

4. seed Optional. Define a seed to use for sampling. Defaults to NULL which is just the normal random number generator in R

The function then returns a list containing data frames named according to the set.names argument.

Here is a brief example on how to use the function.

install.package('devtools') # Only needed if you dont have this installed.
library(devtools)
install_github('adam-m-mcelhinney/helpRFunctions')
df <- data.frame(matrix(rnorm(110), nrow = 11))
t <- split.data(df)
training <- t$training
testing <- t$testing

Let’s see how the function performs for a large-ish data frame.


df <- data.frame(matrix(rnorm(110*1000000), nrow = 11*1000000))
system.time(split.data(df))

Less than 3 seconds to split 11 million rows! Not too bad. The full code can be found here. I always, I welcome any feedback or pull requests.

Advertisement

Using Mathematica to Solve a Chess-Board Riddle

17 Friday Oct 2014

Posted by adamsanalytics in Uncategorized

≈ Leave a comment

Tags

analytical tools, Mathematica

At work, every Friday we have a mailing list set up where a riddle is announced and then the participants are invited to submit answers to the riddle on Tuesday.

A few weeks ago, there was an interesting riddle presented.

Riddle: On an 8 x 8 chessboard, define two squares to be neighbors if they share a common side. Some squares will have two neighbors, some will have three, and some will have four. Now suppose each square contains a number subject to the following condition: The number in a square equals the average of the numbers of all its neighbors. If the square with coordinates [1, 1] (i.e. a corner square) contains the number 10, then find (with proof) all possible values that the square with coordinates [8, 8] (i.e. opposite corner) can have?

After some consideration, there is an easy proof by contradiction. However, I am trying to learn a new programming language, specifically Mathematica. I thought that attempting to solve this riddle by developing a system of equations in Mathematica would be good practice.

The Mathematica code is located below and also posted on my Github account. By the way, the answer is 10.

(* ::Package:: *)

(* Parameters *)
size = 8;
givenVal = 10;
(* Corners *)
bottemLeftCornerEq[{i_, j_}] := StringJoin[“(c”, ToString[i+1], ToString[j], ” + c”, ToString[i], ToString[j+1], “)/2 == c”, ToString[i], ToString[j]]
topRightCornerEq[{i_, j_}] := StringJoin[“(c”, ToString[i-1], ToString[j], ” + c”, ToString[i], ToString[j-1], “)/2 == c”, ToString[i], ToString[j]]
topLeftCornerEq[{i_, j_}] := StringJoin[“(c”, ToString[i+1], ToString[j], ” + c”, ToString[i], ToString[j-1], “)/2 == c”, ToString[i], ToString[j]]
bottmRightCornerEq[{i_, j_}] := StringJoin[“(c”, ToString[i-1], ToString[j], ” + c”, ToString[i], ToString[j+1], “)/2 == c”, ToString[i], ToString[j]]
(* Edges *)
bottomEdge[{i_, j_}]:= StringJoin[“(c”, ToString[i-1], ToString[j], ” + c”, ToString[i], ToString[j+1], ” + c”, ToString[i+1], ToString[j], “)/3 == c”, ToString[i], ToString[j]]
topEdge[{i_, j_}]:= StringJoin[“(c”, ToString[i-1], ToString[j], ” + c”, ToString[i], ToString[j-1], ” + c”, ToString[i+1], ToString[j], “)/3 == c”, ToString[i], ToString[j]]
leftEdge[{i_, j_}]:= StringJoin[“(c”, ToString[i], ToString[j+1], ” + c”, ToString[i+1], ToString[j], ” + c”, ToString[i], ToString[j-1], “)/3 == c”, ToString[i], ToString[j]]
rightEdge[{i_, j_}]:= StringJoin[“(c”, ToString[i-1], ToString[j], ” + c”, ToString[i], ToString[j+1], ” + c”, ToString[i], ToString[j-1], “)/3 == c”, ToString[i], ToString[j]]
(* Middle *)
middleEq[{i_, j_}] := StringJoin[“(c”, ToString[i-1], ToString[j], ” + c”, ToString[i], ToString[j-1], ” + c”, ToString[i+1], ToString[j],” + c”, ToString[i+1], ToString[j+1], “)/4 == c”, ToString[i], ToString[j]]
equationSelector[{i_, j_}] := Which[ i == 1 && j == 1
, bottemLeftCornerEq[{i, j}]
, i == size && j == 1
, bottmRightCornerEq[{i, j}]
, i == size && j == size
, topRightCornerEq[{i, j}]
, i == 1 && j == size
, topLeftCornerEq[{i, j}]
, i == 1
, leftEdge[{i, j}]
, i == size
, rightEdge[{i, j}]
, j == 1
, bottomEdge[{i, j}]
, j == size
, topEdge[{i, j}]
, i != 1 && i != size && j != 1 && j != size
, middleEq[{i, j}]
]

(* Test All Cases
i = 1; j = 1;
bottemLeftCornerEq[{i, j}] \[Equal] equationSelector[{i, j}]
i = 3; j = 1;
bottmRightCornerEq[{i, j}] \[Equal] equationSelector[{i, j}]
i = 1; j = 3;
topLeftCornerEq[{i, j}] \[Equal] equationSelector[{i, j}]
i = 3; j = 3;
topRightCornerEq[{i, j}] \[Equal] equationSelector[{i, j}]
i = 2; j = 1;
bottomEdge[{i, j}] \[Equal] equationSelector[{i, j}]
i = 2; j = 3;
topEdge[{i, j}] \[Equal] equationSelector[{i, j}]
i = 3; j = 2;
rightEdge[{i, j}] \[Equal] equationSelector[{i, j}]
i = 1; j = 2;
leftEdge[{i, j}] \[Equal] equationSelector[{i, j}]
i = 2; j = 2;
middleEq[{i, j}] \[Equal] equationSelector[{i, j}]
*)
(* Test on simple 2×2 case
eq1 = bottemLeftCornerEq[{1, 1}]
eq2 = bottmRightCornerEq[{2, 1}]
eq3 = topRightCornerEq[{2, 2}]
eq4 = topLeftCornerEq[{1, 2}]
Solve[ToExpression[eq1] && ToExpression[eq2] && ToExpression[eq3] && ToExpression[eq4] && c11 \[Equal] 10, {c11, c12, c21, c22}]
*)

(* Create System *)

(* Iterate over range *)
f[x_, y_] := {x, y}
genVar[{i_, j_}] := StringJoin[“c”, ToString[i], ToString[j]];
varList = Map[genVar, grid];
grid = Flatten[Outer[f, Range[size], Range[size]], 1];
eqPart1 = StringReplace[ToString[Map[equationSelector, grid]], {“{” -> “”, “}” -> “”, “,” -> ” &&”}];
SolveStr = StringJoin[eqPart1, ” && c11 ==”, ToString[givenVal]]

Solve[ToExpression[SolveStr], ToExpression[varList]]

(* Solve without a given value
*)

Solve[ToExpression[eqPart1], ToExpression[varList]]

(*
Solve[(c21 + c12)/2 == c11 && (c22 + c11)/2 == c12 && (c11 + c22)/2 == c21 && (c12 + c21)/2 == c22 && c11\[Equal]10, {c11,c12,c21,c22}] *)

(*
*)
(*
startCell = c11
c11 = 10
bottemLeftCornerEq[1, 1]
topRightCornerEq[8, 8]
topLeftCornerEq[1, 8]
bottmRightCornerEq[8, 1]
bottomEdge[{2,1}]
topEdge[4,8]
leftEdge[1, 4]
rightEdge[8, 4]
Map[bottomEdge, {{2,1}}]
eq1
eq2 *)

Beyond R: New Open Source Tools for Analytics

21 Thursday Aug 2014

Posted by adamsanalytics in Uncategorized

≈ Leave a comment

Tags

analytical tools, Analytics

Today I am delighted to be presenting at the BrightTalk Analytics Summit. I will be presenting on some new open-source tools that are available for doing analytics. In particular, I will be focusing on Python Scikit-Learn, PyStats, Orange, Julia and Octave.

I hope to help people answer the questions:

1. Why is there a need for new analytical tools?

2. What are the potential alternatives?

3. What are some of the pro’s and con’s of each alternative?

4. Demonstrate some specific use-cases for each tool

5. Provide a roadmap to get started using these new tools

My full presentation will be available here.

 

Subscribe

  • Entries (RSS)
  • Comments (RSS)

Archives

  • June 2019
  • May 2019
  • April 2019
  • November 2014
  • October 2014
  • August 2014
  • April 2012
  • August 2011

Categories

  • Neural Networks
  • Optimization
  • Real Estate
  • SAS
  • Statistics
  • Uncategorized
  • VBA

Meta

  • Register
  • Log in

Blog at WordPress.com.

Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here: Cookie Policy
  • Follow Following
    • Adam On Analytics
    • Already have a WordPress.com account? Log in now.
    • Adam On Analytics
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar