R Function for Stratified Sampling

Tags

So I was trying to obtain 1000 random samples from 30 different groups within approximately 30k rows of data. I came across this function:

http://news.mrdwab.com/2011/05/20/stratified-random-sampling-in-r-from-a-data-frame/

However, when I ran this function on my data, I received an error that R ran out of memory. Therefore, I had to create my own stratified sampling function that would work for large data sets with many groups.

After some trial and error, the key turned out to be sorting based on the desired groups and then computing counts for those groups. The procedure is extremely fast, taking only .18 seconds on a large data set. I welcome any feedback on how to improve!

stratified_sampling<-function(df,id, size) {
#df is the data to sample from
#id is the column to use for the groups to sample
#size is the count you want to sample from each group

# Order the data based on the groups
df<-df[order(df[,id],decreasing = FALSE),]

# Get unique groups
groups<-unique(df[,id])
group.counts<-c(0,table(df[,id]))
#group.counts<-table(df[,id])

rows<-mat.or.vec(nr=size, nc=length(groups))

# Generate Matrix of Sample Rows for Each Group
for (i in 1:(length(group.counts)-1)) {
start.row<-sum(group.counts[1:i])+1
samp<-sample(group.counts[i+1]-1,size,replace=FALSE)

rows[,i]<-start.row+samp

}

sample.rows<-as.vector(rows)
df[sample.rows,]
}

VBA Code to Standardize a User Specified Range by Column

Hey guys,

I wanted to share some new code with you. The code below allows a user to specify a range of data and then the code will output the standardized values (mean=0 and standard deviation 1) for each of the columns. This can be a big time saver over Excel’s standardize function, which requires the user to input the mean and standard deviation and only standardizes one cell at a time. Also, this allows the user to specify specifically how they want the standard deviation calculated.

I hope that you guys enjoy the code. I welcome any feedback!

Thanks,

```Sub standardize_range()
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
'This code allows a user to select a group of data organized by columns and it will provide a standardized
'output with mean 0, variance 1
'
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Dim variables As Range
Dim change As Range
Set variables = Application.InputBox("Select the row of variable names:", Default:=Range("a1:c1").Address, Type:=8)
Set change = Application.InputBox("Select the row of cells containing the data to standardize:", Default:=Range("a2:c14").Address, Type:=8)
Dim Total, Total_sq, Average, Variance As Double

numRow = change.Rows.Count
numCol = change.Columns.Count
'Debug.Print "numrow=" & numrow & " numcol=" & numCol

'1. Calculate Average and Std for all Columns
'Note: This calculation for Variance and StdDev differs from Excel's slightly

For j = 1 To numCol

ReDim col(numRow) As Double
Total = 0
Total_sq = 0

For i = 1 To numRow
col(i) = change.Cells(i, j)
'Debug.Print col(i)
Total = Total + col(i)
Total_sq = Total_sq + col(i) ^ 2
Next i

Average = Total / numRow
Variance = (Total_sq / numRow) - (Average) ^ 2
Std = Variance ^ (1 / 2)
'Debug.Print "Total="; Total & " Average="; Average & " Total_sq=" & Total_sq & " Variance=" & Variance _
'& " Std=" & Std

'Store values
change.Cells(numRow + 1, j).Value = Average
change.Cells(numRow + 2, j).Value = Variance
change.Cells(numRow + 3, j).Value = Std

Next j

'2. Create Sheet to store standardized values
Sheets("Standardized_Values").Delete

'Write variable names
For k = 1 To numCol
Sheets("Standardized_Values").Cells(1, k).Value = variables.Cells(1, k) & "_std"
Next k

'3. Compute standardized values
For n = 1 To numCol

For m = 1 To numRow

Sheets("Standardized_Values").Cells(m + 1, n).Value = (change.Cells(m, n).Value - change.Cells(numRow + 1, n).Value) / _
change.Cells(numRow + 3, n).Value
Next m
Next n

End Sub```

Comparison of Neural Network and Multivariate Regression Techniques for Residential Real Estate Valuation

The following below is a thesis I wrote while an undergrad at IU.
Adam McElhinneyResearch Conducted Under the Supervision of Roy Gardener
IU Department of Economics
I. Introduction

Real estate is undoubtedly one of the most important components of the modern economy. In particular, owning a house has long typified the American dream and a residential home is the largest asset for a majority of Americans#. The housing market had swelled 12.5% from 1990 to 200, eventually totaling 119.6 million units#. Along with this growth in the housing market, has come an even greater growth in the accurate assessment of housing prices.

In response to this demand, the consumer market has responded with many websites offer users the ability to get a free and quick estimate of their homes value by inputting several variables such as homes location, size, and features. However, the accuracy of these estimates in unknown and the formulas they are using for estimation are often kept proprietary. Despite this, the internet has allowed for an unprecedented amount of real estate transactions to be tracked, often given details regarding the homes age, size, location, and features.
These data sets are currently being used in the world of financial economics, which has long been interested in real estate valuation. They have employed a variety of statistical techniques in an attempt to accurately predict the values of properties. The bulk of the research has relied on multivariate regression to assign weights to the various features of the home. However, the limitations of these techniques have been found and new statistical techniques such as neural networks are being tested as replacements.
The accuracy of these neural networks in comparison to traditional multivariate method is still in debate; however research is showing a growing niche where neural network models are able to outperform traditional techniques.
II. Background

There are three main methods of valuing real estate. One method is the discounted cash flow method. This is primarily used in the valuation of commercial properties. This method calculates the present value of all the income that one expects to receive from the property. However, this method has several flaws. First, the discount rate used is highly subjective. Various complex statistical techniques have been developed for calculating the discount rate. Further, the discount rate assumes that interest rates will stay constant. Interest rates can be highly volatile, additionally even small changes in the interest rate can have large effects of the property valuation. Lastly, the discounted cash flow model faces encounters circular logic if the purchase of the property is to be funded with a loan. The value of the asset is needed to calculate the WACC; however it is that value we seek to find#.

Another method for real estate valuation is the cost method. This method takes the value of the land as if it was vacant, plus the deprecated value of any structures occupying the land.# However, these methods are very limited in their practicality, especially when applied to large scale valuation of residential properties.

Undoubtedly the most popular technique for real estate valuation is the market comparison approach. In a traditional market comparison approach, a property’s value is determined by finding other recently sold properties with similar physical characteristics, called comparables, or ‘comps’ for short. Then, weights are applied to the various features of the house and adjusted to as closely approximate the target house as possible.

Several practical issues arise from this method. First of all, there may not be a large number of comparable properties that have been sold recently. This is particularly true if the property is very different from the average properties in the area, or has some unique characteristic that does not allow for comparison#. In addition, by omitting many other sales from the analysis, the appraiser may be losing valuable information. In a similar vein, typically only a small number of comparable properties are available. This number of comparable properties may not be enough to assume a statistically normal distribution and thus price estimates may be flawed.

Furthermore, the differences in the properties may be quite significant. There are 5 main differences listed that need to be accounted for in this method; physical, location, market conditions, terms of financing, and conditions of sale. For example, the prime rate, the rate that is largely responsible for the rate a homebuyer will pay for a loan, dropped 2.5% the first six months in 2001#. Thus any property sales that were more than 1 or 2 months old would need to be reconfigured to reflect this new rate.

Lastly, the weighting assigned to each of the various comparable properties is often subject to scrutiny and debate. The accuracy of these weights is highly relevant to the fair pricing of the home.

In order to counteract these limitations, appraisers have begun to rely on statistical techniques such as multivariate regression. In a typical appraisal, the prices of a large number of sold properties are regressed against the characteristics that are seen as influencing the prices of the properties. The defining work on this is the book Real Estate Valuation Theory published by the Appraisal Institute and American Real Estate Society.

In more recent years, appraisers have begun to rely on graphical interface systems (GIS) to assemble databases of properties that incorporate geographic information. Common geographic variables include distance to major metropolitan areas, crime rates, distance to public transportation, etc. An important early study by Wyatt incorporated other buildings, car parks, footpaths, boundary of the urban areas, tree preservation orders, open spaces, conservation areas, water features, road edges and centre-lines and railway lines and stations into his model of commercial real estate prices#.

However, these multiple variable regressions (MVR) have several limitations. First, for small sample sizes, multivariate regressions have proven to be extremely poor estimators of prices#. Therefore, in rural areas or properties that do not have many properties suitable for compassion, neural networks (NN) hold great promise.

In addition, many characteristics of the house are highly inter-correlated. This leads to the possibility of multicolinearity occurring. Further, the data used for the comps is often taken from different times. Thus, hetroskedasticity arises, further diminishing the accuracy of the estimates. Lastly, in an increasingly complex economy, assumptions of linearity seem to be less and less feasible.

In an effort to obtain more accurate sales predictions, some researchers have turned to neural networks to improve performance. The results of using neural networks versus multivariate regressions are contradictory and scattered. Studies by researchers such as Tskudua and Baba, Ngyuen and Cripps, and Hanson, all demonstrate effective ways of using neural networks for valuation #. However, seemingly just as many researchers have found MVR superior to NN#.

III. Explanation of the Neural Network System

To understand the neural network model, one must understand its motivation for creation. The neural network model is a statistical technique modeled off the way the brain processes data. A non-rigorous definition of neural networks is given by Gurney as follows:

A neural network is an interconnected assembly of simple processing elements, units or nodes, whose functionality is loosely based on the animal neuron. The processing ability of the network is stored in the inter-unit connection strengths, or weights, obtained by a process of adaptation to, or learning from, a set of training patterns.

The goal of the method was to create a non-linear and non-parametric statistical technique#. Nonparametric statistics are those that do not make any underlying assumptions, such as normality, about the population distribution#.  This is a major weakness of traditional parametric statistical techniques, is that one must assume the underlying probability distribution. Non-parametric statistics are not bound by this constraint.

In order to better understand neural networks, it is helpful to have a little background knowledge on biological neurons, for which our model is based. For our purposes, biological neurons can be broken down into four simple parts. First, are the dendrites. Dendrites are responsible for receiving incoming signals from other neurons. Their structure typically resembles the branches of a tree, allowing them to interact with many different neurons in the brain. The second element is the soma, or cell body. It is the largest part of the cell and responsible for the cells vital functions.  Moving forward, the axon is a single truck-like structure that extends from the soma. It serves to carry the impulse outward to the next part of the neuron, the pre-synaptic terminals. The pre-synaptic terminals branch out from the axon, connecting the neuron with other neurons and transmitting the impulse when applicable.

There are two types of impulses that the neurons transmit, inhibitory and exhibitory. When a neuron receives an impulse in the dendrites, it is then transmitted to the cell body#. If the strength of the impulse is greater than some value, it is said to be exhibitory and the cell continues to transfer the impulse. Conversely, if the strength of the impulse is less than some value, the impulse is not transmitted and it is labeled as inhibitory.

The artificial neuron used for our computation follows a similar form. The first artificial neuron neural network was the threshold logic unit (TLU), developed my McCulloch and Pitts#. A neuron receives inputs and the inputs are weighted in some fashion. If the summed value of the impulses exceeds a certain threshold level, the exiting impulse is given a value of one. If not, the impulse is given a value of zero and does not continue.
It is the arrangement of multiple neurons of this type that gives us a neural ‘network’. Arrangement on the neurons can take many forms, but in its simplest is a series of inputs, which then converge onto a hidden layer, continuing on to an outer layer of neurons, which then eventually produces the outputs#.

The network described above is referred to as a feed forward network. However, these networks can take many shapes and have nodes that not only send signals forward, but may send signals laterally and in some cases backward. More details on various structures will be given later.

The weights given to each impulse by the neuron are an important source of their computing strength. These weights need not necessarily be static. Rather, the weights may be altered by a set of known data. One extremely useful process to alter these weights is called supervised learning.  In this process, the series of known data is passed through the neural network and the weights are adjusted according to a learning rule, to provide the desired outputs. This process is typically repeated many times until the weights very accurately produce the desired outputs. It is then the hope that when the network receives data it has not seen before, that it will be able to correctly identify the pattern and produce the correct outputs.

As one may notice, neural networks are not a very strongly defined statistical technique, thus giving them flexibility to be applied to a very wide range of areas and applications. Neural network models have been applied in diverse fields such as civil engineering#, pavement crack analysis#, soil and water retention# and many other seemingly unrelated areas. However, the diversity of the applications speaks to the utility of the method.

IV. Data Set
For this paper, data on133 real estate transactions, from the period from 3/20/2006 to 2/28/2007, was gathered from the website Zillow.com. Zillow is a free website designed to give consumers a quick estimate of their homes’ values. In order to calculate these estimates, Zillow contains a database of all home transactions for a particular region, going back to different periods, depending on the location. The transactions recorded are from the northwest suburbs of Chicago, primarily in the 60010, 60047, and 60042 zip codes.  In addition, the database lists the address of the home, the number of bedrooms, the number of bathrooms, the size of the home in square-feet, the size of the lot in square feet, the price the home sold for, the date the home was sold on, and the age of the home. The variables, size of home in square feet, size of lot in square feet, have both been divided by 1000 to better condition the numbers to be squared and cubed. Similarly, the variable for the age of the house has been divided by 10. These will be the variables used in our model to predict housing prices.
The data was randomly broken down into 79 “training observations” and 54 “validation” observations. This ratio of training observations to validation observations is consistent with previous literature Each model will be examined for their ability to predict the values of the 54 validation housing prices.

 Variable Name Description Price Listed in US Dollars BD Number of Bedrooms BA Number of Bathrooms SIZE (Square Footage of the House)/1000 LOT (Square Footage of the Lot)/1000 AGE (Number of years since the house was built)/10 LN(VAR) The natural log of the specified variable SQ(VAR) The specified variable squared CU(VAR) The specified variable cubed Q2 Dummy Variable indicating whether or not the house was sold in the second quarter of 2006 Q3 Dummy Variable indicating whether or not the house was sold in the third quarter of 2006 Q4 Dummy Variable indicating whether or not the house was sold in the fourth quarter of 2006 Q5 Dummy Variable indicating whether or not the house was sold in the first quarter of 2007

Descriptive statistics of the data are provided below.
===============================================================================
Variable        Mean         Std.Dev.        Minimum         Maximum      Cases
===============================================================================
——————————————————————————-
All observations in current sample
——————————————————————————-
PRICE     387570.301      189266.582      100000.000      1275000.00        133
BD        3.48872180      .794074088      2.00000000      6.00000000        133
BA        2.36466165      .755980749      1.00000000      5.50000000        133
SIZE      2095.35338      1112.12044      854.000000      10018.0000        133
LOT       10698.8797      8592.70512      70.0000000      52272.0000        133
AGE       35.7142857      24.1584327      3.00000000      140.000000        133
Q2        .368421053      .484200125      .000000000      1.00000000        133
Q3        .323308271      .469507675      .000000000      1.00000000        133
Q4        .157894737      .366020890      .000000000      1.00000000        133
Q5        .827067669E-01  .276479842      .000000000      1.00000000        133

As the graphs show, there is a lot of variability amongst the different variables and their effects on price. This is most likely due in large part to the fact that our model does not control any geographic variables. Anyone who knows the Chicago area can tell you that the closer one gets to the city, the higher housing prices are. Despite this, these complicated relationships are exactly what neural networks are designed to predict and thus this high variability should be a good testing grounds for a non-linear model.
V. Multivariate Regression Models
In accordance with previous literature, several multivariate regression models were tested. Previous literature has hypothesized that real estate data may follow a semi-log, or a log-log pattern. In addition, research has shown that age, property size, and house size may have a squared or even cubic relationship.

Lastly, there is a large possibility for hetroskedasticity to occur due to the fact that data is pulled from different time periods, thus all the models are repeated using White’s model for hetroskedasticity.  The “Un-Whitened” model formats are given below.

Linear: Selling price= +(BD) + (BA) + (Size) + (Lot) + (Age) + (Q2) + (Q3) + (Q4)+(Q5) +

Semi-Log: Ln(Selling price)= +(BD) + (BA) + (Size) + (Lot) + (Age) + (Q2) + (Q3) + (Q4)+(Q5) +

Log-Log: Ln(Selling price)= +LN(BD) + LN(BA) +  LN (Size) +  LN (Lot) +  LN (Age) +  LN (Q2) +  LN (Q3) +  LN (Q4)+ LN (Q5) +

Age: Selling price= +(BD) + (BA) + (Size) + (Lot) + (Age) + (Q2) + (Q3) + (Q4)+(Q5) +

Lot: Selling price= +(BD) + (BA) + (Size) + (Lot) + (Age) + (Q2) + (Q3) + (Q4)+(Q5) +

Size: Selling price= +(BD) + (BA) + (Size) + (Lot) + (Age) + (Q2) + (Q3) + (Q4)+(Q5) +

Age,Lot, Size: Selling price= +(BD) + (BA) + (Size) + (Lot) + (Age) + (Q2) + (Q3) + (Q4)+(Q5) +

Age: Selling price= +(BD) + (BA) + (Size) + (Lot) + (Age) + (Q2) + (Q3) + (Q4)+(Q5) +

Lot: Selling price= +(BD) + (BA) + (Size) + (Lot) + (Age) + (Q2) + (Q3) + (Q4)+(Q5) +

Size: Selling price= +(BD) + (BA) + (Size) + (Lot) + (Age) + (Q2) + (Q3) + (Q4)+(Q5) +

Age,Lot, Size: Selling price= +(BD) + (BA) + (Size)  + (Lot)  + (Age)  + (Q2) + (Q3) + (Q4)+(Q5) +

Somewhat surprisingly, the simplest linear model without correcting for hetreoskedasticity proves to yield the most statistically significant variables. All of the variables, except for the quarter sold variables, are significant. However, previous studies utilizing much larger sample sizes have shown all these variables to be significant, thus we accept that they are relevant for our analysis.
Linear: Selling price= +(BD) + (BA) + (Size) + (Lot) + (Age) + (Q2) + (Q3) + (Q4)+(Q5) +
+———————————————————————–+
| Ordinary    least squares regression    Weighting variable = none     |
| Dep. var. = PRICE    Mean=   387570.3008    , S.D.=   189266.5817     |
| Model size: Observations =     133, Parameters =  10, Deg.Fr.=    123 |
| Residuals:  Sum of squares= .1861022040E+13, Std.Dev.=   123005.12383 |
| Fit:        R-squared=  .606423, Adjusted R-squared =          .57762 |
| Model test: F[  9,    123] =   21.06,    Prob value =          .00000 |
| Diagnostic: Log-L =  -1742.2784, Restricted(b=0) Log-L =   -1804.2882 |
|             LogAmemiyaPrCrt.=   23.512, Akaike Info. Crt.=     26.350 |
| Autocorrel: Durbin-Watson Statistic =   1.64819,   Rho =       .17591 |
+———————————————————————–+
+———+————–+—————-+——–+———+———-+
|Variable | Coefficient  | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+———+————–+—————-+——–+———+———-+
Constant    -226120.6909       72598.896   -3.115   .0023
BD           37659.81050       17601.113    2.140   .0344     3.4887218
BA           122116.0065       18335.652    6.660   .0000     2.3646617
SIZE         35.09598346       12.983202    2.703   .0078     2095.3534
LOT          2.715900837       1.3582596    2.000   .0478     10698.880
AGE          1245.204431       460.88289    2.702   .0079     35.714286
Q2           33570.17097       45364.143     .740   .4607     .36842105
Q3           54453.79975       45884.666    1.187   .2376     .32330827
Q4           68688.20522       50361.193    1.364   .1751     .15789474
Q5           68402.54427       57322.227    1.193   .2350  .82706767E-01

The model was then back tested using the 54 validation observations. The success or failure of the model was based upon its ability to achieve a low forecasting error.

Forecasting Error= |Actual Housing Price- Predicted Housing Price|

Actual Housing Price

Mean Forecasting Error of the Multiple Variable Regression Models

 Linear Semi-Log Log-Log Age Lot Size 0.254316 0.235697 0.259342 0.258416 0.252312 0.250122 Age,Lot, Size Age Lot Size Age,Lot, Size 0.254224 0.261117 0.253886 0.249586 0.260392

It is somewhat surprising that the Semi-Log model has the lowest mean forecasting error. In order to predict with a semi-log model, one must take the exponential of the predicted prices. This is a non-linear transformation and therefore introduces bias. However, in this case it appears that this bias is overcome by the increased predictive power of the model.

The lack of variability between the 11 models is striking. The different in mean forecasting error between the highest and the lowest models is only .025. However, given a larger sample size, one would expect this to increase as more subtle relationships between the independent variables are teased out.

The distribution of the forecasting errors for the semi-log model is shown below. The model does an impressive job, 32 of the observations have a forecasting error of less than .2 and 46 of the observations have a forecasting effort of less than .40.

VI. Neural Network Model Explanation
In order to contrast the performance of the two statistical techniques, a traditional feed forward neural network was utilized. In a feed forward network, the data is separated into two sets, the training set and the validation set. As stated previously, our data was randomly broken down into a training set of 79 observations and a validation set of 54 observations. Then, the network is ‘trained’, meaning the training data is passed through the network and the initial parameters are adjusted in some manner, for a specified number of iterations. Then the validation set is used to check the accuracy of the model specified by the training.  In this analysis, a sigmoid activation function was used for its simplicity and ease of differentiation.

Sigmoid function

The number of hidden layers, the number of neurons per hidden layer, and the number of training iterations to conduct are open to debate. Previous research on neural networks for real estate prediction has shown that one hidden layer is typically sufficient for accurate assessment#. Similarly, one hidden layer is used in our model. For the other parameters, there are a few ‘rules of thumb’ given for these parameters, but due to the low number of observations in our sample, numerous combinations were tested.

The neural network package utilized for this analysis automatically sets the initial values to random numbers within the data range. This results in quicker training than a pure random initialization#.

Neural Network Model Specifications

 Model Number Hidden Layers Neurons Per Hidden Layer 1 1 1 2 1 2 3 1 3 4 1 4 5 1 5 6 1 6 7 1 7 8 1 8 9 1 9 10 1 10 11 1 15 12 1 20 13 1 25

The following models were all run with 5, 10, 15, 20, 25, 30, 50 and 75 training iterations. One of the difficulties of the neural network model is estimating how many training iterations to use. If you underestimate the amount needed, the results are not as accurate as they could be. Conversely, if one uses too many training iterations the function may be over-fitted to the data. In turn, this over-fitting causes a loss of accuracy when new data is run through the model.

To compare the accuracy of the models, the validation data was run through both models and the mean forecasting error was computed.

Mean Forecasting Error of Various Neural Network Models

 Model Number Number of Neurons on One Hidden Layer 5 Training Iterations 10 Training Iterations 15 Training Iterations 20 Training Iterations 25 Training Iterations 30 Training Iterations 50 Training Iterations 75 Training Iterations 1 1 0.277338 0.277165 0.277342 0.277027 0.277411 0.276779 0.2761 0.2761 2 2 0.283716 0.277502 0.280274 0.285198 0.28401 0.282454 0.302764 0.301875 3 3 0.267825 0.273145 0.300541 0.304717 0.317211 0.329043 0.35199 0.350984 4 4 0.293282 0.343108 0.338736 0.360779 0.371395 0.406882 0.429371 0.425458 5 5 0.319706 0.382284 0.385009 0.380813 0.376205 0.382051 0.384963 0.388435 6 6 2.06627 9.448 1963.7 2052.98 2361.01 399139. 3436.7 4133.77 7 7 0.312132 0.287282 0.299599 0.309019 0.330474 0.349585 0.493541 0.635953 8 8 0.352379 0.393616 0.635656 0.959402 1.59276 0.538184 1.07871 6.10403 9 9 1.17289 1.57346 1.39654 93.6565 749.619 638.547 256.935 33.1817 10 10 0.297179 0.328543 0.354833 0.412025 0.461227 0.48286 0.703548 0.918779 11 15 0.33348 0.460503 0.489845 0.788403 0.917985 1.20471 1.13428 1.17958 12 20 0.417438 0.530489 0.692084 0.883574 1.16352 1.33265 1.39361 1.48593 13 25 0.657161 2.37524 3.20335 4.61429 5.74316 7.12753 11.5672 13.0967 FE<0.3 .3

As one can see, the neural network model presented here has the best results with a relatively low number of neurons per hidden layer and low training iterations. In fact, the model with the lowest mean forecasting error has three neurons per hidden layer and only 5 training iterations. This is not entirely surprising, due to the fact that our entire data set was only 133 observations, with nine explanatory variables. If one were to add more observations and more explanatory variables, then the number of training iterations and the number of neurons per hidden layer, and possibly even the number of hidden layers, would likely need to be increased. Interestingly, there is very little variability in mean forecasting error between models 1-5. Despite the parameters ranging from 1-5 neurons and 5-75 training iterations, the mean forecasting error varies by a scant .16.

For models 6-13, there appears to be massive over-training tacking place, resulting in ridiculous price predictions. The absurdity of these estimates underscores the importance of choosing the correct parameters when constructing a neural network. Interestingly, there are some shockingly precise estimates that occur within this range. In particular, model 10 with 5 training iterations and model 7 with 10 and 15 training iterations have mean forecasting errors below .30. This is most likely due to the randomness utilized by the neural network in selecting the initial training parameters and is not repeatable. Again, this serves as a cautionary example of the utilization of neural networks. One must not only concern themselves with the neural network predictions for their particular data set, but whether this network could be repeated in different circumstances.

The chart below shows the distribution of the forecasting errors for the model with the lowest mean forecasting error, model 13. As is apparent in the chart, the model is quite effective for the majority of the predictions. However, several outliers are undoubtedly pulling the mean forecasting error much higher. In fact, approximately 28 of the predicted prices have a forecasting error of less than .2 and an impressive 46 of the predicted prices have a forecasting error of less than .40. This is contrasted by the drastic increases in forecasting errors for predictions 50 and onward. Once again, this high variability is most likely a result of the geographic regions from which the data was taken, but none the less illustrates how a neural network will handle predictions of data with high variability.

VII. Comparison of the Two Models
In our analysis, it seems clear that the Semi-log MVREG model is the most accurate model for predicting the real estate prices. The difference in mean forecasting between the best multivariable regression model and the best neural network model was approximately .032. Additionally, the chart below shows how the distribution of forecasting errors is shown below.
The Semi-Log model beats the NN Model 13 in almost every observation. However, in the low to middle spectrum of the forecasting error, the NN 13 competes closely with the Semi-Log model, even showing superiority in some cases over the range 37-49. However, the largest difference between the two models is that the Semi-Log model seems to do a better job of dealing with outliers than the NN Model 13. This is undoubtedly where a large portion of the difference in mean forecasting error is created.
Additionally, the computational requirements of the neural network models further hinder its practicality. Despite the small sample size used in this analysis, it took the 3.2 GHz Pentium IV with Hyper-Threading approximately 22 minutes to run the 104 neural network models, or .21 minutes per model. However, the 11 multivariate regressions models ran almost instantly.

The ambiguity of a neural network model also presents a problem. The difference between the highest and lowest forecasting error in a multivariate regression model was .025. The difference in the neural network models was over 11. The neural network models were admittedly grossly over trained, but there is not this potential for error in a multivariate regression model.

The greater potential for error in a neural network, model stems from the fact that there are many more parameters that the operator must specify than with the regression model. The chart below summarizes the decisions that must be made to construct the two different models.

 Multivariable Regression Model Neural Network Model Data Transformations (e.g., log, semi-log, squared, etc) Model Structure (Feed-Forward, Radial Basis, Hopfield, etc.) Significance Level Hidden Layers Number of Neurons per Hidden Layer Number of Training Iterations Activation Function Initial Training Values

Although there is research suggesting values for these parameters in the neural network model, there are not hard rules and experimentation is required. One may also pose the question of whether once the appropriate parameters of a neural network model are found, will those parameters remain optimal with different data. If a large scale real estate company was deciding whether or not to purchase several properties, the NN model they had previously relied on may not give accurate predictions of the price for this new set of possible purchases. In contrast, the parameters for the multivariable regression model are much fewer and have long established ranges (e.g. p<.05 significance level for variables, etc).

Another concern with neural network models is variation among various neural network programs. As a result of being very computationally intensive, neural network models tend to be particularly sensitive to computer programming intricacies. Every neural network program comes with a large manual that details the construction of the software and studies have shown how the different programs can produce different results from the same inputs#.

In contrast, the reliability of most regression software is well known. The software is much less sensitive to differences between programs, due to the decreased computations necessary. Additionally, the software is available to a wider range of users, particularly as a result of the data analysis pack that is embedded in all new versions of Microsoft Excel. Neural network software must be purchased separately, often at significant cost.

VII. Conclusion

Our data set showed a neural network model to be inferior to a multivariable regression. There is greater potential for error due to the increased number of user specified parameters. Additionally, the repeatability of the results amongst various computer programs is disputable. Lastly, the computational requirements make the NN models longer to run than the MVR’s. However, there is still much work to be done in the area of neural networks. Neural network models should be studied utilizing much larger data sets that was used in this study. Additionally, rules for establishing the parameters of the neural network model need to be created and rigorously tested. Further, the performance of neural networks with various levels of sufficiency of variables needs to be evaluated. Lastly, our study emphasizes that fact that whatever model of prediction being used, there must be variables to account for geographic differences.

That being said, the regression model was first contemplated in 1885 and is still being refined today#. In contrast, the neural network model as a computational tool came into being in 1949#. Thus, it is still an infant statistical technique, in need of much more refinement to find large scale application.  Due to the ever increasing complexity of problems being studied by researchers and the exponential increases in computing power and technology, it is unlikely that neural networks will cease to be a promising area of study.

The complex nature of real estate valuation will certainly provide a plethora of applications for NN models. Once the technique evolves from its current state as something of an art form into a well defined statistical technique, we will see much more accurate real estate valuations.

Works Cited

“2002 Profile of US Real Estate Markets.” National Association of Realtors. 8 Apr. 2007 <http://www.realtor.org/cipshome.NSF/19A06EC3575CECE886256A1B00731E6F/AE4C9FEC10EFF28386256AE2006CE138?OpenDocument#Demographics&gt;.

Adeli, Hojjat. “Neural Networks in Civil Engineering: 1989−2000.” Computer-Aided Civil and Infrastructure Engineering (2001):  126-142.

“Bank Prime Loan Rate Changes: Historical Dates of Changes and Rates.” Federal Reserve Bank of Saint Louis. 8 Apr. 2007 <http://research.stlouisfed.org/fred2/data/PRIME.txt&gt;.

Benjamin, John D., Randall S. Guttery,  and C.f. Sirmans. “Mass Appraisal: an Introduction to Multiple Regression Analysis for Real Estate Apprasial.” Journal of Real Estate Practice and Education (2004).

D. O. Hebb. The Organization of Behaviour. John Wiley & Sons, New York, 1949.

Francis Galton. Presidential address, Section H, Anthropology. (1885)

Gurney, Kevin. An Introduction to Neural Networks. London: UCL P  Limited, 1997.

Hanson, James. “Australian House Prices; a Comparision of Hedonic and Repeat Sales Measures.” Reserve Bank of Australia (2006).

Hecht-Nielsen, R., Kolmogorov’s Mapping Neural Network Existence Theorem, Paper

presented at IEEE First International Conference on Neural Networks, San Diego, CA,1987.

Hoesli, Martin, Elion Jani,  and Andre´ Bender. “Monte Carlo Simulations for Real Estate Valuation.” Journal of Property Investment & Finance (2006).

Johnston, Sarah Jane. “Real Estate: the Most Imperfect Asset.” 30 Aug. 2004. Harvard Business School. 8 Apr. 2007 <http://hbswk.hbs.edu/item/4342.html&gt;.

Kershaw, Paul, and Peter Rossini. “Using Neural Networks to Estimate Constant Quality House Price Indices.” Fifth Annual Pacific-Rim Real Estate Society Conference (1999). 8 Apr. 2007.

Koekoek, E . J . W . “Neural Network Models to Predict Soil Water Retention.” European Journal of Soil Science Sept (1999):  489-495.

Lee, Byoung J. “Position-Invariant Neural Network for Digital.” Computer-Aided Civil and Infrastructure Engineering (2004):  105-118.

Nguyen, Nghiep, and Al Cripps. “Predicting Housing Value: a Comparison of Multiple Regression Analysis and Artificial Neural Network Analysis.” Journal of Real Estate Research 22 (2001):  313-336.

“Nonparametric Statistics.” StatSoft. 8 Apr. 2007 <http://www.statsoft.com/textbook/stnonpar.html&gt;.

Pullum, Dr. Laura L., Dr. Marjorie A Darrah, and Mr. Brian J. Taylor. “Independent Verification and Validation of Neural Networks.” 8 Apr. 2007 <https://www.softwaretechnews.com//stn7-2/ivv.html&gt;.

Segel, Arthur, and Sarah Jane Johnston. “Real Estate: the Most Imperfect Asset.” Harvard Business School (2004). 8 Apr. 2007 <http://hbswk.hbs.edu/item/4342.html&gt;.

Sjoberg, Jonas. Mathmatica Neural Networks. Champaign, Illinois: Wolfram Research, 2004. 8 Apr. 2007 <http://www.wolfram.com/products/applications/neuralnetworks/&gt;

“The Neuron.” UNAM, Instituto de Fisilogia Celular. 8 Apr. 2007 <http://www.ifisiol.unam.mx/Brain/neuron.htm&gt;.

Limdep Code File Used for Multiple Regressions
;nvar=11
;file = “E:\Econ Thesis\Data\real estate data 3-30-07.txt”
;names =1\$

create; if (quarter=2) q2=1\$
create; if (quarter=3) q3=1\$
create; if (quarter=4) q4=1\$
create; if (quarter=5) q5=1\$
dstats; rhs=price,bd, ba, size, lot, age, q2, q3, q4, q5\$

skip;
regress;
lhs=price;
rhs=one, bd, ba, size, lot, age, q2, q3, q4, q5\$ this is a linear regression

create; lnprice=log(price)\$
skip;
regress;
lhs=lnprice;
rhs=one, bd, ba, size, lot, age, q2, q3, q4, q5\$ this is a semi-log regression

create; lnbd=log(bd)\$
create; lnba=log(ba)\$
create; lnsize=log(size)\$
create; lnlot=log(lot)\$
create; lnage=log(age)\$

skip;
regress;
lhs=lnprice;
rhs=one, lnbd, lnba, lnsize, lnlot, lnage, q2, q3, q4, q5\$ this is a log-log regression

create; sqage=age*age\$
create; sqlot=lot*lot\$
create; sqsize=size*size\$

skip;

regress;
lhs=price;
rhs=one, bd, ba, size, lot, sqage, q2, q3, q4, q5\$ this is a linear regression with age^2
regress;
lhs=price;
rhs=one, bd, ba, size, sqlot, age, q2, q3, q4, q5\$ this is a linear regression with lot^2
regress;
lhs=price;
rhs=one, bd, ba, sqsize, lot, age, q2, q3, q4, q5\$ this is a linear regression with size^2
regress;
lhs=price;
rhs=one, bd, ba, sqsize, sqlot, sqage, q2, q3, q4, q5\$ this is a linear regression with age^2,lot^2, and size^2

create; cuage=age*age*age\$
create; culot=lot*lot*lot\$
create; cusize=size*size*lot\$

regress;
lhs=price;
rhs=one, bd, ba, size, lot, cuage, q2, q3, q4, q5\$ this is a linear regression with age^3
regress;
lhs=price;
rhs=one, bd, ba, size, culot, age, q2, q3, q4, q5\$ this is a linear regression with lot^3
regress;
lhs=price;
rhs=one, bd, ba, cusize, lot, age, q2, q3, q4, q5\$ this is a linear regression with size^3
regress;
lhs=price;
rhs=one, bd, ba, cusize, culot, cuage, q2, q3, q4, q5\$ this is a linear regression with age^3,lot^3, and size^3

skip;
regress;
lhs=price;
rhs=one, bd, ba, size, lot, age, q2, q3, q4, q5; hetro\$ this is a linear regression

skip;
regress;
lhs=lnprice;
rhs=one, bd, ba, size, lot, age, q2, q3, q4, q5; hetro\$ this is a semi-log regression

create; lnbd=log(bd)\$
create; lnba=log(ba)\$
create; lnsize=log(size)\$
create; lnlot=log(lot)\$
create; lnage=log(age)\$

skip;
regress;
lhs=lnprice;
rhs=one, lnbd, lnba, lnsize, lnlot, lnage, q2, q3, q4, q5; hetro\$ this is a log-log regression
skip;
regress;
lhs=price;
rhs=one, bd, ba, size, lot, sqage, q2, q3, q4, q5; hetro\$ this is a linear regression with age^2
regress;
lhs=price;
rhs=one, bd, ba, size, sqlot, age, q2, q3, q4, q5; hetro\$ this is a linear regression with lot^2
regress;
lhs=price;
rhs=one, bd, ba, sqsize, lot, age, q2, q3, q4, q5; hetro\$ this is a linear regression with size^2
regress;
lhs=price;
rhs=one, bd, ba, sqsize, sqlot, sqage, q2, q3, q4, q5; hetro\$ this is a linear regression with age^2,lot^2, and size^2

regress;
lhs=price;
rhs=one, bd, ba, size, lot, cuage, q2, q3, q4, q5; hetro\$ this is a linear regression with age^3
regress;
lhs=price;
rhs=one, bd, ba, size, culot, age, q2, q3, q4, q5; hetro\$ this is a linear regression with lot^3
regress;
lhs=price;
rhs=one, bd, ba, cusize, lot, age, q2, q3, q4, q5; hetro\$ this is a linear regression with size^3
regress;
lhs=price;
rhs=one, bd, ba, cusize, culot, cuage, q2, q3, q4, q5; hetro\$ this is a linear regression with age^3,lot^3, and size^3

Appendix of Regression Results

Linear: Selling price= +(BD) + (BA) + (Size) + (Lot) + (Age) + (Q2) + (Q3) + (Q4)+(Q5) +
+———————————————————————–+
| Ordinary    least squares regression    Weighting variable = none     |
| Dep. var. = PRICE    Mean=   387570.3008    , S.D.=   189266.5817     |
| Model size: Observations =     133, Parameters =  10, Deg.Fr.=    123 |
| Residuals:  Sum of squares= .1861022040E+13, Std.Dev.=   123005.12383 |
| Fit:        R-squared=  .606423, Adjusted R-squared =          .57762 |
| Model test: F[  9,    123] =   21.06,    Prob value =          .00000 |
| Diagnostic: Log-L =  -1742.2784, Restricted(b=0) Log-L =   -1804.2882 |
|             LogAmemiyaPrCrt.=   23.512, Akaike Info. Crt.=     26.350 |
| Autocorrel: Durbin-Watson Statistic =   1.64819,   Rho =       .17591 |
+———————————————————————–+
+———+————–+—————-+——–+———+———-+
|Variable | Coefficient  | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+———+————–+—————-+——–+———+———-+
Constant    -226120.6909       72598.896   -3.115   .0023
BD           37659.81050       17601.113    2.140   .0344     3.4887218
BA           122116.0065       18335.652    6.660   .0000     2.3646617
SIZE         35.09598346       12.983202    2.703   .0078     2095.3534
LOT          2.715900837       1.3582596    2.000   .0478     10698.880
AGE          1245.204431       460.88289    2.702   .0079     35.714286
Q2           33570.17097       45364.143     .740   .4607     .36842105
Q3           54453.79975       45884.666    1.187   .2376     .32330827
Q4           68688.20522       50361.193    1.364   .1751     .15789474
Q5           68402.54427       57322.227    1.193   .2350  .82706767E-01

Semi-Log: Ln(Selling price)= +(BD) + (BA) + (Size) + (Lot) + (Age) + (Q2) + (Q3) + (Q4)+(Q5) +
+———————————————————————–+
| Ordinary    least squares regression    Weighting variable = none     |
| Dep. var. = LNPRICE  Mean=   12.77153525    , S.D.=   .4290662382     |
| Model size: Observations =     133, Parameters =  10, Deg.Fr.=    123 |
| Residuals:  Sum of squares= 9.657298755    , Std.Dev.=         .28020 |
| Fit:        R-squared=  .602595, Adjusted R-squared =          .57352 |
| Model test: F[  9,    123] =   20.72,    Prob value =          .00000 |
| Diagnostic: Log-L =    -14.3136, Restricted(b=0) Log-L =     -75.6798 |
|             LogAmemiyaPrCrt.=   -2.472, Akaike Info. Crt.=       .366 |
| Autocorrel: Durbin-Watson Statistic =   1.65052,   Rho =       .17474 |
+———————————————————————–+
+———+————–+—————-+——–+———+———-+
|Variable | Coefficient  | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+———+————–+—————-+——–+———+———-+
Constant     11.37970888       .16537966   68.810   .0000
BD           .1151867289   .40095183E-01    2.873   .0048     3.4887218
BA           .2576886939   .41768456E-01    6.169   .0000     2.3646617
SIZE      .7566868199E-04  .29575623E-04    2.558   .0117     2095.3534
LOT       .5521456699E-05  .30941035E-05    1.785   .0768     10698.880
AGE       .2349243288E-02  .10498872E-02    2.238   .0270     35.714286
Q2        .3073233877E-01      .10333913     .297   .7667     .36842105
Q3        .9723291039E-01      .10452487     .930   .3541     .32330827
Q4           .1497174823       .11472236    1.305   .1943     .15789474
Q5           .1535488191       .13057954    1.176   .2419  .82706767E-01

Log-Log: Ln(Selling price)= +LN(BD) + LN(BA) +  LN (Size) +  LN (Lot) +  LN (Age) +  LN (Q2) +  LN (Q3) +  LN (Q4)+ LN (Q5) +
+———————————————————————–+
| Ordinary    least squares regression    Weighting variable = none     |
| Dep. var. = LNPRICE  Mean=   12.77153525    , S.D.=   .4290662382     |
| Model size: Observations =     133, Parameters =  10, Deg.Fr.=    123 |
| Residuals:  Sum of squares= 10.20204619    , Std.Dev.=         .28800 |
| Fit:        R-squared=  .580179, Adjusted R-squared =          .54946 |
| Model test: F[  9,    123] =   18.89,    Prob value =          .00000 |
| Diagnostic: Log-L =    -17.9627, Restricted(b=0) Log-L =     -75.6798 |
|             LogAmemiyaPrCrt.=   -2.417, Akaike Info. Crt.=       .420 |
| Autocorrel: Durbin-Watson Statistic =   1.60387,   Rho =       .19806 |
+———————————————————————–+
+———+————–+—————-+——–+———+———-+
|Variable | Coefficient  | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+———+————–+—————-+——–+———+———-+
Constant     8.757137166       .63677360   13.752   .0000
LNBD         .2241713068       .15172804    1.477   .1421     1.2228521
LNBA         .5008477393       .10998272    4.554   .0000     .81084373
LNSIZE       .3844196249   .92943230E-01    4.136   .0001     7.5455685
LNLOT     .6238403231E-02  .31755881E-01     .196   .8446     8.9728713
LNAGE     .9113424286E-01  .36461800E-01    2.499   .0138     3.3225134
Q2        .3070924469E-01      .10623658     .289   .7730     .36842105
Q3        .8425880607E-01      .10815729     .779   .4375     .32330827
Q4           .1622575363       .11861502    1.368   .1738     .15789474
Q5           .1275544129       .13472863     .947   .3456  .82706767E-01

Age: Selling price= +(BD) + (BA) + (Size) + (Lot) + (Age) + (Q2) + (Q3) + (Q4)+(Q5) +
+———————————————————————–+
| Ordinary    least squares regression    Weighting variable = none     |
| Dep. var. = PRICE    Mean=   387570.3008    , S.D.=   189266.5817     |
| Model size: Observations =     133, Parameters =  10, Deg.Fr.=    123 |
| Residuals:  Sum of squares= .1894530209E+13, Std.Dev.=   124107.55265 |
| Fit:        R-squared=  .599337, Adjusted R-squared =          .57002 |
| Model test: F[  9,    123] =   20.44,    Prob value =          .00000 |
| Diagnostic: Log-L =  -1743.4651, Restricted(b=0) Log-L =   -1804.2882 |
|             LogAmemiyaPrCrt.=   23.530, Akaike Info. Crt.=     26.368 |
| Autocorrel: Durbin-Watson Statistic =   1.58509,   Rho =       .20745 |
+———————————————————————–+
+———+————–+—————-+——–+———+———-+
|Variable | Coefficient  | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+———+————–+—————-+——–+———+———-+
Constant    -198197.6762       71355.991   -2.778   .0063
BD           41632.61060       17697.361    2.352   .0202     3.4887218
BA           119922.0764       18461.383    6.496   .0000     2.3646617
SIZE         30.77013285       12.951058    2.376   .0191     2095.3534
LOT          2.819835392       1.3728617    2.054   .0421     10698.880
SQAGE        9.136372604       4.0879400    2.235   .0272     1854.7519
Q2           31407.95264       45749.284     .687   .4937     .36842105
Q3           56904.58301       46290.731    1.229   .2213     .32330827
Q4           63905.15657       50810.226    1.258   .2109     .15789474
Q5           64079.37547       57762.299    1.109   .2694  .82706767E-01

Lot: Selling price= +(BD) + (BA) + (Size) + (Lot) + (Age) + (Q2) + (Q3) + (Q4)+(Q5) +
+———————————————————————–+
| Ordinary    least squares regression    Weighting variable = none     |
| Dep. var. = PRICE    Mean=   387570.3008    , S.D.=   189266.5817     |
| Model size: Observations =     133, Parameters =  10, Deg.Fr.=    123 |
| Residuals:  Sum of squares= .1864381533E+13, Std.Dev.=   123116.09740 |
| Fit:        R-squared=  .605713, Adjusted R-squared =          .57686 |
| Model test: F[  9,    123] =   21.00,    Prob value =          .00000 |
| Diagnostic: Log-L =  -1742.3983, Restricted(b=0) Log-L =   -1804.2882 |
|             LogAmemiyaPrCrt.=   23.514, Akaike Info. Crt.=     26.352 |
| Autocorrel: Durbin-Watson Statistic =   1.65096,   Rho =       .17452 |
+———————————————————————–+
+———+————–+—————-+——–+———+———-+
|Variable | Coefficient  | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+———+————–+—————-+——–+———+———-+
Constant    -218881.7371       72780.583   -3.007   .0032
BD           42017.60864       17509.912    2.400   .0179     3.4887218
BA           119170.0447       18521.575    6.434   .0000     2.3646617
SIZE         36.16375705       12.928519    2.797   .0060     2095.3534
SQLOT     .5893264706E-04  .30354526E-04    1.941   .0545  .18774546E+09
AGE          1243.157772       461.28091    2.695   .0080     35.714286
Q2           33332.79329       45410.126     .734   .4643     .36842105
Q3           55029.73049       45919.835    1.198   .2331     .32330827
Q4           71429.12637       50299.962    1.420   .1581     .15789474
Q5           66239.11398       57337.025    1.155   .2502  .82706767E-01

Size: Selling price= +(BD) + (BA) + (Size) + (Lot) + (Age) + (Q2) + (Q3) + (Q4)+(Q5) +
+———————————————————————–+
| Ordinary    least squares regression    Weighting variable = none     |
| Dep. var. = PRICE    Mean=   387570.3008    , S.D.=   189266.5817     |
| Model size: Observations =     133, Parameters =  10, Deg.Fr.=    123 |
| Residuals:  Sum of squares= .1940633545E+13, Std.Dev.=   125608.55264 |
| Fit:        R-squared=  .589586, Adjusted R-squared =          .55956 |
| Model test: F[  9,    123] =   19.63,    Prob value =          .00000 |
| Diagnostic: Log-L =  -1745.0640, Restricted(b=0) Log-L =   -1804.2882 |
|             LogAmemiyaPrCrt.=   23.554, Akaike Info. Crt.=     26.392 |
| Autocorrel: Durbin-Watson Statistic =   1.59022,   Rho =       .20489 |
+———————————————————————–+
+———+————–+—————-+——–+———+———-+
|Variable | Coefficient  | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+———+————–+—————-+——–+———+———-+
Constant    -241337.0604       74193.454   -3.253   .0015
BD           49579.14015       17172.555    2.887   .0046     3.4887218
BA           135398.2084       17772.520    7.618   .0000     2.3646617
SQSIZE    .1810164194E-02  .12924632E-02    1.401   .1639     5618018.3
LOT          3.151652620       1.3736058    2.294   .0235     10698.880
AGE          1117.317431       467.27311    2.391   .0183     35.714286
Q2           39357.25904       46281.101     .850   .3968     .36842105
Q3           61364.81777       46755.053    1.312   .1918     .32330827
Q4           74204.44587       51370.025    1.445   .1511     .15789474
Q5           71571.08568       58518.552    1.223   .2236  .82706767E-01

Age,Lot, Size: Selling price= +(BD) + (BA) + (Size) + (Lot) + (Age) + (Q2) + (Q3) + (Q4)+(Q5) +
+———————————————————————–+
| Ordinary    least squares regression    Weighting variable = none     |
| Dep. var. = PRICE    Mean=   387570.3008    , S.D.=   189266.5817     |
| Model size: Observations =     133, Parameters =  10, Deg.Fr.=    123 |
| Residuals:  Sum of squares= .1964510934E+13, Std.Dev.=   126378.92867 |
| Fit:        R-squared=  .584537, Adjusted R-squared =          .55414 |
| Model test: F[  9,    123] =   19.23,    Prob value =          .00000 |
| Diagnostic: Log-L =  -1745.8772, Restricted(b=0) Log-L =   -1804.2882 |
|             LogAmemiyaPrCrt.=   23.567, Akaike Info. Crt.=     26.404 |
| Autocorrel: Durbin-Watson Statistic =   1.55620,   Rho =       .22190 |
+———————————————————————–+
+———+————–+—————-+——–+———+———-+
|Variable | Coefficient  | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+———+————–+—————-+——–+———+———-+
Constant    -209053.4763       72658.334   -2.877   .0047
BD           57570.61037       17043.151    3.378   .0010     3.4887218
BA           129251.6231       17991.802    7.184   .0000     2.3646617
SQSIZE    .1652037302E-02  .12929569E-02    1.278   .2038     5618018.3
SQLOT     .6912860879E-04  .31054480E-04    2.226   .0278  .18774546E+09
SQAGE        8.968801650       4.1642212    2.154   .0332     1854.7519
Q2           37103.33327       46532.998     .797   .4268     .36842105
Q3           63880.95430       47029.821    1.358   .1769     .32330827
Q4           72745.10419       51545.465    1.411   .1607     .15789474
Q5           65725.66481       58752.159    1.119   .2655  .82706767E-01

Age: Selling price= +(BD) + (BA) + (Size) + (Lot) + (Age) + (Q2) + (Q3) + (Q4)+(Q5) +
+———————————————————————–+
| Ordinary    least squares regression    Weighting variable = none     |
| Dep. var. = PRICE    Mean=   387570.3008    , S.D.=   189266.5817     |
| Model size: Observations =     133, Parameters =  10, Deg.Fr.=    123 |
| Residuals:  Sum of squares= .1925036441E+13, Std.Dev.=   125102.76885 |
| Fit:        R-squared=  .592885, Adjusted R-squared =          .56310 |
| Model test: F[  9,    123] =   19.90,    Prob value =          .00000 |
| Diagnostic: Log-L =  -1744.5274, Restricted(b=0) Log-L =   -1804.2882 |
|             LogAmemiyaPrCrt.=   23.546, Akaike Info. Crt.=     26.384 |
| Autocorrel: Durbin-Watson Statistic =   1.52685,   Rho =       .23657 |
+———————————————————————–+
+———+————–+—————-+——–+———+———-+
|Variable | Coefficient  | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+———+————–+—————-+——–+———+———-+
Constant    -188301.8461       71581.262   -2.631   .0096
BD           43262.82077       17864.936    2.422   .0169     3.4887218
BA           119268.5435       18603.691    6.411   .0000     2.3646617
SIZE         29047.71377       13050.325    2.226   .0278     2.0953534
LOT          2784.715719       1384.2298    2.012   .0464     10.698880
CUAGE        61.32949530       35.606852    1.722   .0875     126.19379
Q2           30085.44489       46106.441     .653   .5153     .36842105
Q3           57714.60912       46667.604    1.237   .2185     .32330827
Q4           61911.92839       51262.681    1.208   .2295     .15789474
Q5           60372.26504       58168.168    1.038   .3014  .82706767E-01

Lot: Selling price= +(BD) + (BA) + (Size) + (Lot) + (Age) + (Q2) + (Q3) + (Q4)+(Q5) +
+———————————————————————–+
| Ordinary    least squares regression    Weighting variable = none     |
| Dep. var. = PRICE    Mean=   387570.3008    , S.D.=   189266.5817     |
| Model size: Observations =     133, Parameters =  10, Deg.Fr.=    123 |
| Residuals:  Sum of squares= .1887007804E+13, Std.Dev.=   123860.91740 |
| Fit:        R-squared=  .600927, Adjusted R-squared =          .57173 |
| Model test: F[  9,    123] =   20.58,    Prob value =          .00000 |
| Diagnostic: Log-L =  -1743.2005, Restricted(b=0) Log-L =   -1804.2882 |
|             LogAmemiyaPrCrt.=   23.526, Akaike Info. Crt.=     26.364 |
| Autocorrel: Durbin-Watson Statistic =   1.64620,   Rho =       .17690 |
+———————————————————————–+
+———+————–+—————-+——–+———+———-+
|Variable | Coefficient  | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+———+————–+—————-+——–+———+———-+
Constant    -220786.5652       73216.452   -3.016   .0031
BD           43211.70481       17648.554    2.448   .0158     3.4887218
BA           119742.4105       18701.854    6.403   .0000     2.3646617
SIZE         37264.69827       12975.229    2.872   .0048     2.0953534
CULOT        .9963556600       .66434010    1.500   .1362     4978.6157
AGE          12253.39723       4639.4159    2.641   .0093     3.5714286
Q2           33928.23230       45686.087     .743   .4591     .36842105
Q3           55877.82472       46192.497    1.210   .2287     .32330827
Q4           74211.29656       50552.385    1.468   .1447     .15789474
Q5           64860.89912       57673.176    1.125   .2629  .82706767E-01

Size: Selling price= +(BD) + (BA) + (Size) + (Lot) + (Age) + (Q2) + (Q3) + (Q4)+(Q5) +
+———————————————————————–+
| Ordinary    least squares regression    Weighting variable = none     |
| Dep. var. = PRICE    Mean=   387570.3008    , S.D.=   189266.5817     |
| Model size: Observations =     133, Parameters =  10, Deg.Fr.=    123 |
| Residuals:  Sum of squares= .1925640795E+13, Std.Dev.=   125122.40495 |
| Fit:        R-squared=  .592757, Adjusted R-squared =          .56296 |
| Model test: F[  9,    123] =   19.89,    Prob value =          .00000 |
| Diagnostic: Log-L =  -1744.5482, Restricted(b=0) Log-L =   -1804.2882 |
|             LogAmemiyaPrCrt.=   23.547, Akaike Info. Crt.=     26.384 |
| Autocorrel: Durbin-Watson Statistic =   1.59886,   Rho =       .20057 |
+———————————————————————–+
+———+————–+—————-+——–+———+———-+
|Variable | Coefficient  | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+———+————–+—————-+——–+———+———-+
Constant    -221337.2167       76099.097   -2.909   .0043
BD           50303.69757       16942.158    2.969   .0036     3.4887218
BA           131567.8584       18073.353    7.280   .0000     2.3646617
CUSIZE       189.8603463       110.83285    1.713   .0892     75.419611
LOT          1613.499064       1676.2397     .963   .3377     10.698880
AGE          11162.99840       4646.6675    2.402   .0178     3.5714286
Q2           39134.77136       46066.731     .850   .3972     .36842105
Q3           59027.42881       46624.808    1.266   .2079     .32330827
Q4           73710.37655       51173.359    1.440   .1523     .15789474
Q5           69016.76995       58325.435    1.183   .2390  .82706767E-01

Age,Lot, Size: Selling price= +(BD) + (BA) + (Size)  + (Lot)  + (Age)  + (Q2) + (Q3) + (Q4)+(Q5) +
+———————————————————————–+
| Ordinary    least squares regression    Weighting variable = none     |
| Dep. var. = PRICE    Mean=   387570.3008    , S.D.=   189266.5817     |
| Model size: Observations =     133, Parameters =  10, Deg.Fr.=    123 |
| Residuals:  Sum of squares= .1981635825E+13, Std.Dev.=   126928.56403 |
| Fit:        R-squared=  .580915, Adjusted R-squared =          .55025 |
| Model test: F[  9,    123] =   18.94,    Prob value =          .00000 |
| Diagnostic: Log-L =  -1746.4544, Restricted(b=0) Log-L =   -1804.2882 |
|             LogAmemiyaPrCrt.=   23.575, Akaike Info. Crt.=     26.413 |
| Autocorrel: Durbin-Watson Statistic =   1.52703,   Rho =       .23648 |
+———————————————————————–+
+———+————–+—————-+——–+———+———-+
|Variable | Coefficient  | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+———+————–+—————-+——–+———+———-+
Constant    -179207.8145       74321.138   -2.411   .0174
BD           56041.05757       17411.842    3.219   .0016     3.4887218
BA           125686.3363       18235.331    6.892   .0000     2.3646617
CUSIZE       207.0734780       108.47541    1.909   .0586     75.419611
CULOT        .3907830999       .80803997     .484   .6295     4978.6157
CUAGE        61.71148077       36.127071    1.708   .0901     126.19379
Q2           34966.44174       46677.926     .749   .4552     .36842105
Q3           61530.57060       47310.295    1.301   .1958     .32330827
Q4           69789.42370       51816.146    1.347   .1805     .15789474
Q5           59428.41302       58942.872    1.008   .3153  .82706767E-01

Excel VBA Code to Generate Sensitivity Graphs Based on User Specified Ranges

Tags

Updated:

Many thanks to Joris Claassen who provided feedback on how to better optimize this code. Please find the updated code below.

Hey guys,

As I continue my quest to become a master VBA programmer, I would like to share a piece of code with you that I recently constructed. This could was the result of Excel Solver not being able to generate a satisfactory solution to my optimization problem. Thus, I wanted to run a sensitivity analysis to see how various initial values would affect my target cell. This code allows the user to specify a target cell, a set of change cells, a range of lower bounds (inclusive) and a range of upper bounds (inclusive), as well as a precision level. The code then one-by-one tests the sensitivity of each of your change cells on your target cell and generates graphs demonstrating the output. I would appreciate any feedback on the code!

Thanks,

```'Created by Adam McElhinney, 8/4/2011
'This code allows a user to specify a range of cells to adjust, a target cell as well as lower and upper bounds for their adjustments.
'Then the code generates a charts that determine how sensititive the target cell is to the input parameters. This can be useful for
'cases where Solver cannot generate a satisfactory solution.
Sub refit_model()
Call clear_sheet
Call calc_configs
Call ArrangeMyCharts

End Sub

'Clear Sheets
Sub clear_sheet()

Dim check As Worksheet

Sheets("output").Select
Sheets("output").Cells.Clear

Dim wsSheet As Worksheet
On Error Resume Next
Set wsSheet = Sheets("graphs")
If Not wsSheet Is Nothing Then
wsSheet.Delete

Else
End If

End Sub

Sub calc_configs()
Application.ScreenUpdating = True

'Timer
'Dim Starttime As Double, EndTime As Double
'Startime = Timer

'Prompt user to select range of coefficients, lower bounds and upper bounds
Worksheets("calculation_configs").Activate
Dim variables As Range
Dim change As Range
Dim lower As Range
Dim upper As Range
Dim current As Range
Dim target As Range
Dim precision As Integer
Dim test_mode As Integer
Set variables = Application.InputBox("Select the row of variable names:", Default:=Range("a6:a19").Address, Type:=8)
Set current = Application.InputBox("Select the row of cells that are the current configs:", Default:=Range("b6:b19").Address, Type:=8)
Set change = Application.InputBox("Select the row of cells to change:", Default:=Range("c6:c19").Address, Type:=8)
Set lower = Application.InputBox("Select the row of cells that are lower bound:", Default:=Range("d6:d19").Address, Type:=8)
Set upper = Application.InputBox("Select the row of cells that are upper bound:", Default:=Range("e6:e19").Address, Type:=8)
Set target = Application.InputBox("Select the target cell", Default:=Range("k16").Address, Type:=8)
precision = Application.InputBox("Enter in the graphics precision:", Default:=10)

test_mode = (MsgBox("Do you want to run in test mode? Click 'Yes' for test mode or 'No' for regular mode. In test mode, cells will not recalculate.", vbYesNo))

'Check to ensure that change, lower and upper are all same length
Dim j, k, l As Integer
j = change.Rows.Count
k = lower.Rows.Count
l = upper.Rows.Count
If j <> k Or k <> l Then
MsgBox ("The number of cells to change must equal the number of upper bounds and the number of lower bounds")
Exit Sub

End If

Application.ScreenUpdating = False

ReDim coeff(j) As Double
ReDim lowerb(j) As Double
ReDim upperb(j) As Double
Dim i As Integer
i = 1
For i = 1 To j
coeff(i) = change.Cells(i, 1).value
lowerb(i) = lower.Cells(i, 1).value
upperb(i) = upper.Cells(i, 1).value
'MsgBox (lower.Cells(i, 1).Value)
Next i

'QC Code
'ReDim testdisplay(j) As Double
'testdisplay = lowerb
'MsgBox testdisplay(1) & "-" & testdisplay(2) & "-" & testdisplay(3) & "-" & testdisplay(4) & "-" & testdisplay(5) & "-" & testdisplay(6) & "-" & testdisplay(7) & "-" & testdisplay(8) & "-" & testdisplay(9) & "-" & _
'testdisplay(10) & "-" & testdisplay(11) & "-" & testdisplay(12) & "-" & testdisplay(13) & "-" & testdisplay(14)

'Determine distance between upper and lower bounds, as well as upper bound and initial values and lower bound and initial values
ReDim distance(j) As Double
ReDim UpperDistance(j) As Double
ReDim LowerDistance(j) As Double
i = 0
For i = 1 To j
distance(i) = upperb(i) - lowerb(i)
UpperDistance(i) = upperb(i) - coeff(i)
LowerDistance(i) = coeff(i) - lowerb(i)
Next i

'Determine the graphing precision change intervals
ReDim interval(j) As Double
For i = 1 To j
interval(i) = distance(i) / precision
'MsgBox (interval(i))
Next i

'QC Code

'ReDim testdisplay2(j) As Double
'testdisplay2 = distance
'MsgBox testdisplay2(1) & "-" & testdisplay2(2) & "-" & testdisplay2(3) & "-" & testdisplay2(4) & "-" & testdisplay2(5) & "-" & testdisplay2(6) & "-" & testdisplay2(7) & "-" & testdisplay2(8) & "-" & testdisplay2(9) & "-" & _
'testdisplay2(10) & "-" & testdisplay2(11) & "-" & testdisplay2(12) & "-" & testdisplay2(13) & "-" & testdisplay2(14)

'Set all configs to current values
change.value = current.value
'Calc sum squares
If test_mode = vbNo Then
Application.CalculateFull
End If
'Record results

'Create index variable
Dim ObsCol As Range
'ObsCol=Sheets("output").

Dim ObsNum As Integer
ObsNum = 1

For i = 1 To j
Sheets("output").Cells(1, i).value = variables.Cells(i, 1)
Sheets("output").Cells(2, i).value = change.Cells(i, 1)
Next i
Sheets("output").Cells(1, j + 1).value = "Target"
Sheets("output").Cells(1, j + 2).value = "ObsNum"
Sheets("output").Cells(2, j + 1).value = target.Cells(1, 1).value
Sheets("output").Cells(2, j + 2).value = ObsNum

'New Counters
Dim n As Integer
Dim p As Integer
n = 1
p = 1

'Iterate through each variable one by one in increments equal to the precision, calculating the value of the target
i = 1
'Loop for each variable
For i = 1 To j
'Set all configs to lower bounds
change.value = lower.value

'Loop for each increment
For m = 1 To precision
change.Cells(i, 1).value = m * interval(i) + lowerb(i)
'MsgBox (m * interval(i) + lowerb(i))

If test_mode = vbNo Then
Application.CalculateFull
End If

'Print output
n = 1
For n = 1 To j
'Write change values
Sheets("output").Cells(2 + i * precision - (precision - m), n).value = change.Cells(n, 1).value
Next n
'Write target value
Sheets("output").Cells(2 + i * precision - (precision - m), j + 1).value = target.value
'Obs Num
Sheets("output").Cells(2 + i * precision - (precision - m), j + 2).value = i * precision - (precision - m) + 1

Next m
'Create graph of output
Worksheets("output").Activate
ActiveChart.SetSourceData Source:=Sheets("output").Range(Cells(i * precision - (precision - 1) + 2, j + 1), _
Cells(i * precision + 2, j + 1))

Debug.Print "Chart Target range:" & i * precision - (precision - 1) + 2 & "-" & j + 1

ActiveChart.ChartType = xlLine
ActiveChart.SeriesCollection(1).XValues = Sheets("output").Range(Cells(i * precision - (precision - 1) + 2, i), _
Cells(i * precision + 2, i))
ActiveChart.HasTitle = True
ActiveChart.ChartTitle.Text = "Change in Target for a Change in Variable " & variables(i)
ActiveChart.Location xlLocationAsObject, "graphs"
Debug.Print "Chart Series range:" & ii * precision - (precision - 1) + 2 & "-" & i
Next i

'Save workbook
'ActiveWorkbook.SaveAs Filename:="C:\Documents and Settings\amcelhinney\Desktop\" & test_file

EndTime = Timer
Debug.Print "Execution Time in seconds:", EndTime - Starttime

End Sub

Sub ArrangeMyCharts()
Worksheets("graphs").Activate

Dim iChart As Long
Dim nCharts As Long
Dim dTop As Double
Dim dLeft As Double
Dim dHeight As Double
Dim dWidth As Double
Dim nColumns As Long
dTop = 75 ' top of first row of charts
dLeft = 100 ' left of first column of charts
dHeight = 225 ' height of all charts
dWidth = 375 ' width of all charts
nColumns = 3 ' number of columns of charts

nCharts = ActiveSheet.ChartObjects.Count

Debug.Print nCharts

For iChart = 1 To nCharts
With ActiveSheet.ChartObjects(iChart)
.Height = dHeight
.Width = dWidth
.Top = dTop + Int((iChart - 1) / nColumns) * dHeight
.Left = dLeft + ((iChart - 1) Mod nColumns) * dWidth
End With

Next
End Sub

Sub ArrangeMyCharts_2()
Worksheets("Calculation_Configs").Activate

Dim iChart2 As Long
Dim nCharts2 As Long
Dim dTop As Double
Dim dLeft As Double
Dim dHeight As Double
Dim dWidth As Double
Dim nColumns As Long
dTop = 300 ' top of first row of charts
dLeft = 500 ' left of first column of charts
dHeight = 225 ' height of all charts
dWidth = 375 ' width of all charts
nColumns = 2 ' number of columns of charts

nCharts2 = ActiveSheet.ChartObjects.Count

Debug.Print nCharts2

For iChart2 = 1 To nCharts2
With ActiveSheet.ChartObjects(iChart2)
.Height = dHeight
.Width = dWidth
.Top = dTop + Int((iChart2 - 1) / nColumns) * dHeight
.Left = dLeft + ((iChart2 - 1) Mod nColumns) * dWidth
End With

Next
End Sub

```

SAS Certified Base Programmer

Today I took the SAS Certified Base Programmer Exam and passed! I wanted to share a few thoughts for people who are interested in taking the exam.

1. Ensure that the exam material is relevant how you use SAS. The certification is extremely detailed and mainly covers data importing, basic data manipulation and basic reporting (full description of exam content). If you use SAS primarily for predictive modeling purposes and already have an established method of getting data into SAS, this exam may not be a good use of your time.

2. Buy the exam book. Unless you have been using SAS for years, you won’t be able to pass the test without it (and even if you have been using SAS for years, I’d be shocked if you were able to pass without this book). Ensure that you read each chapter and make flashcards along the way. Also, take the practice exams at the end of each chapter and record which questions you got wrong. Then revisit those questions periodically until you are able to complete all of the practice exams with 90% or greater accuracy.

3. Consider purchasing the practice exam. I was not excited to shell out another \$55 for this test, on top of the cost of the book and the cost of the exam, but it was definitely worth it. This is far more robust than the book and is very representative of the type of questions you will see on the exam.

I hope that this was helpful and good luck with your studies!