Modeling the Prices of Homes

In this post I use information about homes sold from 2006-2010 in Ames, Iowa, to create a model for predicting sale prices, and submit my model to a Kaggle competition.

The data - which was originally published by Dean De Cock, a professor at Truman State University - has 80 variables describing the quality of homes. The data was collected by the City of Ames for purposes of tax assessment, and contains information about which one would expect potential property buyers to care, so it makes sense that it would influence price.

As of May 2018, De Cock’s 2011 publication in the Journal of Statistics Education, 19(3) can be found here: https://ww2.amstat.org/publications/jse/v19n3/decock.pdf

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import seaborn as sns

from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

After importing the necessary modules, I read the data from two csv files, ‘test.csv’ and ‘train.csv’

test = pd.read_csv('test.csv')
train = pd.read_csv('train.csv')

I started my exploratory data analysis by finding out how many rows and columns there are in the test and train sets.

As seen in the output of the following two cells, the training data has more rows (observations) than the test data, and also has one more column than the test data. Upon further investigation, this column is ‘price’, the target variable. This makes sense since my objective is to predict price. My predictions for price will go in the price column of the test data.

test.shape

(879, 80)

train.shape

(2051, 81)

Next I visually inspected the first few rows of each data set.

train.head()

	Id	PID	MS SubClass	MS Zoning	Lot Frontage	Lot Area	Street	Alley	Lot Shape	Land Contour	...	Pool QC	Fence	Misc Feature	Mo Sold	Yr Sold	Sale Type	SalePrice
0	109	533352170	60	RL	NaN	13517	Pave	NaN	IR1	Lvl	...	NaN	NaN	NaN	3	2010	WD	130500
1	544	531379050	60	RL	43.0	11492	Pave	NaN	IR1	Lvl	...	NaN	NaN	NaN	4	2009	WD	220000
2	153	535304180	20	RL	68.0	7922	Pave	NaN	Reg	Lvl	...	NaN	NaN	NaN	1	2010	WD	109000
3	318	916386060	60	RL	73.0	9802	Pave	NaN	Reg	Lvl	...	NaN	NaN	NaN	4	2010	WD	174000
4	255	906425045	50	RL	82.0	14235	Pave	NaN	IR1	Lvl	...	NaN	NaN	NaN	3	2010	WD	138500

5 rows × 81 columns

test.head()

	Id	PID	MS SubClass	MS Zoning	Lot Frontage	Lot Area	Street	Alley	Lot Shape	Land Contour	...	Screen Porch	Pool QC	Fence	Misc Feature	Mo Sold	Yr Sold	Sale Type
0	2658	902301120	190	RM	69.0	9142	Pave	Grvl	Reg	Lvl	...	0	NaN	NaN	NaN	4	2006	WD
1	2718	905108090	90	RL	NaN	9662	Pave	NaN	IR1	Lvl	...	0	NaN	NaN	NaN	8	2006	WD
2	2414	528218130	60	RL	58.0	17104	Pave	NaN	IR1	Lvl	...	0	NaN	NaN	NaN	9	2006	New
3	1989	902207150	30	RM	60.0	8520	Pave	NaN	Reg	Lvl	...	0	NaN	NaN	NaN	7	2007	WD
4	625	535105100	20	RL	NaN	9500	Pave	NaN	IR1	Lvl	...	185	NaN	NaN	NaN	7	2009	WD

5 rows × 80 columns

I also opened the data in a spreadsheet program for a more thorough visual inspection of the data (since the .head() method does not show all columns). This revealed that many of the columns had missing values (as white space). In order to use these variables for modeling without getting errors, I turned these missing values into np.NaN.

# This use of the DataFrame.replace() method uses a regular expression
# to replace cells which only contain whitespace with np.NaN.
# Credit for this implementation goes to Stack Overflow user Temak 
# on this thread: https://stackoverflow.com/questions/13445241/replacing-blank-values-white-space-with-nan-in-pandas 

train.replace(r'^\s+$', np.nan, regex=True, inplace=True)
test.replace(r'^\s+$', np.nan, regex=True, inplace=True)

Many of the column titles have spaces, which I converted to underscores in the next cell because spaces do not play nicely with the other objects.

train.columns = [col.strip().replace(' ', '_') for col in train.columns]
test.columns =  [col.strip().replace(' ', '_') for col in test.columns]

Next I plotted the distribution of the variable of most interest, the prediction target: SalePrice.

As seen in the graph below, it has a right skewed distribution with some large outliers.

sns.distplot(train.SalePrice);

png

To understand the distribution of SalePrice more precisely, I used the .describe() method to get summary statistics. In dollars, the mean sale price was ~181,000, the maximum sale price was ~611,000, the minimum was ~13,000, and the middle 50% of sale prices were between ~130,000 and ~214,000.

train.SalePrice.describe()

count      2051.000000
mean     181469.701609
std       79258.659352
min       12789.000000
25%      129825.000000
50%      162500.000000
75%      214000.000000
max      611657.000000
Name: SalePrice, dtype: float64

I wish houses were this cheap in New York or Vancouver.

The date of sale could also have a lot to do with the prices, so I looked at that next.

Plotting sale frequency by month:

# This graph was inspired by Lee Clemmer's blog post here: https://www.kaggle.com/leeclemmer/exploratory-data-analysis-of-housing-in-ames-iowa/code
train.groupby(['Yr_Sold','Mo_Sold']).Id.count().plot(kind='bar',fontsize=14,figsize=(15,5),color='b')
plt.title("Amount of Sales Per Month")
plt.xlabel('Year, Month')
plt.show()

png

Sale dates appear seasonal, and the time period includes the ‘08 housing bubble crash. To check whether the national housing market had a notable impact on Ames house prices, I plotted sale prices by time and looked for a dip starting in 2008.

train.groupby(['Yr_Sold','Mo_Sold'])['SalePrice'].mean().plot(kind=
                                                              'bar',color='b',
                                                              figsize=(15,5));

png

If the housing crisis impacted sale prices in Ames, the effect wasn’t drastic. However, it appears that prices tended to be lower after December of 2007 than before. This is a topic worthy of further study, though I decided not to make a variable representing when a sale happened in relation to the bubble, since any information offered by such a variable was already contained within existing variables about date of sale.

In preparation for predicting SalePrice using regularized linear regression, I log transformed SalePrice to reduce its skew and handle its large outliers. Log transforming target data is one method for handling these outliers (which the author of the dataset even recommended removing), and making cheap and expensive housing prices influence the model equally

(as argued by by Julien Cohen Solal in his analysis of the Ames data: https://www.kaggle.com/juliencs/a-study-on-regression-applied-to-the-ames-dataset/code).

# Log-transform SalePrice, the target.
train.SalePrice = np.log(train.SalePrice)

Next I continued my preparation for linear regression by making dummy variables out of the categorical and ordinal variables.

# Make dummy variables for all variables of type object, plus MS SubClass
# which, while not currently encoded as an object, still represents a categorical variable.
# Credit for this function goes to Joe Dorfman from my DSI cohort
def get_dummied(train, test):
    list = ['MS_SubClass']
    for i in train.columns:
        if train[i].dtype == object:
            list.append(i)
            
    full_data = pd.concat([train, test], axis = 0)
    full_data = pd.get_dummies(full_data, columns=list, drop_first = True)
    
    X_dummied = full_data[:len(train)]
    test_dummied = full_data[len(train):]
    
    return X_dummied, test_dummied

train, test = get_dummied(train, test)

# review the columns to check that get_dummied really made the dummies
train.head()
test.head()

	1st_Flr_SF	2nd_Flr_SF	Bedroom_AbvGr	BsmtFin_SF_1	Bsmt_Full_Bath	Bsmt_Unf_SF	Enclosed_Porch	...	Sale_Type_New	Sale_Type_WD
0	908	1020	4	0.0	0.0	1020.0	112	...	0	1
1	1967	0	6	0.0	0.0	1967.0	0	...	0	1
2	664	832	3	554.0	1.0	100.0	0	...	1	0
3	968	0	2	0.0	0.0	968.0	184	...	0	1
4	1394	0	3	609.0	1.0	785.0	0	...	0	1

5 rows × 273 columns

Next, I added 2nd degree polynomial features for the 10 most important variables (ranked by correlation with SalePrice). I chose not to make polynomial features for all variables in order to reduce the processing time that would be required in later steps to ensure that I would have time to test multiple models.

# Assess which variables are most correlated with sale price:
corr = train.corrwith(train.SalePrice)
print(corr.sort_values(ascending=False).head(11))

SalePrice         1.000000
Overall_Qual      0.822774
Gr_Liv_Area       0.687774
Garage_Cars       0.667780
Garage_Area       0.650755
Year_Built        0.624449
Total_Bsmt_SF     0.621416
Year_Remod/Add    0.599459
1st_Flr_SF        0.599086
Garage_Yr_Blt     0.581093
Full_Bath         0.565855
dtype: float64

# Make a list of the columns to be passed to PolynomialFeatures()
cols = ['Overall_Qual','Gr_Liv_Area','Garage_Cars','Garage_Area',
        'Year_Built','Total_Bsmt_SF','Year_Remod/Add', '1st_Flr_SF', 
        'Garage_Yr_Blt','Full_Bath']

# Instantiate PolynomialFeatures with intercept = False
poly = PolynomialFeatures(include_bias=False)

# Make 2nd degree polynomials from the variables in cols
# Filling all NaNs with 0s because you can't multiply times a NaN
# And turn them back into dataframes
train_poly = pd.DataFrame(poly.fit_transform(train[cols].fillna(0)))
test_poly = pd.DataFrame(poly.fit_transform(test[cols].fillna(0)))

# I don't care about the column names because I'm not trying to make an
# inferential model, just a predictive one

# check that the features of both dataframes line up
test_poly.head()
train_poly.head()

	0	1	2	3	4	5	6	7	8	9	...	55	56	57	58	59	60	61	62	63	64
0	6.0	1479.0	2.0	475.0	1976.0	725.0	2005.0	725.0	1976.0	2.0	...	4020025.0	1453625.0	3961880.0	4010.0	525625.0	1432600.0	1450.0	3904576.0	3952.0	4.0
1	7.0	2122.0	2.0	559.0	1996.0	913.0	1997.0	913.0	1997.0	2.0	...	3988009.0	1823261.0	3988009.0	3994.0	833569.0	1823261.0	1826.0	3988009.0	3994.0	4.0
2	5.0	1057.0	1.0	246.0	1953.0	1057.0	2007.0	1057.0	1953.0	1.0	...	4028049.0	2121399.0	3919671.0	2007.0	1117249.0	2064321.0	1057.0	3814209.0	1953.0	1.0
3	5.0	1444.0	2.0	400.0	2006.0	384.0	2007.0	744.0	2007.0	2.0	...	4028049.0	1493208.0	4028049.0	4014.0	553536.0	1493208.0	1488.0	4028049.0	4014.0	4.0
4	6.0	1445.0	2.0	484.0	1900.0	676.0	1993.0	831.0	1957.0	2.0	...	3972049.0	1656183.0	3900301.0	3986.0	690561.0	1626267.0	1662.0	3829849.0	3914.0	4.0

5 rows × 65 columns

# concatonate the new polynomial features with the existing X features
X_test = pd.concat([test, test_poly], axis=1)
X_train = pd.concat([train, test_poly], axis=1)

# check the result of the concatonation
X_train.head()
X_test.head()

	1st_Flr_SF	2nd_Flr_SF	Bedroom_AbvGr	BsmtFin_SF_1	Bsmt_Full_Bath	Bsmt_Unf_SF	Enclosed_Porch	...	55	56	57	58	59	60	61	62	63	64
0	908	1020	4	0.0	0.0	1020.0	112	...	3802500.0	1770600.0	3724500.0	3900.0	824464.0	1734280.0	1816.0	3648100.0	3820.0	4.0
1	1967	0	6	0.0	0.0	1967.0	0	...	3908529.0	3888759.0	3908529.0	3954.0	3869089.0	3888759.0	3934.0	3908529.0	3954.0	4.0
2	664	832	3	554.0	1.0	100.0	0	...	4024036.0	1331984.0	4024036.0	4012.0	440896.0	1331984.0	1328.0	4024036.0	4012.0	4.0
3	968	0	2	0.0	0.0	968.0	184	...	4024036.0	1941808.0	3881610.0	2006.0	937024.0	1873080.0	968.0	3744225.0	1935.0	1.0
4	1394	0	3	609.0	1.0	785.0	0	...	3853369.0	2736422.0	3853369.0	1963.0	1943236.0	2736422.0	1394.0	3853369.0	1963.0	1.0

5 rows × 338 columns

Continuing to prepare for modeling, I separated the training data and the prediction data into y_train, X_train, and y_pred, X_pred.

X_train = train[[col for col in train.columns if col!="SalePrice"]].fillna(0)
y_train = train['SalePrice'].fillna(0)

X_pred = test[[col for col in test.columns if col!="SalePrice"]].fillna(0)
y_pred = test['SalePrice'].fillna(0)

Next I used train_test_split to make my training data into a training set and a testing set so that I could check the variance of my model.

X_train, X_test, y_train, y_test = train_test_split(X_train,y_train)

I planned to use three different cross-validated regularization techniques, RidgeCV, LassoCV, and ElasticNetCV, all of which require the standardization of predictors to stop those with different scales having disproportionate impacts on the models.

I set up pipelines to standardize the data using StandardScaler and each of the the models.

# Instantiate StandardScaler object
ss = StandardScaler()

RidgeCV Models:

# The 1st pipeline uses RidgeCV
ridge = RidgeCV()

pipe_ridge = Pipeline([
    ('ss', ss),
    ('ridge', ridge)
])

# Fit ridge model on training data with target that is not log-transformed
pipe_ridge.fit(X_train,np.exp(y_train))

# Score ridge model on training data with atarget that is not log-transformed
pipe_ridge.score(X_train,np.exp(y_train))

0.9393211124466319

# Score ridge model on testing data without log-transforming the target
pipe_ridge.score(X_test,np.exp(y_test))

0.9009603187169944

The difference between my ridge model’s score on the testing and training sets ranged from 0.2 to 0.02 when I ran it, which indicates that the model’s variance is substantial, meaning that depending on the train_test_split, it could be substantially overfit to the training data. I submitted a model which had R^2 values above 0.9 for both the training and testing set.

# Make predictions for the prediction set's SalePrice and format them 
# For submission to the kaggle contest
# this code snippet was inspired by Jamie Quella in my DSI Cohort
ridge_prediction = pd.DataFrame(pipe_ridge.predict(X_pred), 
                                columns=['SalePrice'],
                                index=X_pred['Id']).sort_index()

# Output my RidgeCV predictions to a CSV
ridge_prediction.to_csv('./ridge_submission.csv')

RidgeCV with log transformed target:

On training data first:

# y_train was log transformed in the preprocessing stage
pipe_ridge.fit(X_train, y_train)

# Score model on training data
yhat = pipe_ridge.predict(X_train)
yhat = np.exp(yhat)
r2_score( np.exp(y_train), yhat)

0.954693949599155

Now on Test Data:

# y_train was log transformed in the preprocessing stage
pipe_ridge.fit(X_test, y_test)

# Score model on training data
yhat = pipe_ridge.predict(X_test)
yhat = np.exp(yhat)
r2_score(np.exp(y_test), yhat)

0.9685287628198033

The consistently small (~0.02) difference in the R^2 scores means that this model’s variance is not a problem. I also really liked these R^2 scores because their values indicate that my model explains ~96% of the variation in price in my test and train data.

Next, I made predictions from this model and formatted them for submission:

# Make predictions for the prediction set's SalePrice 
yhat = pipe_ridge.predict(X_pred)
yhat = np.exp(yhat)

# Format predictions for submission to the kaggle contest
# this code snippet was inspired by Jamie Quella in my DSI Cohort
ridge_prediction_log_y = pd.DataFrame(yhat, 
                                columns=['SalePrice'],
                                index=X_pred['Id']).sort_index()

# Output the predictions of my RidgeCV model with a log-transformed target to a CSV
ridge_prediction_log_y.to_csv('./ridge_submission_log_y.csv')

LassoCV Models:

# The 2nd pipeline uses LassoCV
lasso = LassoCV()

pipe_lasso = Pipeline([
    ('ss', ss),
    ('lasso', lasso)
])

# Fit lasso model on training data with a target that is not log-transformed
pipe_lasso.fit(X_train,np.exp(y_train))

# Score lasso model on training data with a target that is not log-transformed
pipe_lasso.score(X_train,np.exp(y_train))

0.928759624447597

# Score lasso model on testing data without log-transforming the target
pipe_lasso.score(X_test,np.exp(y_test))

0.8031853311539953

As for the ridge model above, the difference between my lasso model’s score on the testing and training sets was small, which indicates that the model’s variance is small. In other words, it isn’t very overfit to the training data.

Considering the model’s scores on both the training and on the testing data, the LassoCV model without a log transformed target performed about as well as the comparable (non log-transformed) RidgeCV model, so I submitted its predictions too.

# Make predictions for the prediction set's SalePrice and format them 
# For submission to the kaggle contest
# this code snippet was inspired by Jamie Quella in my DSI Cohort
lasso_prediction = pd.DataFrame(pipe_lasso.predict(X_pred), 
                                columns=['SalePrice'],
                                index=X_pred['Id']).sort_index()

# Output my LassoCV predictions to a CSV
lasso_prediction.to_csv('./lasso_submission.csv')

ElasticNetCV Models:

#### The 3rd pipeline uses ElasticNetCV
enet = ElasticNetCV()

pipe_enet = Pipeline([
    ('ss', ss),
    ('enet', enet)
])

# Fit enet model on training data with a y that is not log-transformed
pipe_enet.fit(X_train,np.exp(y_train))

# Score enet model on training data with a target that is not log-transformed
pipe_enet.score(X_train,np.exp(y_train))

0.2658321655459208

The ElasticNetCV model had a much worse score than either of the other models, so I didn’t submit it. I think its my fault for not optimizing it, but…

I’m taking the GRE in 2 days so I’m going to call it quits for now and be satisfied with my existing models

(As of now my best model is in 6th place on the leaderboard out of about 70, which is among the best evidence I’ve ever seen for divine intervention on my behalf).

Worthy paths of further exploration include a better optimized implementation of ElasticNetCV and a LassoCV model with a log-transformed target.

Thank you for reading this post, I welcome your comments.

Written on June 30, 2018