In this post we are going to talk about Linear Regression which is one of the most widely used statistical tools in Machine Learning. The idea is very simple. We have some features and we want to know how our predictions change as we change the value of features. Features are the square footage of the house, # of bathrooms, # of bedrooms etc. and observation is the price of the house.
So we want to generate a model that takes features as input and outputs the predicted price.
First, let’s explain a naive method to create such an application. We will only consider the square feet of the house as a feature and we will try to predict the price of the house. Let’s take our observations and make a plot of them.
X axis represents the “feature” square feet. It is also called “covariate” or “predictor”. Y axis represents the “observation” that we collect. It is also called “response” or “dependent variable”. Also each point on the graph represents a previous house sale.
So the question is how are we going to use these observations to estimate price of a house? One way is to look at how big the house and look for the similar price range as shown below.
The problem with that approach is that we are only considering 2 house sales that we are going to base our estimate off of. We are throwing out all the other house sales and the question is, is that approach reasonable?
Of course no. In that approach we leave all the other observations as they have nothing to the with our prediction. We can instead think about modeling the relationship between the square footage of the house and the house sales price. To do this we are going to use Linear Regression.
Our main goal is to understand the relationship between the square footage of the house and the house sales price. The simplest model would be just fitting a straight line to data.
This line is defined by;
W0 is the intercept and the W1 is being the weight on the feature X. Intercept and slope are the parameters of our model.
So now the question is, which line is the best line? We need to define a cost for given line to find the best fit. We will use Root Mean Square Deviation(RMSE) to minimize our cost.
Now we are ready to get started to make our prediction with some real data. First of all, click here to download the dataset which includes house sales in King County, the region where the city of Seattle, WA is located. Then open up your iPython Notebook. If you are not familiar with the iPython Notebook and GraphLab Create, I strongly encourage you to read this post.
We start by importing the GraphLab Create library then we load our data.
You can view the data in iPython notebook by typing;
This will show the very first few lines of the data.
Let’s explore little bit more about the data. We know that the house price is correlated with the number of square feet of living space. Let’s show this on a scatter plot.
Now it is time to create a simple linear regression model of sqft_living to price.
- We need to split the data into training set and test set.
- We will use seed = 0 so that everyone running this notebook gets the same results. In practice, you may set a random seed (or let GraphLab Create pick a random seed for you).
Now we can build our model using linear_regression.create function with only sqft_living as a feature.
After that, we can evaluate our model to see how good we are doing.
The output will be very close to following;
Isn’t it would be great to see what our predictions look like? Surely it would. We will use Matplotlib for visualizing our predictions. Matplotlib is a Python plotting library that is also useful for plotting. You can install it with: ‘pip install matplotlib’
This is how the output will look like;
Blue dots are representing the original data, green line is representing the prediction from the simple regression.
Let’s create another model using more features in order to come up with better predictions.
We can also see a summary of our features with .show() function.
Now we will try to find out what is the most expensive zipcode in our data set. For that we will visualize the data in BoxWhisker view.
Here is the output:
Pull the bar at the bottom to view more of the data. 98039 is the most expensive zip code.
Next, we will compare our simple square feet model with the model that has a few more features.
As you can see from the output, the RMSE goes down from $255,196 to $179,542 with more features.
We can now build a new and even better regression model. Then we will compare all these 3 models.
With this model we will have lower RMSE and better predictions. Let’s evaluate it using the test set.
You will immediately see that RMSE goes down to 156.813.
Now we are in the most fun part. Applying the trained models to predict price of a house.
We will choose a house from our test set. The first house that we will use is considered an “average” house in Seattle.
Let’s apply our models.
The model with more features provides a better prediction than the simpler model with only 1 feature. However, seems like we can make much more better predictions.
Now let’s see how our advanced model is doing.
Our advanced model did a great job! The original price of the house was $2,200,000 and we predicted $2,115,905 which is pretty reasonable!
At the end it is also possible that in some cases, the model with more features may provide a worse prediction than the simpler model with only 1 feature. However, on average, the model with more features is better. Also note that predictions may vary from yours with just a little bit difference.