R Squared – A Way Of Evaluating Regression
Regression is a way of fitting a function to a set of data. For instance, maybe you have been using satellites to count the number of cars in the parking lot of a bunch of Walmart stores for the past couple of years. You also know the quarterly sales that Walmart had during that time frame from their earnings report. You want to find a function that relates the two, so that you can use your satellites to count the number of cars and predict Walmart’s quarterly earnings. (In order to get an advantage in the stock market)
In order to generate that function you would use regression analysis. But after you generated that function how can you tell if it is a good quality? After all, if you are using it to try to predict the stock market, you will be betting real money on it. You need to know, is your model a good fit? A bad fit? Mediocre? One commonly used metric for determining goodness of fit is R squared, or more formally, the coefficient of determination.
This post goes over R2, and by the end you will understand what it is and how to calculate it, but unfortunately you won’t have a good rule of thumb for what R2 value is good enough for your analysis, because it is entirely problem dependent.
What Is R Squared?
We will get into the equation R2 in a little bit, but first what is R squared? It is how much better your regression line is than a simple horizontal line through the mean of the data. In the plot below the blue line is the data that we are trying to generate a regression to and the horizontal red line is the average of that data.
The red line is the value that gives the lowest summed squared error to the blue data points, if you had no other information about the blue data points other than their y value. This is shown in the plot below. For the blue data points, only the y values are available. You don’t know anything else about those values. You can download the Excel file I used to generate these plots and tables here.If you want to select a value that gives you the lowest summed squared error, the value that you would select was the mean value, shown as the red triangle. A different way to think about that assertion is this. If I took all 7 of the y points (0, 1, 4, 9, 16, 25, 36), rearranged them into a random order, and made you guess the value of all 7 points before revealing any of their order, what strategy would give you the minimum sum squared error? That strategy is to guess the mean value for all the points.
With regression the question is, now that you have more information (the X values in this case) can you make a better approximation than just guessing the mean value? And the R squared value answer the question, how much better did you do?
That is actually a pretty intuitive understanding. First calculate how much error you would have if you don’t even try to do regression, and instead just guess the mean of all the values. That is the total error. It could be low if all the data is clustered together, or it could be high if the data is spread out.
SS stands for summed squared error, which is how the error is calculated. To get the total sum squared error you
- Start with the mean value
- For every data point subtract that mean value from the data point value
- Square that difference
Add up all of the squares. This results in summed squared error
Next calculate the error in your regression values against the true values. This is your regression error. Ideally the regression error is very low, near zero.
The ratio of the regression error against the total error tells you how much of the total error remains in your regression model. Subtracting that ratio from 1.0 gives how much error you removed using the regression analysis. That is R2
In actual equation form it is
To get the total error, you subtract the mean value from each data point, and square the results
For the sum squared regression error, the equation is the same except you use the regression prediction instead of the mean value
What is a Good R Squared?
In most statistics books, you will see that an R squared value is always between 0 and 1, and that the best value is 1.0. That is only partially true. The lower the error in your regression analysis relative to total error, the higher the R2 value will be. And the best R2 value is 1.0. To get that value you have to have zero error in your regression analysis.
However R2 is not truly limited to a lower bound of zero.
For practical purposes, the lowest R2 you can get is zero, but only because the assumption is that if your regression line is not better than using the mean, then you will just use the mean value.
Theoretically however, you could use something else. Let’s say that you wanted to make a prediction on the population of one of the states in the United States. I am not giving you any information other than the population of all the states, based on the 2010 census. I.e. I am not telling you the name of the state you are trying to make the prediction on, you just have to guess the population (in millions) of all the states in a random order. The best you could do here is take the mean value. Your total squared error would be 2298.2
If you used something other than mean, for instance took the median, the summed squared error would be 2247.2
Which when converted into R2 is
And you get a negative R2 number
The assertion that the R squared value has to be greater than or equal to zero is based on the assumption that if you get a negative R squared value, you will dump whatever regression calculation you are using and just go with the mean value.
The take away
- An R2 of 1.0 is the best. It means you have not error in your regression.
- An R2 of 0 means your regression is no better than taking the mean value, i.e. you are not using any information from the other variables
- A Negative R2 means you are doing worse than the mean value. However maybe summed squared error isn’t the metric that matters most to you and this is OK.
R Squared Example
As an example, let’s look at this data
This data is just the numbers 0 through 6, with the y value being the square of those numbers. The linear regression equation for this data is
y = 6x-5
and is plotted on the graph below
Excel has calculated the R2 of this equation to be .9231. How can we duplicate that manually?
Well the equation is
So we need to find the total summed squared error (based off the mean) and the summed squared error based off the regression line.
The mean value of the y values of the data (0, 1, 4, 9, 16, 25, 36) is 13
To find the total summed square error, we will subtract 13 from each of the y values, square that result, and add up all of the squares. Graphically, this is shown below. At every data point, the distance between the red line and the blue line is squared, and then all of those squares are summed up
The total sum squared error is 1092, with most of the error coming from the edges of the chart where the mean is the farthest way from the true value
Now we need find the values that our regression line of y = 6x-5 predicts, and get the summed squared error of that. For the sum squared value, we will subtract each y regression value from the true value, take the square, and sum up all of the squares
So the total summed squared error of the linear regression is 84, and the total summed squared error is 1092 based on the mean value.
Plugging these numbers into the R2 equation we get
Which is the same value that Excel calculated.
A different way to think about the same result would be that we have 84/1092 = 7.69 % of the total error remaining. Basically, if someone had just given us the y values, and then told us that they were going to pick random order those y values and we had to guess what they all were, the best guess we could have made was the mean for each one. But if now they give us the X value, and tell us to try to guess the Y value, we can use the regression line and remove 92.31% of the error from our guess.
But Wait, Can’t We Do Better?
We just showed a linear regression line that produced an R squared value of .9231 and said that that was the best linear fit we could make based on the summed squared error metric. But couldn’t we do better with a different regression fit?
Well the answer is yes, of course we could. We used a linear regression, i.e. a straight line, on this data. However the data itself wasn’t linear. Y is the square of x. So if we used a square regression, and in fact just used the equation y = x2, we get a much better result
Here we have an R2 of 1.0, because the regression line exactly matches the data, and there is no remaining error. However the fact that we were able to do this is somewhat beside the point for this R2 explanation. We were able to find an exact match for this data only because it is a toy data set. The Y values were built as the square of the X values, so it is no surprise that making a regression that utilized that fact gave a good match.
For most data sets, there an exact match will not be able to be generated because the data will be noisy and not an easy equation. For instance, an economist might be doing a study to determine what attributes of a person correlate to their adult profession and income. Some of those attributes could be height, childhood interests, parent’s incomes, school grades, SAT scores etc. It is unlikely that any of those will have a perfect R2 value, in fact some of them might be quite low. But there are times that even a low R2 value could be of interest
Any R2 value above 0.0 indicates that there could be some correlation between the variable and the result, although very low values are likely just random noise.
An Odd Special Case
Just for fun, what do you think the R2 of this linear regression line for this data is?
Here we have a pure horizontal line, all the data is 5.0.
As it turns out, the R squared value of a linear regression on this data is undefined. Excel will display the value as N/A
What has happened in this example is that the total summed squared error is equal to zero. All the data values exactly equal the mean value. So there is zero error if you just estimate the mean value.
Of course, there is also zero error for the regression line. You end up with a zero divided by zero term in the R squared equation, which is undefined.