Regression Analysis is a statistical process that involves identifying the relationship between independent and dependent attributes. The first studies of it dates to the early 19th Century by Legendre using Least Square Methods. Since then, various methods of estimating the coefficients of regression were developed. The purpose of this article is to build and understand the internal working of linear regression using a gradient descent approach. The goal is two-fold- a) to give an intuitive understanding of linear regression internals using gradient descent approach and b) To be comfortable with the concept of gradient descent which is very most frequently used for building Neural Network architectures.
Assume that we have data that has m data points and n features/attributes. Therefore, our data is now a matrix of size mXn. But, for computational purposes, it is better to arrange the samples by columns and features/attributes as rows
Let’s understand our goal- in a supervised approach given X, we need to predict y. If the model is good, then the difference between the actual and the predicted values would be small else the difference would be large. This difference is known as error. For a regression problem, the difference is computed as a squared error. The squared error is chosen because of the convex nature of the function.
For each data point i, the regression equation that predicts the outcome is y ̂ = w1X1 + w2X2 + …… + wmXm + C. From the equation it is evident that if there are m independent attributes, then we need m W’s (weights/coefficients which signify the importance of X’s) and a constant/intercept. This y ̂ is technically a dot product between a vector of W’s and vector of X’s (remember vector of X’s are nothing but the value of each attribute for that data point. So, it can be vectorially represented as
where w1, w2 …. are the weights of the features. Xnm represents the nth feature value of mth data point, b is the intercept value to be added.
Observe the dimensions of the W’s matrix (1 X n), samples matrix (n X m), intercept ( 1 X m) and predictions matrix (1 X m). Mathematically, the equation is written as
Here to make predictions, we have X’s (from the data) and all we need to compute are W’s and b. Now, the question boils down to how to get them such that they make a good model (the error is small). The loss function is defined as the error for a single data point is (yi-y ̂i)2 . Since we focus on the overall error on the data and not for an individual data point, this is computed as
where m is the number of data points. This is known as cost function and it should be minimized. The parameters are W’s and b (from which y ̂ is computed) which impact the cost.
Here is a step-by-step procedure to solve the problem
We start with random initialization of W’s and b. Let’s say all W’s and b’s be initialized with 0
With these initialized values, we compute the y ̂ using (eq. 1).
Then compute the cost function using (eq.2)
Observe that this cost function is impacted/influenced by all the W’s and b. Now, to what extent each of Wi or b are affecting the cost function i.e. How much is the impact of W1 on cost, how much is the impact of W2 on cost and so on. How can we compute it? a. This can be done by computing a partial derivative of the cost function w.r.t each of the Wi. b. This is important because we can use this computed value to update corresponding Wi and b, and the updated W and b are used to compute the y ̂ again. The intuition is that when the loss is used to update W’s and b, we are driving the W’s and b in the direction of minimizing the cost. This approach is known as gradient descent.
Here is the ideal W is where cost is minimum. Observe that, this is for one variable Wi (hence is represented in a 2-D) but in our case, there are multiple Ws and hence the plot is difficult to put on paper, but the intuition remains the same.
5. So how do we update the weights? a. As mentioned previously, we have computed the partial derivatives of cost w.r.t W’s and b. During the update phase, we need to decide how much should we update. Should add/subtract the entire value (∂cost/ ∂Wi) for all i which is equal to the number of features/attributes from their corresponding Wi or only a part of it?
i. Imagine if we need to reach point B from point A. If we take large strides, then there is a possibility that we might cross point B and if we take very small strides, then it might take a long time to reach point B. So, our step size should be appropriate. This step size in this context is known as Learning Rate (denoted by α).
b. We now update the value of W’s and b as follows Wupdated(i) = Wi – α*∂cost/ ∂Wi bupdated = bi – α*∂cost/ ∂b (eq. 3 & 4)
c. Let’s assume we have a function z= xy. Then ∂z/ ∂x =y ∂x/ ∂x and ∂z/ ∂y = x ∂y/ ∂y . You might know this from differential calculus. Similarly, if W2 is coefficient for feature X2 then ∂cost/ ∂W2 = X2 and so on.
Here “*” represents the matrix multiplication between the matrix X and the vector of residuals or errors for all the data points (observe that the Y are in upper case which implies that it is a vector of y’s of all data points) and similarly
6. With the updated values of W’s and b, the Y ̂ are predicted and cost is computed from (eq. 1) and (eq. 2) respectively. Then the W’s and b are updated using (eq. 3 & 4) and this process continues iteratively, till cost function is minimized.
7. The values of W’s and b at the least cost function value would then be the coefficients of the features/attributes and intercept respectively as shown in image 1.
Algorithm for Linear Regression using Gradient Descent
Initialize W’s and b
Step 2: Compute the y ̂ using the function WTX + b
Step 3: Compute the cost from the actual and predicted values using function
Step 4: Calculate gradient of cost function w.r.t w’s and b. Then update the w’s and b using the equations
Wupdated(i) = Wi – α*∂cost/ ∂Wi bupdated = bi – α*∂cost/ ∂b
5: Repeat Step 2 to Step 4 until the
cost is minimized