Applying the Normal Equations to solve the Linear Regression Problems.

Question

I am new to machine learning and I am currently studying the gradient descent method and its application for linear regression. An iterative method known as gradient descent is finding the linear function: $$ J(\theta)=\underset{\theta}{\operatorname{argmin}}\frac{1}{2}\sum_{i=1}^{n}\left(h_{\theta}(x^{(i)})-y^{(i)}\right) \tag1$$ However, I came to notice of an explicit non-iterative scheme in $\text{Andrew Ng's}$ lecture notes right here: https://see.stanford.edu/materials/aimlcs229/cs229-notes1.pdf for which he concluded in page $11$ that the gradient descent is now equivalent to finding the vector $\theta$ where: $$ \theta=(X^{T}X)^{-1}X^{T}y \tag2$$ Terminology:

$\theta$ are the parameters.
$n$ are number of training examples.
$x$ are input features and $y$ are output variables.
$(x^{(i)},y^{(i)})$ is the $i^{th}$ training example.
$X\in\mathbb{R}^{m\times n}$ is the design matrix where $\mathbf{X}=\begin{bmatrix} --(x^{(1)})^{T}-- \\ --(x^{(2)})^{T}-- \\ \vdots\\ --(x^{(n)})^{T}-- \end{bmatrix} $

I have two questions.

$(1)$: Say I have $(1,2)$, $(2,1.5)$, and $(3,2.5)$ I wish someone can demonstrate this procedure to find $\theta$ by solving $\theta=(X^{T}X)^{-1}X^{T}y$

$(2)$: I would really hope if a Python,MATLAB, or C$++$ algorithm for this procedure can be provided.

https://www.kdnuggets.com/2017/04/simple-understand-gradient-descent-algorithm.html — P. J., Feb 06 '21 at 06:08

Pablo Carneiro Elias · Accepted Answer · 2021-02-09T07:48:32.903

The way to use normal equations is just explicitly solving the equation as it has only one solution. The only reason why its not widely used instead of gradient descent is because it get to be very inefficient for large datasets. Look at the equations you wrote:

$$ \theta=(X^{T}X)^{-1}X^{T}y \tag2 $$

You have three matrix multiplications and one matrix inversion. Usually, the data matrix can be large, because its composed by the observed samples. Imagine $X$ matrix can be huge, and matrix multiplication has complexity $O(n^3)$.

However, if one wants to solve the linear system of equations by using the normal equation, one just have to do exactly what the equation says. If you are using numpy, for example, you will do something like:

For the $T_1 = (X^{T}X)^{-1}$ matrix:

x_transpose = np.transpose(x)   #calculating transpose
x_transpose_dot_x = x_transpose.dot(x)  # calculating dot product
T_1 = np.linalg.inv(x_transpose_dot_x) #calculating inverse

For the $T_2 = X^{T}y$ matrix:

T_2 = x_transpose.dot(y)

Now just multiply them together:

theta = T_1.dot(T_2)

Normal equation can be easily derived understanding the fact that a linear transformation $A$ respects:

$Ker(A^T) \perp Im(A)$,

which is, image of $A$ is the orthogonal complement of kernel of $A^T$. This means that if the linear system is not solvable (which is, $y$ vector is not spanned by image of $A$) then the best we can do is to find a vector perpendicular to image space of $A$ (since the vector that minimizes a distance between a hyperplane and a point is perpendicular to that hyperplane).

Edit: I just realize now that I think you want some example using some static dumb data. For applying the given example to a sample data, you only have to know how to create a matrix using numpy, which can easily be found on internet. It would look something like this:

x = np.array([[1,2], 
    [2, 1.5],
    [3, 2.5]])

Thank you very much, now it makes so much sense I did not take into account the level-3 BLAS operations and their costs. Furthermore, thank you for the demonstration and the algorithm that you supplied :-) — Maria Kuznetsov, Feb 09 '21 at 07:18

Applying the Normal Equations to solve the Linear Regression Problems.

1 Answers1