I am new to machine learning and I am currently studying the gradient descent method and its application for linear regression. An iterative method known as gradient descent is finding the linear function: $$ J(\theta)=\underset{\theta}{\operatorname{argmin}}\frac{1}{2}\sum_{i=1}^{n}\left(h_{\theta}(x^{(i)})-y^{(i)}\right) \tag1$$ However, I came to notice of an explicit non-iterative scheme in $\text{Andrew Ng's}$ lecture notes right here: https://see.stanford.edu/materials/aimlcs229/cs229-notes1.pdf for which he concluded in page $11$ that the gradient descent is now equivalent to finding the vector $\theta$ where: $$ \theta=(X^{T}X)^{-1}X^{T}y \tag2$$ Terminology:
- $\theta$ are the parameters.
- $n$ are number of training examples.
- $x$ are input features and $y$ are output variables.
- $(x^{(i)},y^{(i)})$ is the $i^{th}$ training example.
- $X\in\mathbb{R}^{m\times n}$ is the design matrix where $\mathbf{X}=\begin{bmatrix} --(x^{(1)})^{T}-- \\ --(x^{(2)})^{T}-- \\ \vdots\\ --(x^{(n)})^{T}-- \end{bmatrix} $
I have two questions.
$(1)$: Say I have $(1,2)$, $(2,1.5)$, and $(3,2.5)$ I wish someone can demonstrate this procedure to find $\theta$ by solving $\theta=(X^{T}X)^{-1}X^{T}y$
$(2)$: I would really hope if a Python,MATLAB, or C$++$ algorithm for this procedure can be provided.