When I studied regression using linear least squares (LLS), the solution to $\mathbf{X}\boldsymbol{\beta} = \mathbf{y}$ was always presented as
$$\boldsymbol{\hat{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \tag{1}$$
but I have recently learnt that a general solution to the least squares problem is given by
$$\boldsymbol{\beta}_{LS} = \mathbf{X}^{+}\mathbf{y} + (\mathbf{I} -\mathbf{X}^{+}\mathbf{X})\mathbf{z} \tag{2}$$
where $\mathbf{X}^{+}$ is the Moore–Penrose (pseudo-)inverse (obtained, for example, via SVD of $\mathbf{X}$).
As I understand it, in cases where $\mathbf{X}$ has linearly independent columns, the matrix $\mathbf{X}^{T}\mathbf{X}$ is invertible and the pseudo-inverse matrix is equal to
$$\mathbf{X}^{+} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$$
which means that $(1)$ is equal to the first part on the right-hand side of $(2)$. But why do we never see the second part on the rhs of $(2)$ being used/mentioned in LLS solutions? It seems, that this homogeneous part can be used to derive solutions which have some other desirable properties (in addition to being least-squares), like minimum variance solution given by:
$$\boldsymbol{\beta}_{MV} = \mathbf{X}^{+}\mathbf{y} + \mu(\mathbf{I} -\mathbf{X}^{+}\mathbf{X})\mathbf{1} \tag{3}$$ where
$$ \mu = \frac{\mathbf{1}^T\mathbf{X}^{+}\mathbf{y}}{\mathbf{1}^T\mathbf{X}^{+}\mathbf{X}\mathbf{1}}.$$
And what happens when $\mathbf{X}$ does not have linearly independent columns? I think in this case, I was taught that there is a "multi-collinearity in the system" (or something like that) and that there are no solutions (or, if computed, then they will be "volatile", or the model will be "overfitted", or the standard errors will be large, or some other "dangers"). But $(2)$ tells us that there are solutions even in this case. I am confused: are there solutions in this case or are there no solutions? And is there something "special" about solutions in this case (if they exist)? And how do we reconcile this with all of those "dangers" that we were told about?
Add 1 (Following the comments)
Another case taught in elementary classes is that of an underdetermined system where we are told that when there are more unknows than equations, there exist infinite number of solutions and one has to apply some further constrains to obtain a "particular" solution. One such constrain is that the solution has the minimum norm of all possible solutions, i.e. one that minimizes $||\boldsymbol{\beta}||_2$, and we are given the following form for it:
$$\boldsymbol{\beta}_{MN} = \mathbf{X}^T(\mathbf{X}\mathbf{X}^T)^{-1}\mathbf{y} \tag{4}$$
But the form of the general LS solution in $(2)$ remains unchanged when we go from a "thin" $\mathbf{X}$ (with more rows than columns) to a "fat" $\mathbf{X}$ (with more columns than rows). But I suspect that it's not about the number of rows vs columns that matters, but something else (maybe the column /row rank?)
(It also means that $(4)$ is also a least-squares solution, since the sum of squared residuals is $0$ in this case, i.e. the minimum possible, but this is never mentioned too....)
So, how come the SVD approach always gives us a solution, but the "classical" ones gives them when the matrix is full rank? What happens to the solutions when we move "between" a full column rank to full row rank?