Suppose each row of the matrix $X$ of $m\times n\,\,(m>n)$ dimension refers to the feature vector of data, and the vector $y$ of $m\times 1$ be the value of each data. As we all know, we can use the parameter vector $w$ of dimension $n\times 1$ to fit the data in a linear manner by solving the minimization problem as below: $$\min_{\theta} \Vert y-Xw \Vert^2, $$ which yields the optimal solution as $w^*=(X^TX)^{-1}X^Ty=X^{\dagger}y$. This is also known as the least square estimation.
But my question is what if the parameters here are not in an $n\times1$ vector form, but a matrix $W$ of $n\times k$ instead?
So the problem is formulated as if the value of data is $Y$, an $m\times k$ matrix. That is, the values for each data is of $k$ dimensions. And as for the parameters we are about to estimate, $W$, is a matrix of $m\times k$. Will the estimation of $W$ still be like $W^*=(X^TX)^{-1}X^TY=X^{\dagger}Y$. If this makes sense, what is the physical meaning of this solution? Or Is $W^*$ the optimal solution to some expression, just like the optimization problem defined above?