Extending the least square estimation from the vector to a matrix

Question

Suppose each row of the matrix $X$ of $m\times n\,\,(m>n)$ dimension refers to the feature vector of data, and the vector $y$ of $m\times 1$ be the value of each data. As we all know, we can use the parameter vector $w$ of dimension $n\times 1$ to fit the data in a linear manner by solving the minimization problem as below: $$\min_{\theta} \Vert y-Xw \Vert^2, $$ which yields the optimal solution as $w^*=(X^TX)^{-1}X^Ty=X^{\dagger}y$. This is also known as the least square estimation.

But my question is what if the parameters here are not in an $n\times1$ vector form, but a matrix $W$ of $n\times k$ instead?

So the problem is formulated as if the value of data is $Y$, an $m\times k$ matrix. That is, the values for each data is of $k$ dimensions. And as for the parameters we are about to estimate, $W$, is a matrix of $m\times k$. Will the estimation of $W$ still be like $W^*=(X^TX)^{-1}X^TY=X^{\dagger}Y$. If this makes sense, what is the physical meaning of this solution? Or Is $W^*$ the optimal solution to some expression, just like the optimization problem defined above?

@RodrigodeAzevedo Thanks for your advice. I have replaced all the vectors in the lowercase. — Yuejiang_Li, Apr 05 '19 at 07:58
@RodrigodeAzevedo Really Thanks! That is exactly what I want! — Yuejiang_Li, Apr 05 '19 at 12:33

V. Vancak · Answer 1 · 2019-04-06T12:10:24.430

Yes, and this is called Multivariate Regression Model. For example for the case of $k=2$ you may have $$ (Y_1, Y_2)=(\beta_{11}X_{1} + \beta_{12}X_2, \beta_{12}X_{1} + \beta_{22}X_2) + (\epsilon_1, \epsilon_2), $$ the main reason for a such model is the fact that your response variable $Y$ is a vector where each of its entries depends on the same set (subet) of variables $X = (X_1, X_2)$, i.e., they are correlated and thus have to be estimated simultaneously and not in two different univariate regression models. The OLS solution is indeed $$ \hat{B}=(X'X)^{-1}X'Y, $$ and has the same meaning as in the univariate case. The only difference is the fact that now your parametric space is much larger. In my example you have $4$ coefficients instead of $2$, hence in order to obtain a stable solution you should have much more observations.

@Vancak Thanks for your answer. But can you give me more detailed illustrations about the "correlated" stuff you mentioned about? Cause if we write the results in a vector form, i.e., $(\beta_1, \beta_2)=X^{\dagger}(y_1, y_2)$, this may suggest each column of $\beta$ can be estimated through OLS with the corresponding column of $Y$. So I didn't get the correlated thing in an appropriate way. — Yuejiang_Li, Apr 07 '19 at 02:04

Extending the least square estimation from the vector to a matrix

1 Answers1