To begin with, it is worth making some comments. Given a probability space $(\Omega,\mathcal{U},\textbf{P})$, we can think of $\mathcal{U}$ as the information we have at hand related to the random phenomenon we are interested in. More precisely, the $\sigma$-algebra $\mathcal{U}$ tells us what are the events that we can observe the occurrence of. So, when one considers a sub-$\sigma$-algebra $\mathcal{V}\subseteq\mathcal{U}$, we are restricting the information about the random phenomenon we are studying.
Based on such interpretation, we can consider the conditional expectation $\textbf{E}[X\mid\mathcal{V}]$ as the random variable which best approximates $X$ based on the knowledge of $\mathcal{V}\subseteq\mathcal{U}$. This means that $Y := \textbf{E}[X\mid\mathcal{V}]$ should be $\mathcal{V}$-measurable and both $Y$ and $X$ should coincide (in average) at every given measurable set $A\in\mathcal{V}$. That is why $\textbf{E}[X\mid\mathcal{U}]$ equals $X$: the best approximation of $X$ given all the knowledge of $X$ is the random variable $X$ itself.
In order to make it clearer, let us consider the particular case where $Y$ is a simple random variable. This means that we can express $Y$ as a linear combination of indicator functions of measurable sets which partitions the sample space $\Omega$:
\begin{align*}
Y(\omega) = \sum_{i=1}^{n}y_{i}1_{D_{i}}(\omega)
\end{align*}
In such context, if we let that $\mathcal{D}_{Y} = \{D_{1},D_{2},\ldots,D_{n}\}$, then the conditional expectation is given by:
\begin{align*}
\textbf{E}[X\mid Y](\omega) = \textbf{E}[X \mid \mathcal{D}_{Y}](\omega) = \sum_{i=1}^{n}\textbf{E}[X\mid D_{i}]1_{D_{i}}(\omega)
\end{align*}
In other words, we are approximating $X$ by $\textbf{E}[X\mid D_{i}]$ for every $\omega\in D_{i}$. This is not a good approximation, because we are approximating $X$ by a constant at each $D_{i}$, but it is the best approximation among such type of approximations.
Generally speaking, given a probability space $(\Omega,\mathcal{U},\textbf{P})$ where $X$ is $\mathcal{U}$-measurable, $Y$ is $\mathcal{V}$-measurable and $\mathcal{V}\subseteq\mathcal{U}$, we can define the conditional expectation as follows:
\begin{align*}
\textbf{E}[X\mid Y] = \textbf{E}[X\mid\sigma(Y)]
\end{align*}
where $\sigma(Y)$ is the $\sigma$-algebra generated by $Y$. Based on such definition, you can recover the usual definition of conditional expectation that you are acquainted to.
Finally, as @OliverDíaz has mentioned, you can formalize what has been discussed in terms of the best approximation related to the quadratic mean of the difference.