Just giving an intuitive overview. Michael Zubilevsky's lectures go through it in more technical detail.
Consider the quadratic objective $f(x) = x^T Q x + b^T x$. Geometrically, CG tries to "unwarp" some of the geometry induced by the quadratic form $Q$.
To build intuition, consider $Q \in \mathbb{R}^{2\times 2}$. If $Q = I$, then the contours of $f$ are circles. If $Q$ is diagonal, then the contours of $f$ are $x$-axis and $y$-axis aligned ellipses. If $Q$ has off-diagonal components as well, then there is "correlation" between the $x$ and $y$ directions, so the contours are ellipses whose principal axes are a combination of the $x$- and $y$-axis.
Now consider performing coordinate descent in the $Q=I$ or $Q$ diagonal case. We can reach the optimum in at most $2$ steps by viewing the problem as two $1$-dimensional minimization problems over each coordinate. We have this property because the $x$- and $y$-axis are orthogonal, and they "align" with the geometry of $Q$.
In the case where $Q$ contains off-diagonal components, CG finds orthogonal directions in $Q$'s geometry so that we can perform coordinate descent. We can use Gram-Schmidt to do the orthogonalization to find orthogonal directions, but if we start by orthogonalizing against the gradient, then we observe a lot of simplification in the update rule.
It outperforms gradient descent because by orthogonality of search directions, we ensure that we're never repeating a step along a direction we've searched in before. Thus we get rid of the "zig-zagging" of GD.
[EDIT: some additional notes here]