Define the data set to be a sequence $\{(Y_i, X_i)\}_{i=1}^n$ of pairs of real numbers. Fitting a line to this sequence amounts to finding $\theta = (\alpha, \beta) \in \mathbb R^2$ such that the equation $Y_i = \alpha X_i + \beta$ "captures" the relationship between each $Y_i$ and $X_i$ for each $i$ in some sense. But in what sense? One way to make this question meaningful is to assume a statistical model. Namely, assume that, for each $i$, $$Y_i = \alpha X_i + \beta + e_i,$$ where each $e_i$ is a random variable with known distribution. In other words, assume that each $X_i$ is a deterministic variable and that $Y_i$ is related to $X_i$ in a linear but "noisy" way. Under this assumption, the question of finding $\theta$ can be cast as a question of finding the $\theta$ that maximizes the "likelihood" that $\{(Y_i, X_i)\}_{i=1}^n$ was observed given that $\theta$ was the parameter of the above statistical model.
One way to define likelihood is by use of a conditional probability density function (p.d.f.), which we denote by $f(Y_i \; | \; X_i, \theta)$ for each $i$. (I use this notation to emphasize that $X_i$ is deterministic, so it admits no useful notion of probability density. Each $X_i$ will instead play a role in parametrizing the distribution of $Y_i$.) For example, if each $e_i$ is a normal random variable with zero mean and variance $\sigma^2$, and the $e_i$ are all mutually independent, then the $Y_i$ are also independent normal variables with variance $\sigma^2$, but each $Y_i$ has a mean of $\alpha X_i + \beta$. Hence, we may define the likelihood of the data to be the following joint conditional p.d.f.:
\begin{align*}
\ell((Y,X); \theta) &= \prod_{i=1}^n f(Y_i \; | \; X_i, \theta) \\
&= \frac{1}{(2\pi)^{n/2}\sigma^n}\exp\left(\frac{-1}{2\sigma^2}\left(\sum_{i=1}^n(Y_i - \alpha X_i - \beta)^2\right)\right).
\end{align*}
Note that the mutual independence of $e_i$ was essential for writing this formula, for otherwise the joint conditional p.d.f. would not factorize in the first step above. Clearly, maximizing this likelihood function over $\theta$ is equivalent to minimizing
$$\sum_{i=1}^n(Y_i - \alpha X_i - \beta)^2$$
over $\theta$. (Again, keep in mind that $\theta = (\alpha, \beta)$.) This last expression is the square-error formula used in least-squares fitting. The qualifier "least" implies that we're minimizing this formula (or, equivalently, maximizing the above likelihood formula). This equivalence between minimizing square-error and maximizing likelihood could only be shown because the $e_i$ in the statistical model are i.i.d. normally distributed with zero mean and finite variance. Hence, the normality assumption is essential in motivating the use of least-squares fitting from a statistical standpoint. However, as mentioned by Pantelis Sopasakis in the comments, it is not essential to the actual computation of $\theta$, which is the focus of the derivations in the link you posted.