I was looking at Andrew Ng's machine learning course and for linear regression he defined a hypothesis function to be $h(x) = \theta_0 + \theta_1x_1 + \dots + \theta_nx_n$, where $x$ is a vector of values, so the goal of linear regression is to find $\theta$ that most closely estimates the real result in order to estimate how wrong the hypothesis is compared to how the data is actually distributed. He uses the least square $$ \mathrm{error} = (h(x) - y)^2, $$ where $y$ is the real result. Since there are a total of $m$ training examples he needs to aggregate them such that all the errors get accounted for. So he defined a cost function $$ J(\theta) = \frac{1}{2m}\sum_{i=0}^{m}(h(x_i) - y_i)^2, $$ where $x_i$ is a single training set.
He states that $J(\theta)$ is convex with only $1$ local optimum. I want to know why is this function convex?