2

I’m confused about the steps to go from a simple linear regression to logistic regression.

If we have a dataset consisting of a column of x values and a column of y values (the values we want to predict), then we can run a simple linear regression to get a predictive model such that y_pred = B1x + c, where B1 is the coefficient for our inputs, x, and c is the intercept of the line.

Now let’s say y is categorical such that it is either 1 or 0. 1 if event occurs, and 0 of it does not. Many of the videos I’ve watched tell me to think of y_pred as a probability even though it’s not. If we think of it as a probability it makes no sense because, assuming the regression line has positive slope, for very large values of x we get y_pred values which can go to infinity. Also for small values of x, depending on the regression line, we may have negative predicted values. Both of those are not a way to think about probabilities so we throw linear regression out and try something else.

As a next step, they say to start calling y_pred “z” and think of a function that can take in the z values we got from our linear regression output and map them to values between 0 and 1. A sigmoid does this well and is described as P = 1/(1+e^(-z)). If we now make a new column of data, P, using all our z values, then we now have a column of data that tells us the probability of a 1 or 0 based on the independent variable. But to fit the data better we think of P = (1/e^(-z)) as ln(p/(p-1)) = z, as they are equivalent. Then we perform something called maximum likelihood estimation to get a new coefficient for x and c so the curve fits the data better... or am I wrong and you simply do a linear regression because of the linear relationship between log odds and z?

I think my confusion is this: why do we care about log odds? Why not just pass z through the sigmoid, fit it, and use that to get probabilities for varying x values? Am I just so lost that I’m going in circles with misunderstandings? Can someone help with thinking through the steps above?

1 Answers1

2

In a binary classification problem, we are given a training dataset consisting of feature vectors $x_1, \ldots, x_N \in \mathbb R^d$ and corresponding labels. Let's think of the label for example $i$ as being a random variable $Y_i$ with two possible values, $0$ or $1$. Moreover, let's assume that the random variables $Y_i$ are independent and that there exists a vector $\beta^\star \in \mathbb R^{d+1}$ such that $$ P(Y_i = 1) = \sigma(\hat x_i^T \beta^\star) \quad \text{for } i = 1, \ldots, N. $$ Here $\sigma(u) = \frac{1}{1 + e^{-u}}$ is the sigmoid function and $\hat x_i$ is the "augmented" feature vector obtained by prepending a $1$ to the feature vector $x_i$.

Let $y_i$ be the observed value of the random variable $Y_i$. Notice that \begin{align} P(Y_i = y_i \text { for } i = 1, \ldots, N) &= \Pi_{i=1}^N P(Y_i = y_i) \\ &= \Pi_{i=1}^N \sigma(\hat x_i^T \beta^\star)^{y_i}(1 - \sigma(\hat x_i^T \beta^\star))^{1 - y_i}. \end{align} (Parse that last expression carefully. It gives the correct value if $y_i = 0$ and it also gives the correct value if $y_i = 1$. Admittedly, expressing $P(Y_i = y_i)$ in this way is a "slick" thing to do. It's something that you would only think of with the benefit of hindsight, after making a lot of effort to simplify this calculation.)

It seems natural to estimate $\beta^\star$ by finding the vector $\beta$ that maximizes the function $$ L(\beta) = \Pi_{i=1}^N \sigma(\hat x_i^T \beta)^{y_i} (1 - \sigma(\hat x_i^T \beta)^{1 - y_i}, $$ which is called the "likelihood function". But maximizing $L(\beta)$ is equivalent to maximizing $$ \log L(\beta) = \sum_{i=1}^N y_i \log\left(\sigma(\hat x_i^T \beta) \right) + (1 - y_i) \log \left(1 - \sigma(\hat x_i^T \beta) \right). $$ This is the objective function that we maximize when training a logistic regression model.

littleO
  • 54,048
  • I lost you when you said prepending a 1 to to xi. I’m also a bit lot on Yi and yi... if I knew these things I have a feeling I could understand what you’ve written.. also a bit confused about why P(Yi = yi) is equal to a multiplication of probabilities.. phew! – King Squirrel Apr 09 '21 at 06:19
  • 1
    "Prepending a $1$" just means you stick a $1$ at the beginning. It would have probably been more clear if I had said that $P(Y_i = y_i) = \sigma(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_d x_{id})$, where $x_{i1}, \ldots, x_{id}$ are the components of the feature vector $x_i$. But to save writing, I wrote the expression $\beta_0 + \beta_1 x_{i1} + \cdots + \beta_d x_{id}$ using vector notation as $\hat x_i^T \beta$, where $\hat x_i = \begin{bmatrix} 1 \ x_{i1} \ \vdots \ x_{id} \end{bmatrix}$ and $\beta = \begin{bmatrix} \beta_1 \ \vdots \ \beta_d \end{bmatrix}$. – littleO Apr 09 '21 at 06:35
  • 1
    Regarding the expression $P(Y_i = y_i) = \sigma(\hat x_i^T \beta)^{y_i} (1 - \sigma(\hat x_i^T \beta))^{1 - y_i}$, think about the cases $y_i = 1$ and $y_i = 0$ separately. If $y_i = 1$, this expression states that $P(Y_i = 1) = \sigma(\hat x_i^T \beta)$. If $y_i = 0$, this expression states that $P(Y_i = 0) = 1 - \sigma(\hat x_i^T \beta)$. It's just a clever way to handle both cases at once. – littleO Apr 09 '21 at 06:42
  • Thank you. I think the only thing I now don’t understand is where you say P(Yi = yi) is the multiplication from i = 1 to N of those two terms. Could you explain that? – King Squirrel Apr 09 '21 at 14:55
  • I’m also wondering if my logic in the post I valid.. 1. Linear regression on say two columns of data, 2. Obtain your row of y_pred values, 3. Input those into sigmoid, 4. Now you have a function that takes in independent variables and outputs probability of event occurring. Correct? – King Squirrel Apr 09 '21 at 14:58
  • For example this article seems to make this very simple... did they oversimplify and skip MLE??? https://towardsai.net/p/machine-learning/logistic-regression-with-mathematics – King Squirrel Apr 09 '21 at 14:59
  • 1
    That article doesn't provide the MLE viewpoint, but that's ok. You can write down the logistic regression cost function based on intuition, without using MLE, if you accept that cross-entropy is the natural way to measure how well a predicted probability agrees with a ground truth probability. – littleO Apr 09 '21 at 16:58