0

A logistic regression involves a linear combination of features to predict the log-odds of a binary, yes/no-style event. That log-odds can then be transformed to a probability. If $\hat L_i$ is the predicted log-odds of observation $i$, then define the predicted probability of observation $i$, $\hat p_i$, by $\hat p_i=\frac{1}{1+e^{-\hat L_i}}$.

How, then, do data scientists obtain accuracy values for such models? In particular, every predicted probability is going to be greater than zero and less than one, so every prediction is going to be a little bit wrong, since binary values in this framework tend to be coded as $0$ and $1$, yet I routinely see data scientists claiming nonzero accuracy of such models.

Dave
  • 4,542
  • 1
  • 10
  • 35

1 Answers1

2

You need a threshold value $t$ to assign a class based on the probability, so that if $p < t$ you assign it to class 0, and if $p >= t$ you assign it to class 1. Then, you can compute the accuracy on a test set based on the assigned class and the true class; for this, you run your model on the test dataset, get predictions and divide the number of correct predictions by the total number of predictions.

For the threshold value, by default, you can use $t = 0.5$. Alternatively, you could compute an optimal threshold on a held-out validation data set (i.e. the threshold value that maximizes the accuracy or any other metric over the said validation data).

noe
  • 28,203
  • 1
  • 49
  • 83