Much higher scoring metrics with classification_report than cross_validate

Question

I'm training a classifier on the DAIGT dataset. The objective is to differentiate human from AI text and so this is a binary classification problem. As a baseline before I move onto an LLM classifier, I am using a pipeline of a TF-IDF vectorizer and then a logistic regression classifier. However, when I try classifying the data this way I get extremely high scoring metrics. For example, the following code:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_validate
from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, test_idx in kf.split(daigt_v2["text"], daigt_v2["label"]):
  X_train, y_train = daigt_v2.iloc[train_idx]["text"], daigt_v2.iloc[train_idx]["label"]
  X_test, y_test = daigt_v2.iloc[test_idx]["text"], daigt_v2.iloc[test_idx]["label"]
baseline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LogisticRegression())
  ])
baseline.fit(X_train, y_train)
y_pred = baseline.predict(X_test)
  print(classification_report(y_test, y_pred, target_names=["Human", "AI"]))

gives the following output:

              precision    recall  f1-score   support
   Human       0.99      1.00      0.99      5475
      AI       1.00      0.98      0.99      3499

accuracy                           0.99      8974

macro avg       0.99      0.99      0.99      8974
weighted avg       0.99      0.99      0.99      8974
          precision    recall  f1-score   support

   Human       0.99      1.00      0.99      5474
      AI       0.99      0.98      0.99      3500

accuracy                           0.99      8974

macro avg       0.99      0.99      0.99      8974
weighted avg       0.99      0.99      0.99      8974
          precision    recall  f1-score   support

   Human       0.99      1.00      0.99      5474
      AI       1.00      0.98      0.99      3500

accuracy                           0.99      8974

macro avg       0.99      0.99      0.99      8974
weighted avg       0.99      0.99      0.99      8974
          precision    recall  f1-score   support

   Human       0.99      1.00      0.99      5474
      AI       0.99      0.98      0.99      3499

accuracy                           0.99      8973

macro avg       0.99      0.99      0.99      8973
weighted avg       0.99      0.99      0.99      8973
          precision    recall  f1-score   support

   Human       0.99      1.00      0.99      5474
      AI       1.00      0.98      0.99      3499

accuracy                           0.99      8973

macro avg       0.99      0.99      0.99      8973
weighted avg       0.99      0.99      0.99      8973

So we see that there is 0.99 f1 score and 0.99 classification accuracy. Which obviously seems way to high. However when I try using cross_validate like this:

baseline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LogisticRegression())
])
scores = cross_validate(baseline,
                        daigt_v2["text"],
                        daigt_v2["label"],
                        cv=10,
                        scoring=["accuracy",
                                 "f1",
                                 "recall",
                                 "precision",
                                 "roc_auc",
                                 "average_precision"])
summary = {key : float(np.mean(value)) for key, value in scores.items()}

summary returns as:

{'fit_time': 13.48662896156311,
 'score_time': 5.418254947662353,
 'test_accuracy': 0.8590308329341341,
 'test_f1': 0.8367589483608666,
 'test_recall': 0.9277524353897032,
 'test_precision': 0.7674348038361346,
 'test_roc_auc': 0.9595275583634191,
 'test_average_precision': 0.9446004784576681}

Which are much more modest scores. Obviously I trust the second result better, but can anyone explain the discrepency here?

score 3 · Accepted Answer · answered May 31 '25 at 16:22

When you supply cv=<int> to cross_validate(), it will use a splitting regime without shuffling; shuffle=False by default. Since the rows of your dataset are ordered (roughly 50% label=0 followed by label=1), the model gets trained on label=0 data before being tested on label=1, which skews the results.

One solution is to define a splitter, and use it for both of your code snippets:

#Define a splitter for all CV analyses
splitter = StratifiedKFold(5, shuffle=True, random_state=0)
.
.
... = cross_validate(..., cv=splitter)

Note that random_state=0 will ensure that it randomises the same way each call.

You could alternatively shuffle your data upon loading, which then permits you to use a non-randomising splitter like in cross_validate(..., cv=5).

score -2 · Answer 2 · answered Jun 06 '25 at 04:20

I do not think the process is wrong as there is a flaw in computing f1 value in confusion matrix calculation of Python. I have mentioned the same around 5 years back and hope it was resolved by this time. Seeing the results, I think that bug still is existing in the software. Let me explain why I am thinking the way... I have developed a Logit model for a business problem, and accuracy came as 100% through python. The moment, I saw as 100%, I am sure, being statistician by education, I felt,somewhere I am missing something and hence, I calculated the results against actuals, and I received only 96% accuracy. Further digging, I found that, instead at each row difference, it is finding the difference of columns (Results and actuals). For example: Actuals : [1,0,1,0] and results thru code: [0,0,1,1] by looking at these two matrices, the accuracy is 75% but if you sum it up both gives the answer 2, and difference is 0, hence, python code provide the accuracy as 100%. I hope I am clear in way of explaining this. I wish I might be wrong in understanding the calculation by Python, but during my understanding the code of Logit model by python, this is what I understood. Brickbats on my understanding is welcome :-)

Much higher scoring metrics with classification_report than cross_validate

2 Answers2