2

I have a dataset where I have to detect anomalies. Now, I use a subset of the data(let's call that subset A) and apply the DBSCAN algorithm to detect anomalies on set A.Once the anomalies are detected, using the dbscan labels I create a label variable (anomaly:1, non-anomaly:0) in the dataset A. Now, I train a supervised algorithm on dataset A to predict the anomalies using the label as the dependent/target variable and finally use the trained supervised model to predict the anomalies on the rest of the data (A compliment).

While, this seems to be a fair approach to me, I am just wondering if there is any data leakage happening at any stage. Please note that I am using the same set of variables/features at both stage(unsupervised and supervised). Reason for posting is when I train the supervised model, I get very high roc-auc score, which is around 0.99XX and that is suspicious.

Note that, I can not use the DBSCAN algorithm for the entire data set because of computational constraints. I can not use supervised model as I do not have labels.

2 Answers2

0

Please take care of stratification while sampling data for training. You should mention stratify = y in train_test_split .

High ROC-AUC may be pointing to imbalance dataset - when-is-an-auc-score-misleadingly-high

Also check this out - k-fold-cross-validation-auc-score-vs-test-auc-score

Sandeep Bhutani
  • 914
  • 1
  • 7
  • 26
0

Not knowing exactly your dependant variable, and your explanatory variables, and the volume of data you have at hands, it's hard to give a good diagnosis.

However 99+%-sh scores often hide problems within models. To rule out the data leakage problem : before doing anything, start by keeping a hold-out set that doesn't participate neither in the unsupervised learning nor in the supervised one, and evaluate on it.

Blenz
  • 2,124
  • 13
  • 29