Is it safe to use labels created from unsupervised model to train a supervised model using the same data?

Question

I have a dataset where I have to detect anomalies. Now, I use a subset of the data(let's call that subset A) and apply the DBSCAN algorithm to detect anomalies on set A.Once the anomalies are detected, using the dbscan labels I create a label variable (anomaly:1, non-anomaly:0) in the dataset A. Now, I train a supervised algorithm on dataset A to predict the anomalies using the label as the dependent/target variable and finally use the trained supervised model to predict the anomalies on the rest of the data (A compliment).

While, this seems to be a fair approach to me, I am just wondering if there is any data leakage happening at any stage. Please note that I am using the same set of variables/features at both stage(unsupervised and supervised). Reason for posting is when I train the supervised model, I get very high roc-auc score, which is around 0.99XX and that is suspicious.

Note that, I can not use the DBSCAN algorithm for the entire data set because of computational constraints. I can not use supervised model as I do not have labels.

score 0 · Answer 1 · answered Sep 17 '19 at 15:52

Please take care of stratification while sampling data for training. You should mention stratify = y in train_test_split .

High ROC-AUC may be pointing to imbalance dataset - when-is-an-auc-score-misleadingly-high

Also check this out - k-fold-cross-validation-auc-score-vs-test-auc-score

score 0 · Answer 2 · answered Oct 17 '19 at 16:07

Not knowing exactly your dependant variable, and your explanatory variables, and the volume of data you have at hands, it's hard to give a good diagnosis.

However 99+%-sh scores often hide problems within models. To rule out the data leakage problem : before doing anything, start by keeping a hold-out set that doesn't participate neither in the unsupervised learning nor in the supervised one, and evaluate on it.

Is it safe to use labels created from unsupervised model to train a supervised model using the same data?

2 Answers2