1

I’m studying the basics of ML and trying to train a random forest model in a .csv dataset which each row contains the values of pixels in the red, green and blue bands (all varying from 0-255 values) plus a target variable which is binary (0-1). Those pixels were sampled without replacement from 20 RGB images.

My question is: I have 2,000,000 rows which are divided in train/test data. But I observed that some combinations appear more than once in both train and test divisions. For exemple:

Train dataset (70%)

R G B Class
20 30 40 0
30 40 50 1
50 20 5 0

Test dataset (30%)

R G B Class
20 30 40 0
30 40 50 1
30 40 50 1
10 20 30 1

Is it ok for the random forest model to be trained and tested in a dataset divided in train/test partitions that have occurences of the same combination of values in both partitions? Or is it a data leakage scenario? If so, how to deal with it? Thank you very much.

Kol Rocket
  • 13
  • 3

1 Answers1

1

20 images may have a common color pattern(e.g., green forest), which might be reflected in your data. So theoretically, you will have <R, G, B> color pattern occurring quite a lot.

Also, you should not remove the commonly occurring points in this case. These are images, not regular numerical data or sample data, and removing them could badly affect your performance.

Aviral Verma
  • 919
  • 1
  • 4