1

I am experimenting with a simple MLPClassifier and one-hot encoding in SKLearn.

data = pd.read_csv("./Synthetic_data.csv", header=0)
filtered_data = data.drop(['diag_binary'], axis=1)

X = filtered_data[['sex', 'age', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q']] y = filtered_data[['diag_multi']]

datasets = train_test_split(X_1hot, y, test_size=0.25, random_state=42, stratify=y) train_data, test_data, train_labels, test_labels = datasets

enc = OneHotEncoder(handle_unknown='ignore') train_data = enc.fit_transform(train_data) test_data = enc.transform(test_data)

mlp = MLPClassifier(max_iter=1000, batch_size=32, random_state=42) mlp.fit(train_data, train_labels.values.ravel()) print(mlp.score(test_data, test_labels))

The score of my MLP in this setup is higher compared to the MLP without one-hot encoded data. I do not understand as to how the encoding can change the output of my model?

To provide context to the database: The features 'a' to 'z' all represent questions from a survey with 4 possible values: 'yes', 'no', 'don't know' and 'skipped' and are encoded with 1,2,3 and 9. Since the MLP could learn an ordinal relation between the samples I one-hot encode these values. I also one-hot encode 'sex' since similar to the other features it is also encoded as 1,2 and 3. Same for 'age' of course, even though I am not sure if one-hot encoding is the correct way to handle the age values. The dataset is very imbalanced so this may be a problem, I'm not sure.

desertnaut
  • 2,154
  • 2
  • 16
  • 25
Leandro
  • 25
  • 3

1 Answers1

0

Applying an encoding could increase your model's performance, or it could introduce noise to the point that the model won't be able to classify properly.

Applying One-hot encoding to your features will remove any ordinal relationship in their values (if you have an ordinal relationship in some features please apply an ordinal encoder).

In addition, if you have a specific category feature that has an imbalanced representation in your data (e.x. a feature called type_business has the value "Logistics" in more than 40% of your data, it is clear that it is a dominant category in this feature, the model might give more importance to this category without you knowing, so one hot encoding will prevent that).

I don't know about MLP, but some models perform better on numeric data like logistic regression, therefore encoding will be a good idea.

I am unsure why you encode the feature "age", just keep it numerical and ordinal.

NOTE: since you have imbalanced data, it is recommended to use precision and recall (or AUC_Precision recall curve) to measure your model's performance.

NOTE2: Usually we do the encoding before the train_test split, why? because if there is a category that exists in test, but not in train, your enc.transform() will crash since it doesn't know how to encode this new category, the handle_unknown will ignore these cases, but if you have a small dataset, that will reduce its size.

I hope this helps!

Ali Massoud
  • 116
  • 3