1

I need your help to find a flaw in my model, since it's accuracy (95%) is not realistic.

I'm working on a classification problem using Randomforest, with around 2500 positive case and 15000 negative ones, 75 independent variables. Here's the core of my code:

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# Fitting Random Forest Classification to the Training set
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 900, criterion = 'gini', random_state = 0)
classifier.fit(X_train, y_train)

# Predicting Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

I've optimized the hyperparameters through grid search and performed a k-fold cross validation, reporting 0.9444 as accuracies mean. Confusion matrix:

[[3390,   85],
 [ 101,  516]]

showing 97.6% accuracy.

Did I miss something?

NOTE: the database is composed by 2500 Italian mafia firms' financial reports, and 15000 lawful firms randomly sampled from the same regions as negative cases.

Thank you guys!

EDIT: I upload the metrics and cm. The model is actually performing well, but looking at the metrics and cm, it shows more realistic values regarding logloss and recall, so I assume it is fine. enter image description here enter image description here

4 Answers4

1

In order to understand if the model is performing well, first would do the following:

  1. Plot the distribution of the classes, to understand if sampling mechanisms are required.

  2. In case the classes are not evenly distributed, would do stratified sampling during the test-train split.

  3. After the prediction, would plot the confusion matrix that is supported by libraries such as matplotlib or seaborn

  4. Based on the class distribution its also important to understand what sort of metrics are required, micro-averaging / weighted / macro-averaging of precision, recall and f1 score.

This should help you evaluate if you are model is truly learning the features or if one of the classes are imbalanced causing such a spike in the results.

Nischal Hp
  • 795
  • 3
  • 10
0

Random Forest are built by using decision trees, which are sensitive to the distribution of the classes. Other than stratification method, you can use oversampling, undersampling or use greater weights to the less frequent class to mitigate this effect. A detailed response you can study is in Cross Validated.

You might want to consider other metrics to measure your classification model other than accuracy, since your data are really unbalanced.

JoPapou13
  • 185
  • 5
0

I summarise below several ways that would help you train and validate your model with as less bias as possible:

  1. Usually a good way to assess the classification performance is to compare with some very basic models. If your validation metrics are worse than (or close to) those, it is obvious that the model needs to improve. E.g. in your case you could compare with:

    • random model (each observation is randomly classified to each class with probability 1/2)
    • model that always predicts negative class
  2. Another way to ensure that the high validation numbers you get aren't biased by the way training set and test set are separated, is to use cross-validation. In cross-validation, the data is split in training and test set multiple times though an iteration process and the end validation metrics are calculated as average over the iterations. Here is an example of how you can perform cross-validation in python using scikit-learn.

  3. In addition to accuracy I would also try to calculate and compare other validation metrics in order to get a more complete picture about the model's performance (e.g. precision, recall or more concise ones as F-score). Accuracy is not a recommended metric when most of the observations belong to one class. You can read more about performance metrics here and here. Scikit-learn can calculate automatically some of them (see here), but you can calculate any using the confusion matrix.

  4. SMOTE is a library popularly used with unbalanced datasets like yours - it applies resampling to create a new balanced dataset. You can read more here.

missrg
  • 584
  • 2
  • 12
0

Random Forest generally works well out of the box; In your case, it looks like the data is not balanced due to which is causing this false high accuracy. How to balance data? There are multiple techniques you can choose from but simplest ones are "Up-Sample" or "Under-Sample""

Sample Code:

from sklearn.utils import resample

minority_df = df[df.Col1 == 'value of Italian mafia firm']
majority_df = df[df.Col1 == 'value of lawful firm']

-- this will upsample your minority class to 15k, you can down-sample using your majority class but you already have less data, so I won't suggest that.

minority_df = resample(minority_df, replace=True, n_samples=15000, random_state=123)

-- Concat and create new balanced dataset.

df_balanced = pd.concat([majority_df, minority_df])

Use this new balanced Dataset for your model training, rest everything in your code looks standard. Let me know if I can help more. Cheers!

mannu
  • 108
  • 5