3

I inspired by this notebook, and I'm experimenting IsolationForest algorithm using scikit-learn==0.22.2.post1 for anomaly detection context on the SF version of KDDCUP99 dataset, including 4 attributes. The data is directly fetched from sklearn and after preprocessing (label encoding the categorical feature) passed to the IF algorithm with the default setup.

The full code is as follows:

from sklearn import datasets
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

from sklearn.ensemble import IsolationForest from sklearn.metrics import confusion_matrix from sklearn.metrics import recall_score, roc_curve, roc_auc_score, f1_score, precision_recall_curve, auc from sklearn.metrics import make_scorer from sklearn.metrics import accuracy_score

import pandas as pd import numpy as np import seaborn as sns import itertools import matplotlib.pyplot as plt import datetime

%matplotlib inline

def byte_decoder(val): # decodes byte literals to strings

return val.decode('utf-8')

#Load Dataset KDDCUP99 from sklearn target = 'target' sf = datasets.fetch_kddcup99(subset='SF', percent10=True) # you can use percent10=True for convenience sake dfSF=pd.DataFrame(sf.data, columns=["duration", "service", "src_bytes", "dst_bytes"]) assert len(dfSF)>0, "SF dataset no loaded."

dfSF[target]=sf.target anomaly_rateSF = 1.0 - len(dfSF.loc[dfSF[target]==b'normal.'])/len(dfSF)

"SF Anomaly Rate is:"+"{:.1%}".format(anomaly_rateSF) #'SF Anomaly Rate is: 0.45%'

#Data Processing toDecodeSF = ['service']

apply hot encoding to fields of type string

convert all abnormal target types to a single anomaly class

dfSF['binary_target'] = [1 if x==b'normal.' else -1 for x in dfSF[target]]

leSF = preprocessing.LabelEncoder()

for f in toDecodeSF: dfSF[f + " (encoded)"] = list(map(byte_decoder, dfSF[f])) dfSF[f + " (encoded)"] = leSF.fit_transform(dfSF[f])

for f in toDecodeSF: dfSF.drop(f, axis=1, inplace=True)

dfSF.drop(target, axis=1, inplace=True)

#check rate of Anomaly for setting contamination parameter in IF dfSF["binary_target"].value_counts() / np.sum(dfSF["binary_target"].value_counts())

#data split X_train_sf, X_test_sf, y_train_sf, y_test_sf = train_test_split(dfSF.drop('binary_target', axis=1), dfSF['binary_target'], test_size=0.33, random_state=11, stratify=dfSF['binary_target'])

#print(y_test_sf.value_counts()) #1 230899 #-1 1114 #Name: binary_target, dtype: int64

#y_test_sf.value_counts() / np.sum(y_test_sf.value_counts())

1 0.954984

#-1 0.045016 #Name: binary_target, dtype: float64

#GridSearch IF parameters (SF) scoring = {'AUC': 'roc_auc', 'Recall': make_scorer(recall_score, #f1_score pos_label=-1)}

gs_cont_sf = GridSearchCV(IsolationForest(n_jobs=-1), param_grid={'n_estimators': [2], #[2**i for i in range(1, 9)], 'max_samples': np.arange(0.1, 1.0, 0.2), 'contamination': [0.001, 0.003, 0.005, 0.01, 0.1, 0.2, 0.3] }, scoring=scoring, refit='Recall', return_train_score=True, cv=3, verbose=1, n_jobs=-1) gs_cont_sf.fit(X_train_sf, y_train_sf) results = gs_cont_sf.cv_results_

contamination, max_samples, n_estimators = tuple(pd.DataFrame(results).iloc[np.argmax(pd.DataFrame(results)["mean_test_Recall"])][["param_contamination", "param_max_samples", "param_n_estimators"]].to_numpy().tolist()) contamination, max_samples, n_estimators

##training IF Model - SF ver. and predict the outliers/anomalies on the test-set with final GridSearchCV results iso_for_sf = IsolationForest(random_state=11, n_estimators=n_estimators, #2 max_samples=max_samples, #0.1 contamination=contamination, #0.3 real is 0.045! n_jobs=-1) iso_for_sf.fit(X_train_sf, y_train_sf)

Create shap values and plot outliers summary_plot for test-set

X_explain = X_test_sf shap_values = shap.TreeExplainer(iso_for_sf).shap_values(X_explain) shap.summary_plot(shap_values, X_explain)

#plot 2 sampled_data = X_train_sf.sample(100) shap.initjs() explainer = shap.TreeExplainer(iso_for_sf) shap_values = explainer.shap_values(sampled_data) shap.force_plot(explainer.expected_value, shap_values, sampled_data)

  • Why 3 features contributions are depicted with a grey color, which is out of the bar color range?
  • What is the interpretation of the following shap.summary_plot and shap.force_plot in terms of outliers?
  • Is it clear how the SHAP toolset could transparent the contribution of features concerning outliers/anomalies?

shap.summary_plot for all samples in test-set:

img

shap.force_plot for 100 samples in train-set:

img

Probably I am missing something here, and any help will be highly appreciated.

Shayan Shafiq
  • 1,008
  • 4
  • 13
  • 24
Mario
  • 571
  • 1
  • 6
  • 24

1 Answers1

2

I had the same question, and according to this link:

Grey represents the categorical values which cannot be scaled in high or low.

Concerning the other questions, I found this link: https://github.com/slundberg/shap/issues/960 Where slundberg states:

In the linear model SHAP does indeed give high importance to outlier feature values.

  1. For a linear (or additive) model SHAP values trace out the partial dependence plot for each feature. So a positive SHAP value tells you that your value for that feature increases the model's output relative to typical values for that feature. For example if you have systolic blood pressure of 150, the average BP is 120 and higher blood pressure is bad for you then you will get a positive SHAP value because your BP is worse than average. But if you have a BP of 110 you will get a negative SHAP value because your BP is better than average (lowers your risk relative to average). SHAP values tell you about the informational content of each of your features, they don't tell you how to change the model output by manipulating the inputs (other than what would happen if you "hide" those feature values). To know how the model will change as you change the inputs you would need to trace out a dependence_plot of many SHAP values.
Mariane Reis
  • 121
  • 4