This is my code:
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import LabelEncoder, MaxAbsScaler
from sklearn.metrics import precision_recall_fscore_support
from sklearn.decomposition import TruncatedSVD
from scipy.sparse import csr_matrix, hstack
import os
sgd_classifier = SGDClassifier(loss='log', penalty='elasticnet', max_iter=30, n_jobs=60, alpha=1e-6, l1_ratio=0.7, class_weight='balanced', random_state=0)
vectorizer = TfidfVectorizer(analyzer="char_wb", ngram_range=(4,4), min_df=10)
X_train = vectorizer.fit_transform(X_text_train.ravel())
X_test = vectorizer.transform(X_text_test.ravel())
print('TF-IDF number of features:', len(vectorizer.get_feature_names()))
scaler = MaxAbsScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
print('Inputs shape:', X_train.shape)
sgd_classifier.fit(X_train, y_train)
y_predicted = sgd_classifier.predict(X_test)
y_predicted_prob = sgd_classifier.predict_proba(X_test)
results_report = classification_report(y_test, y_predicted, labels=classes_trained, digits=2, output_dict=True)
df_results_report = pd.DataFrame.from_dict(results_report)
pd.set_option('display.max_rows', 300)
print(df_results_report.transpose())
X_text_train & X_text_test has shape (2M, 2) and (100k, 2) respectively.
They first column is about the description of financial transactions and generally speaking each description consists of 5-15 words; so each line contains about 5-15 words. The second column is a categorical variable which just has the name of the bank related to this bank transaction.
I merge these two columns in one description so now X_text_train & X_text_test has shape (2M, ) and (100k, ) respectively.
Then I apply TF-IDF and now X_text_train & X_text_test has shape (2M, 50k) and (100k, 50k) respectively.
What I observe is that when there is an unseen value on the second column (so a new bank name in the merged description) then the SGDClassifier returns some very different and quite random predictions than what it would return if I had entirely dropped the second column with the bank names.
The same occurs if I do the TF-IDF only on the descriptions and keep the bank names separately as a categorical variable.
Why this happens with SGDClassifier?
Is it that SGD in general cannot handle well at all unseen values because of the fact that it converges in this stochastic way ?
The interesting thing is that on TF-IDF the vocabulary is predetermined so unseen values in the test set are basically not taken into account at all in the features (ie all the respective features just have 0 as a value) but still the SGD breaks.
(I posted also this on skLearn's Github https://github.com/scikit-learn/scikit-learn/issues/21906)