How do I split correctly split my dataset into train, test and validation?

Question

I'm attempting to split my data set into 70% training, 15% testing and 15% validation.

train_X, test_X, train_Y, test_Y = train_test_split(data, labels, test_size=0.3, train_size=0.7,random_state=1,stratify = labels)
test_X, val_X, test_Y, val_Y = train_test_split(test_X, test_Y, test_size=0.5,
                                                    random_state=1,stratify = labels)

But I'm not sure if this code splits the test set into half. Also, I keep getting this error:

     29 def main():
     30     data, labels = load_data()
---> 31     train_X, train_Y, val_X, val_Y, test_X, test_Y = process_data(data, labels)
     32 
     33     best_model, best_k = select_knn_model(train_X, val_X, train_Y, val_Y)
/tmp/ipykernel_50/3409802801.py in process_data(data, labels)
     45     X_counts = vectorizer.fit_transform(train_X)
     46     X_count = vectorizer.transform(test_X)
---> 47     Xval = vectorizer.transform(Val_X)
     48     # Return the training, validation, and test set inputs and labels
     49
NameError: name 'Val_X' is not defined

How do I fix this?

score 1 · Answer 1 · answered Sep 25 '21 at 08:08

Do not split the test set into half for the second train_test_split. Instead first split your whole data into train and test set. Then split the train set into train and validation sets as shown below.

X_train, X_test, y_train, y_test 
    = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val 
    = train_test_split(X_train, y_train, test_size=0.25, random_state=1) # 0.25 x 0.8 = 0.2

Regarding the error, you have defined val_X in your second split but you are using Val_X when using the vectorizer. Just correct the uppercase into lowercase and you should be fine!

How do I split correctly split my dataset into train, test and validation?

1 Answers1