1

I'm attempting to split my data set into 70% training, 15% testing and 15% validation.

train_X, test_X, train_Y, test_Y = train_test_split(data, labels, test_size=0.3, train_size=0.7,random_state=1,stratify = labels)

test_X, val_X, test_Y, val_Y = train_test_split(test_X, test_Y, test_size=0.5, random_state=1,stratify = labels)

But I'm not sure if this code splits the test set into half. Also, I keep getting this error:

     29 def main():
     30     data, labels = load_data()
---> 31     train_X, train_Y, val_X, val_Y, test_X, test_Y = process_data(data, labels)
     32 
     33     best_model, best_k = select_knn_model(train_X, val_X, train_Y, val_Y)

/tmp/ipykernel_50/3409802801.py in process_data(data, labels) 45 X_counts = vectorizer.fit_transform(train_X) 46 X_count = vectorizer.transform(test_X) ---> 47 Xval = vectorizer.transform(Val_X) 48 # Return the training, validation, and test set inputs and labels 49

NameError: name 'Val_X' is not defined

How do I fix this?

Sara
  • 11
  • 1
  • 2

1 Answers1

1

Do not split the test set into half for the second train_test_split. Instead first split your whole data into train and test set. Then split the train set into train and validation sets as shown below.

X_train, X_test, y_train, y_test 
    = train_test_split(X, y, test_size=0.2, random_state=1)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1) # 0.25 x 0.8 = 0.2

Regarding the error, you have defined val_X in your second split but you are using Val_X when using the vectorizer. Just correct the uppercase into lowercase and you should be fine!

spectre
  • 2,223
  • 2
  • 14
  • 37