Highest Voted 'test' Questions - Data Science Stack Exchange

2

votes

2 answers

Dataset and why use evaluate()?

I am starting in Machine Learning, and I have doubts about some concepts. I've read we need to split our dataset into training, validation and test sets. I'll ask four questions related to them. 1 - Training set: It is used in .fit() for our model…

asked Dec 03 '21 at 14:44

Murilo

125
3

1

vote

1 answer

When same combination of variable values appear in train and test datasets?

I’m studying the basics of ML and trying to train a random forest model in a .csv dataset which each row contains the values of pixels in the red, green and blue bands (all varying from 0-255 values) plus a target variable which is binary (0-1).…

machine-learning random-forest training test

asked May 11 '25 at 16:19

Kol Rocket

13
3

1

vote

1 answer

I dont understand this way of having a stable train/test split even after updating the dataset

from zlib import crc32 def is_id_in_test_set(identifier, test_ratio): return crc32(np.int64(identifier)) < test_ratio * 2**32 def split_data_with_id_hash(data, test_ratio, id_column): ids = data[id_column] in_test_set =…

machine-learning validation test

asked Jan 30 '24 at 19:03

samsamradas

115
3

1

vote

1 answer

Imputation in train or test data

I'm having a rather simple question. Let's say i want to do a median imputation. I've read in some places that you should do: imputer = SimpleImputer(strategy='median') train_imputed = pd.DataFrame(imputer.fit_transform(train[feature_columns]),…

training data-imputation test

asked Aug 17 '23 at 20:25

Guilherme Raibolt

35
5

1

vote

1 answer

Test score higher than train score

I implemented a Gaussian Naive Bayes classifier and I got a test score (99,99%) higher than the train score (96,87%) Is this normal or does it mean that my model is underfitting ? Thank you.

machine-learning classification training naive-bayes-classifier test

asked Dec 21 '22 at 09:47

biihu

21
1
3

1

vote

1 answer

How to address label imbalance in deciding train/test splits?

I'm working on a dataset that isn't split into test and train set by default and I'm a bit concerned about the imbalance between the 'label' distributions between them and how they might affect the trained model's performance. Let me note that I use…

deep-learning neural-network training test

asked Feb 09 '22 at 11:07

civy

111

1

vote

2 answers

Updating a train/val/test set

It is considered best practice to split your data into a train and test set at the start of a data science / machine learnign project (and then your train set further into a validation set for hyperparamter optimisation). If it turns out that the…

dataset training validation test

asked Dec 17 '21 at 15:36

Aesir

458
1
6
15

0

votes

1 answer

Is it a problem to use the test dataset for the hyperparameter tuning, when I want to compare 2 classification algorithms on the 10 different dataset?

I know that we should use the validation set to perform hyperparameter tuning and that test dataset is not anymore really the test if it is used for hyperparameter tuning. But is this a problem if i want to compare the performance of 2 algorithms…

machine-learning-model training validation test

asked Oct 04 '23 at 19:01

John B

1

0

votes

2 answers

Why shouldn't we try to balance the test set?

Most advice I have found online is that we must not balance the test set. The test set should remain to be unseen. However, I failed to see how balancing the test set will cause us to leak knowledge about the test set into the training set.…

machine-learning training methodology generalization test

asked Aug 29 '23 at 07:30

Fraïssé

119
3

0

votes

1 answer

Fairness metrics in the test set when wrong distribution

I have a doubt that we have been discussing for weeks with my colleagues and I wanted your opinion. I have a model for diagnosis of a disease and I want to know if it is fair. I train the model with one cohort and I use another cohort for testing.…

class-imbalance test

asked Jan 31 '23 at 20:05

Esmeralda Ruiz Pujadas

1
1

0

votes

1 answer

The meaning of P and degree of freedom in T-Test

I read about T-Test and how we can use it to compare between 2 models (https://towardsdatascience.com/paired-t-test-to-evaluate-machine-learning-classifiers-1f395a6c93fa) There are some issues I'm not sure I understand correctly: I saw that we…

statistics hypothesis-testing test

asked Jan 26 '23 at 11:07

user3668129

769
4
15

0

votes

1 answer

How to demonstrate two variables are orthogonal with respect to the output in a 3-D Python dataset?

I have a Python dataset with 300 samples and 3 columns: 2 independent integer variables X,Y and the dependent continuous variable F (output). The X variable can only take 3 values, but Ycan take up to 1024 different values. Based on my sample, I…

python probability hypothesis-testing chi-square-test test

asked Dec 07 '22 at 18:20

Adrián Pérez Diéguez

1

0

votes

2 answers

Why label encoding before split is data leakage?

I want to ask why Label Encoding before train test split is considered data leakage? From my point of view, it is not. Because, for example, you encode "good" to 2, "neutral" to 1 and "bad" to 0. It will be same for both train and test sets. So, why…

training preprocessing data-leakage labelling test

asked Mar 01 '22 at 19:38

Anar

73
5

0

votes

1 answer

Is it good to use .fit to xtest when we use PolynomialFeatures() of sklearn?

My teacher did this in class, and I'm wondering is this ok to use .fit_transform with xtest? It shouldn't just be poly.transform(xtest) Teacher's Code from sklearn.preprocessing import PolynomialFeatures poly =…

scikit-learn training test

asked Feb 08 '22 at 02:07

JEAN LEONARDO

103
2

Questions tagged [test]