Questions tagged [test]
14 questions
2
votes
2 answers
Dataset and why use evaluate()?
I am starting in Machine Learning, and I have doubts about some concepts. I've read we need to split our dataset into training, validation and test sets. I'll ask four questions related to them.
1 - Training set: It is used in .fit() for our model…
Murilo
- 125
- 3
1
vote
1 answer
When same combination of variable values appear in train and test datasets?
I’m studying the basics of ML and trying to train a random forest model in a .csv dataset which each row contains the values of pixels in the red, green and blue bands (all varying from 0-255 values) plus a target variable which is binary (0-1).…
Kol Rocket
- 13
- 3
1
vote
1 answer
I dont understand this way of having a stable train/test split even after updating the dataset
from zlib import crc32
def is_id_in_test_set(identifier, test_ratio):
return crc32(np.int64(identifier)) < test_ratio * 2**32
def split_data_with_id_hash(data, test_ratio, id_column):
ids = data[id_column]
in_test_set =…
samsamradas
- 115
- 3
1
vote
1 answer
Imputation in train or test data
I'm having a rather simple question.
Let's say i want to do a median imputation. I've read in some places that you should do:
imputer = SimpleImputer(strategy='median')
train_imputed = pd.DataFrame(imputer.fit_transform(train[feature_columns]),…
Guilherme Raibolt
- 35
- 5
1
vote
1 answer
Test score higher than train score
I implemented a Gaussian Naive Bayes classifier and I got a test score (99,99%) higher than the train score (96,87%)
Is this normal or does it mean that my model is underfitting ?
Thank you.
biihu
- 21
- 1
- 3
1
vote
1 answer
How to address label imbalance in deciding train/test splits?
I'm working on a dataset that isn't split into test and train set by default and I'm a bit concerned about the imbalance between the 'label' distributions between them and how they might affect the trained model's performance. Let me note that I use…
civy
- 111
1
vote
2 answers
Updating a train/val/test set
It is considered best practice to split your data into a train and test set at the start of a data science / machine learnign project (and then your train set further into a validation set for hyperparamter optimisation).
If it turns out that the…
Aesir
- 458
- 1
- 6
- 15
0
votes
1 answer
Is it a problem to use the test dataset for the hyperparameter tuning, when I want to compare 2 classification algorithms on the 10 different dataset?
I know that we should use the validation set to perform hyperparameter tuning and that test dataset is not anymore really the test if it is used for hyperparameter tuning. But is this a problem if i want to compare the performance of 2 algorithms…
John B
- 1
0
votes
2 answers
Why shouldn't we try to balance the test set?
Most advice I have found online is that we must not balance the test set. The test set should remain to be unseen.
However, I failed to see how balancing the test set will cause us to leak knowledge about the test set into the training set.…
Fraïssé
- 119
- 3
0
votes
1 answer
Fairness metrics in the test set when wrong distribution
I have a doubt that we have been discussing for weeks with my colleagues and I wanted your opinion. I have a model for diagnosis of a disease and I want to know if it is fair. I train the model with one cohort and I use another cohort for testing.…
0
votes
1 answer
The meaning of P and degree of freedom in T-Test
I read about T-Test and how we can use it to compare between 2 models (https://towardsdatascience.com/paired-t-test-to-evaluate-machine-learning-classifiers-1f395a6c93fa)
There are some issues I'm not sure I understand correctly:
I saw that we…
user3668129
- 769
- 4
- 15
0
votes
1 answer
How to demonstrate two variables are orthogonal with respect to the output in a 3-D Python dataset?
I have a Python dataset with 300 samples and 3 columns: 2 independent integer variables X,Y and the dependent continuous variable F (output).
The X variable can only take 3 values, but Ycan take up to 1024 different values. Based on my sample, I…
0
votes
2 answers
Why label encoding before split is data leakage?
I want to ask why Label Encoding before train test split is considered data leakage?
From my point of view, it is not. Because, for example, you encode "good" to 2, "neutral" to 1 and "bad" to 0. It will be same for both train and test sets.
So, why…
Anar
- 73
- 5
0
votes
1 answer
Is it good to use .fit to xtest when we use PolynomialFeatures() of sklearn?
My teacher did this in class, and I'm wondering is this ok to use .fit_transform with xtest? It shouldn't just be poly.transform(xtest)
Teacher's Code
from sklearn.preprocessing import PolynomialFeatures
poly =…
JEAN LEONARDO
- 103
- 2