Highest Voted 'preprocessing' Questions - Data Science Stack Exchange

63

votes

4 answers

Difference between OrdinalEncoder and LabelEncoder

I was going through the official documentation of scikit-learn learn after going through a book on ML and came across the following thing: In the Documentation it is given about sklearn.preprocessing.OrdinalEncoder() whereas in the book it was given…

asked Oct 07 '18 at 18:55

Saurabh Singh

773
1
6
8

51

votes

3 answers

StandardScaler before or after splitting data - which is better?

When I was reading about using StandardScaler, most of the recommendations were saying that you should use StandardScaler before splitting the data into train/test, but when i was checking some of the codes posted online (using sklearn) there were…

machine-learning scikit-learn preprocessing

asked Sep 18 '18 at 02:35

tsumaranaina

725
1
6
17

41

votes

2 answers

How to prepare/augment images for neural network?

I would like to use a neural network for image classification. I'll start with pre-trained CaffeNet and train it for my application. How should I prepare the input images? In this case, all the images are of the same object but with variations…

neural-network image-classification convolutional-neural-network preprocessing

asked Feb 24 '15 at 11:59

Alex I

3,152
1
23
27

26

votes

4 answers

Different Test Set and Training Set Distribution

I am working on a data science competition for which the distribution of my test set is different from the training set. I want to subsample observations from training set which closely resembles test set. How can I do this?

preprocessing

asked Feb 26 '18 at 20:29

Pooja

261
1
3
3

23

votes

2 answers

Loading own train data and labels in dataloader using pytorch?

I have x_data and labels separately. How can I combine and load them in the model using torch.utils.data.DataLoader? I have a dataset that I created and the training data has 20k samples and the labels are also separate. Lets say I want to load a…

python dataset preprocessing pytorch

asked Feb 20 '19 at 21:13

Amarnath

361
1
2
5

21

votes

3 answers

Image resizing and padding for CNN

I want to train a CNN for image recognition. Images for training have not fixed size. I want the input size for the CNN to be 50x100 (height x width), for example. When I resize some small sized images (for example 32x32) to input size, the content…

machine-learning deep-learning image-classification preprocessing image-recognition

asked Apr 25 '18 at 13:46

Odgiiv

333
1
2
7

17

votes

2 answers

One Hot Encoding vs Word Embedding - When to choose one or another?

A colleague of mine is having an interesting situation, he has quite a large set of possibilities for a defined categorical feature (+/- 300 different values) The usual data science approach would be to perform a One-Hot Encoding. However, wouldn't…

preprocessing word-embeddings embeddings encoding

asked Apr 03 '18 at 14:13

Jonathan DEKHTIAR

610
2
6
10

15

votes

2 answers

Preprocessing for Text Classification in Transformer Models (BERT variants)

This might be silly to ask, but I am wondering if one should carry out the conventional text preprocessing steps for training one of the transformer models? I remember for training a Word2Vec or Glove, we needed to perform an extensive text cleaning…

python nlp preprocessing bert transformer

asked Nov 08 '19 at 06:28

TwinPenguins

4,429
3
22
54

12

votes

5 answers

Please review my sketch of the Machine Learning process

It's amazingly difficult to find an outline of the end-to-end machine learning process. As a total beginner, this lack of information is frustrating, so I decided to try scraping together my own process by looking at a lot of tutorials that all do…

machine-learning data-cleaning preprocessing data-imputation

asked Apr 06 '20 at 01:10

rocksNwaves

309
1
11

11

votes

1 answer

Data preprocessing: Should we normalise images pixel-wise?

Let me present you with a toy example and a reasoning on image normalisation I had: Suppose we have a CNN architecture to classify NxN grayscale images in two categories. Pixel values range from 0 (black) to 255 (white). Class 0: Images that…

machine-learning deep-learning image-classification preprocessing computer-vision

asked Jan 21 '18 at 11:56

lucasrodesg

235
2
7

9

votes

2 answers

Effect of Stop-Word Removal on Transformers for Text Classification

The domain here is essentially topic classification, so not necessarily a problem where stop-words have an impact on the analysis (as opposed to, say, sentiment analysis where structure can affect meaning). With respect to the positional encoding…

nlp preprocessing transfer-learning transformer text-classification

asked Dec 03 '20 at 20:24

Andy

650
4
13

9

votes

1 answer

Extracting individual emails from an email thread

Most of the open source datasets are well formatted i.e each email message is separated well like the enron email dataset. But out in the real world it is highly difficult to separate a top email message from a thread of emails. For example consider…

classification scikit-learn apache-spark preprocessing sentiment-analysis

asked Jun 01 '17 at 13:02

Greedy Coder

153
1
6

9

votes

1 answer

How to approach the numer.ai competition with anonymous scaled numerical predictors?

Numer.ai has been around for a while now and there seem to be only few posts or other discussions about it on the web. The system has changed from time to time and the set-up today is the following: train (N=96K) and test (N=33K) data with 21…

machine-learning deep-learning cross-validation preprocessing competitions

asked Jun 29 '16 at 16:11

Richi W

165
2
11

8

votes

1 answer

Encoding with OrdinalEncoder : how to give levels as user input?

I am trying to do ordinal encoding using: from sklearn.preprocessing import OrdinalEncoder I will try to explain my problem with a simple dataset. X = pd.DataFrame({'animals':['low','med','low','high','low','high']}) enc =…

machine-learning scikit-learn data-cleaning preprocessing encoding

asked Apr 15 '20 at 00:25

Ayush Ranjan

411
1
4
15

8

votes

1 answer

sklearn SimpleImputer too slow for categorical data represented as string values

I have a data set with categorical features represented as string values and I want to fill-in missing values in it. I’ve tried to use sklearn’s SimpleImputer but it takes too much time to fulfill the task as compared to pandas. Both methods produce…

python scikit-learn pandas preprocessing

asked Jan 07 '20 at 12:43

vlc146543

83
1
4

Questions tagged [preprocessing]