Questions tagged [feature-engineering]

the process of using domain knowledge of the data to create features that improve machine learning algorithms

648 questions
177
votes
4 answers

When to use One Hot Encoding vs LabelEncoder vs DictVectorizor?

I have been building models with categorical data for a while now and when in this situation I basically default to using scikit-learn's LabelEncoder function to transform this data prior to building a model. I understand the difference between OHE,…
anthr
  • 1,893
  • 3
  • 12
  • 11
48
votes
6 answers

Encoding features like month and hour as categorial or numeric?

Is it better to encode features like month and hour as factor or numeric in a machine learning model? On the one hand, I feel numeric encoding might be reasonable, because time is a forward progressing process (the fifth month is followed by the…
33
votes
1 answer

Ways to deal with longitude/latitude feature

I am working on a fictional dataset with 25 features. Two of the features are latitude and longitude of a place and others are pH values, elevation, windSpeed etc with varying ranges. I can perform normalization on the other features but how do I…
31
votes
1 answer

Should one hot vectors be scaled with numerical attributes

In the case of having a combination of categorical and numerical Attributes, I usually convert the categorical attributes to one hot vectors. My question is do I leave those vectors as is and scale the numerical attributes through…
30
votes
3 answers

Why do we convert skewed data into a normal distribution

I was going through a solution of the Housing prices competition on Kaggle (Human Analog's Kernel on House Prices: Advance Regression Techniques) and came across this part: # Transform the skewed numeric features by taking log(feature + 1). # This…
27
votes
3 answers

Encoding categorical variables using likelihood estimation

I am trying to understand how I can encode categorical variables using likelihood estimation, but have had little success so far. Any suggestions would be greatly appreciated.
small dwarf
  • 271
  • 1
  • 3
  • 4
25
votes
4 answers

Is feature engineering still useful when using XGBoost?

I was reading the material related to XGBoost. It seems that this method does not require any variable scaling since it is based on trees and this one can capture complex non-linearity pattern, interactions. And it can handle both numerical and…
KevinKim
  • 635
  • 1
  • 7
  • 13
22
votes
3 answers

How to perform feature engineering on unknown features?

I am participating on a kaggle competition. The dataset has around 100 features and all are unknown (in terms of what actually they represent). Basically they are just numbers. People are performing a lot of feature engineering on these features. I…
20
votes
1 answer

What is difference between one hot encoding and leave one out encoding?

I am reading a presentation and it recommends not using leave one out encoding, but it is okay with one hot encoding. I thought they both were the same. Can anyone describe what the differences between them are?
18
votes
2 answers

List of feature engineering techniques

Is there any resource with a list of feature engineering techniques? A mapping of type of data, model and feature engineering technique would be a gold mine.
15
votes
3 answers

Why does frequency encoding work?

Frequency encoding is a widely used technique in Kaggle competitions, and many times proves to be a very reasonable way of dealing with categorical features with high cardinality. I really don't understand why it works. Does it work in very…
12
votes
1 answer

What feature engineering is necessary with tree based algorithms?

I understand data hygiene, which is probably the most basic feature engineering. That is making sure all your data is properly loaded, making sure N/As are treated as a special value rather than a number between -1 and 1, and tagging your…
11
votes
2 answers

Dissmissing features based on correlation with target variable

Is it valid to dismiss features based on their Pearson correlation values with the target variable in a classification problem? say for instance I have a dataset with the following format where the target variable takes 1 or 0: >>> dt.head() ID …
MedAli
  • 275
  • 1
  • 2
  • 10
10
votes
1 answer

Should I rescale tfidf features?

I have a dataset which contains both text and numeric features. I have encoded the text ones using the TfidfVectorizer from sklearn. I would now like to apply logistic regression to the resulting dataframe. My issue is that the numeric features…
ignoring_gravity
  • 793
  • 4
  • 15
10
votes
4 answers

Is this a good practice of feature engineering?

I have a practical question about feature engineering... say I want to predict house prices by using logistic regression and used a bunch of features including zip code. Then by checking the feature importance, I realize zip is a pretty good…
1
2 3
43 44