27

I have learned that, for creating a regression model, we have to take care of categorical variables by converting them into dummy variables. As an example, if, in our data set, there is a variable like location:

Location 
----------
Californian
NY
Florida

We have to convert them like:

1  0  0
0  1  0
0  0  1

However, it was suggested that we have to discard one dummy variable, no matter how many dummy variables are there.

Why do we need to discard one dummy variable?

Ethan
  • 1,657
  • 9
  • 25
  • 39
Mithun Sarker
  • 383
  • 1
  • 3
  • 7

2 Answers2

19

Simply put because one level of your categorical feature (here location) become the reference group during dummy encoding for regression and is redundant. I am quoting form here "A categorical variable of K categories, or levels, usually enters a regression as a sequence of K-1 dummy variables. This amounts to a linear hypothesis on the level means."

This is already discussed at this very nice stats.stackexchange answer.

I was told there is an advanced course by Yandex in Coursera that covers this subject in more details if you still have doubts, see here. Note you can always audit the course content for free. ;-)

Another nice post if you want a thorough explanation with lots of examples with statistical perspective and not being limited to only dummy coding, see this from UCLA (in R)

Note that if you using pandas.get_dummies, there is a parameter i.e. drop_first so that whether to get k-1 dummies out of k categorical levels by removing the first level. Please note default = False, meaning that the reference is not dropped and k dummies created out of k categorical levels!

TwinPenguins
  • 4,429
  • 3
  • 22
  • 54
1

You don't need to drop a level, depending on your use case.

See
In which cases shouldn't we drop the first level of categorical variables?
and the much more general question
In supervised learning, why is it bad to have correlated features?

Ben Reiniger
  • 12,855
  • 3
  • 20
  • 63