1

I have dataset consisting of house prices for example. The dataset contains features such as: house size, monthly rent, house colour, location, year the house was built.

I wanted to group these all attributes into clusters. The problem is how to represent the categorical features, such as colour. And how does the clustering algorithm work with these categorical variables?

Another question is: what happens if let’s say monthly rent 0, how does this affect clustering?

shepan6
  • 1,486
  • 7
  • 14
user102751
  • 11
  • 1

1 Answers1

1

thank you for your question, which asks about how to represent categorical variables in clustering.

The main way that we represent any categorical variable is to represent them as one-hot encodings (https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/).

After converting our categorical variables to one-hot encodings, we simply concatenate these vectors with our numerical values (which should be normalised!) to make a long / high-dimensional vector for a particular example in your dataset.

shepan6
  • 1,486
  • 7
  • 14