5

I have unsupervised data (i.e this data doesn't have any target variable through which I can learn it's prior behaviour) it is a mix of continuous and categorical data. Now I want to classify the test data into three categories on basis of my unsupervised data.

The approach I took is to first do the clustering of unsupervised data, use this categorised data as a base data for preparing a new model that predicts on top of it.

I want know whether this approach correct or not or is there better way for classifying test set? Particular algorithm I need to follow for this?

I am doing this in R.

The approach is to modify the training set data so that this can be used to properly predict the test data. Here target variable is missing in train and test set.

Rahul Sharma
  • 182
  • 7

3 Answers3

1

You have many options of algorithms to use for classification of unsupervised data.

This is a very broad topic, but if you need a specific algo recommendation, try to see if self-organizing maps (SOM) may help with your specific problem. In R, try the kohonen package.

K-means is another popular clustering algorithm.

No matter which method you use, consider converting your categorical data to numerical data for clustering, as it may alleviate some of your mixed data-type woes.

1

It's really a broad topic but I think you are going on a right track.

I solved a similar problem couple of months back where I worked on classification of documents in multiple categories using Centroid based algorithm. Here, I clustered training dataset using Spherical K-Means, and the resulting centroids of the clusters represented a category. Later while predicting a category of new document, I would compare the document with all the centroids and assign a category based of SSE.

Rohan
  • 191
  • 10
1

Michael is right - K-means clustering could possibly work for you, but K-means isn't designed to handle categorical variables.

If you don't have too many categories, then you could choose to represent them as dummy variables. Here's a link to a post where I explain dummy variables in Python Pandas. I also found a Stack Overflow answer that explains how to create dummy variables in R.

Andrew
  • 256
  • 2
  • 4