2

I have a sparse matrix beside my data. I think these data could be beneficial but because it is sparse in usual usage is useless. Is there any method to transform it? for example, PCA, do you have any experience?


additional detail:

I working on titanic competition from Kaggle.com, I add nationality (base Name feature) to data. my nationality is something like: "Nordic,Scandinavian,Norway" so I have converted it to dummy variable, but then I have a sparse matrix.

I think usually sparse matrix could not be helpful for classification and we should transform it. (??? Is it true?)

I think in some way I should merge my features to reduce sparsity.

parvij
  • 791
  • 5
  • 18

2 Answers2

1

Your whole question makes sense, except for:

because it is sparse in usual usage is useless

Therefore, I will try to answer this from different angles.

Implementation of sparse matrix

A sparse matrix is not useless because many packages contain algorithms that accept a sparse matrix as input.

You can look at Sklearn's SVMs as an example.

You need a special implementation to deal with sparse matrix because they are stored differently from a normal matrix. However, as mentioned in the comments by @Majid Mortazavi you can follow this answer to convert your sparse matrix to a normal one and use any other implementations out there. You just need to be careful with your RAM usage.

Theoretical application of sparse matrix

A sparse matrix is not useless because many fields in machine learning make use of sparse features.

You can look at NLP as an example. A common feature used in text categorization is TF-IDF which is a sparse representation of a document.

Since the data is sparse, it is often enough to use linear models for classification, since most of the features for a data point will be $0$. This can be considered a good thing because they are easier and faster to compute and no polynomial kernels are needed.

Solving your challenge

You also mentioned PCA for dimensionality reduction but I don't understand why you would need it at this points. Seems like you are creating a sparse matrix to encode a categorical feature.

Maybe, instead of reducing your sparse matrix, you can have a look at better ways to deal with categorical features.

Bruno Lubascher
  • 3,618
  • 1
  • 14
  • 36
0

You can use TruncatedSVD instead of PCA for sparse matrices. The logic is similar, but it accepts sparse data such as NLP text.

fuwiak
  • 1,373
  • 8
  • 14
  • 26