1

I have a sparse matrix, $X$, created by TfidfVectorizer and its size is $(500000, 200000)$. I want to convert $X$ to a data frame but I'm always getting a memory error.

I tried

pd.DataFrame(X.toarray(), columns=tokens)

and

pd.read_csv(X.toarray().astype("float32"), columns=tokens, chunksize=...).

And it seems that when I convert $X$ to a numpy array using X.toarray(), I get an error.

Can someone tell me what is an easy solution for this? Is there anyway I can create a sparse dataframe from $X$ without memory error?

I have been running my codes on Google Colab Pro and I think it provides me less than 100 GB Ram.

3 Answers3

3

You can use pandas.Dataframe.sparse.from_spmatrix. It will create a Dataframe populated by pd.arrays.SparseArray from a scipy sparse matrix.

Pandas used to have explicit sparse dataframes, but in more modern versions there is no such concept. Only normal pd.Dataframe populated by sparse data.

zachdj
  • 2,812
  • 10
  • 15
2

I have had to deal with huge data frames as you mention, in mi case the problem was "solved" by storing the data frame as pickle pd.to_pickle() and not as csv.

The memory usage reduced by 60%

I also heard recently about a format named feather

For reference:

https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d

enter image description here

Multivac
  • 3,199
  • 2
  • 10
  • 26
0

You can also use max_df and min_df or max_features for tfidfvectorizer apart from sparse array.

MANU
  • 101
  • 3