How to create a big data frame in Python

Question

I have a sparse matrix, $X$, created by TfidfVectorizer and its size is $(500000, 200000)$. I want to convert $X$ to a data frame but I'm always getting a memory error.

I tried

pd.DataFrame(X.toarray(), columns=tokens)

and

pd.read_csv(X.toarray().astype("float32"), columns=tokens, chunksize=...).

And it seems that when I convert $X$ to a numpy array using X.toarray(), I get an error.

Can someone tell me what is an easy solution for this? Is there anyway I can create a sparse dataframe from $X$ without memory error?

I have been running my codes on Google Colab Pro and I think it provides me less than 100 GB Ram.

score 3 · Answer 1 · answered Apr 10 '21 at 22:23

You can use pandas.Dataframe.sparse.from_spmatrix. It will create a Dataframe populated by pd.arrays.SparseArray from a scipy sparse matrix.

Pandas used to have explicit sparse dataframes, but in more modern versions there is no such concept. Only normal pd.Dataframe populated by sparse data.

score 2 · Answer 2 · answered Apr 10 '21 at 22:30

I have had to deal with huge data frames as you mention, in mi case the problem was "solved" by storing the data frame as pickle pd.to_pickle() and not as csv.

The memory usage reduced by 60%

I also heard recently about a format named feather

For reference:

https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d

MANU · Answer 3 · 2021-04-23T14:41:00.057

0

You can also use max_df and min_df or max_features for tfidfvectorizer apart from sparse array.

edited Apr 23 '21 at 14:41

answered Apr 23 '21 at 14:30

MANU

101
3

How to create a big data frame in Python

3 Answers3