What is a good way to store processed CSV data to train model in Python?

Question

I have about 100MB of CSV data that is cleaned and used for training in Keras stored as Panda DataFrame. What is a good (simple) way of saving it for fast reads? I don't need to query or load part of it.

Some options appear to be:

HDFS
HDF5
HDFS3
PyArrow

score 6 · Answer 1 · answered Mar 26 '19 at 10:11

6

With 100MB data, you can store it in any filesystem as CSV since read is going to take less than a second.

Most of the time is going to be spent by dataframe runtime in parsing data and creation of in-memory data structures.

answered Mar 26 '19 at 10:11

Shamit Verma

2,319
1
10
14

score 4 · Answer 2 · answered Mar 26 '19 at 11:15

4

You can find a nice benchmark for every approach in here.

answered Mar 26 '19 at 11:15

Francesco Pegoraro

744
4
19

score 2 · Answer 3 · answered Mar 26 '19 at 10:30

Your data size is not that much huge, but there are some debates whenever you deal with big data What is the best way to store data in Python and Optimized I/O operations in Python. They all depend on the way the serialisation occurs and the policies which are taken in different layers. For instance, security, valid transactions and such things. I guess the latter link can help you dealing with large data.

What is a good way to store processed CSV data to train model in Python?

3 Answers3