2

I have about 100MB of CSV data that is cleaned and used for training in Keras stored as Panda DataFrame. What is a good (simple) way of saving it for fast reads? I don't need to query or load part of it.

Some options appear to be:

  • HDFS
  • HDF5
  • HDFS3
  • PyArrow
Green Falcon
  • 14,308
  • 10
  • 59
  • 98
B Seven
  • 302
  • 1
  • 10

3 Answers3

6

With 100MB data, you can store it in any filesystem as CSV since read is going to take less than a second.

Most of the time is going to be spent by dataframe runtime in parsing data and creation of in-memory data structures.

Shamit Verma
  • 2,319
  • 1
  • 10
  • 14
4

You can find a nice benchmark for every approach in here.

enter image description here

2

Your data size is not that much huge, but there are some debates whenever you deal with big data What is the best way to store data in Python and Optimized I/O operations in Python. They all depend on the way the serialisation occurs and the policies which are taken in different layers. For instance, security, valid transactions and such things. I guess the latter link can help you dealing with large data.

Green Falcon
  • 14,308
  • 10
  • 59
  • 98