4

In short, what would be the best method/tricks/techniques/tools for performing ad hoc sql (style) queries on 350TB of csv data? Would there be other options, tool wise that would do it faster if we dropped the "sql" requirement?

Is my best option Hive and as many servers I can't get my hands on? Would spark be of any benefit since this size of dataset wouldn't come close to fitting into memory? I have other related questions /ideas but don't want to bog this down.

Background

The data is originally stored in a binary format that gets converted into an ASCII. The ASCII file is getting turned into CSV (actually tab separated but whatever). 1 binary file = 1 ASCII file = 1 CSV file. The data format is pretty simple, 200 to 500 header columns, each row is a sample of an attribute in 1-second intervals.

Example (first row is header):
t,attribute1,attribute2...<attributeX>
1,val1,val2...<valx>
2,val1,val2...<valx>
3,val1,val2...<valx>
...
X,val1,val2...<valx>

Example queries that could be ran.

1) Take t value (t is non-unique, as each file starts at 1 and ends at some random time) and compare different attribute values between t value of 100-1000 for all other attributes that were sampled at 100-1000.

2)There is UTC value as an attribute value (one of the columns), and same idea as #1 we might want to compare all attributes between X dates and compare those attributes to Y data that is from a different UTC range.

3)We might want to figure out a single particular attribute and trend that from the earliest time we saw that attribute for a different particular attribute.

These are my best examples as provided to me but other types of queries could exist. If you do have a question of why don't we cut out the middle man of the conversions of the binary data to CSV data, we could theoretically. However, the program that converts the files is quite old and is very involved.

Kevin Vasko
  • 301
  • 1
  • 5

2 Answers2

2

There are many tools that are able to support such queries (as you mentioned Hive or Spark), and it is really up to your requirements in terms of number of queries, number of people who are going to query the data, what kind of BI or reporting tools you might want to use with it, etc.

More than that, your requirements are probably going to change in the future. If you provide easier ways to query the data, the more people around you are going to use it.

For example, if you choose to load that data into Amazon Redshift (https://aws.amazon.com/redshift/), it will give you a full SQL flexibility, very fast performance and the ability to connect to several BI, visualization and reporting tools. When you COPY the data into Redshift it is being compressed (column based encoding) 5-8 times, and you can put all your data in a cluster of 6 nodes of DS2.8XL of Redshift, for a couple of weeks for your POC.

The concept of compression can be applied in different solutions as well. For example, instead of using CSV format, you can convert it to Parquet (https://parquet.apache.org/) format, and save on your storage and increase the performance as you need to move around and scan less bytes of data. Once you have your data in Parquet, you can query it for example with Presto, like Netflix are doing - http://techblog.netflix.com/2014/10/using-presto-in-our-big-data-platform.html

Regarding Spark, it can help even if you don't fit all the data into memory. Spark can also help you with the usage of Zeppelin (https://zeppelin.incubator.apache.org/) as an interactive notebook, with nice visualization features, as well as write code with Scala, Python or even R with SparkR.

Guy
  • 119
  • 2
1

Talking about SQL world: from your requirements point of view parallel column based DB Vertica seems to fit well. It is not free of charge though. Even more powerful solution if you have good budget is Teradata.

You'd need to have at least 70 spindles to store your data on HDDs (5TB) on site +/- compression and redundancy. May be add more RAM or SDD if you can cache. That could fit in 3-10 boxes, depending on the processing power you require.

The rest is a software question on a trade-off between data management, feature set, optimization, your own time spent and the price. Teradata and Vertica will surely reduce your time spent writing queries and optimizing them, managing data. In case you'd need joins that advantage will be much bigger yet.

Diego
  • 550
  • 2
  • 8