Big data face-off: Spark vs. Impala vs. Hive vs. Presto

Today AtScale released its Q4 benchmark results for the major big data SQL engines: Spark, Impala, Hive/Tez, and Presto.

The findings prove a lot of what we already know: Impala is better for needles in moderate-size haystacks, even when there are a lot of users. Presto also does well here. Hive and Spark do better on long-running analytics queries.

I spoke to Joshua Klar, AtScale’s vice president of product management, and he noted that many of the company’s customers use two engines. Generally they view Hive as more stable and tend to run their long-running queries on it. All of their Hive customers use Tez, and none use MapReduce any longer.

In my experience, the stability gap between Spark and Hive closed a while ago, so long as you’re smart about memory management. As I noted recently, I don’t see a long-term future for Hive on Tez, because Impala and Presto are better for those normal BI queries, and Spark generally performs better for analytics queries (that is, for finding smaller haystacks inside of huge haystacks). In an era of cheap memory, if you can afford to do large scale analytics, you can afford to do it in-memory, and everything else is more of a BI pattern.

While all of the engines have shown improvement over the last AtScale benchmark, Hive/Tez with the new LLAP (Live Long and Process) feature has made impressive gains across the board. The performance still hasn’t caught up with Impala and Spark, but according to this benchmark, it isn’t as slow and unwieldy as before — and at least Hive/Tez with LLAP is now practical to use in BI scenarios.

The full benchmark report is worth reading, but key highlights include:

  • Spark 2.0 improved its large query performance by an average of 2.4X over Spark 1.6 (so upgrade!). Small query performance was already good and remained roughly the same.

  • Impala 2.6 is 2.8X as fast for large queries as version 2.3. Small query performance was already good and remained roughly the same

  • Hive 2.1 with LLAP is over 3.4X faster than 1.2 and its small query performance doubled. If you’re using Hive, this isn’t an upgrade you can afford to skip.

Not really analyzed is whether SQL is always the right way to go and how, say, a functional approach in Spark would compare. You need to take these benchmarks within the scope of which they are presented.

The bottom line is that all of these engines have dramatically improved in one year. Both Impala and Presto continue lead in BI-type queries and Spark leads performance-wise in large analytics queries. I’d like to see what could be done to address the concurrency issue with memory tuning, but that’s actually consistent with what I observed in the Google Dataflow/Spark Benchmark released by my former employer earlier this year. Either way, it is time to upgrade!

Source: InfoWorld Big Data