You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by "Saumitra Shahapure (Vizury)" <sa...@vizury.com> on 2015/01/22 12:03:39 UTC

Re: Spark performance for small queries

Hello,

We were comparing performance of some of our production hive queries
between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both
Spark 0.9 and 1.1. We could see that the performance gains have been good
in Spark.

We tried a very simple query,
select count(*) from T where col3=123
in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark
performance had been 2x better than Hive (120sec vs 60sec). Table T is
stored in S3 and contains 600MB single GZIP file.

My question is, why Spark is faster than Hive here? In both of the cases,
the file will be downloaded, uncompressed and lines will be counted by a
single process. For Hive case, reducer will be identity function
since hive.map.aggr is true.

Note that disk spills and network I/O are very less for Hive's case as well,

Re: Spark performance for small queries

Posted by sjayatheertha <sj...@gmail.com>.
I'm not answering your question but, could you give me more insight on where and how do you use spark? I know that spark has in memory capabilities. 

Also, I have a similar question on ways to optimize hive queries and file storage. Which is better Orc vs parquet along with when to use compressions

> On Jan 22, 2015, at 3:03 AM, "Saumitra Shahapure (Vizury)" <sa...@vizury.com> wrote:
> 
> Hello,
> 
> We were comparing performance of some of our production hive queries between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both Spark 0.9 and 1.1. We could see that the performance gains have been good in Spark.
>  
> We tried a very simple query, 
> select count(*) from T where col3=123 
> in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark performance had been 2x better than Hive (120sec vs 60sec). Table T is stored in S3 and contains 600MB single GZIP file.
> 
> My question is, why Spark is faster than Hive here? In both of the cases, the file will be downloaded, uncompressed and lines will be counted by a single process. For Hive case, reducer will be identity function since hive.map.aggr is true.
> 
> Note that disk spills and network I/O are very less for Hive's case as well,