You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Flavio Pompermaier <po...@okkam.it> on 2014/06/22 13:32:00 UTC

Shark vs Impala

Hi folks,
I was looking at the benchmark provided by Cloudera at
http://blog.cloudera.com/blog/2014/05/new-sql-choices-in-the-apache-hadoop-ecosystem-why-impala-continues-to-lead/
.
Is it real that Shark cannot execute some query if you don't have enough
memory?
And is it true/reliable that Impala overcome so much Spark when executing
complex queries?

Best,
Flavio

Re: Shark vs Impala

Posted by Toby Douglass <to...@avocet.io>.

On Sun, Jun 22, 2014 at 5:53 PM, Debasish Das <de...@gmail.com>
wrote:

> 600s for Spark vs 5s for Redshift...The numbers look much different from
> the amplab benchmark...
>
> https://amplab.cs.berkeley.edu/benchmark/
>
> Is it like SSDs or something that's helping redshift or the whole data is
> in memory when you run the query ? Could you publish the query ?
>

I think we'll blog it when it's done.  Still working on it.  This was done
with HD nodes, not SSD.

The query is very simple;

select id, count(*) from data_table group by id;

This is on 52.13 GB of gzipped data, with about 150 distinct IDs.

Re: Shark vs Impala

Posted by Toby Douglass <to...@avocet.io>.

On Mon, Jun 23, 2014 at 8:50 AM, Aaron Davidson <il...@gmail.com> wrote:

> Note that regarding a "long load time", data format means a whole lot in
> terms of query performance. If you load all your data into compressed,
> columnar Parquet files on local hardware, Spark SQL would also perform far,
> far better than it would reading from gzipped S3 files.
>

Yes.  We're comparing our particular use cases; if we used Spark, we'd like
to run from s3 from gzipped files for the sheer convenience of it.  Having
to pre-process data (which is the equivalent of the load phase with newSQL)
is a PITN.  One of the reasons for using post-Hadoop (rather than newSQL)
systems is to avoid this.

> You must also be careful about your queries; certain queries can be
> answered much more efficiently due to specific optimizations implemented in
> the query engine. For instance, Parquet keeps statistics. so you could
> theoretically do a count(*) over petabytes of data in less than a second,
> blowing away any competition that resorts to actually reading data.
>

Yes.  I posted the query just now.  The Redshifft table was only ordered by
timestamp, so in all cases the database should perform a single full table
scan.

Re: Shark vs Impala

Posted by Aaron Davidson <il...@gmail.com>.

Note that regarding a "long load time", data format means a whole lot in
terms of query performance. If you load all your data into compressed,
columnar Parquet files on local hardware, Spark SQL would also perform far,
far better than it would reading from gzipped S3 files. You must also be
careful about your queries; certain queries can be answered much more
efficiently due to specific optimizations implemented in the query engine.
For instance, Parquet keeps statistics. so you could theoretically do a
count(*) over petabytes of data in less than a second, blowing away any
competition that resorts to actually reading data.



On Sun, Jun 22, 2014 at 6:24 PM, Matei Zaharia <ma...@gmail.com>
wrote:

> In this benchmark, the problem wasn’t that Shark could not run without
> enough memory; Shark spills some of the data to disk and can run just fine.
> The issue was that the in-memory form of the RDDs was larger than the
> cluster’s memory, although the raw Parquet / ORC files did fit in memory,
> so Cloudera did not want to run an “RDD” number where some of the RDD is
> not in memory. But the wording “could not complete” is confusing — the
> queries complete just fine.
>
> We do plan to update the AMPLab benchmark with Spark SQL as well, and
> expand it to include more of TPC-DS.
>
> Matei
>
> On Jun 22, 2014, at 9:53 AM, Debasish Das <de...@gmail.com>
> wrote:
>
> 600s for Spark vs 5s for Redshift...The numbers look much different from
> the amplab benchmark...
>
> https://amplab.cs.berkeley.edu/benchmark/
>
> Is it like SSDs or something that's helping redshift or the whole data is
> in memory when you run the query ? Could you publish the query ?
>
> Also after spark-sql are we planning to add spark-sql runtimes in the
> amplab benchmark as well ?
>
>
>
> On Sun, Jun 22, 2014 at 9:13 AM, Toby Douglass <to...@avocet.io> wrote:
>
>> I've just benchmarked Spark and Impala.  Same data (in s3), same query,
>> same cluster.
>>
>> Impala has a long load time, since it cannot load directly from s3.  I
>> have to create a Hive table on s3, then insert from that to an Impala
>> table.  This takes a long time; Spark took about 600s for the query, Impala
>> 250s, but Impala required 6k seconds to load data from s3.  If you're going
>> to go the long-initial-load-then-quick-queries route, go for Redshift.  On
>> equivalent hardware, that took about 4k seconds to load, but then queries
>> are like 5s each.
>>
>>
>
>

Re: Shark vs Impala

Posted by Matei Zaharia <ma...@gmail.com>.

In this benchmark, the problem wasn’t that Shark could not run without enough memory; Shark spills some of the data to disk and can run just fine. The issue was that the in-memory form of the RDDs was larger than the cluster’s memory, although the raw Parquet / ORC files did fit in memory, so Cloudera did not want to run an “RDD” number where some of the RDD is not in memory. But the wording “could not complete” is confusing — the queries complete just fine.

We do plan to update the AMPLab benchmark with Spark SQL as well, and expand it to include more of TPC-DS.

Matei

On Jun 22, 2014, at 9:53 AM, Debasish Das <de...@gmail.com> wrote:

> 600s for Spark vs 5s for Redshift...The numbers look much different from the amplab benchmark...
> 
> https://amplab.cs.berkeley.edu/benchmark/
> 
> Is it like SSDs or something that's helping redshift or the whole data is in memory when you run the query ? Could you publish the query ?
> 
> Also after spark-sql are we planning to add spark-sql runtimes in the amplab benchmark as well ?
> 
> 
> 
> On Sun, Jun 22, 2014 at 9:13 AM, Toby Douglass <to...@avocet.io> wrote:
> I've just benchmarked Spark and Impala.  Same data (in s3), same query, same cluster.
> 
> Impala has a long load time, since it cannot load directly from s3.  I have to create a Hive table on s3, then insert from that to an Impala table.  This takes a long time; Spark took about 600s for the query, Impala 250s, but Impala required 6k seconds to load data from s3.  If you're going to go the long-initial-load-then-quick-queries route, go for Redshift.  On equivalent hardware, that took about 4k seconds to load, but then queries are like 5s each.
> 
>

Re: Shark vs Impala

Posted by Debasish Das <de...@gmail.com>.

600s for Spark vs 5s for Redshift...The numbers look much different from
the amplab benchmark...

https://amplab.cs.berkeley.edu/benchmark/

Is it like SSDs or something that's helping redshift or the whole data is
in memory when you run the query ? Could you publish the query ?

Also after spark-sql are we planning to add spark-sql runtimes in the
amplab benchmark as well ?

On Sun, Jun 22, 2014 at 9:13 AM, Toby Douglass <to...@avocet.io> wrote:

> I've just benchmarked Spark and Impala.  Same data (in s3), same query,
> same cluster.
>
> Impala has a long load time, since it cannot load directly from s3.  I
> have to create a Hive table on s3, then insert from that to an Impala
> table.  This takes a long time; Spark took about 600s for the query, Impala
> 250s, but Impala required 6k seconds to load data from s3.  If you're going
> to go the long-initial-load-then-quick-queries route, go for Redshift.  On
> equivalent hardware, that took about 4k seconds to load, but then queries
> are like 5s each.
>
>

Re: Shark vs Impala

Posted by Toby Douglass <to...@avocet.io>.

I've just benchmarked Spark and Impala.  Same data (in s3), same query,
same cluster.

Impala has a long load time, since it cannot load directly from s3.  I have
to create a Hive table on s3, then insert from that to an Impala table.
This takes a long time; Spark took about 600s for the query, Impala 250s,
but Impala required 6k seconds to load data from s3.  If you're going to go
the long-initial-load-then-quick-queries route, go for Redshift.  On
equivalent hardware, that took about 4k seconds to load, but then queries
are like 5s each.

Re: Shark vs Impala

Posted by Bertrand Dechoux <de...@gmail.com>.

For the second question, I would say it is mainly because the projects have
not the same aim. Impala does have a "cost-based optimizer and predicate
propagation capability" which is natural because it is interpreting
pseudo-SQL query. In the realm of relational database, it is often not a
good idea to compete against the optimizer, it is of course also true for
'BigData'.

Bertrand

On Sun, Jun 22, 2014 at 1:32 PM, Flavio Pompermaier <po...@okkam.it>
wrote:

> Hi folks,
> I was looking at the benchmark provided by Cloudera at
> http://blog.cloudera.com/blog/2014/05/new-sql-choices-in-the-apache-hadoop-ecosystem-why-impala-continues-to-lead/
> .
> Is it real that Shark cannot execute some query if you don't have enough
> memory?
> And is it true/reliable that Impala overcome so much Spark when executing
> complex queries?
>
> Best,
> Flavio
>