You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Soumya Simanta <so...@gmail.com> on 2014/11/01 00:04:56 UTC

SparkSQL performance

I was really surprised to see the results here, esp. SparkSQL "not
completing"
http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style

I was under the impression that SparkSQL performs really well because it
can optimize the RDD operations and load only the columns that are
required. This essentially means in most cases SparkSQL should be as fast
as Spark is.

I would be very interested to hear what others in the group have to say
about this.

Thanks
-Soumya

Re: SparkSQL performance

Posted by Soumya Simanta <so...@gmail.com>.

I agree. My personal experience with Spark core is that it performs really
well once you tune it properly.

As far I understand SparkSQL under the hood performs many of these
optimizations (order of Spark operations) and uses a more efficient storage
format. Is this assumption correct?

Has anyone done any comparison of SparkSQL with Impala ? The fact that many
of the queries don't even finish in the benchmark is quite surprising and
hard to believe.

A few months ago there were a few emails about Spark not being able to
handle large volumes (TBs) of data. That myth was busted recently when the
folks at Databricks published their sorting record results.

Thanks
-Soumya

On Fri, Oct 31, 2014 at 7:35 PM, Du Li <li...@yahoo-inc.com> wrote:

>   We have seen all kinds of results published that often contradict each
> other. My take is that the authors often know more tricks about how to tune
> their own/familiar products than the others. So the product on focus is
> tuned for ideal performance while the competitors are not. The authors are
> not necessarily biased but as a consequence the results are.
>
>  Ideally it’s critical for the user community to be informed of all the
> in-depth tuning tricks of all products. However, realistically, there is a
> big gap in terms of documentation. Hope the Spark folks will make a
> difference. :-)
>
>  Du
>
>
>   From: Soumya Simanta <so...@gmail.com>
> Date: Friday, October 31, 2014 at 4:04 PM
> To: "user@spark.apache.org" <us...@spark.apache.org>
> Subject: SparkSQL performance
>
>   I was really surprised to see the results here, esp. SparkSQL "not
> completing"
> http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
>
>  I was under the impression that SparkSQL performs really well because it
> can optimize the RDD operations and load only the columns that are
> required. This essentially means in most cases SparkSQL should be as fast
> as Spark is.
>
>  I would be very interested to hear what others in the group have to say
> about this.
>
>  Thanks
> -Soumya
>
>
>

Re: SparkSQL performance

Posted by Marius Soutier <mp...@gmail.com>.

I did some simple experiments with Impala and Spark, and Impala came out ahead. But it’s also less flexible, couldn’t handle irregular schemas, didn't support Json, and so on.

On 01.11.2014, at 02:20, Soumya Simanta <so...@gmail.com> wrote:

> I agree. My personal experience with Spark core is that it performs really well once you tune it properly. 
> 
> As far I understand SparkSQL under the hood performs many of these optimizations (order of Spark operations) and uses a more efficient storage format. Is this assumption correct? 
> 
> Has anyone done any comparison of SparkSQL with Impala ? The fact that many of the queries don't even finish in the benchmark is quite surprising and hard to believe. 
> 
> A few months ago there were a few emails about Spark not being able to handle large volumes (TBs) of data. That myth was busted recently when the folks at Databricks published their sorting record results. 
>  
> 
> Thanks
> -Soumya
> 
> 
> 
> 
>  
> 
> On Fri, Oct 31, 2014 at 7:35 PM, Du Li <li...@yahoo-inc.com> wrote:
> We have seen all kinds of results published that often contradict each other. My take is that the authors often know more tricks about how to tune their own/familiar products than the others. So the product on focus is tuned for ideal performance while the competitors are not. The authors are not necessarily biased but as a consequence the results are.
> 
> Ideally it’s critical for the user community to be informed of all the in-depth tuning tricks of all products. However, realistically, there is a big gap in terms of documentation. Hope the Spark folks will make a difference. :-)
> 
> Du
> 
> 
> From: Soumya Simanta <so...@gmail.com>
> Date: Friday, October 31, 2014 at 4:04 PM
> To: "user@spark.apache.org" <us...@spark.apache.org>
> Subject: SparkSQL performance
> 
> I was really surprised to see the results here, esp. SparkSQL "not completing"
> http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
> 
> I was under the impression that SparkSQL performs really well because it can optimize the RDD operations and load only the columns that are required. This essentially means in most cases SparkSQL should be as fast as Spark is. 
> 
> I would be very interested to hear what others in the group have to say about this. 
> 
> Thanks
> -Soumya
> 
> 
>

Re: SparkSQL performance

Posted by Soumya Simanta <so...@gmail.com>.

I agree. My personal experience with Spark core is that it performs really
well once you tune it properly.

As far I understand SparkSQL under the hood performs many of these
optimizations (order of Spark operations) and uses a more efficient storage
format. Is this assumption correct?

Has anyone done any comparison of SparkSQL with Impala ? The fact that many
of the queries don't even finish in the benchmark is quite surprising and
hard to believe.

A few months ago there were a few emails about Spark not being able to
handle large volumes (TBs) of data. That myth was busted recently when the
folks at Databricks published their sorting record results.

Thanks
-Soumya

On Fri, Oct 31, 2014 at 7:35 PM, Du Li <li...@yahoo-inc.com> wrote:

>   We have seen all kinds of results published that often contradict each
> other. My take is that the authors often know more tricks about how to tune
> their own/familiar products than the others. So the product on focus is
> tuned for ideal performance while the competitors are not. The authors are
> not necessarily biased but as a consequence the results are.
>
>  Ideally it’s critical for the user community to be informed of all the
> in-depth tuning tricks of all products. However, realistically, there is a
> big gap in terms of documentation. Hope the Spark folks will make a
> difference. :-)
>
>  Du
>
>
>   From: Soumya Simanta <so...@gmail.com>
> Date: Friday, October 31, 2014 at 4:04 PM
> To: "user@spark.apache.org" <us...@spark.apache.org>
> Subject: SparkSQL performance
>
>   I was really surprised to see the results here, esp. SparkSQL "not
> completing"
> http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
>
>  I was under the impression that SparkSQL performs really well because it
> can optimize the RDD operations and load only the columns that are
> required. This essentially means in most cases SparkSQL should be as fast
> as Spark is.
>
>  I would be very interested to hear what others in the group have to say
> about this.
>
>  Thanks
> -Soumya
>
>
>

Re: SparkSQL performance

Posted by Du Li <li...@yahoo-inc.com.INVALID>.

We have seen all kinds of results published that often contradict each other. My take is that the authors often know more tricks about how to tune their own/familiar products than the others. So the product on focus is tuned for ideal performance while the competitors are not. The authors are not necessarily biased but as a consequence the results are.

Ideally it’s critical for the user community to be informed of all the in-depth tuning tricks of all products. However, realistically, there is a big gap in terms of documentation. Hope the Spark folks will make a difference. :-)

Du

From: Soumya Simanta <so...@gmail.com>>
Date: Friday, October 31, 2014 at 4:04 PM
To: "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: SparkSQL performance

I was really surprised to see the results here, esp. SparkSQL "not completing"
http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style

I was under the impression that SparkSQL performs really well because it can optimize the RDD operations and load only the columns that are required. This essentially means in most cases SparkSQL should be as fast as Spark is.

I would be very interested to hear what others in the group have to say about this.

Thanks
-Soumya