You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Tin Vu <tv...@ucr.edu> on 2018/03/29 00:03:59 UTC

[SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

Hi,

I am executing a benchmark to compare performance of SparkSQL, Apache Drill
and Presto. My experimental setup:

   - TPCDS dataset with scale factor 100 (size 100GB).
   - Spark, Drill, Presto have a same number of workers: 12.
   - Each worked has same allocated amount of memory: 4GB.
   - Data is stored by Hive with ORC format.

I executed a very simple SQL query: "SELECT * from table_name"
The issue is that for some small size tables (even table with few dozen of
records), SparkSQL still required about 7-8 seconds to finish, while Drill
and Presto only needed less than 1 second.
For other large tables with billions records, SparkSQL performance was
reasonable when it required 20-30 seconds to scan the whole table.
Do you have any idea or reasonable explanation for this issue?

Thanks,

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

Posted by Tin Vu <tv...@ucr.edu>.

Hi Gaurav,

Thank you for your response. This is the answer for your questions:
1. Spark 2.3.0
2. I was using 'spark-sql' command, for example: 'spark-sql --master
spark:/*:7077 --database tpcds_bin_partitioned_orc_100 -f $file_name' wih
file_name is the file that contains SQL script ("select * from table_name").
3. Hadoop 2.9.0

I am using JDBS connector to Drill from Hive Metastore. SparkSQL is also
connecting to ORC database by Hive.

Thanks so much!

Tin

On Sat, Mar 31, 2018 at 11:41 AM, Gourav Sengupta <gourav.sengupta@gmail.com
> wrote:

> Hi Tin,
>
> This sounds interesting. While I would prefer to think that Presto and
> Drill have
>
> can you please provide the following details:
> 1. SPARK version
> 2. The exact code used in SPARK (the full code that was used)
> 3. HADOOP version
>
> I do think that SPARK and DRILL have complementary and different used
> cases. Have you tried using JDBC connector to Drill from within SPARKSQL?
>
> Regards,
> Gourav Sengupta
>
>
> On Thu, Mar 29, 2018 at 1:03 AM, Tin Vu <tv...@ucr.edu> wrote:
>
>> Hi,
>>
>> I am executing a benchmark to compare performance of SparkSQL, Apache
>> Drill and Presto. My experimental setup:
>>
>>    - TPCDS dataset with scale factor 100 (size 100GB).
>>    - Spark, Drill, Presto have a same number of workers: 12.
>>    - Each worked has same allocated amount of memory: 4GB.
>>    - Data is stored by Hive with ORC format.
>>
>> I executed a very simple SQL query: "SELECT * from table_name"
>> The issue is that for some small size tables (even table with few dozen
>> of records), SparkSQL still required about 7-8 seconds to finish, while
>> Drill and Presto only needed less than 1 second.
>> For other large tables with billions records, SparkSQL performance was
>> reasonable when it required 20-30 seconds to scan the whole table.
>> Do you have any idea or reasonable explanation for this issue?
>>
>> Thanks,
>>
>>
>

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

Posted by Gourav Sengupta <go...@gmail.com>.

Hi Tin,

This sounds interesting. While I would prefer to think that Presto and
Drill have

can you please provide the following details:
1. SPARK version
2. The exact code used in SPARK (the full code that was used)
3. HADOOP version

I do think that SPARK and DRILL have complementary and different used
cases. Have you tried using JDBC connector to Drill from within SPARKSQL?

Regards,
Gourav Sengupta

On Thu, Mar 29, 2018 at 1:03 AM, Tin Vu <tv...@ucr.edu> wrote:

> Hi,
>
> I am executing a benchmark to compare performance of SparkSQL, Apache
> Drill and Presto. My experimental setup:
>
>    - TPCDS dataset with scale factor 100 (size 100GB).
>    - Spark, Drill, Presto have a same number of workers: 12.
>    - Each worked has same allocated amount of memory: 4GB.
>    - Data is stored by Hive with ORC format.
>
> I executed a very simple SQL query: "SELECT * from table_name"
> The issue is that for some small size tables (even table with few dozen of
> records), SparkSQL still required about 7-8 seconds to finish, while Drill
> and Presto only needed less than 1 second.
> For other large tables with billions records, SparkSQL performance was
> reasonable when it required 20-30 seconds to scan the whole table.
> Do you have any idea or reasonable explanation for this issue?
>
> Thanks,
>
>

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

Posted by Tin Vu <tv...@ucr.edu>.

Thanks for your response.  What do you mean when you said "immediately
return"?

On Wed, Mar 28, 2018, 10:33 PM Jörn Franke <jo...@gmail.com> wrote:

> I don’t think select * is a good benchmark. You should do a more complex
> operation, otherwise optimizes might see that you don’t do anything in the
> query and immediately return (similarly count might immediately return by
> using some statistics).
>
> On 29. Mar 2018, at 02:03, Tin Vu <tv...@ucr.edu> wrote:
>
> Hi,
>
> I am executing a benchmark to compare performance of SparkSQL, Apache
> Drill and Presto. My experimental setup:
>
>    - TPCDS dataset with scale factor 100 (size 100GB).
>    - Spark, Drill, Presto have a same number of workers: 12.
>    - Each worked has same allocated amount of memory: 4GB.
>    - Data is stored by Hive with ORC format.
>
> I executed a very simple SQL query: "SELECT * from table_name"
> The issue is that for some small size tables (even table with few dozen of
> records), SparkSQL still required about 7-8 seconds to finish, while Drill
> and Presto only needed less than 1 second.
> For other large tables with billions records, SparkSQL performance was
> reasonable when it required 20-30 seconds to scan the whole table.
> Do you have any idea or reasonable explanation for this issue?
>
> Thanks,
>
>

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

Posted by Jörn Franke <jo...@gmail.com>.

I don’t think select * is a good benchmark. You should do a more complex operation, otherwise optimizes might see that you don’t do anything in the query and immediately return (similarly count might immediately return by using some statistics).

> On 29. Mar 2018, at 02:03, Tin Vu <tv...@ucr.edu> wrote:
> 
> Hi,
> 
> I am executing a benchmark to compare performance of SparkSQL, Apache Drill and Presto. My experimental setup:
> TPCDS dataset with scale factor 100 (size 100GB).
> Spark, Drill, Presto have a same number of workers: 12.
> Each worked has same allocated amount of memory: 4GB.
> Data is stored by Hive with ORC format.
> I executed a very simple SQL query: "SELECT * from table_name"
> The issue is that for some small size tables (even table with few dozen of records), SparkSQL still required about 7-8 seconds to finish, while Drill and Presto only needed less than 1 second.
> For other large tables with billions records, SparkSQL performance was reasonable when it required 20-30 seconds to scan the whole table.
> Do you have any idea or reasonable explanation for this issue?
> Thanks,
>

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

Posted by "Lalwani, Jayesh" <Ja...@capitalone.com>.

It depends on how you have loaded data.. Ideally, if you have dozens of records, your input data should have them in one partition. If the input has 1 partition, and data is small enough, Spark will keep it in one partition (as far as possible)

If you cannot control your data, you need to repartition the data when you load it  This will (eventually) cause a shuffle and all the data will be moved into the number of partitions that you specify. Subsequent operations will be on the repartitioned dataframe, and should take number of tasks. Shuffle has costs assosciated with it. You will need to make a call whether you want to take the upfront cost of a shuffle, or you want to live with large number of tasks

From: Tin Vu <tv...@ucr.edu>
Date: Thursday, March 29, 2018 at 10:47 AM
To: "Lalwani, Jayesh" <Ja...@capitalone.com>
Cc: "user@spark.apache.org" <us...@spark.apache.org>
Subject: Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

 You are right. There are too much tasks was created. How can we reduce the number of tasks?

On Thu, Mar 29, 2018, 7:44 AM Lalwani, Jayesh <Ja...@capitalone.com>> wrote:
Without knowing too many details, I can only guess. It could be that Spark is creating a lot of tasks even though there are less records. Creation and distribution of tasks has a noticeable overhead on smaller datasets.

You might want to look at the driver logs, or the Spark Application Detail UI.

From: Tin Vu <tv...@ucr.edu>>
Date: Wednesday, March 28, 2018 at 8:04 PM
To: "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

Hi,

I am executing a benchmark to compare performance of SparkSQL, Apache Drill and Presto. My experimental setup:
•         TPCDS dataset with scale factor 100 (size 100GB).
•         Spark, Drill, Presto have a same number of workers: 12.
•         Each worked has same allocated amount of memory: 4GB.
•         Data is stored by Hive with ORC format.

I executed a very simple SQL query: "SELECT * from table_name"
The issue is that for some small size tables (even table with few dozen of records), SparkSQL still required about 7-8 seconds to finish, while Drill and Presto only needed less than 1 second.
For other large tables with billions records, SparkSQL performance was reasonable when it required 20-30 seconds to scan the whole table.
Do you have any idea or reasonable explanation for this issue?

Thanks,

________________________________

The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.
________________________________________________________

The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

Posted by Tin Vu <tv...@ucr.edu>.

 You are right. There are too much tasks was created. How can we reduce the
number of tasks?

On Thu, Mar 29, 2018, 7:44 AM Lalwani, Jayesh <Ja...@capitalone.com>
wrote:

> Without knowing too many details, I can only guess. It could be that Spark
> is creating a lot of tasks even though there are less records. Creation and
> distribution of tasks has a noticeable overhead on smaller datasets.
>
>
>
> You might want to look at the driver logs, or the Spark Application Detail
> UI.
>
>
>
> *From: *Tin Vu <tv...@ucr.edu>
> *Date: *Wednesday, March 28, 2018 at 8:04 PM
> *To: *"user@spark.apache.org" <us...@spark.apache.org>
> *Subject: *[SparkSQL] SparkSQL performance on small TPCDS tables is very
> low when compared to Drill or Presto
>
>
>
> Hi,
>
>
>
> I am executing a benchmark to compare performance of SparkSQL, Apache
> Drill and Presto. My experimental setup:
>
> ·         TPCDS dataset with scale factor 100 (size 100GB).
>
> ·         Spark, Drill, Presto have a same number of workers: 12.
>
> ·         Each worked has same allocated amount of memory: 4GB.
>
> ·         Data is stored by Hive with ORC format.
>
> I executed a very simple SQL query: "SELECT * from table_name"
> The issue is that for some small size tables (even table with few dozen of
> records), SparkSQL still required about 7-8 seconds to finish, while Drill
> and Presto only needed less than 1 second.
> For other large tables with billions records, SparkSQL performance was
> reasonable when it required 20-30 seconds to scan the whole table.
> Do you have any idea or reasonable explanation for this issue?
>
> Thanks,
>
>
>
> ------------------------------
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

Posted by "Lalwani, Jayesh" <Ja...@capitalone.com>.

Without knowing too many details, I can only guess. It could be that Spark is creating a lot of tasks even though there are less records. Creation and distribution of tasks has a noticeable overhead on smaller datasets.

You might want to look at the driver logs, or the Spark Application Detail UI.

From: Tin Vu <tv...@ucr.edu>
Date: Wednesday, March 28, 2018 at 8:04 PM
To: "user@spark.apache.org" <us...@spark.apache.org>
Subject: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

Hi,

I am executing a benchmark to compare performance of SparkSQL, Apache Drill and Presto. My experimental setup:
·         TPCDS dataset with scale factor 100 (size 100GB).
·         Spark, Drill, Presto have a same number of workers: 12.
·         Each worked has same allocated amount of memory: 4GB.
·         Data is stored by Hive with ORC format.

I executed a very simple SQL query: "SELECT * from table_name"
The issue is that for some small size tables (even table with few dozen of records), SparkSQL still required about 7-8 seconds to finish, while Drill and Presto only needed less than 1 second.
For other large tables with billions records, SparkSQL performance was reasonable when it required 20-30 seconds to scan the whole table.
Do you have any idea or reasonable explanation for this issue?

Thanks,

________________________________________________________

The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.