You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jeff Thompson <je...@gmail.com> on 2015/10/03 20:08:31 UTC

performance difference between Thrift server and SparkSQL?

Hi,

I'm running a simple SQL query over a ~700 million row table of the form:

SELECT * FROM my_table WHERE id = '12345';

When I submit the query via beeline & the JDBC thrift server it returns in
35s
When I submit the exact same query using sparkSQL from a pyspark shell
(sqlContex.sql("SELECT * FROM ....")) it returns in 3s.

Both times are obtained from the spark web UI.  The query only returns 43
rows, a small amount of data.

The table was created by saving a sparkSQL dataframe as a parquet file and
then calling createExternalTable.

I have tried to ensure that all relevant cluster parameters are equivalent
across the two queries:
spark.executor.memory = 6g
spark.executor.instances = 100
no explicit caching (storage tab in web UI is empty)
spark version: 1.4.1
Hadoop v2.5.0-cdh5.3.0, running spark on top of YARN
jobs run on the same physical cluster (on-site harware)

>From the web UIs, I can see that the query plans are clearly different, and
I think this may be the source of the performance difference.

Thrift server job:
1 stage only, stage 1 (35s) map -> Filter -> mapPartitions

SparkSQL job:
2 stages, stage 1 (2s): map -> filter -> Project -> Aggregate -> Exchange,
stage 2 (0.4s): Exchange -> Aggregate -> mapPartitions

Is this a know issue?  Is there anything I can do to get the Thrift server
to use the same query optimizer as the one used by sparkSQL?  I'd love to
pick up a ~10x performance gain for my jobs submitted via the Thrift server.

Best regards,

Jeff

Re: performance difference between Thrift server and SparkSQL?

Posted by Jeff Thompson <je...@gmail.com>.

Thanks for the suggestion.  The output from EXPLAIN is indeed equivalent in
both sparkSQL and via the Thrift server.  I did some more testing.  The
source of the performance difference is in the way I was triggering the
sparkSQL query.  I was using .count() instead of .collect().  When I use
.collect() I get the same performance as the Thrift server.  My table has
28 columns.  I guess that .count() only required one column to be loaded
into memory, whereas .collect() required all columns to be loaded?
Curiously, it doesn't appear to matter how many rows are returned.  The
speed is the same even if I adjust the query to return 0 rows.  Anyway,
looks like it was a poor comparison on my part.  No real performance
difference between Thrift and SparkSQL.  Thanks for the help.

-Jeff

On Sat, Oct 3, 2015 at 1:26 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> Underneath the covers, the thrift server is just calling
> <https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala#L224>
> hiveContext.sql(...) so this is surprising.  Maybe running EXPLAIN or
> EXPLAIN EXTENDED in both modes would be helpful in debugging?
>
>
>
> On Sat, Oct 3, 2015 at 1:08 PM, Jeff Thompson <
> jeffreykeatingthompson@gmail.com> wrote:
>
>> Hi,
>>
>> I'm running a simple SQL query over a ~700 million row table of the form:
>>
>> SELECT * FROM my_table WHERE id = '12345';
>>
>> When I submit the query via beeline & the JDBC thrift server it returns
>> in 35s
>> When I submit the exact same query using sparkSQL from a pyspark shell
>> (sqlContex.sql("SELECT * FROM ....")) it returns in 3s.
>>
>> Both times are obtained from the spark web UI.  The query only returns 43
>> rows, a small amount of data.
>>
>> The table was created by saving a sparkSQL dataframe as a parquet file
>> and then calling createExternalTable.
>>
>> I have tried to ensure that all relevant cluster parameters are
>> equivalent across the two queries:
>> spark.executor.memory = 6g
>> spark.executor.instances = 100
>> no explicit caching (storage tab in web UI is empty)
>> spark version: 1.4.1
>> Hadoop v2.5.0-cdh5.3.0, running spark on top of YARN
>> jobs run on the same physical cluster (on-site harware)
>>
>> From the web UIs, I can see that the query plans are clearly different,
>> and I think this may be the source of the performance difference.
>>
>> Thrift server job:
>> 1 stage only, stage 1 (35s) map -> Filter -> mapPartitions
>>
>> SparkSQL job:
>> 2 stages, stage 1 (2s): map -> filter -> Project -> Aggregate ->
>> Exchange, stage 2 (0.4s): Exchange -> Aggregate -> mapPartitions
>>
>> Is this a know issue?  Is there anything I can do to get the Thrift
>> server to use the same query optimizer as the one used by sparkSQL?  I'd
>> love to pick up a ~10x performance gain for my jobs submitted via the
>> Thrift server.
>>
>> Best regards,
>>
>> Jeff
>>
>
>

Re: performance difference between Thrift server and SparkSQL?

Posted by Michael Armbrust <mi...@databricks.com>.

Underneath the covers, the thrift server is just calling
<https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala#L224>
hiveContext.sql(...) so this is surprising.  Maybe running EXPLAIN or
EXPLAIN EXTENDED in both modes would be helpful in debugging?



On Sat, Oct 3, 2015 at 1:08 PM, Jeff Thompson <
jeffreykeatingthompson@gmail.com> wrote:

> Hi,
>
> I'm running a simple SQL query over a ~700 million row table of the form:
>
> SELECT * FROM my_table WHERE id = '12345';
>
> When I submit the query via beeline & the JDBC thrift server it returns in
> 35s
> When I submit the exact same query using sparkSQL from a pyspark shell
> (sqlContex.sql("SELECT * FROM ....")) it returns in 3s.
>
> Both times are obtained from the spark web UI.  The query only returns 43
> rows, a small amount of data.
>
> The table was created by saving a sparkSQL dataframe as a parquet file and
> then calling createExternalTable.
>
> I have tried to ensure that all relevant cluster parameters are equivalent
> across the two queries:
> spark.executor.memory = 6g
> spark.executor.instances = 100
> no explicit caching (storage tab in web UI is empty)
> spark version: 1.4.1
> Hadoop v2.5.0-cdh5.3.0, running spark on top of YARN
> jobs run on the same physical cluster (on-site harware)
>
> From the web UIs, I can see that the query plans are clearly different,
> and I think this may be the source of the performance difference.
>
> Thrift server job:
> 1 stage only, stage 1 (35s) map -> Filter -> mapPartitions
>
> SparkSQL job:
> 2 stages, stage 1 (2s): map -> filter -> Project -> Aggregate -> Exchange,
> stage 2 (0.4s): Exchange -> Aggregate -> mapPartitions
>
> Is this a know issue?  Is there anything I can do to get the Thrift server
> to use the same query optimizer as the one used by sparkSQL?  I'd love to
> pick up a ~10x performance gain for my jobs submitted via the Thrift server.
>
> Best regards,
>
> Jeff
>