You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Maciej Bryński <ma...@brynski.pl> on 2016/06/29 20:48:54 UTC

Spark 2.0 Performance drop

Hi,
Did anyone measure performance of Spark 2.0 vs Spark 1.6 ?

I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes 2x slower.

I tested following queries:
1) select count(*) where id > some_id
In this query we have PPD and performance is similar. (about 1 sec)

2) select count(*) where nested_column.id > some_id
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min
Is it normal that both version didn't do PPD ?

3) Spark connected with python
df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %
100000 else []).collect()
Spark 1.6 -> 2.3 min
Spark 2.0 -> 4.6 min (2x slower)

I used BasicProfiler for this task and cumulative time was:
Spark 1.6 - 4300 sec
Spark 2.0 - 5800 sec

Should I expect such a drop in performance ?

BTW: why in Spark 2.0 Dataframe lost map and flatmap method ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?

I'd like to create Jira for it but Apache server is down at the moment.

Regards,
-- 
Maciek Bryński

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Spark 2.0 Performance drop

Posted by Maciej Bryński <ma...@brynski.pl>.

I filled up 2 Jira.
1) Performance when queries nested column
https://issues.apache.org/jira/browse/SPARK-16320

2) Pyspark performance
https://issues.apache.org/jira/browse/SPARK-16321

I found Jira for:
1) PPD on nested columns
https://issues.apache.org/jira/browse/SPARK-5151

2) Drop of support for df.map etc. in Pyspark
https://issues.apache.org/jira/browse/SPARK-13594

2016-06-30 0:47 GMT+02:00 Michael Allman <mi...@videoamp.com>:
> The patch we use in production is for 1.5. We're porting the patch to master (and downstream to 2.0, which is presently very similar) with the intention of submitting a PR "soon". We'll push it here when it's ready: https://github.com/VideoAmp/spark-public.
>
> Regarding benchmarking, we have a suite of Spark SQL regression tests which we run to check correctness and performance. I can share our findings when I have them.
>
> Cheers,
>
> Michael
>
>> On Jun 29, 2016, at 2:39 PM, Maciej Bryński <ma...@brynski.pl> wrote:
>>
>> 2016-06-29 23:22 GMT+02:00 Michael Allman <mi...@videoamp.com>:
>>> I'm sorry I don't have any concrete advice for you, but I hope this helps shed some light on the current support in Spark for projection pushdown.
>>>
>>> Michael
>>
>> Michael,
>> Thanks for the answer. This resolves one of my questions.
>> Which Spark version you have patched ? 1.6 ? Are you planning to
>> public this patch or just for 2.0 branch ?
>>
>> I gladly help with some benchmark in my environment.
>>
>> Regards,
>> --
>> Maciek Bryński
>



-- 
Maciek Bryński

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Spark 2.0 Performance drop

Posted by Michael Allman <mi...@videoamp.com>.

The patch we use in production is for 1.5. We're porting the patch to master (and downstream to 2.0, which is presently very similar) with the intention of submitting a PR "soon". We'll push it here when it's ready: https://github.com/VideoAmp/spark-public.

Regarding benchmarking, we have a suite of Spark SQL regression tests which we run to check correctness and performance. I can share our findings when I have them.

Cheers,

Michael

> On Jun 29, 2016, at 2:39 PM, Maciej Bryński <ma...@brynski.pl> wrote:
> 
> 2016-06-29 23:22 GMT+02:00 Michael Allman <mi...@videoamp.com>:
>> I'm sorry I don't have any concrete advice for you, but I hope this helps shed some light on the current support in Spark for projection pushdown.
>> 
>> Michael
> 
> Michael,
> Thanks for the answer. This resolves one of my questions.
> Which Spark version you have patched ? 1.6 ? Are you planning to
> public this patch or just for 2.0 branch ?
> 
> I gladly help with some benchmark in my environment.
> 
> Regards,
> -- 
> Maciek Bryński

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Spark 2.0 Performance drop

Posted by Maciej Bryński <ma...@brynski.pl>.

2016-06-29 23:22 GMT+02:00 Michael Allman <mi...@videoamp.com>:
> I'm sorry I don't have any concrete advice for you, but I hope this helps shed some light on the current support in Spark for projection pushdown.
>
> Michael

Michael,
Thanks for the answer. This resolves one of my questions.
Which Spark version you have patched ? 1.6 ? Are you planning to
public this patch or just for 2.0 branch ?

I gladly help with some benchmark in my environment.

Regards,
-- 
Maciek Bryński

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Spark 2.0 Performance drop

Posted by Michael Allman <mi...@videoamp.com>.

Hi Maciej,

In Spark, projection pushdown is currently limited to top-level columns (StructFields). VideoAmp has very large parquet-based tables (many billions of records accumulated per day) with deeply nested schema (four or five levels), and we've spent a considerable amount of time optimizing query performance on these tables.

We have a patch internally that extends Spark to support projection pushdown for arbitrarily nested fields. This has resulted in a *huge* performance improvement for many of our queries, like 10x to 100x in some cases.

I'm still putting the finishing touches on our port of this patch to Spark master and 2.0. We haven't done any specific benchmarking between versions, but I will do that when our patch is complete. We hope to contribute this functionality to the Spark project at some point in the near future, but it is not ready yet.

I'm sorry I don't have any concrete advice for you, but I hope this helps shed some light on the current support in Spark for projection pushdown.

Michael

> On Jun 29, 2016, at 1:48 PM, Maciej Bryński <ma...@brynski.pl> wrote:
> 
> Hi,
> Did anyone measure performance of Spark 2.0 vs Spark 1.6 ?
> 
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes 2x slower.
> 
> I tested following queries:
> 1) select count(*) where id > some_id
> In this query we have PPD and performance is similar. (about 1 sec)
> 
> 2) select count(*) where nested_column.id > some_id
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Is it normal that both version didn't do PPD ?
> 
> 3) Spark connected with python
> df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %
> 100000 else []).collect()
> Spark 1.6 -> 2.3 min
> Spark 2.0 -> 4.6 min (2x slower)
> 
> I used BasicProfiler for this task and cumulative time was:
> Spark 1.6 - 4300 sec
> Spark 2.0 - 5800 sec
> 
> Should I expect such a drop in performance ?
> 
> BTW: why in Spark 2.0 Dataframe lost map and flatmap method ?
> 
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> 
> I'd like to create Jira for it but Apache server is down at the moment.
> 
> Regards,
> -- 
> Maciek Bryński
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org