You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Pallavi Singh <pa...@persistent.com> on 2018/04/26 15:49:09 UTC

Spark Optimization

Hi Team,

We are currently working on POC based on Spark and Scala.
we have to read 18million records from parquet file and perform the 25 user defined aggregation based on grouping keys.
we have used spark high level Dataframe API for the aggregation. On cluster of two node we could finish end to end job ((Read+Aggregation+Write))in 2 min.

Cluster Information:
Number of Node:2
Total Core:28Core
Total RAM:128GB

Component:
Spark Core

Scenario:
How-to

Tuning Parameter:
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.default.parallelism 24
spark.sql.shuffle.partitions 24
spark.executor.extraJavaOptions -XX:+UseG1GC
spark.speculation true
spark.executor.memory 16G
spark.driver.memory 8G
spark.sql.codegen true
spark.sql.inMemoryColumnarStorage.batchSize 100000
spark.locality.wait 1s
spark.ui.showConsoleProgress false
spark.io.compression.codec org.apache.spark.io.SnappyCompressionCodec
Please let us know, If you have any ideas/tuning parameter that we can use to finish the job in less than one min.


Regards,
Pallavi
DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.

RE: Spark Optimization

Posted by Pallavi Singh <pa...@persistent.com>.

Thanks for your reply.

It is 64GB per node. We will try using UseParallelGC.

From: CPC [mailto:achalil@gmail.com]
Sent: Thursday, April 26, 2018 11:44 PM
To: vincent gromakowski <vi...@gmail.com>
Cc: Pallavi Singh <pa...@persistent.com>; user <us...@spark.apache.org>
Subject: Re: Spark Optimization

I would recommend UseParallelGC since this is a batch job. Parallelization should be 2-3x of cores. Also if those are physical machines i would recommend 9000 as network mtu. Is 128 gb per node or 64 gb per node?
On Thu, Apr 26, 2018, 7:40 PM vincent gromakowski <vi...@gmail.com>> wrote:
Ideal parallelization is 2-3x the nb of cores. But it depends on the number of partitions of your source and the operation you use (Shuffle or not). It can be worth paying the extra cost of an initial repartition to match your cluster but it clearly depends on your DAG.
Optimizing spark apps depends on lots of thing, it's hard to answer
- cluster size
- scheduler
- spark version
- transformation graph (DAG)
...

Le jeu. 26 avr. 2018 à 17:49, Pallavi Singh <pa...@persistent.com>> a écrit :
Hi Team,

We are currently working on POC based on Spark and Scala.
we have to read 18million records from parquet file and perform the 25 user defined aggregation based on grouping keys.
we have used spark high level Dataframe API for the aggregation. On cluster of two node we could finish end to end job ((Read+Aggregation+Write))in 2 min.

Cluster Information:
Number of Node:2
Total Core:28Core
Total RAM:128GB

Component:
Spark Core

Scenario:
How-to

Tuning Parameter:
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.default.parallelism 24
spark.sql.shuffle.partitions 24
spark.executor.extraJavaOptions -XX:+UseG1GC
spark.speculation true
spark.executor.memory 16G
spark.driver.memory 8G
spark.sql.codegen true
spark.sql.inMemoryColumnarStorage.batchSize 100000
spark.locality.wait 1s
spark.ui.showConsoleProgress false
spark.io.compression.codec org.apache.spark.io.SnappyCompressionCodec
Please let us know, If you have any ideas/tuning parameter that we can use to finish the job in less than one min.

Regards,
Pallavi
DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.

Re: Spark Optimization

Posted by CPC <ac...@gmail.com>.

I would recommend UseParallelGC since this is a batch job. Parallelization
should be 2-3x of cores. Also if those are physical machines i would
recommend 9000 as network mtu. Is 128 gb per node or 64 gb per node?

On Thu, Apr 26, 2018, 7:40 PM vincent gromakowski <
vincent.gromakowski@gmail.com> wrote:

> Ideal parallelization is 2-3x the nb of cores. But it depends on the
> number of partitions of your source and the operation you use (Shuffle or
> not). It can be worth paying the extra cost of an initial repartition to
> match your cluster but it clearly depends on your DAG.
> Optimizing spark apps depends on lots of thing, it's hard to answer
> - cluster size
> - scheduler
> - spark version
> - transformation graph (DAG)
> ...
>
> Le jeu. 26 avr. 2018 à 17:49, Pallavi Singh <pa...@persistent.com>
> a écrit :
>
>> Hi Team,
>>
>>
>>
>> We are currently working on POC based on Spark and Scala.
>>
>> we have to read 18million records from parquet file and perform the 25
>> user defined aggregation based on grouping keys.
>>
>> we have used spark high level Dataframe API for the aggregation. On
>> cluster of two node we could finish end to end job
>> ((Read+Aggregation+Write))in 2 min.
>>
>>
>>
>> *Cluster Information:*
>>
>> Number of Node:2
>>
>> Total Core:28Core
>>
>> Total RAM:128GB
>>
>>
>>
>> *Component: *
>>
>> Spark Core
>>
>>
>>
>> *Scenario:*
>>
>> How-to
>>
>>
>>
>> *Tuning Parameter:*
>>
>> spark.serializer org.apache.spark.serializer.KryoSerializer
>>
>> spark.default.parallelism 24
>>
>> spark.sql.shuffle.partitions 24
>>
>> spark.executor.extraJavaOptions -XX:+UseG1GC
>>
>> spark.speculation true
>>
>> spark.executor.memory 16G
>>
>> spark.driver.memory 8G
>>
>> spark.sql.codegen true
>>
>> spark.sql.inMemoryColumnarStorage.batchSize 100000
>>
>> spark.locality.wait 1s
>>
>> spark.ui.showConsoleProgress false
>>
>> spark.io.compression.codec org.apache.spark.io.SnappyCompressionCodec
>>
>> Please let us know, If you have any ideas/tuning parameter that we can
>> use to finish the job in less than one min.
>>
>>
>>
>>
>>
>> Regards,
>>
>> Pallavi
>> DISCLAIMER
>> ==========
>> This e-mail may contain privileged and confidential information which is
>> the property of Persistent Systems Ltd. It is intended only for the use of
>> the individual or entity to which it is addressed. If you are not the
>> intended recipient, you are not authorized to read, retain, copy, print,
>> distribute or use this message. If you have received this communication in
>> error, please notify the sender and delete all copies of this message.
>> Persistent Systems Ltd. does not accept any liability for virus infected
>> mails.
>>
>

Re: Spark Optimization

Posted by vincent gromakowski <vi...@gmail.com>.

Ideal parallelization is 2-3x the nb of cores. But it depends on the number
of partitions of your source and the operation you use (Shuffle or not). It
can be worth paying the extra cost of an initial repartition to match your
cluster but it clearly depends on your DAG.
Optimizing spark apps depends on lots of thing, it's hard to answer
- cluster size
- scheduler
- spark version
- transformation graph (DAG)
...

Le jeu. 26 avr. 2018 à 17:49, Pallavi Singh <pa...@persistent.com>
a écrit :

> Hi Team,
>
>
>
> We are currently working on POC based on Spark and Scala.
>
> we have to read 18million records from parquet file and perform the 25
> user defined aggregation based on grouping keys.
>
> we have used spark high level Dataframe API for the aggregation. On
> cluster of two node we could finish end to end job
> ((Read+Aggregation+Write))in 2 min.
>
>
>
> *Cluster Information:*
>
> Number of Node:2
>
> Total Core:28Core
>
> Total RAM:128GB
>
>
>
> *Component: *
>
> Spark Core
>
>
>
> *Scenario:*
>
> How-to
>
>
>
> *Tuning Parameter:*
>
> spark.serializer org.apache.spark.serializer.KryoSerializer
>
> spark.default.parallelism 24
>
> spark.sql.shuffle.partitions 24
>
> spark.executor.extraJavaOptions -XX:+UseG1GC
>
> spark.speculation true
>
> spark.executor.memory 16G
>
> spark.driver.memory 8G
>
> spark.sql.codegen true
>
> spark.sql.inMemoryColumnarStorage.batchSize 100000
>
> spark.locality.wait 1s
>
> spark.ui.showConsoleProgress false
>
> spark.io.compression.codec org.apache.spark.io.SnappyCompressionCodec
>
> Please let us know, If you have any ideas/tuning parameter that we can use
> to finish the job in less than one min.
>
>
>
>
>
> Regards,
>
> Pallavi
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>