You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Brandon White <bw...@gmail.com> on 2016/07/27 06:26:03 UTC

Setting spark.sql.shuffle.partitions Dynamically

Hello,

My platform runs hundreds of Spark jobs every day each with its own
datasize from 20mb to 20TB. This means that we need to set resources
dynamically. One major pain point for doing this is
spark.sql.shuffle.partitions, the number of partitions to use when
shuffling data for joins or aggregations. It is to be arbitrarily hard
coded to 200. The only way to set this config is in the spark submit
command or in the SparkConf before the executor is created.

This creates a lot of problems when I want to set this config dynamically
based on the in memory size of a dataframe. I only know the in memory size
of the dataframe halfway through the spark job. So I would need to stop the
context and recreate it in order to set this config.

Is there any better way to set this? How does  spark.sql.shuffle.partitions
work differently than .repartition?

Brandon

Re: Setting spark.sql.shuffle.partitions Dynamically

Posted by Takeshi Yamamuro <li...@gmail.com>.
Hi,

How about trying adaptive execution in spark?
https://issues.apache.org/jira/browse/SPARK-9850
This feature is turned off by default because it seems experimental.

// maropu



On Wed, Jul 27, 2016 at 3:26 PM, Brandon White <bw...@gmail.com>
wrote:

> Hello,
>
> My platform runs hundreds of Spark jobs every day each with its own
> datasize from 20mb to 20TB. This means that we need to set resources
> dynamically. One major pain point for doing this is
> spark.sql.shuffle.partitions, the number of partitions to use when
> shuffling data for joins or aggregations. It is to be arbitrarily hard
> coded to 200. The only way to set this config is in the spark submit
> command or in the SparkConf before the executor is created.
>
> This creates a lot of problems when I want to set this config dynamically
> based on the in memory size of a dataframe. I only know the in memory size
> of the dataframe halfway through the spark job. So I would need to stop the
> context and recreate it in order to set this config.
>
> Is there any better way to set this? How
> does  spark.sql.shuffle.partitions work differently than .repartition?
>
> Brandon
>



-- 
---
Takeshi Yamamuro