You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Chirag Aggarwal <Ch...@guavus.com> on 2014/09/01 14:02:18 UTC

Value of SHUFFLE_PARTITIONS

Hi,

Currently the number of shuffle partitions is config driven parameter (SHUFFLE_PARTITIONS) . This means that anyone who is running a spark-sql query should first of
all analyze that what value of SHUFFLE_PARTITIONS would give the best performance for the query.

Shouldn't there be a logic in SparkSql which should be able to figure out the best value and also provide a mechanism to give preference to user specified value.
This I believe can be worked out on the basis of number of partitions in the original data.

I ran some queries and with default value (200) of shuffle-partitioning, and when I changed this value to 5, the time taken by the query reduced by nearly 35%.

Thanks,
Chirag