You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Fang, Mike" <ch...@paypal.com.INVALID> on 2015/08/05 07:47:43 UTC

control the number of reducers for groupby in data frame

Hi,

Does anyone know how I could control the number of reducer when we do operation such as groupie For data frame?
I could set spark.sql.shuffle.partitions in sql but not sure how to do in df.groupBy("XX") api.

Thanks,
Mike

Re: control the number of reducers for groupby in data frame

Posted by "Fang, Mike" <ch...@paypal.com.INVALID>.

Thanks Farooqui.

From: Sameer Farooqui <sa...@databricks.com>>
Date: Wednesday, August 5, 2015 at 2:46 PM
To: "Fang, Mike" <ch...@paypal.com.invalid>>
Cc: "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>, "Fang, Mike" <ch...@gmail.com>>
Subject: Re: control the number of reducers for groupby in data frame

Hi Mike,

When using Dataframes the same parameter spark.sql.shuffle.partititions is used to determine how many partitions (hence reducer tasks launched) will be in RDD after the shuffle operation (like groupBy).

When using Dataframes,  Spark does not do any automatic determination of partitions. When spark needs to shuffle, it  just uses spark.sql.shuffle.partititions to determine the # of partitions or reducers in next RDD.

In a future version of Spark, Dataframes may dynamically decide the ideal # of partitions for the RDD after the shuffle.

On Wed, Aug 5, 2015 at 1:47 PM, Fang, Mike <ch...@paypal.com.invalid>> wrote:
Hi,

Does anyone know how I could control the number of reducer when we do operation such as groupie For data frame?
I could set spark.sql.shuffle.partitions in sql but not sure how to do in df.groupBy("XX") api.

Thanks,
Mike

Re: control the number of reducers for groupby in data frame

Posted by Sameer Farooqui <sa...@databricks.com>.

Hi Mike,

When using Dataframes the same parameter spark.sql.shuffle.partititions is
used to determine how many partitions (hence reducer tasks launched) will
be in RDD after the shuffle operation (like groupBy).

When using Dataframes,  Spark does not do any automatic determination of
partitions. When spark needs to shuffle, it  just uses
spark.sql.shuffle.partititions to determine the # of partitions or reducers
in next RDD.

In a future version of Spark, Dataframes may dynamically decide the ideal #
of partitions for the RDD after the shuffle.

On Wed, Aug 5, 2015 at 1:47 PM, Fang, Mike <ch...@paypal.com.invalid>
wrote:

> Hi,
>
> Does anyone know how I could control the number of reducer when we do
> operation such as groupie For data frame?
> I could set spark.sql.shuffle.partitions in sql but not sure how to do in
> df.groupBy(“XX”) api.
>
> Thanks,
> Mike
>