You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Aniket Bhatnagar <an...@gmail.com> on 2016/11/11 17:22:14 UTC

Dataset API | Setting number of partitions during join/groupBy

Hi

I can't seem to find a way to pass number of partitions while join 2
Datasets or doing a groupBy operation on the Dataset. There is an option of
repartitioning the resultant Dataset but it's inefficient to repartition
after the Dataset has been joined/grouped into default number of
partitions. With RDD API, this was easy to do as the functions accepted a
numPartitions parameter. The only way to do this seems to be
sparkSession.conf.set(SQLConf.SHUFFLE_PARTITIONS.key, <num partitions>) but
this means that all join/groupBy operations going forward will have the
same number of partitions.

Thanks,
Aniket

Re: Dataset API | Setting number of partitions during join/groupBy

Posted by Aniket Bhatnagar <an...@gmail.com>.
Hi Shreya

Initial partitions in the Datasets were more than 1000 and after a group by
operation, the resultant Dataset had only 200 partitions (because by
default number of partitions being set to 200). Any further operations on
the resultant Dataset will have a maximum of 200 parallelism resulting in
inefficient use of cluster.

I am performing multiple join & group by operations on Datasets that are
huge (8TB+) and low parallelism severely affects the time it takes to run
the data pipeline. The workaround that
sets sparkSession.conf.set(SQLConf.SHUFFLE_PARTITIONS.key, <num
partitions>) works but it would be ideal to set partitions on a per
join/group by operation basis, like we could using the RDD API.

Thanks,
Aniket

On Fri, Nov 11, 2016 at 6:27 PM Shreya Agarwal <sh...@microsoft.com>
wrote:

> Curious – why do you want to repartition? Is there a subsequent step which
> fails because the number of partitions is less? Or you want to do it for a
> perf gain?
>
>
>
> Also, what were your initial Dataset partitions and how many did you have
> for the result of join?
>
>
>
> *From:* Aniket Bhatnagar [mailto:aniket.bhatnagar@gmail.com]
> *Sent:* Friday, November 11, 2016 9:22 AM
> *To:* user <us...@spark.apache.org>
> *Subject:* Dataset API | Setting number of partitions during join/groupBy
>
>
>
> Hi
>
>
>
> I can't seem to find a way to pass number of partitions while join 2
> Datasets or doing a groupBy operation on the Dataset. There is an option of
> repartitioning the resultant Dataset but it's inefficient to repartition
> after the Dataset has been joined/grouped into default number of
> partitions. With RDD API, this was easy to do as the functions accepted a
> numPartitions parameter. The only way to do this seems to be
> sparkSession.conf.set(SQLConf.SHUFFLE_PARTITIONS.key, <num partitions>) but
> this means that all join/groupBy operations going forward will have the
> same number of partitions.
>
>
>
> Thanks,
>
> Aniket
>

RE: Dataset API | Setting number of partitions during join/groupBy

Posted by Shreya Agarwal <sh...@microsoft.com>.
Curious – why do you want to repartition? Is there a subsequent step which fails because the number of partitions is less? Or you want to do it for a perf gain?

Also, what were your initial Dataset partitions and how many did you have for the result of join?

From: Aniket Bhatnagar [mailto:aniket.bhatnagar@gmail.com]
Sent: Friday, November 11, 2016 9:22 AM
To: user <us...@spark.apache.org>
Subject: Dataset API | Setting number of partitions during join/groupBy

Hi

I can't seem to find a way to pass number of partitions while join 2 Datasets or doing a groupBy operation on the Dataset. There is an option of repartitioning the resultant Dataset but it's inefficient to repartition after the Dataset has been joined/grouped into default number of partitions. With RDD API, this was easy to do as the functions accepted a numPartitions parameter. The only way to do this seems to be sparkSession.conf.set(SQLConf.SHUFFLE_PARTITIONS.key, <num partitions>) but this means that all join/groupBy operations going forward will have the same number of partitions.

Thanks,
Aniket