You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by "Ulanov, Alexander" <al...@hp.com> on 2015/06/18 23:26:00 UTC

Increase partition count (repartition) without shuffle

Hi,

Is there a way to increase the amount of partition of RDD without causing shuffle? I've found JIRA issue https://issues.apache.org/jira/browse/SPARK-5997 however there is no implementation yet.

Just in case, I am reading data from ~300 big binary files, which results in 300 partitions, then I need to sort my RDD, but it crashes with outofmemory exception. If I change the number of partitions to 2000, sort works OK, but repartition itself takes a lot of time due to shuffle.

Best regards, Alexander

Re: Increase partition count (repartition) without shuffle

Posted by Mridul Muralidharan <mr...@gmail.com>.

If you can scan input twice, you can of course do per partition count and
build custom RDD which can reparation without shuffle.
But nothing off the shelf as Sandy mentioned.

Regards
Mridul

On Thursday, June 18, 2015, Sandy Ryza <sa...@cloudera.com> wrote:

> Hi Alexander,
>
> There is currently no way to create an RDD with more partitions than its
> parent RDD without causing a shuffle.
>
> However, if the files are splittable, you can set the Hadoop
> configurations that control split size to something smaller so that the
> HadoopRDD ends up with more partitions.
>
> -Sandy
>
> On Thu, Jun 18, 2015 at 2:26 PM, Ulanov, Alexander <
> alexander.ulanov@hp.com
> <javascript:_e(%7B%7D,'cvml','alexander.ulanov@hp.com');>> wrote:
>
>>  Hi,
>>
>>
>>
>> Is there a way to increase the amount of partition of RDD without causing
>> shuffle? I’ve found JIRA issue
>> https://issues.apache.org/jira/browse/SPARK-5997 however there is no
>> implementation yet.
>>
>>
>>
>> Just in case, I am reading data from ~300 big binary files, which results
>> in 300 partitions, then I need to sort my RDD, but it crashes with
>> outofmemory exception. If I change the number of partitions to 2000, sort
>> works OK, but repartition itself takes a lot of time due to shuffle.
>>
>>
>>
>> Best regards, Alexander
>>
>
>

Re: Increase partition count (repartition) without shuffle

Posted by Sandy Ryza <sa...@cloudera.com>.

Hi Alexander,

There is currently no way to create an RDD with more partitions than its
parent RDD without causing a shuffle.

However, if the files are splittable, you can set the Hadoop configurations
that control split size to something smaller so that the HadoopRDD ends up
with more partitions.

-Sandy

On Thu, Jun 18, 2015 at 2:26 PM, Ulanov, Alexander <al...@hp.com>
wrote:

>  Hi,
>
>
>
> Is there a way to increase the amount of partition of RDD without causing
> shuffle? I’ve found JIRA issue
> https://issues.apache.org/jira/browse/SPARK-5997 however there is no
> implementation yet.
>
>
>
> Just in case, I am reading data from ~300 big binary files, which results
> in 300 partitions, then I need to sort my RDD, but it crashes with
> outofmemory exception. If I change the number of partitions to 2000, sort
> works OK, but repartition itself takes a lot of time due to shuffle.
>
>
>
> Best regards, Alexander
>