You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Artur R <ar...@gpnxgroup.com> on 2017/03/17 21:52:01 UTC

How to redistribute dataset without full shuffle

Hi!

I use Spark heavily for various workloads and always fall in the situation
when there is some skewed dataset (without any partitioner assigned) and I
just want to "redistribute" its data more evenly.

For example, say there is RDD of X partitions with Y rows on each except
one large partition with Y * 10 rows. I don't want to change number of
partitions, only redistribute it. Obviously, such operation should not send
more than ~Y * 9 rows across the network.
But the only option available is repartition that requires full shuffle
that takes ALL (X * Y) rows.

The question: why there is no such operation like "redistribute"?