You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Alexander Pivovarov <ap...@gmail.com> on 2015/08/14 04:56:45 UTC

Reduce number of partitions before saving to file. coalesce or repartition?

Hi Everyone

Which one should work faster (coalesce or repartition) if I need to reduce
number of partitions from 5000 to 3 before saving RDD asTextFile

Total data size is about 400MB on disk in text format

Thank you

Re: Reduce number of partitions before saving to file. coalesce or repartition?

Posted by Anish Haldiya <an...@sigmoidanalytics.com>.

Hi,

If you are decreasing the number of partitions in this RDD, consider
using coalesce, which can avoid performing a shuffle.

However, if you're doing a drastic coalesce, e.g. to numPartitions =
1, this may result in your computation taking place on fewer nodes
than you like (e.g. one node in the case of numPartitions = 1). To
avoid this, you can pass shuffle = true. This will add a shuffle step,
but means the current upstream partitions will be executed in parallel
(per whatever the current partitioning is).

Regards,

anish

On 8/14/15, Alexander Pivovarov <ap...@gmail.com> wrote:
> Hi Everyone
>
> Which one should work faster (coalesce or repartition) if I need to reduce
> number of partitions from 5000 to 3 before saving RDD asTextFile
>
> Total data size is about 400MB on disk in text format
>
> Thank you
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org