You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by אורן שמון <or...@gmail.com> on 2017/10/31 15:01:59 UTC

Bucket vs repartition

Hi all,
I have 2 spark jobs one is pre-process and the second is the process.
Process job needs to calculate for each user in the data.
I want  to avoid shuffle like groupBy so I think about to save the result
of the pre-process as bucket by user in Parquet or to re-partition by user
and save the result .

What is prefer ? and why
Thanks in advance,
Oren