You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by אורן שמון <or...@gmail.com> on 2017/10/31 13:17:43 UTC
Hi all,
I have 2 spark jobs one is pre-process and the second is the process.
Process job needs to calculate for each user in the data.
I want to avoid shuffle like groupBy so I think about to save the result
of the pre-process as bucket by user in Parquet or to re-partition by user
and save the result .
What is prefer ? and why
Thanks in advance,
Oren
Re: Hi all,
Posted by אורן שמון <or...@gmail.com>.
Hi Jean,
We prepare the data for all another jobs. We have a lot of jobs that
schedule to different time but all of them need to read same raw data.
On Fri, Nov 3, 2017 at 12:49 PM Jean Georges Perrin <jp...@lumeris.com>
wrote:
> Hi Oren,
>
> Why don’t you want to use a GroupBy? You can cache or checkpoint the
> result and use it in your process, keeping everything in Spark and avoiding
> save/ingestion...
>
>
> > On Oct 31, 2017, at 08:17, אורן שמון <oren.shamun@gmail.com> wrote:
> >
> > I have 2 spark jobs one is pre-process and the second is the process.
> > Process job needs to calculate for each user in the data.
> > I want to avoid shuffle like groupBy so I think about to save the
> result of the pre-process as bucket by user in Parquet or to re-partition
> by user and save the result .
> >
> > What is prefer ? and why
> > Thanks in advance,
> > Oren
>
>
Re: Hi all,
Posted by Jean Georges Perrin <jp...@lumeris.com>.
Hi Oren,
Why don’t you want to use a GroupBy? You can cache or checkpoint the result and use it in your process, keeping everything in Spark and avoiding save/ingestion...
> On Oct 31, 2017, at 08:17, אורן שמון <oren.shamun@gmail.com> wrote:
>
> I have 2 spark jobs one is pre-process and the second is the process.
> Process job needs to calculate for each user in the data.
> I want to avoid shuffle like groupBy so I think about to save the result of the pre-process as bucket by user in Parquet or to re-partition by user and save the result .
>
> What is prefer ? and why
> Thanks in advance,
> Oren
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org