You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jean Georges Perrin <jp...@lumeris.com> on 2017/11/03 10:48:52 UTC

Re: Hi all,

Hi Oren,

Why don’t you want to use a GroupBy? You can cache or checkpoint the result and use it in your process, keeping everything in Spark and avoiding save/ingestion...


> On Oct 31, 2017, at 08:17, ⁨אורן שמון⁩ <⁨oren.shamun@gmail.com⁩> wrote:
> 
> I have 2 spark jobs one is pre-process and the second is the process.
> Process job needs to calculate for each user in the data.
> I want  to avoid shuffle like groupBy so I think about to save the result of the pre-process as bucket by user in Parquet or to re-partition by user and save the result .
> 
> What is prefer ? and why 
> Thanks in advance,
> Oren


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Hi all,

Posted by אורן שמון <or...@gmail.com>.

Hi Jean,
We prepare the data for all another jobs. We have a lot of jobs that
schedule to different time but all of them need to read same raw data.

On Fri, Nov 3, 2017 at 12:49 PM Jean Georges Perrin <jp...@lumeris.com>
wrote:

> Hi Oren,
>
> Why don’t you want to use a GroupBy? You can cache or checkpoint the
> result and use it in your process, keeping everything in Spark and avoiding
> save/ingestion...
>
>
> > On Oct 31, 2017, at 08:17, ⁨אורן שמון⁩ <⁨oren.shamun@gmail.com⁩> wrote:
> >
> > I have 2 spark jobs one is pre-process and the second is the process.
> > Process job needs to calculate for each user in the data.
> > I want  to avoid shuffle like groupBy so I think about to save the
> result of the pre-process as bucket by user in Parquet or to re-partition
> by user and save the result .
> >
> > What is prefer ? and why
> > Thanks in advance,
> > Oren
>
>