You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Marco Cadetg <ma...@zattoo.com> on 2015/01/08 10:27:37 UTC

how to optimize multiple stores

Hi there,

I've a big pig script which first generates some expensive intermediate
result on which I run multiple group by statements and multiple stores.
Something like this.

Register UDFs etc
A = LOAD....
B = LOAD....
C = LOAD....

-- do lots of transformations with A and B and C get intermediate result
INTER_RES
result1 = FOREACH (GROUP INTER_RES BY (...
STORE result1 INTO '....
result2 = FOREACH (GROUP INTER_RES BY (...
STORE result2 INTO '....
result3 = FOREACH (GROUP INTER_RES BY (...
STORE result3 INTO '....
result4 = FOREACH (GROUP INTER_RES BY (...
STORE result4 INTO '....
...
...

Note the results which get stored are independent off each other. Meaning
they are not getting used as an input for anything else further down and do
also not alter the INTER_RES.

Am I correct that pig would only need to LOAD A, B and C once? From what I
can see on the command line output it looks like the expensive intermediate
is computed every time for each store. I've done a quick test and if I do a
STORE of the intermediate and LOAD that it seems to be faster. Is there a
way to avoid this storing of the expensive intermediate?

Cheers,
-Marco

Re: how to optimize multiple stores

Posted by Marco Cadetg <ma...@zattoo.com>.
Hi Rodrigo,

Thanks for your suggestion. Though I don't see how the multistore UDF
helps.

Register UDFs etc
> A = LOAD....
> B = LOAD....
> C = LOAD....
>
> -- do lots of transformations with A and B and C get intermediate result
> INTER_RES
> result1 = FOREACH (GROUP INTER_RES BY (...
> STORE result1 INTO '....
> result2 = FOREACH (GROUP INTER_RES BY (...
> STORE result2 INTO '....
> result3 = FOREACH (GROUP INTER_RES BY (...
> STORE result3 INTO '....
> result4 = FOREACH (GROUP INTER_RES BY (...
> STORE result4 INTO '....
> ...
> ...
>

The different projections (groupings) are not done in the intermediate
result INTER_RES they are done later...

Cheers,
-Marco

On Thu, Jan 8, 2015 at 12:04 PM, Rodrigo Ferreira <we...@gmail.com> wrote:

> Marco,
>
> check out this UDF:
>
> http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/MultiStorage.html
>
> I think it can get the job done without having to group everything.
>
> Cheers,
> Rodrigo
>
> 2015-01-08 7:27 GMT-02:00 Marco Cadetg <ma...@zattoo.com>:
>
> > Hi there,
> >
> > I've a big pig script which first generates some expensive intermediate
> > result on which I run multiple group by statements and multiple stores.
> > Something like this.
> >
> > Register UDFs etc
> > A = LOAD....
> > B = LOAD....
> > C = LOAD....
> >
> > -- do lots of transformations with A and B and C get intermediate result
> > INTER_RES
> > result1 = FOREACH (GROUP INTER_RES BY (...
> > STORE result1 INTO '....
> > result2 = FOREACH (GROUP INTER_RES BY (...
> > STORE result2 INTO '....
> > result3 = FOREACH (GROUP INTER_RES BY (...
> > STORE result3 INTO '....
> > result4 = FOREACH (GROUP INTER_RES BY (...
> > STORE result4 INTO '....
> > ...
> > ...
> >
> > Note the results which get stored are independent off each other. Meaning
> > they are not getting used as an input for anything else further down and
> do
> > also not alter the INTER_RES.
> >
> > Am I correct that pig would only need to LOAD A, B and C once? From what
> I
> > can see on the command line output it looks like the expensive
> intermediate
> > is computed every time for each store. I've done a quick test and if I
> do a
> > STORE of the intermediate and LOAD that it seems to be faster. Is there a
> > way to avoid this storing of the expensive intermediate?
> >
> > Cheers,
> > -Marco
> >
>

Re: how to optimize multiple stores

Posted by Rodrigo Ferreira <we...@gmail.com>.
Marco,

check out this UDF:
http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/MultiStorage.html

I think it can get the job done without having to group everything.

Cheers,
Rodrigo

2015-01-08 7:27 GMT-02:00 Marco Cadetg <ma...@zattoo.com>:

> Hi there,
>
> I've a big pig script which first generates some expensive intermediate
> result on which I run multiple group by statements and multiple stores.
> Something like this.
>
> Register UDFs etc
> A = LOAD....
> B = LOAD....
> C = LOAD....
>
> -- do lots of transformations with A and B and C get intermediate result
> INTER_RES
> result1 = FOREACH (GROUP INTER_RES BY (...
> STORE result1 INTO '....
> result2 = FOREACH (GROUP INTER_RES BY (...
> STORE result2 INTO '....
> result3 = FOREACH (GROUP INTER_RES BY (...
> STORE result3 INTO '....
> result4 = FOREACH (GROUP INTER_RES BY (...
> STORE result4 INTO '....
> ...
> ...
>
> Note the results which get stored are independent off each other. Meaning
> they are not getting used as an input for anything else further down and do
> also not alter the INTER_RES.
>
> Am I correct that pig would only need to LOAD A, B and C once? From what I
> can see on the command line output it looks like the expensive intermediate
> is computed every time for each store. I've done a quick test and if I do a
> STORE of the intermediate and LOAD that it seems to be faster. Is there a
> way to avoid this storing of the expensive intermediate?
>
> Cheers,
> -Marco
>