You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Vincent BARAT <vi...@ubikod.com> on 2009/10/07 16:54:44 UTC

storing intermediate results ?

Hello,

I'm new to PIG, and I have a bunch of statements that process the 
same input, which is actually the result of a JOIN between two very 
big data set (millions of entries).

I wonder if it is better (faster) to save the result of this JOIN 
into an Hadoop file and then to LOAD it, instead of just relying on 
PIG optimizations ?

Thank a lot for your help.

Re: storing intermediate results ?

Posted by Alan Gates <ga...@yahoo-inc.com>.
The optimizer runs when Pig is invoked from Java.  However, until  
recently join and multi-query optimization did not work together.  See http://issues.apache.org/jira/browse/PIG-983

Alan.

On Oct 8, 2009, at 6:33 AM, Vincent BARAT wrote:

> Ok, then I did some testing.
>
> Actually, if I store my first JOIN into a file, I see a 50% increase  
> of the speed of all my subsequents computations.
>
> I guess that it may be related to the fact I use PIG from Java  
> (maybe the optimizer don't work in that mode?).
>
> Here is my code (including just the JOIN and the first computation):
>
> Data loading:
> -------------
>
>        Analytics.pigServer
>          .registerQuery("start_sessions = LOAD 'startSession_sample'  
> USING PigStorage(',') "
>            + "AS (sid:chararray, infoid:chararray, imei:chararray,  
> start:long);");
>        Analytics.pigServer
>          .registerQuery("end_sessions = LOAD 'endSession_sample'  
> USING PigStorage(',') "
>            + "AS (sid:chararray, infoid:chararray, imei:chararray,  
> end:long);");
>
> First Join (with storage):
> ---------------------------
>
>        Analytics.pigServer
>          .registerQuery("sessions = JOIN start_sessions BY sid,  
> end_sessions BY sid;");
>        Analytics.pigServer.store("sessions", "sessions");
>        Analytics.pigServer
>          .registerQuery("sessions = LOAD 'sessions' "
>            + "AS (start_sessions::sid:chararray,  
> start_sessions::infoid:chararray, start_sessions::imei:chararray,  
> start_sessions::start:long, "
>            + "end_sessions::sid:chararray,  
> end_sessions::infoid:chararray, end_sessions::imei:chararray,  
> end_sessions::end:long);");
>
> First join (without storage):
> -----------------------------
>
>        Analytics.pigServer
>          .registerQuery("sessions = JOIN start_sessions BY sid,  
> end_sessions BY sid;");
>
> First computation:
> ------------------
>
>          Analytics.pigServer.registerQuery("session_periods =  
> FOREACH sessions "
>            + "GENERATE FLATTEN(SessionPeriods('" +  
> timeBucket.toString() + "', start, end)) "
>            + "AS (periodid:int, inner_length:long,  
> outer_length:long);");
>          Analytics.pigServer.registerQuery("period_sessions = GROUP  
> session_periods BY periodid;");
>        Analytics.pigServer.registerQuery("session_count_and_length"
>            + " = FOREACH period_sessions " + "GENERATE group, " +  
> "COUNT(session_periods), "
>            + "SUM(session_periods.inner_length), " +  
> "SUM(session_periods.outer_length);");
>
>          Analytics.pigServer.store("session_count_and_length",  
> Analytics.getHadoopOutputFile(
>            "session_count_and_length", timeBucket));
>
>
>
> Thejas Nair a écrit :
>> Hi Zaki,
>> Please file a jira if you are able to identify the problem you were  
>> facing
>> and the steps to reproduce it.
>> Thanks,
>> Thejas
>> On 10/7/09 1:08 PM, "zaki rahaman" <za...@gmail.com> wrote:
>>> Vincent,
>>>
>>> I've run into this problem before, if you know beforehand that  
>>> you're going
>>> to recycle this joined dataset for several different operations or
>>> pipelines, it is worth your time to simply store it  
>>> intermediately. While
>>> Pig can definitely handle this and the Multiquery Optimizer is  
>>> great, I've
>>> run into problems with it before (can't remember what now  
>>> exactly), and
>>> pre-joining has worked well for me.
>>>
>>> Hopefully you found some part of that useful.
>>>
>>> On Wed, Oct 7, 2009 at 12:33 PM, Ashutosh Chauhan <
>>> ashutosh.chauhan@gmail.com> wrote:
>>>
>>>> Hi Vincent,
>>>>
>>>> Pig has a multi-query optimization which if firing will  
>>>> automatically
>>>> figure
>>>> out that join needs to be done only once and there will not be any
>>>> repetition of work. If Pig determines that its not safe to do that
>>>> optimization then its possible that your join is getting computed  
>>>> more then
>>>> once. If thats the case, then it will be better to do the join  
>>>> and store
>>>> it.
>>>> You can do that within same script using "exec"
>>>> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#exec
>>>>
>>>> You can read more about multi-query optimization here:
>>>>
>>>> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#Multi-Query+Execution
>>>>
>>>> Hope it helps,
>>>> Ashutosh
>>>>
>>>> On Wed, Oct 7, 2009 at 10:54, Vincent BARAT <vincent.barat@ubikod.com
>>>>> wrote:
>>>>> Hello,
>>>>>
>>>>> I'm new to PIG, and I have a bunch of statements that process  
>>>>> the same
>>>>> input, which is actually the result of a JOIN between two very  
>>>>> big data
>>>> set
>>>>> (millions of entries).
>>>>>
>>>>> I wonder if it is better (faster) to save the result of this  
>>>>> JOIN into an
>>>>> Hadoop file and then to LOAD it, instead of just relying on PIG
>>>>> optimizations ?
>>>>>
>>>>> Thank a lot for your help.
>>>>>
>>>


Re: storing intermediate results ?

Posted by Vincent BARAT <vi...@ubikod.com>.
Ok, then I did some testing.

Actually, if I store my first JOIN into a file, I see a 50% increase 
of the speed of all my subsequents computations.

I guess that it may be related to the fact I use PIG from Java 
(maybe the optimizer don't work in that mode?).

Here is my code (including just the JOIN and the first computation):

Data loading:
-------------

         Analytics.pigServer
           .registerQuery("start_sessions = LOAD 
'startSession_sample' USING PigStorage(',') "
             + "AS (sid:chararray, infoid:chararray, imei:chararray, 
start:long);");
         Analytics.pigServer
           .registerQuery("end_sessions = LOAD 'endSession_sample' 
USING PigStorage(',') "
             + "AS (sid:chararray, infoid:chararray, imei:chararray, 
end:long);");

First Join (with storage):
---------------------------

         Analytics.pigServer
           .registerQuery("sessions = JOIN start_sessions BY sid, 
end_sessions BY sid;");
         Analytics.pigServer.store("sessions", "sessions");
         Analytics.pigServer
           .registerQuery("sessions = LOAD 'sessions' "
             + "AS (start_sessions::sid:chararray, 
start_sessions::infoid:chararray, start_sessions::imei:chararray, 
start_sessions::start:long, "
             + "end_sessions::sid:chararray, 
end_sessions::infoid:chararray, end_sessions::imei:chararray, 
end_sessions::end:long);");

First join (without storage):
-----------------------------

         Analytics.pigServer
           .registerQuery("sessions = JOIN start_sessions BY sid, 
end_sessions BY sid;");

First computation:
------------------

           Analytics.pigServer.registerQuery("session_periods = 
FOREACH sessions "
             + "GENERATE FLATTEN(SessionPeriods('" + 
timeBucket.toString() + "', start, end)) "
             + "AS (periodid:int, inner_length:long, 
outer_length:long);");
           Analytics.pigServer.registerQuery("period_sessions = 
GROUP session_periods BY periodid;");
         Analytics.pigServer.registerQuery("session_count_and_length"
             + " = FOREACH period_sessions " + "GENERATE group, " + 
"COUNT(session_periods), "
             + "SUM(session_periods.inner_length), " + 
"SUM(session_periods.outer_length);");

           Analytics.pigServer.store("session_count_and_length", 
Analytics.getHadoopOutputFile(
             "session_count_and_length", timeBucket));



Thejas Nair a écrit :
> Hi Zaki,
> Please file a jira if you are able to identify the problem you were facing
> and the steps to reproduce it.
> Thanks,
> Thejas
> 
> 
> 
> 
> On 10/7/09 1:08 PM, "zaki rahaman" <za...@gmail.com> wrote:
> 
>> Vincent,
>>
>> I've run into this problem before, if you know beforehand that you're going
>> to recycle this joined dataset for several different operations or
>> pipelines, it is worth your time to simply store it intermediately. While
>> Pig can definitely handle this and the Multiquery Optimizer is great, I've
>> run into problems with it before (can't remember what now exactly), and
>> pre-joining has worked well for me.
>>
>> Hopefully you found some part of that useful.
>>
>> On Wed, Oct 7, 2009 at 12:33 PM, Ashutosh Chauhan <
>> ashutosh.chauhan@gmail.com> wrote:
>>
>>> Hi Vincent,
>>>
>>> Pig has a multi-query optimization which if firing will automatically
>>> figure
>>> out that join needs to be done only once and there will not be any
>>> repetition of work. If Pig determines that its not safe to do that
>>> optimization then its possible that your join is getting computed more then
>>> once. If thats the case, then it will be better to do the join and store
>>> it.
>>> You can do that within same script using "exec"
>>> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#exec
>>>
>>> You can read more about multi-query optimization here:
>>>
>>> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#Multi-Query+Execution
>>>
>>> Hope it helps,
>>> Ashutosh
>>>
>>> On Wed, Oct 7, 2009 at 10:54, Vincent BARAT <vincent.barat@ubikod.com
>>>> wrote:
>>>> Hello,
>>>>
>>>> I'm new to PIG, and I have a bunch of statements that process the same
>>>> input, which is actually the result of a JOIN between two very big data
>>> set
>>>> (millions of entries).
>>>>
>>>> I wonder if it is better (faster) to save the result of this JOIN into an
>>>> Hadoop file and then to LOAD it, instead of just relying on PIG
>>>> optimizations ?
>>>>
>>>> Thank a lot for your help.
>>>>
>>
> 
> 
> 

Re: storing intermediate results ?

Posted by Thejas Nair <te...@yahoo-inc.com>.
Hi Zaki,
Please file a jira if you are able to identify the problem you were facing
and the steps to reproduce it.
Thanks,
Thejas




On 10/7/09 1:08 PM, "zaki rahaman" <za...@gmail.com> wrote:

> Vincent,
> 
> I've run into this problem before, if you know beforehand that you're going
> to recycle this joined dataset for several different operations or
> pipelines, it is worth your time to simply store it intermediately. While
> Pig can definitely handle this and the Multiquery Optimizer is great, I've
> run into problems with it before (can't remember what now exactly), and
> pre-joining has worked well for me.
> 
> Hopefully you found some part of that useful.
> 
> On Wed, Oct 7, 2009 at 12:33 PM, Ashutosh Chauhan <
> ashutosh.chauhan@gmail.com> wrote:
> 
>> Hi Vincent,
>> 
>> Pig has a multi-query optimization which if firing will automatically
>> figure
>> out that join needs to be done only once and there will not be any
>> repetition of work. If Pig determines that its not safe to do that
>> optimization then its possible that your join is getting computed more then
>> once. If thats the case, then it will be better to do the join and store
>> it.
>> You can do that within same script using "exec"
>> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#exec
>> 
>> You can read more about multi-query optimization here:
>> 
>> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#Multi-Query+Execution
>> 
>> Hope it helps,
>> Ashutosh
>> 
>> On Wed, Oct 7, 2009 at 10:54, Vincent BARAT <vincent.barat@ubikod.com
>>> wrote:
>> 
>>> Hello,
>>> 
>>> I'm new to PIG, and I have a bunch of statements that process the same
>>> input, which is actually the result of a JOIN between two very big data
>> set
>>> (millions of entries).
>>> 
>>> I wonder if it is better (faster) to save the result of this JOIN into an
>>> Hadoop file and then to LOAD it, instead of just relying on PIG
>>> optimizations ?
>>> 
>>> Thank a lot for your help.
>>> 
>> 
> 
> 


Re: storing intermediate results ?

Posted by zaki rahaman <za...@gmail.com>.
Vincent,

I've run into this problem before, if you know beforehand that you're going
to recycle this joined dataset for several different operations or
pipelines, it is worth your time to simply store it intermediately. While
Pig can definitely handle this and the Multiquery Optimizer is great, I've
run into problems with it before (can't remember what now exactly), and
pre-joining has worked well for me.

Hopefully you found some part of that useful.

On Wed, Oct 7, 2009 at 12:33 PM, Ashutosh Chauhan <
ashutosh.chauhan@gmail.com> wrote:

> Hi Vincent,
>
> Pig has a multi-query optimization which if firing will automatically
> figure
> out that join needs to be done only once and there will not be any
> repetition of work. If Pig determines that its not safe to do that
> optimization then its possible that your join is getting computed more then
> once. If thats the case, then it will be better to do the join and store
> it.
> You can do that within same script using "exec"
> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#exec
>
> You can read more about multi-query optimization here:
>
> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#Multi-Query+Execution
>
> Hope it helps,
> Ashutosh
>
> On Wed, Oct 7, 2009 at 10:54, Vincent BARAT <vincent.barat@ubikod.com
> >wrote:
>
> > Hello,
> >
> > I'm new to PIG, and I have a bunch of statements that process the same
> > input, which is actually the result of a JOIN between two very big data
> set
> > (millions of entries).
> >
> > I wonder if it is better (faster) to save the result of this JOIN into an
> > Hadoop file and then to LOAD it, instead of just relying on PIG
> > optimizations ?
> >
> > Thank a lot for your help.
> >
>



-- 
Zaki Rahaman

Re: storing intermediate results ?

Posted by Vincent BARAT <vi...@ubikod.com>.
Hello,

Thank for your answer.

Actually, I use PIG by running it from Java (using a set of 
registerQuery() methods). The exec you mention cannot be used in 
that context (AFAIK).

Ashutosh Chauhan a écrit :
> Hi Vincent,
> 
> Pig has a multi-query optimization which if firing will automatically figure
> out that join needs to be done only once and there will not be any
> repetition of work. If Pig determines that its not safe to do that
> optimization then its possible that your join is getting computed more then
> once. If thats the case, then it will be better to do the join and store it.
> You can do that within same script using "exec"
> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#exec
> 
> You can read more about multi-query optimization here:
> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#Multi-Query+Execution
> 
> Hope it helps,
> Ashutosh
> 
> On Wed, Oct 7, 2009 at 10:54, Vincent BARAT <vi...@ubikod.com>wrote:
> 
>> Hello,
>>
>> I'm new to PIG, and I have a bunch of statements that process the same
>> input, which is actually the result of a JOIN between two very big data set
>> (millions of entries).
>>
>> I wonder if it is better (faster) to save the result of this JOIN into an
>> Hadoop file and then to LOAD it, instead of just relying on PIG
>> optimizations ?
>>
>> Thank a lot for your help.
>>
> 

Re: storing intermediate results ?

Posted by Ashutosh Chauhan <as...@gmail.com>.
Hi Vincent,

Pig has a multi-query optimization which if firing will automatically figure
out that join needs to be done only once and there will not be any
repetition of work. If Pig determines that its not safe to do that
optimization then its possible that your join is getting computed more then
once. If thats the case, then it will be better to do the join and store it.
You can do that within same script using "exec"
http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#exec

You can read more about multi-query optimization here:
http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#Multi-Query+Execution

Hope it helps,
Ashutosh

On Wed, Oct 7, 2009 at 10:54, Vincent BARAT <vi...@ubikod.com>wrote:

> Hello,
>
> I'm new to PIG, and I have a bunch of statements that process the same
> input, which is actually the result of a JOIN between two very big data set
> (millions of entries).
>
> I wonder if it is better (faster) to save the result of this JOIN into an
> Hadoop file and then to LOAD it, instead of just relying on PIG
> optimizations ?
>
> Thank a lot for your help.
>