You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by Gang Luo <lg...@yahoo.com.cn> on 2010/07/25 18:27:28 UTC

split operator

Hi all
according to the vldb 09 paper, the split operator and all its successive 
operators reside in memory without any blocking in between. However, the source 
code (version 0.7) shows that a MR job is actually ended when it meets the split 
operator and multiple new MR jobs are created, each representing one branch. 
This write-once-read-multiple-times method is different from the in-memory 
method mentioned in that paper. Does pig change the strategy for split, or is 
there still an in-memory version of split I didn't discover?

Thanks,
-Gang

Re: split operator

Posted by Daniel Dai <ji...@yahoo-inc.com>.

Hi, Gang,
Yes, that's what MultiQueryOptimizer address. After splitting, we split
the script into smaller combinable pieces, and MultiQueryOptimizer will
combine as much splitter and splittees into the same map-reduce job. So
after SplitInserter, you might see more jobs, but you will end up with
fewer jobs. The algorithm for MultiQueryOptimizer is: for every
splitter, find as much combinable splittees, and combine them into the
same mapreduce job. You can find more details at
http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification

Daniel

Gang Luo wrote:
> Hi Daniel,
> This is a question long ago, but I suddenly come up with some more thoughts on 
> this. In a query as simple as this:
>
> A = LOAD 'input';
> B = FILTER A BY $1 == 1;
> C = COGROUP A BY $0, B BY $0;
>
> the optimizer will insert a split operator to reuse A. According to the source 
> code, a map-reduce job will be ended when it sees split and output the result to 
> A1 and A2 which will be used by two subsequent jobs to process B and C. In this 
> case, the first job does nothing meaningful but copy the souce 'input' twice. Is 
> there some optimization applied here (like the MultiQueryOptimizer you mentioned 
> previously) ? How?
>
> Since I didn't take a look at the MultiQueryOptimizer, it will be great help if 
> you can briefly describe how MultiQueryOptimizer works. Thanks a lot.
>
> -Gang
>
>
>
>
> ----- 原始邮件 ----
> 发件人： Daniel Dai <ji...@yahoo-inc.com>
> 收件人： "pig-dev@hadoop.apache.org" <pi...@hadoop.apache.org>
> 发送日期： 2010/7/26 (周一) 4:58:49 下午
> 主   题： Re: split operator
>
> Hi, Gang,
> It is about multiquery optimization. In MRCompiler, we will create a
> map-reduce boundary for split, later in MultiQueryOptimizer, we will
> merge several split into one map-reduce job. In this map-reduce job, we
> will nest several split plans.
>
> Daniel
>
> Gang Luo wrote:
>   
>> Hi Daniel,
>> in 4.3.1, the example and figure 6 show this. 5.1 last paragraph says split 
>> operator maintain one-tuple buffer for each branch and talks about how to 
>> synchronize multiple branches. I do think that is the in-memory split.
>>
>> here is the paper: http://www.vldb.org/pvldb/2/vldb09-1074.pdf
>>
>>
>> -Gang
>>
>>
>>
>> ----- 原始邮件 ----
>> 发件人： Daniel Dai <ji...@yahoo-inc.com>
>> 收件人： "pig-dev@hadoop.apache.org" <pi...@hadoop.apache.org>
>> 发送日期： 2010/7/26 (周一) 2:09:25 下午
>> 主   题： Re: split operator
>>
>> Hi, Gang,
>> Which part of the paper are you talking about? We don't do in-memory split. We 
>>     
>
>
>   
>> dump the split result to a temporary file and start a new map-reduce job. Split 
>>
>>
>> do create a map-reduce boundary (Though it is not entirely true, multiquery 
>> optimizer may combine some of these jobs)
>>
>> Daniel
>>
>> Gang Luo wrote:
>>  
>>     
>>> Hi all
>>> according to the vldb 09 paper, the split operator and all its successive 
>>> operators reside in memory without any blocking in between. However, the source 
>>>
>>>
>>> code (version 0.7) shows that a MR job is actually ended when it meets the 
>>> split 
>>>
>>> operator and multiple new MR jobs are created, each representing one branch. 
>>> This write-once-read-multiple-times method is different from the in-memory 
>>> method mentioned in that paper. Does pig change the strategy for split, or is 
>>>       
>
>
>   
>>> there still an in-memory version of split I didn't discover?
>>>
>>> Thanks,
>>> -Gang
>>>
>>>
>>>        
>>>    
>>>       
>>      
>>  
>>     
>
>
>       
>

Re: split operator

Posted by Gang Luo <lg...@yahoo.com.cn>.

Hi Daniel,
This is a question long ago, but I suddenly come up with some more thoughts on 
this. In a query as simple as this:

A = LOAD 'input';
B = FILTER A BY $1 == 1;
C = COGROUP A BY $0, B BY $0;

the optimizer will insert a split operator to reuse A. According to the source 
code, a map-reduce job will be ended when it sees split and output the result to 
A1 and A2 which will be used by two subsequent jobs to process B and C. In this 
case, the first job does nothing meaningful but copy the souce 'input' twice. Is 
there some optimization applied here (like the MultiQueryOptimizer you mentioned 
previously) ? How?

Since I didn't take a look at the MultiQueryOptimizer, it will be great help if 
you can briefly describe how MultiQueryOptimizer works. Thanks a lot.

-Gang

----- 原始邮件 ----
发件人： Daniel Dai <ji...@yahoo-inc.com>
收件人： "pig-dev@hadoop.apache.org" <pi...@hadoop.apache.org>
发送日期： 2010/7/26 (周一) 4:58:49 下午
主   题： Re: split operator

Hi, Gang,
It is about multiquery optimization. In MRCompiler, we will create a
map-reduce boundary for split, later in MultiQueryOptimizer, we will
merge several split into one map-reduce job. In this map-reduce job, we
will nest several split plans.

Daniel

Gang Luo wrote:
> Hi Daniel,
> in 4.3.1, the example and figure 6 show this. 5.1 last paragraph says split 
> operator maintain one-tuple buffer for each branch and talks about how to 
> synchronize multiple branches. I do think that is the in-memory split.
>
> here is the paper: http://www.vldb.org/pvldb/2/vldb09-1074.pdf
>
>
> -Gang
>
>
>
> ----- 原始邮件 ----
> 发件人： Daniel Dai <ji...@yahoo-inc.com>
> 收件人： "pig-dev@hadoop.apache.org" <pi...@hadoop.apache.org>
> 发送日期： 2010/7/26 (周一) 2:09:25 下午
> 主   题： Re: split operator
>
> Hi, Gang,
> Which part of the paper are you talking about? We don't do in-memory split. We 

> dump the split result to a temporary file and start a new map-reduce job. Split 
>
>
> do create a map-reduce boundary (Though it is not entirely true, multiquery 
> optimizer may combine some of these jobs)
>
> Daniel
>
> Gang Luo wrote:
>  
>> Hi all
>> according to the vldb 09 paper, the split operator and all its successive 
>> operators reside in memory without any blocking in between. However, the source 
>>
>>
>> code (version 0.7) shows that a MR job is actually ended when it meets the 
>>split 
>>
>> operator and multiple new MR jobs are created, each representing one branch. 
>> This write-once-read-multiple-times method is different from the in-memory 
>> method mentioned in that paper. Does pig change the strategy for split, or is 

>> there still an in-memory version of split I didn't discover?
>>
>> Thanks,
>> -Gang
>>
>>
>>        
>>    
>
>
>      
>

Re: split operator

Posted by Daniel Dai <ji...@yahoo-inc.com>.

Hi, Gang,
It is about multiquery optimization. In MRCompiler, we will create a
map-reduce boundary for split, later in MultiQueryOptimizer, we will
merge several split into one map-reduce job. In this map-reduce job, we
will nest several split plans.

Daniel

Gang Luo wrote:
> Hi Daniel,
> in 4.3.1, the example and figure 6 show this. 5.1 last paragraph says split 
> operator maintain one-tuple buffer for each branch and talks about how to 
> synchronize multiple branches. I do think that is the in-memory split.
>
> here is the paper: http://www.vldb.org/pvldb/2/vldb09-1074.pdf
>
>
> -Gang
>
>
>
> ----- 原始邮件 ----
> 发件人： Daniel Dai <ji...@yahoo-inc.com>
> 收件人： "pig-dev@hadoop.apache.org" <pi...@hadoop.apache.org>
> 发送日期： 2010/7/26 (周一) 2:09:25 下午
> 主   题： Re: split operator
>
> Hi, Gang,
> Which part of the paper are you talking about? We don't do in-memory split. We 
> dump the split result to a temporary file and start a new map-reduce job. Split 
> do create a map-reduce boundary (Though it is not entirely true, multiquery 
> optimizer may combine some of these jobs)
>
> Daniel
>
> Gang Luo wrote:
>   
>> Hi all
>> according to the vldb 09 paper, the split operator and all its successive 
>> operators reside in memory without any blocking in between. However, the source 
>> code (version 0.7) shows that a MR job is actually ended when it meets the split 
>> operator and multiple new MR jobs are created, each representing one branch. 
>> This write-once-read-multiple-times method is different from the in-memory 
>> method mentioned in that paper. Does pig change the strategy for split, or is 
>> there still an in-memory version of split I didn't discover?
>>
>> Thanks,
>> -Gang
>>
>>
>>        
>>     
>
>
>       
>

Re: split operator

Posted by Gang Luo <lg...@yahoo.com.cn>.

Hi Daniel,
in 4.3.1, the example and figure 6 show this. 5.1 last paragraph says split 
operator maintain one-tuple buffer for each branch and talks about how to 
synchronize multiple branches. I do think that is the in-memory split.

here is the paper: http://www.vldb.org/pvldb/2/vldb09-1074.pdf

-Gang

----- 原始邮件 ----
发件人： Daniel Dai <ji...@yahoo-inc.com>
收件人： "pig-dev@hadoop.apache.org" <pi...@hadoop.apache.org>
发送日期： 2010/7/26 (周一) 2:09:25 下午
主   题： Re: split operator

Hi, Gang,
Which part of the paper are you talking about? We don't do in-memory split. We 
dump the split result to a temporary file and start a new map-reduce job. Split 
do create a map-reduce boundary (Though it is not entirely true, multiquery 
optimizer may combine some of these jobs)

Daniel

Gang Luo wrote:
> Hi all
> according to the vldb 09 paper, the split operator and all its successive 
>operators reside in memory without any blocking in between. However, the source 
>code (version 0.7) shows that a MR job is actually ended when it meets the split 
>operator and multiple new MR jobs are created, each representing one branch. 
>This write-once-read-multiple-times method is different from the in-memory 
>method mentioned in that paper. Does pig change the strategy for split, or is 
>there still an in-memory version of split I didn't discover?
> 
> Thanks,
> -Gang
> 
> 
>

Re: split operator

Posted by Daniel Dai <ji...@yahoo-inc.com>.

Hi, Gang,
Which part of the paper are you talking about? We don't do in-memory 
split. We dump the split result to a temporary file and start a new 
map-reduce job. Split do create a map-reduce boundary (Though it is not 
entirely true, multiquery optimizer may combine some of these jobs)

Daniel

Gang Luo wrote:
> Hi all
> according to the vldb 09 paper, the split operator and all its successive 
> operators reside in memory without any blocking in between. However, the source 
> code (version 0.7) shows that a MR job is actually ended when it meets the split 
> operator and multiple new MR jobs are created, each representing one branch. 
> This write-once-read-multiple-times method is different from the in-memory 
> method mentioned in that paper. Does pig change the strategy for split, or is 
> there still an in-memory version of split I didn't discover?
>
> Thanks,
> -Gang
>
>
>       
>