You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Gang Luo <lg...@yahoo.com.cn> on 2010/08/24 01:20:39 UTC
Re: split operator
Hi Daniel,
This is a question long ago, but I suddenly come up with some more thoughts on
this. In a query as simple as this:
A = LOAD 'input';
B = FILTER A BY $1 == 1;
C = COGROUP A BY $0, B BY $0;
the optimizer will insert a split operator to reuse A. According to the source
code, a map-reduce job will be ended when it sees split and output the result to
A1 and A2 which will be used by two subsequent jobs to process B and C. In this
case, the first job does nothing meaningful but copy the souce 'input' twice. Is
there some optimization applied here (like the MultiQueryOptimizer you mentioned
previously) ? How?
Since I didn't take a look at the MultiQueryOptimizer, it will be great help if
you can briefly describe how MultiQueryOptimizer works. Thanks a lot.
-Gang
----- 原始邮件 ----
发件人: Daniel Dai <ji...@yahoo-inc.com>
收件人: "pig-dev@hadoop.apache.org" <pi...@hadoop.apache.org>
发送日期: 2010/7/26 (周一) 4:58:49 下午
主 题: Re: split operator
Hi, Gang,
It is about multiquery optimization. In MRCompiler, we will create a
map-reduce boundary for split, later in MultiQueryOptimizer, we will
merge several split into one map-reduce job. In this map-reduce job, we
will nest several split plans.
Daniel
Gang Luo wrote:
> Hi Daniel,
> in 4.3.1, the example and figure 6 show this. 5.1 last paragraph says split
> operator maintain one-tuple buffer for each branch and talks about how to
> synchronize multiple branches. I do think that is the in-memory split.
>
> here is the paper: http://www.vldb.org/pvldb/2/vldb09-1074.pdf
>
>
> -Gang
>
>
>
> ----- 原始邮件 ----
> 发件人: Daniel Dai <ji...@yahoo-inc.com>
> 收件人: "pig-dev@hadoop.apache.org" <pi...@hadoop.apache.org>
> 发送日期: 2010/7/26 (周一) 2:09:25 下午
> 主 题: Re: split operator
>
> Hi, Gang,
> Which part of the paper are you talking about? We don't do in-memory split. We
> dump the split result to a temporary file and start a new map-reduce job. Split
>
>
> do create a map-reduce boundary (Though it is not entirely true, multiquery
> optimizer may combine some of these jobs)
>
> Daniel
>
> Gang Luo wrote:
>
>> Hi all
>> according to the vldb 09 paper, the split operator and all its successive
>> operators reside in memory without any blocking in between. However, the source
>>
>>
>> code (version 0.7) shows that a MR job is actually ended when it meets the
>>split
>>
>> operator and multiple new MR jobs are created, each representing one branch.
>> This write-once-read-multiple-times method is different from the in-memory
>> method mentioned in that paper. Does pig change the strategy for split, or is
>> there still an in-memory version of split I didn't discover?
>>
>> Thanks,
>> -Gang
>>
>>
>>
>>
>
>
>
>
Re: split operator
Posted by Daniel Dai <ji...@yahoo-inc.com>.
Hi, Gang,
Yes, that's what MultiQueryOptimizer address. After splitting, we split
the script into smaller combinable pieces, and MultiQueryOptimizer will
combine as much splitter and splittees into the same map-reduce job. So
after SplitInserter, you might see more jobs, but you will end up with
fewer jobs. The algorithm for MultiQueryOptimizer is: for every
splitter, find as much combinable splittees, and combine them into the
same mapreduce job. You can find more details at
http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification
Daniel
Gang Luo wrote:
> Hi Daniel,
> This is a question long ago, but I suddenly come up with some more thoughts on
> this. In a query as simple as this:
>
> A = LOAD 'input';
> B = FILTER A BY $1 == 1;
> C = COGROUP A BY $0, B BY $0;
>
> the optimizer will insert a split operator to reuse A. According to the source
> code, a map-reduce job will be ended when it sees split and output the result to
> A1 and A2 which will be used by two subsequent jobs to process B and C. In this
> case, the first job does nothing meaningful but copy the souce 'input' twice. Is
> there some optimization applied here (like the MultiQueryOptimizer you mentioned
> previously) ? How?
>
> Since I didn't take a look at the MultiQueryOptimizer, it will be great help if
> you can briefly describe how MultiQueryOptimizer works. Thanks a lot.
>
> -Gang
>
>
>
>
> ----- 原始邮件 ----
> 发件人: Daniel Dai <ji...@yahoo-inc.com>
> 收件人: "pig-dev@hadoop.apache.org" <pi...@hadoop.apache.org>
> 发送日期: 2010/7/26 (周一) 4:58:49 下午
> 主 题: Re: split operator
>
> Hi, Gang,
> It is about multiquery optimization. In MRCompiler, we will create a
> map-reduce boundary for split, later in MultiQueryOptimizer, we will
> merge several split into one map-reduce job. In this map-reduce job, we
> will nest several split plans.
>
> Daniel
>
> Gang Luo wrote:
>
>> Hi Daniel,
>> in 4.3.1, the example and figure 6 show this. 5.1 last paragraph says split
>> operator maintain one-tuple buffer for each branch and talks about how to
>> synchronize multiple branches. I do think that is the in-memory split.
>>
>> here is the paper: http://www.vldb.org/pvldb/2/vldb09-1074.pdf
>>
>>
>> -Gang
>>
>>
>>
>> ----- 原始邮件 ----
>> 发件人: Daniel Dai <ji...@yahoo-inc.com>
>> 收件人: "pig-dev@hadoop.apache.org" <pi...@hadoop.apache.org>
>> 发送日期: 2010/7/26 (周一) 2:09:25 下午
>> 主 题: Re: split operator
>>
>> Hi, Gang,
>> Which part of the paper are you talking about? We don't do in-memory split. We
>>
>
>
>
>> dump the split result to a temporary file and start a new map-reduce job. Split
>>
>>
>> do create a map-reduce boundary (Though it is not entirely true, multiquery
>> optimizer may combine some of these jobs)
>>
>> Daniel
>>
>> Gang Luo wrote:
>>
>>
>>> Hi all
>>> according to the vldb 09 paper, the split operator and all its successive
>>> operators reside in memory without any blocking in between. However, the source
>>>
>>>
>>> code (version 0.7) shows that a MR job is actually ended when it meets the
>>> split
>>>
>>> operator and multiple new MR jobs are created, each representing one branch.
>>> This write-once-read-multiple-times method is different from the in-memory
>>> method mentioned in that paper. Does pig change the strategy for split, or is
>>>
>
>
>
>>> there still an in-memory version of split I didn't discover?
>>>
>>> Thanks,
>>> -Gang
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>
>
>
>