You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Gang Luo <lg...@yahoo.com.cn> on 2010/07/25 18:27:28 UTC
split operator
Hi all
according to the vldb 09 paper, the split operator and all its successive
operators reside in memory without any blocking in between. However, the source
code (version 0.7) shows that a MR job is actually ended when it meets the split
operator and multiple new MR jobs are created, each representing one branch.
This write-once-read-multiple-times method is different from the in-memory
method mentioned in that paper. Does pig change the strategy for split, or is
there still an in-memory version of split I didn't discover?
Thanks,
-Gang
Re: split operator
Posted by Daniel Dai <ji...@yahoo-inc.com>.
Hi, Gang,
Yes, that's what MultiQueryOptimizer address. After splitting, we split
the script into smaller combinable pieces, and MultiQueryOptimizer will
combine as much splitter and splittees into the same map-reduce job. So
after SplitInserter, you might see more jobs, but you will end up with
fewer jobs. The algorithm for MultiQueryOptimizer is: for every
splitter, find as much combinable splittees, and combine them into the
same mapreduce job. You can find more details at
http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification
Daniel
Gang Luo wrote:
> Hi Daniel,
> This is a question long ago, but I suddenly come up with some more thoughts on
> this. In a query as simple as this:
>
> A = LOAD 'input';
> B = FILTER A BY $1 == 1;
> C = COGROUP A BY $0, B BY $0;
>
> the optimizer will insert a split operator to reuse A. According to the source
> code, a map-reduce job will be ended when it sees split and output the result to
> A1 and A2 which will be used by two subsequent jobs to process B and C. In this
> case, the first job does nothing meaningful but copy the souce 'input' twice. Is
> there some optimization applied here (like the MultiQueryOptimizer you mentioned
> previously) ? How?
>
> Since I didn't take a look at the MultiQueryOptimizer, it will be great help if
> you can briefly describe how MultiQueryOptimizer works. Thanks a lot.
>
> -Gang
>
>
>
>
> ----- 原始邮件 ----
> 发件人: Daniel Dai <ji...@yahoo-inc.com>
> 收件人: "pig-dev@hadoop.apache.org" <pi...@hadoop.apache.org>
> 发送日期: 2010/7/26 (周一) 4:58:49 下午
> 主 题: Re: split operator
>
> Hi, Gang,
> It is about multiquery optimization. In MRCompiler, we will create a
> map-reduce boundary for split, later in MultiQueryOptimizer, we will
> merge several split into one map-reduce job. In this map-reduce job, we
> will nest several split plans.
>
> Daniel
>
> Gang Luo wrote:
>
>> Hi Daniel,
>> in 4.3.1, the example and figure 6 show this. 5.1 last paragraph says split
>> operator maintain one-tuple buffer for each branch and talks about how to
>> synchronize multiple branches. I do think that is the in-memory split.
>>
>> here is the paper: http://www.vldb.org/pvldb/2/vldb09-1074.pdf
>>
>>
>> -Gang
>>
>>
>>
>> ----- 原始邮件 ----
>> 发件人: Daniel Dai <ji...@yahoo-inc.com>
>> 收件人: "pig-dev@hadoop.apache.org" <pi...@hadoop.apache.org>
>> 发送日期: 2010/7/26 (周一) 2:09:25 下午
>> 主 题: Re: split operator
>>
>> Hi, Gang,
>> Which part of the paper are you talking about? We don't do in-memory split. We
>>
>
>
>
>> dump the split result to a temporary file and start a new map-reduce job. Split
>>
>>
>> do create a map-reduce boundary (Though it is not entirely true, multiquery
>> optimizer may combine some of these jobs)
>>
>> Daniel
>>
>> Gang Luo wrote:
>>
>>
>>> Hi all
>>> according to the vldb 09 paper, the split operator and all its successive
>>> operators reside in memory without any blocking in between. However, the source
>>>
>>>
>>> code (version 0.7) shows that a MR job is actually ended when it meets the
>>> split
>>>
>>> operator and multiple new MR jobs are created, each representing one branch.
>>> This write-once-read-multiple-times method is different from the in-memory
>>> method mentioned in that paper. Does pig change the strategy for split, or is
>>>
>
>
>
>>> there still an in-memory version of split I didn't discover?
>>>
>>> Thanks,
>>> -Gang
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>
>
>
>
Re: split operator
Posted by Gang Luo <lg...@yahoo.com.cn>.
Hi Daniel,
This is a question long ago, but I suddenly come up with some more thoughts on
this. In a query as simple as this:
A = LOAD 'input';
B = FILTER A BY $1 == 1;
C = COGROUP A BY $0, B BY $0;
the optimizer will insert a split operator to reuse A. According to the source
code, a map-reduce job will be ended when it sees split and output the result to
A1 and A2 which will be used by two subsequent jobs to process B and C. In this
case, the first job does nothing meaningful but copy the souce 'input' twice. Is
there some optimization applied here (like the MultiQueryOptimizer you mentioned
previously) ? How?
Since I didn't take a look at the MultiQueryOptimizer, it will be great help if
you can briefly describe how MultiQueryOptimizer works. Thanks a lot.
-Gang
----- 原始邮件 ----
发件人: Daniel Dai <ji...@yahoo-inc.com>
收件人: "pig-dev@hadoop.apache.org" <pi...@hadoop.apache.org>
发送日期: 2010/7/26 (周一) 4:58:49 下午
主 题: Re: split operator
Hi, Gang,
It is about multiquery optimization. In MRCompiler, we will create a
map-reduce boundary for split, later in MultiQueryOptimizer, we will
merge several split into one map-reduce job. In this map-reduce job, we
will nest several split plans.
Daniel
Gang Luo wrote:
> Hi Daniel,
> in 4.3.1, the example and figure 6 show this. 5.1 last paragraph says split
> operator maintain one-tuple buffer for each branch and talks about how to
> synchronize multiple branches. I do think that is the in-memory split.
>
> here is the paper: http://www.vldb.org/pvldb/2/vldb09-1074.pdf
>
>
> -Gang
>
>
>
> ----- 原始邮件 ----
> 发件人: Daniel Dai <ji...@yahoo-inc.com>
> 收件人: "pig-dev@hadoop.apache.org" <pi...@hadoop.apache.org>
> 发送日期: 2010/7/26 (周一) 2:09:25 下午
> 主 题: Re: split operator
>
> Hi, Gang,
> Which part of the paper are you talking about? We don't do in-memory split. We
> dump the split result to a temporary file and start a new map-reduce job. Split
>
>
> do create a map-reduce boundary (Though it is not entirely true, multiquery
> optimizer may combine some of these jobs)
>
> Daniel
>
> Gang Luo wrote:
>
>> Hi all
>> according to the vldb 09 paper, the split operator and all its successive
>> operators reside in memory without any blocking in between. However, the source
>>
>>
>> code (version 0.7) shows that a MR job is actually ended when it meets the
>>split
>>
>> operator and multiple new MR jobs are created, each representing one branch.
>> This write-once-read-multiple-times method is different from the in-memory
>> method mentioned in that paper. Does pig change the strategy for split, or is
>> there still an in-memory version of split I didn't discover?
>>
>> Thanks,
>> -Gang
>>
>>
>>
>>
>
>
>
>
Re: split operator
Posted by Daniel Dai <ji...@yahoo-inc.com>.
Hi, Gang,
It is about multiquery optimization. In MRCompiler, we will create a
map-reduce boundary for split, later in MultiQueryOptimizer, we will
merge several split into one map-reduce job. In this map-reduce job, we
will nest several split plans.
Daniel
Gang Luo wrote:
> Hi Daniel,
> in 4.3.1, the example and figure 6 show this. 5.1 last paragraph says split
> operator maintain one-tuple buffer for each branch and talks about how to
> synchronize multiple branches. I do think that is the in-memory split.
>
> here is the paper: http://www.vldb.org/pvldb/2/vldb09-1074.pdf
>
>
> -Gang
>
>
>
> ----- 原始邮件 ----
> 发件人: Daniel Dai <ji...@yahoo-inc.com>
> 收件人: "pig-dev@hadoop.apache.org" <pi...@hadoop.apache.org>
> 发送日期: 2010/7/26 (周一) 2:09:25 下午
> 主 题: Re: split operator
>
> Hi, Gang,
> Which part of the paper are you talking about? We don't do in-memory split. We
> dump the split result to a temporary file and start a new map-reduce job. Split
> do create a map-reduce boundary (Though it is not entirely true, multiquery
> optimizer may combine some of these jobs)
>
> Daniel
>
> Gang Luo wrote:
>
>> Hi all
>> according to the vldb 09 paper, the split operator and all its successive
>> operators reside in memory without any blocking in between. However, the source
>> code (version 0.7) shows that a MR job is actually ended when it meets the split
>> operator and multiple new MR jobs are created, each representing one branch.
>> This write-once-read-multiple-times method is different from the in-memory
>> method mentioned in that paper. Does pig change the strategy for split, or is
>> there still an in-memory version of split I didn't discover?
>>
>> Thanks,
>> -Gang
>>
>>
>>
>>
>
>
>
>
Re: split operator
Posted by Gang Luo <lg...@yahoo.com.cn>.
Hi Daniel,
in 4.3.1, the example and figure 6 show this. 5.1 last paragraph says split
operator maintain one-tuple buffer for each branch and talks about how to
synchronize multiple branches. I do think that is the in-memory split.
here is the paper: http://www.vldb.org/pvldb/2/vldb09-1074.pdf
-Gang
----- 原始邮件 ----
发件人: Daniel Dai <ji...@yahoo-inc.com>
收件人: "pig-dev@hadoop.apache.org" <pi...@hadoop.apache.org>
发送日期: 2010/7/26 (周一) 2:09:25 下午
主 题: Re: split operator
Hi, Gang,
Which part of the paper are you talking about? We don't do in-memory split. We
dump the split result to a temporary file and start a new map-reduce job. Split
do create a map-reduce boundary (Though it is not entirely true, multiquery
optimizer may combine some of these jobs)
Daniel
Gang Luo wrote:
> Hi all
> according to the vldb 09 paper, the split operator and all its successive
>operators reside in memory without any blocking in between. However, the source
>code (version 0.7) shows that a MR job is actually ended when it meets the split
>operator and multiple new MR jobs are created, each representing one branch.
>This write-once-read-multiple-times method is different from the in-memory
>method mentioned in that paper. Does pig change the strategy for split, or is
>there still an in-memory version of split I didn't discover?
>
> Thanks,
> -Gang
>
>
>
Re: split operator
Posted by Daniel Dai <ji...@yahoo-inc.com>.
Hi, Gang,
Which part of the paper are you talking about? We don't do in-memory
split. We dump the split result to a temporary file and start a new
map-reduce job. Split do create a map-reduce boundary (Though it is not
entirely true, multiquery optimizer may combine some of these jobs)
Daniel
Gang Luo wrote:
> Hi all
> according to the vldb 09 paper, the split operator and all its successive
> operators reside in memory without any blocking in between. However, the source
> code (version 0.7) shows that a MR job is actually ended when it meets the split
> operator and multiple new MR jobs are created, each representing one branch.
> This write-once-read-multiple-times method is different from the in-memory
> method mentioned in that paper. Does pig change the strategy for split, or is
> there still an in-memory version of split I didn't discover?
>
> Thanks,
> -Gang
>
>
>
>