You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Zhang, Liyun" <li...@intel.com> on 2015/03/05 08:46:36 UTC
a question about MRCompiler#visitSort
Hi all:
Now I am implementing PIG-4438<https://issues.apache.org/jira/browse/PIG-4438>(Can not work when in "limit after sort" situation in spark mode).
testlimit.pig
a = load './testlimit.txt' as (x:int, y:chararray);
b = order a by x;
c = limit b 1;
store c into './testlimit.out';
explain c;
I read the code of MRCompiler#visitSort, can anyone tell me the function of org.apache.pig.impl.builtin.RandomSampleLoader, org.apache.pig.impl.builtin.FindQuantiles, why need get a sampling job when using POSort?
I appreciate If someone can provide the design document of MRCompiler#visitSort implemention.
following is mapreduce plan:
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node scope-11
Map Plan
Store(hdfs://zly1.sh.intel.com:8020/tmp/temp2146669591/tmp694083214:org.apache.pig.impl.io.InterStorage) - scope-12
|
|---a: New For Each(false,false)[bag] - scope-7
| |
| Cast[int] - scope-2
| |
| |---Project[bytearray][0] - scope-1
| |
| Cast[chararray] - scope-5
| |
| |---Project[bytearray][1] - scope-4
|
|---a: Load(hdfs://zly1.sh.intel.com:8020/user/root/testlimit.txt:org.apache.pig.builtin.PigStorage) - scope-0--------
Global sort: false
----------------
MapReduce node scope-14
Map Plan
b: Local Rearrange[tuple]{tuple}(false) - scope-18
| |
| Constant(all) - scope-17
|
|---New For Each(false)[tuple] - scope-16
| |
| Project[int][0] - scope-15
|
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp2146669591/tmp694083214:org.apache.pig.impl.builtin.RandomSampleLoader('org.apache.pig.impl.io.InterStorage','100')) - scope-13--------
Reduce Plan
Store(hdfs://zly1.sh.intel.com:8020/tmp/temp2146669591/tmp300898425:org.apache.pig.impl.io.InterStorage) - scope-27
|
|---New For Each(false)[tuple] - scope-26
| |
| POUserFunc(org.apache.pig.impl.builtin.FindQuantiles)[tuple] - scope-25
| |
| |---Project[tuple][*] - scope-24
|
|---New For Each(false,false)[tuple] - scope-23
| |
| Constant(2) - scope-22
| |
| Project[bag][1] - scope-20
|
|---Package(Packager)[tuple]{chararray} - scope-19--------
Global sort: false
Secondary sort: true
----------------
MapReduce node scope-29
Map Plan
b: Local Rearrange[tuple]{int}(false) - scope-30
| |
| Project[int][0] - scope-8
|
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp2146669591/tmp694083214:org.apache.pig.impl.io.InterStorage) - scope-28--------
Combine Plan
Local Rearrange[tuple]{int}(false) - scope-35
| |
| Project[int][0] - scope-8
|
|---Limit - scope-34
|
|---New For Each(true)[tuple] - scope-33
| |
| Project[bag][1] - scope-32
|
|---Package(LitePackager)[tuple]{int} - scope-31--------
Reduce Plan
c: Store(hdfs://zly1.sh.intel.com:8020/tmp/temp2146669591/tmp538566422:org.apache.pig.impl.io.InterStorage) - scope-10
|
|---Limit - scope-39
|
|---New For Each(true)[tuple] - scope-38
| |
| Project[bag][1] - scope-37
|
|---Package(LitePackager)[tuple]{int} - scope-36--------
Global sort: true
Quantile file: hdfs://zly1.sh.intel.com:8020/tmp/temp2146669591/tmp300898425
----------------
MapReduce node scope-40
Map Plan
b: Local Rearrange[tuple]{int}(false) - scope-42
| |
| Project[int][0] - scope-43
|
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp2146669591/tmp538566422:org.apache.pig.impl.io.InterStorage) - scope-41--------
Reduce Plan
c: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-49
|
|---Limit - scope-48
|
|---New For Each(true)[bag] - scope-47
| |
| Project[tuple][1] - scope-46
|
|---Package(LitePackager)[tuple]{int} - scope-45--------
Global sort: false
----------------
Best regards
Zhang,Liyun
Re: a question about MRCompiler#visitSort
Posted by Mohit Sabharwal <mo...@cloudera.com>.
Hi Kelly,
I haven't looked at the code in detail. But I assume it's using some
version of 'distributed sample sort' (
http://en.wikipedia.org/wiki/Samplesort) and purpose of sampling job is to
find distribution of keys, so that each bucket can be of similar size (i.e.
buckets will cover different size key ranges like a-b, c-m, n-z, etc but
have similar size of items), so that reducers get similar amount of work to
do.
thanks,
Mohit
On Wed, Mar 4, 2015 at 11:46 PM, Zhang, Liyun <li...@intel.com> wrote:
> Hi all:
>
> Now I am implementing PIG-4438
> <https://issues.apache.org/jira/browse/PIG-4438>(Can not work when in
> "limit after sort" situation in spark mode).
>
>
>
> testlimit.pig
>
> a = load './testlimit.txt' as (x:int, y:chararray);
>
> b = order a by x;
>
> c = limit b 1;
>
> store c into './testlimit.out';
>
> explain c;
>
>
>
> I read the code of MRCompiler#visitSort, can anyone tell me the function
> of org.apache.pig.impl.builtin.RandomSampleLoader,
> org.apache.pig.impl.builtin.FindQuantiles, why need get a sampling job when
> using POSort?
>
> I appreciate If someone can provide the design document of
> MRCompiler#visitSort implemention.
>
>
>
> following is mapreduce plan:
>
> #--------------------------------------------------
>
> # Map Reduce Plan
>
> #--------------------------------------------------
>
> MapReduce node scope-11
>
> Map Plan
>
> Store(hdfs://
> zly1.sh.intel.com:8020/tmp/temp2146669591/tmp694083214:org.apache.pig.impl.io.InterStorage)
> - scope-12
>
> |
>
> |---a: New For Each(false,false)[bag] - scope-7
>
> | |
>
> | Cast[int] - scope-2
>
> | |
>
> | |---Project[bytearray][0] - scope-1
>
> | |
>
> | Cast[chararray] - scope-5
>
> | |
>
> | |---Project[bytearray][1] - scope-4
>
> |
>
> |---a: Load(hdfs://
> zly1.sh.intel.com:8020/user/root/testlimit.txt:org.apache.pig.builtin.PigStorage)
> - scope-0--------
>
> Global sort: false
>
> ----------------
>
>
>
> MapReduce node scope-14
>
> Map Plan
>
> b: Local Rearrange[tuple]{tuple}(false) - scope-18
>
> | |
>
> | Constant(all) - scope-17
>
> |
>
> |---New For Each(false)[tuple] - scope-16
>
> | |
>
> | Project[int][0] - scope-15
>
> |
>
> |---Load(hdfs://
> zly1.sh.intel.com:8020/tmp/temp2146669591/tmp694083214:org.apache.pig.impl.builtin.RandomSampleLoader('org.apache.pig.impl.io.InterStorage','100'))
> - scope-13--------
>
> Reduce Plan
>
> Store(hdfs://
> zly1.sh.intel.com:8020/tmp/temp2146669591/tmp300898425:org.apache.pig.impl.io.InterStorage)
> - scope-27
>
> |
>
> |---New For Each(false)[tuple] - scope-26
>
> | |
>
> |
> POUserFunc(org.apache.pig.impl.builtin.FindQuantiles)[tuple] - scope-25
>
> | |
>
> | |---Project[tuple][*] - scope-24
>
> |
>
> |---New For Each(false,false)[tuple] - scope-23
>
> | |
>
> | Constant(2) - scope-22
>
> | |
>
> | Project[bag][1] - scope-20
>
> |
>
> |---Package(Packager)[tuple]{chararray} -
> scope-19--------
>
> Global sort: false
>
> Secondary sort: true
>
> ----------------
>
>
>
> MapReduce node scope-29
>
> Map Plan
>
> b: Local Rearrange[tuple]{int}(false) - scope-30
>
> | |
>
> | Project[int][0] - scope-8
>
> |
>
> |---Load(hdfs://
> zly1.sh.intel.com:8020/tmp/temp2146669591/tmp694083214:org.apache.pig.impl.io.InterStorage)
> - scope-28--------
>
> Combine Plan
>
> Local Rearrange[tuple]{int}(false) - scope-35
>
> | |
>
> | Project[int][0] - scope-8
>
> |
>
> |---Limit - scope-34
>
> |
>
> |---New For Each(true)[tuple] - scope-33
>
> | |
>
> | Project[bag][1] - scope-32
>
> |
>
> |---Package(LitePackager)[tuple]{int} -
> scope-31--------
>
> Reduce Plan
>
> c: Store(hdfs://
> zly1.sh.intel.com:8020/tmp/temp2146669591/tmp538566422:org.apache.pig.impl.io.InterStorage)
> - scope-10
>
> |
>
> |---Limit - scope-39
>
> |
>
> |---New For Each(true)[tuple] - scope-38
>
> | |
>
> | Project[bag][1] - scope-37
>
> |
>
> |---Package(LitePackager)[tuple]{int} -
> scope-36--------
>
> Global sort: true
>
> Quantile file: hdfs://
> zly1.sh.intel.com:8020/tmp/temp2146669591/tmp300898425
>
> ----------------
>
>
>
> MapReduce node scope-40
>
> Map Plan
>
> b: Local Rearrange[tuple]{int}(false) - scope-42
>
> | |
>
> | Project[int][0] - scope-43
>
> |
>
> |---Load(hdfs://
> zly1.sh.intel.com:8020/tmp/temp2146669591/tmp538566422:org.apache.pig.impl.io.InterStorage)
> - scope-41--------
>
> Reduce Plan
>
> c: Store(fakefile:org.apache.pig.builtin.PigStorage) -
> scope-49
>
> |
>
> |---Limit - scope-48
>
> |
>
> |---New For Each(true)[bag] - scope-47
>
> | |
>
> | Project[tuple][1] - scope-46
>
> |
>
> |---Package(LitePackager)[tuple]{int} -
> scope-45--------
>
> Global sort: false
>
> ----------------
>
>
>
>
>
>
>
> Best regards
>
> Zhang,Liyun
>
>
>