You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Desai Dharmendra <de...@gmail.com> on 2009/11/20 21:03:18 UTC

Pig Relation Sorting, labeling, partitioning

I am using PIG and this is what I am trying to do this:

1) Sort a relation A into B by a field x. The smallest value of x is first.
Just use SORT.

2) Label each tuple in B with a number denoting its order in the sorted
relation. So the first tuple would be labeled with a 1, the second tuple
with a 2, the third with a 3 and so on. Not certain how to do this.

3) Derive a relation C where each row is a bag of tuples. The first row
contains the first n1 tuples from relation B, the second row contains the
tuples from B labeled (n1 + 1) to n2 from, the third row contains the tuples
from B labeled (n2 + 1) to n3 and so on to n100. This step is simple (just
use filter) once we've labeled each tuple in B with a number.

The question: how do I do step 2)?

thanks

Re: Pig Relation Sorting, labeling, partitioning

Posted by Thejas Nair <te...@yahoo-inc.com>.

If you are ok with approximately dividing A into 100 parts on sorted order ,
you could do 
B = order A by x parallel 100;
That will generate 100 part files, (somewhat) evenly distributing the data
across them. Pig samples the input data first and generates a histogram to
try to evenly spread the data across reducers.

-Thejas



On 11/20/09 12:53 PM, "Alan Gates" <ga...@yahoo-inc.com> wrote:

> Item 2 is no currently easy to do in Pig in a parallel fashion.  This
> is because you don't know how many records each map task is going to
> get so you don't know which number to start on in map 2 and greater.
> You could write a complex two pass algorithm that were first count the
> number of tuples and then do the splits again, but it would involve
> implementing your own Slicer and LoadFunc.
> 
> If your data set is small enough that you can do the labeling on a
> single machine you can do something like:
> 
> A = load 'data';
> B = group A all parallel 1;
> C = foreach B {
>         D = order A by sortkey;
>         generate flatten(lablerUDF(D));
> }
> 
> where lablerUDF is a UDF you write that walks through the bag and
> appends the position to each tuple.  If you are doing this on trunk I
> was strongly suggest using the new accumulator interface for this UDF
> as it will make is much more efficient.  But again, this depends on
> pulling all of your data onto one machine, which defeats the purpose
> of parallel systems like Hadoop.
> 
> Alan.
> 
> On Nov 20, 2009, at 12:03 PM, Desai Dharmendra wrote:
> 
>> I am using PIG and this is what I am trying to do this:
>> 
>> 1) Sort a relation A into B by a field x. The smallest value of x is
>> first.
>> Just use SORT.
>> 
>> 2) Label each tuple in B with a number denoting its order in the
>> sorted
>> relation. So the first tuple would be labeled with a 1, the second
>> tuple
>> with a 2, the third with a 3 and so on. Not certain how to do this.
>> 
>> 3) Derive a relation C where each row is a bag of tuples. The first
>> row
>> contains the first n1 tuples from relation B, the second row
>> contains the
>> tuples from B labeled (n1 + 1) to n2 from, the third row contains
>> the tuples
>> from B labeled (n2 + 1) to n3 and so on to n100. This step is simple
>> (just
>> use filter) once we've labeled each tuple in B with a number.
>> 
>> The question: how do I do step 2)?
>> 
>> thanks
>

Re: Pig Relation Sorting, labeling, partitioning

Posted by Alan Gates <ga...@yahoo-inc.com>.

Item 2 is no currently easy to do in Pig in a parallel fashion.  This  
is because you don't know how many records each map task is going to  
get so you don't know which number to start on in map 2 and greater.   
You could write a complex two pass algorithm that were first count the  
number of tuples and then do the splits again, but it would involve  
implementing your own Slicer and LoadFunc.

If your data set is small enough that you can do the labeling on a  
single machine you can do something like:

A = load 'data';
B = group A all parallel 1;
C = foreach B {
        D = order A by sortkey;
        generate flatten(lablerUDF(D));
}

where lablerUDF is a UDF you write that walks through the bag and  
appends the position to each tuple.  If you are doing this on trunk I  
was strongly suggest using the new accumulator interface for this UDF  
as it will make is much more efficient.  But again, this depends on  
pulling all of your data onto one machine, which defeats the purpose  
of parallel systems like Hadoop.

Alan.

On Nov 20, 2009, at 12:03 PM, Desai Dharmendra wrote:

> I am using PIG and this is what I am trying to do this:
>
> 1) Sort a relation A into B by a field x. The smallest value of x is  
> first.
> Just use SORT.
>
> 2) Label each tuple in B with a number denoting its order in the  
> sorted
> relation. So the first tuple would be labeled with a 1, the second  
> tuple
> with a 2, the third with a 3 and so on. Not certain how to do this.
>
> 3) Derive a relation C where each row is a bag of tuples. The first  
> row
> contains the first n1 tuples from relation B, the second row  
> contains the
> tuples from B labeled (n1 + 1) to n2 from, the third row contains  
> the tuples
> from B labeled (n2 + 1) to n3 and so on to n100. This step is simple  
> (just
> use filter) once we've labeled each tuple in B with a number.
>
> The question: how do I do step 2)?
>
> thanks