You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Vijay Rao <ra...@gmail.com> on 2010/05/10 21:16:40 UTC

Question about SORT

Hello,

I am new to Hadoop, Pig and have just been reading whatever I could lay my
hands on. If I needed to sort a dataset using Pig is just the ORDER syntax
sufficient?

For eg here is what I came up with to sort a dataset of users based on their
login count

records = LOAD 'input/sample.txt' AS (username:chararray);

grpd = GROUP records BY username;

cntd = FOREACH grpd GENERATE
          group, COUNT(records) AS cnt;

srtd = ORDER cntd BY cnt;

STORE srtd INTO 'output';

Is this sufficient to sort a dataset? Is there something else that needs to
be done? I read about partition/combine for SORT when I read Mapreduce and
hence was confused.

Any help is greatly appreciated.

Thanks
VJ

Re: Question about SORT

Posted by Vijay Rao <ra...@gmail.com>.

Awesome. Thanks. Where can I find more information about how hadoop
assembles the intermediate output to come up with the final reduce? Since
data is local to the slaves. Also I know the input data is usually stored
with 3 copies(default). Is the output also stored 3 times?

Thanks
VJ

On Mon, May 10, 2010 at 12:35 PM, Thejas Nair <te...@yahoo-inc.com> wrote:

> Yes, "order" in pig-latin is sufficient - it will sort the file globally
> (not just within each part file).
>
> An "order" statement results in two MR jobs, the first one takes sample of
> the order-by keys to figure out the distribution and decide how to
> partition
> the data across reducers in the 2nd MR job which does the sorting.
>
> -Thejas
>
>
>
> On 5/10/10 12:16 PM, "Vijay Rao" <ra...@gmail.com> wrote:
>
> > Hello,
> >
> > I am new to Hadoop, Pig and have just been reading whatever I could lay
> my
> > hands on. If I needed to sort a dataset using Pig is just the ORDER
> syntax
> > sufficient?
> >
> > For eg here is what I came up with to sort a dataset of users based on
> their
> > login count
> >
> > records = LOAD 'input/sample.txt' AS (username:chararray);
> >
> > grpd = GROUP records BY username;
> >
> > cntd = FOREACH grpd GENERATE
> >           group, COUNT(records) AS cnt;
> >
> > srtd = ORDER cntd BY cnt;
> >
> > STORE srtd INTO 'output';
> >
> > Is this sufficient to sort a dataset? Is there something else that needs
> to
> > be done? I read about partition/combine for SORT when I read Mapreduce
> and
> > hence was confused.
> >
> > Any help is greatly appreciated.
> >
> > Thanks
> > VJ
>
>

Re: Question about SORT

Posted by Thejas Nair <te...@yahoo-inc.com>.

Yes, "order" in pig-latin is sufficient - it will sort the file globally
(not just within each part file).

An "order" statement results in two MR jobs, the first one takes sample of
the order-by keys to figure out the distribution and decide how to partition
the data across reducers in the 2nd MR job which does the sorting.

-Thejas

On 5/10/10 12:16 PM, "Vijay Rao" <ra...@gmail.com> wrote:

> Hello,
> 
> I am new to Hadoop, Pig and have just been reading whatever I could lay my
> hands on. If I needed to sort a dataset using Pig is just the ORDER syntax
> sufficient?
> 
> For eg here is what I came up with to sort a dataset of users based on their
> login count
> 
> records = LOAD 'input/sample.txt' AS (username:chararray);
> 
> grpd = GROUP records BY username;
> 
> cntd = FOREACH grpd GENERATE
>           group, COUNT(records) AS cnt;
> 
> srtd = ORDER cntd BY cnt;
> 
> STORE srtd INTO 'output';
> 
> Is this sufficient to sort a dataset? Is there something else that needs to
> be done? I read about partition/combine for SORT when I read Mapreduce and
> hence was confused.
> 
> Any help is greatly appreciated.
> 
> Thanks
> VJ