You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Vijay Rao <ra...@gmail.com> on 2010/05/10 21:16:40 UTC
Question about SORT
Hello,
I am new to Hadoop, Pig and have just been reading whatever I could lay my
hands on. If I needed to sort a dataset using Pig is just the ORDER syntax
sufficient?
For eg here is what I came up with to sort a dataset of users based on their
login count
records = LOAD 'input/sample.txt' AS (username:chararray);
grpd = GROUP records BY username;
cntd = FOREACH grpd GENERATE
group, COUNT(records) AS cnt;
srtd = ORDER cntd BY cnt;
STORE srtd INTO 'output';
Is this sufficient to sort a dataset? Is there something else that needs to
be done? I read about partition/combine for SORT when I read Mapreduce and
hence was confused.
Any help is greatly appreciated.
Thanks
VJ
Re: Question about SORT
Posted by Vijay Rao <ra...@gmail.com>.
Awesome. Thanks. Where can I find more information about how hadoop
assembles the intermediate output to come up with the final reduce? Since
data is local to the slaves. Also I know the input data is usually stored
with 3 copies(default). Is the output also stored 3 times?
Thanks
VJ
On Mon, May 10, 2010 at 12:35 PM, Thejas Nair <te...@yahoo-inc.com> wrote:
> Yes, "order" in pig-latin is sufficient - it will sort the file globally
> (not just within each part file).
>
> An "order" statement results in two MR jobs, the first one takes sample of
> the order-by keys to figure out the distribution and decide how to
> partition
> the data across reducers in the 2nd MR job which does the sorting.
>
> -Thejas
>
>
>
> On 5/10/10 12:16 PM, "Vijay Rao" <ra...@gmail.com> wrote:
>
> > Hello,
> >
> > I am new to Hadoop, Pig and have just been reading whatever I could lay
> my
> > hands on. If I needed to sort a dataset using Pig is just the ORDER
> syntax
> > sufficient?
> >
> > For eg here is what I came up with to sort a dataset of users based on
> their
> > login count
> >
> > records = LOAD 'input/sample.txt' AS (username:chararray);
> >
> > grpd = GROUP records BY username;
> >
> > cntd = FOREACH grpd GENERATE
> > group, COUNT(records) AS cnt;
> >
> > srtd = ORDER cntd BY cnt;
> >
> > STORE srtd INTO 'output';
> >
> > Is this sufficient to sort a dataset? Is there something else that needs
> to
> > be done? I read about partition/combine for SORT when I read Mapreduce
> and
> > hence was confused.
> >
> > Any help is greatly appreciated.
> >
> > Thanks
> > VJ
>
>
Re: Question about SORT
Posted by Thejas Nair <te...@yahoo-inc.com>.
Yes, "order" in pig-latin is sufficient - it will sort the file globally
(not just within each part file).
An "order" statement results in two MR jobs, the first one takes sample of
the order-by keys to figure out the distribution and decide how to partition
the data across reducers in the 2nd MR job which does the sorting.
-Thejas
On 5/10/10 12:16 PM, "Vijay Rao" <ra...@gmail.com> wrote:
> Hello,
>
> I am new to Hadoop, Pig and have just been reading whatever I could lay my
> hands on. If I needed to sort a dataset using Pig is just the ORDER syntax
> sufficient?
>
> For eg here is what I came up with to sort a dataset of users based on their
> login count
>
> records = LOAD 'input/sample.txt' AS (username:chararray);
>
> grpd = GROUP records BY username;
>
> cntd = FOREACH grpd GENERATE
> group, COUNT(records) AS cnt;
>
> srtd = ORDER cntd BY cnt;
>
> STORE srtd INTO 'output';
>
> Is this sufficient to sort a dataset? Is there something else that needs to
> be done? I read about partition/combine for SORT when I read Mapreduce and
> hence was confused.
>
> Any help is greatly appreciated.
>
> Thanks
> VJ