You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by sonia gehlot <so...@gmail.com> on 2012/07/03 01:59:18 UTC

One file with sorted results.

Hi Guys,

I have use case, where I need to generate data feed using Pig script. Data
feed in total is of about 12 GB.

I want Pig script to generate 1 file and data in that data should be sorted
as well. I know I can run it with one reducer as dataset is big with lot of
Joins it takes forever to finish.

What are the other options to get one sorted file with better performance.

Thanks in advance,

Sonia

RE: One file with sorted results.

Posted by "Duckworth, Will" <wd...@comscore.com>.
Have you tried breaking it into 2 jobs?  The first are the pre-sort work then a final job with the sort and single reducer?



Will Duckworth  Senior Vice President, Software Engineering  | comScore, Inc.(NASDAQ:SCOR)
o +1 (703) 438-2108 | m +1 (301) 606-2977 | mailto:wduckworth@comscore.com
.....................................................................................................

Introducing Mobile Metrix 2.0 - The next generation of mobile behavioral measurement
www.comscore.com/MobileMetrix
-----Original Message-----
From: sonia gehlot [mailto:sonia.gehlot@gmail.com]
Sent: Monday, July 02, 2012 7:59 PM
To: user@pig.apache.org
Subject: One file with sorted results.

Hi Guys,

I have use case, where I need to generate data feed using Pig script. Data feed in total is of about 12 GB.

I want Pig script to generate 1 file and data in that data should be sorted as well. I know I can run it with one reducer as dataset is big with lot of Joins it takes forever to finish.

What are the other options to get one sorted file with better performance.

Thanks in advance,

Sonia

Re: One file with sorted results.

Posted by sonia gehlot <so...@gmail.com>.
Thanks Alan,

I will try this.

-Sonia

On Tue, Jul 3, 2012 at 7:56 AM, Alan Gates <ga...@hortonworks.com> wrote:

> You can set different parallel levels at different parts of your script by
> attaching parallel to the different operations.  For example:
>
> Y = join W by a, X by b parallel 100;
> Z = order Y by a parallel 1;
> store Z into 'onefile';
>
> If your output is big I would suggest trying out ordering in parallel as
> well and then using HDFS's cat command in a separate pass to see if it is
> faster.  It will write twice but it won't flood one reducer with all of the
> data.
>
> Alan.
>
> On Jul 2, 2012, at 4:59 PM, sonia gehlot wrote:
>
> > Hi Guys,
> >
> > I have use case, where I need to generate data feed using Pig script.
> Data
> > feed in total is of about 12 GB.
> >
> > I want Pig script to generate 1 file and data in that data should be
> sorted
> > as well. I know I can run it with one reducer as dataset is big with lot
> of
> > Joins it takes forever to finish.
> >
> > What are the other options to get one sorted file with better
> performance.
> >
> > Thanks in advance,
> >
> > Sonia
>
>

Re: One file with sorted results.

Posted by Alan Gates <ga...@hortonworks.com>.
You can set different parallel levels at different parts of your script by attaching parallel to the different operations.  For example:

Y = join W by a, X by b parallel 100;
Z = order Y by a parallel 1;
store Z into 'onefile';

If your output is big I would suggest trying out ordering in parallel as well and then using HDFS's cat command in a separate pass to see if it is faster.  It will write twice but it won't flood one reducer with all of the data.

Alan.

On Jul 2, 2012, at 4:59 PM, sonia gehlot wrote:

> Hi Guys,
> 
> I have use case, where I need to generate data feed using Pig script. Data
> feed in total is of about 12 GB.
> 
> I want Pig script to generate 1 file and data in that data should be sorted
> as well. I know I can run it with one reducer as dataset is big with lot of
> Joins it takes forever to finish.
> 
> What are the other options to get one sorted file with better performance.
> 
> Thanks in advance,
> 
> Sonia