You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Stanley Xu <we...@gmail.com> on 2012/10/31 09:20:21 UTC

Is that possible to use Pig to do an optimized secondary sort.

Dear buddies,

We are trying to write some of the UDF to do some machine learning work. We
did a simple experiment to calculate the AUC through a UDF like the
following code in gist

https://gist.github.com/3985764

The map-reduce job will only take a couple of few minutes, but will wait
there hours to do the cleanup.

I guess the reason is that the sort inside the foreach will generate lots
of data spill to local fs and takes a long time to do cleanup there.

In a java map-reduce problem, we could made it like a secondary sort. We
make the model + ctr as the key so the same model's ctr will be sorted, and
group by only the model name part, then the sort is done after shuffling.

I  am wondering if we could do that kind of optimization in pig as well?

Re: Is that possible to use Pig to do an optimized secondary sort.

Posted by Stanley Xu <we...@gmail.com>.
I have posted the code by a gist link in the mail. I just simplify the real
code to make it simple, will that trigger a secondary sort automatically?

If that, is there any other places I should check to understand why the
cleanup of the mapreduce takes that long time?

Thanks.

Best wishes,
Stanley Xu



On Wed, Oct 31, 2012 at 11:21 PM, Alan Gates <ga...@hortonworks.com> wrote:

> Seeing your Pig Latin script will help us determine whether this will work
> in your case.  But in general Pig uses secondary sort when you do an order
> by in a nested foreach.  So if you are grouping you could order within that
> group and then pass it to your UDF.
>
> Alan.
>
> On Oct 31, 2012, at 1:20 AM, Stanley Xu wrote:
>
> > Dear buddies,
> >
> > We are trying to write some of the UDF to do some machine learning work.
> We
> > did a simple experiment to calculate the AUC through a UDF like the
> > following code in gist
> >
> > https://gist.github.com/3985764
> >
> > The map-reduce job will only take a couple of few minutes, but will wait
> > there hours to do the cleanup.
> >
> > I guess the reason is that the sort inside the foreach will generate lots
> > of data spill to local fs and takes a long time to do cleanup there.
> >
> > In a java map-reduce problem, we could made it like a secondary sort. We
> > make the model + ctr as the key so the same model's ctr will be sorted,
> and
> > group by only the model name part, then the sort is done after shuffling.
> >
> > I  am wondering if we could do that kind of optimization in pig as well?
>
>

Re: Is that possible to use Pig to do an optimized secondary sort.

Posted by Russell Jurney <ru...@gmail.com>.
I'd love to see an example of a secondary sort in a nested foreach.
Does anyone have one?

Russell Jurney http://datasyndrome.com

On Oct 31, 2012, at 8:22 AM, Alan Gates <ga...@hortonworks.com> wrote:

> Seeing your Pig Latin script will help us determine whether this will work in your case.  But in general Pig uses secondary sort when you do an order by in a nested foreach.  So if you are grouping you could order within that group and then pass it to your UDF.
>
> Alan.
>
> On Oct 31, 2012, at 1:20 AM, Stanley Xu wrote:
>
>> Dear buddies,
>>
>> We are trying to write some of the UDF to do some machine learning work. We
>> did a simple experiment to calculate the AUC through a UDF like the
>> following code in gist
>>
>> https://gist.github.com/3985764
>>
>> The map-reduce job will only take a couple of few minutes, but will wait
>> there hours to do the cleanup.
>>
>> I guess the reason is that the sort inside the foreach will generate lots
>> of data spill to local fs and takes a long time to do cleanup there.
>>
>> In a java map-reduce problem, we could made it like a secondary sort. We
>> make the model + ctr as the key so the same model's ctr will be sorted, and
>> group by only the model name part, then the sort is done after shuffling.
>>
>> I  am wondering if we could do that kind of optimization in pig as well?
>

Re: Is that possible to use Pig to do an optimized secondary sort.

Posted by Alan Gates <ga...@hortonworks.com>.
Seeing your Pig Latin script will help us determine whether this will work in your case.  But in general Pig uses secondary sort when you do an order by in a nested foreach.  So if you are grouping you could order within that group and then pass it to your UDF.

Alan.

On Oct 31, 2012, at 1:20 AM, Stanley Xu wrote:

> Dear buddies,
> 
> We are trying to write some of the UDF to do some machine learning work. We
> did a simple experiment to calculate the AUC through a UDF like the
> following code in gist
> 
> https://gist.github.com/3985764
> 
> The map-reduce job will only take a couple of few minutes, but will wait
> there hours to do the cleanup.
> 
> I guess the reason is that the sort inside the foreach will generate lots
> of data spill to local fs and takes a long time to do cleanup there.
> 
> In a java map-reduce problem, we could made it like a secondary sort. We
> make the model + ctr as the key so the same model's ctr will be sorted, and
> group by only the model name part, then the sort is done after shuffling.
> 
> I  am wondering if we could do that kind of optimization in pig as well?