You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Stanley Xu <we...@gmail.com> on 2012/11/01 06:03:15 UTC

Re: Is that possible to use Pig to do an optimized secondary sort.

I have posted the code by a gist link in the mail. I just simplify the real
code to make it simple, will that trigger a secondary sort automatically?

If that, is there any other places I should check to understand why the
cleanup of the mapreduce takes that long time?

Thanks.

Best wishes,
Stanley Xu



On Wed, Oct 31, 2012 at 11:21 PM, Alan Gates <ga...@hortonworks.com> wrote:

> Seeing your Pig Latin script will help us determine whether this will work
> in your case.  But in general Pig uses secondary sort when you do an order
> by in a nested foreach.  So if you are grouping you could order within that
> group and then pass it to your UDF.
>
> Alan.
>
> On Oct 31, 2012, at 1:20 AM, Stanley Xu wrote:
>
> > Dear buddies,
> >
> > We are trying to write some of the UDF to do some machine learning work.
> We
> > did a simple experiment to calculate the AUC through a UDF like the
> > following code in gist
> >
> > https://gist.github.com/3985764
> >
> > The map-reduce job will only take a couple of few minutes, but will wait
> > there hours to do the cleanup.
> >
> > I guess the reason is that the sort inside the foreach will generate lots
> > of data spill to local fs and takes a long time to do cleanup there.
> >
> > In a java map-reduce problem, we could made it like a secondary sort. We
> > make the model + ctr as the key so the same model's ctr will be sorted,
> and
> > group by only the model name part, then the sort is done after shuffling.
> >
> > I  am wondering if we could do that kind of optimization in pig as well?
>
>