You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by James Newhaven <ja...@gmail.com> on 2012/09/03 18:32:42 UTC

UDF Performance Problem

Hi,

I'd appreciate if anyone has some ideas/pointers regarding a pig script and
custom UDF I have written. I've found it runs too slowly on my hadoop
cluster to be useful.......

I have two million records inside a single 600MB file.

For each record, I need to query a web service to retrieve additional data
for this record.

The web service supports batch requests of up to 50 records.

I split the two million records into bags of 50 items (using the datafu
BagSplit UDF) and then pass each bag on to a custom UDF I have written that
processes each bag and queries the web service.

I noticed when my script reaches my UDF, only one reducer is used and the
job takes forever to complete (in fact it has never finished since I
terminate it after a few hours).

My script looks like this:

A = LOAD 'records.txt'  USING PigStorage('\t') AS (recordId:int);
B = GROUP B ALL;
SPLITS = FOREACH B GENERATE Flatten(BagSplit(50,A));
COMPLETE_RCORDS = FOREACH SPLITS GENERATE FLATTEN(MyCustomUDF($0));

Thanks,

James

Re: UDF Performance Problem

Posted by James Newhaven <ja...@gmail.com>.

Thanks Dmitriy, all sorted now.

James

On Mon, Sep 3, 2012 at 6:21 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> That's cause you used "group all" which groups everything into one
> group, which by definition can only go to one reducer.
>
> What if instead you group into some large-enough number of buckets?
>
> A = LOAD 'records.txt'  USING PigStorage('\t') AS (recordId:int);
> A_PRIME = FOREACH A generate *, ROUND(RANDOM() * 1000) as bucket;
>  B = GROUP A_PRIME by bucket PARALLEL $parallelism;
> SPLITS = FOREACH B GENERATE Flatten(BagSplit(50,A_PRIME));
> COMPLETE_RCORDS = FOREACH SPLITS GENERATE FLATTEN(MyCustomUDF($0));
>
> D
>
>
> On Mon, Sep 3, 2012 at 9:32 AM, James Newhaven <ja...@gmail.com>
> wrote:
> > Hi,
> >
> > I'd appreciate if anyone has some ideas/pointers regarding a pig script
> and
> > custom UDF I have written. I've found it runs too slowly on my hadoop
> > cluster to be useful.......
> >
> > I have two million records inside a single 600MB file.
> >
> > For each record, I need to query a web service to retrieve additional
> data
> > for this record.
> >
> > The web service supports batch requests of up to 50 records.
> >
> > I split the two million records into bags of 50 items (using the datafu
> > BagSplit UDF) and then pass each bag on to a custom UDF I have written
> that
> > processes each bag and queries the web service.
> >
> > I noticed when my script reaches my UDF, only one reducer is used and the
> > job takes forever to complete (in fact it has never finished since I
> > terminate it after a few hours).
> >
> > My script looks like this:
> >
> > A = LOAD 'records.txt'  USING PigStorage('\t') AS (recordId:int);
> > B = GROUP B ALL;
> > SPLITS = FOREACH B GENERATE Flatten(BagSplit(50,A));
> > COMPLETE_RCORDS = FOREACH SPLITS GENERATE FLATTEN(MyCustomUDF($0));
> >
> > Thanks,
> >
> > James
>

Re: UDF Performance Problem

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

That's cause you used "group all" which groups everything into one
group, which by definition can only go to one reducer.

What if instead you group into some large-enough number of buckets?

A = LOAD 'records.txt'  USING PigStorage('\t') AS (recordId:int);
A_PRIME = FOREACH A generate *, ROUND(RANDOM() * 1000) as bucket;
 B = GROUP A_PRIME by bucket PARALLEL $parallelism;
SPLITS = FOREACH B GENERATE Flatten(BagSplit(50,A_PRIME));
COMPLETE_RCORDS = FOREACH SPLITS GENERATE FLATTEN(MyCustomUDF($0));

D


On Mon, Sep 3, 2012 at 9:32 AM, James Newhaven <ja...@gmail.com> wrote:
> Hi,
>
> I'd appreciate if anyone has some ideas/pointers regarding a pig script and
> custom UDF I have written. I've found it runs too slowly on my hadoop
> cluster to be useful.......
>
> I have two million records inside a single 600MB file.
>
> For each record, I need to query a web service to retrieve additional data
> for this record.
>
> The web service supports batch requests of up to 50 records.
>
> I split the two million records into bags of 50 items (using the datafu
> BagSplit UDF) and then pass each bag on to a custom UDF I have written that
> processes each bag and queries the web service.
>
> I noticed when my script reaches my UDF, only one reducer is used and the
> job takes forever to complete (in fact it has never finished since I
> terminate it after a few hours).
>
> My script looks like this:
>
> A = LOAD 'records.txt'  USING PigStorage('\t') AS (recordId:int);
> B = GROUP B ALL;
> SPLITS = FOREACH B GENERATE Flatten(BagSplit(50,A));
> COMPLETE_RCORDS = FOREACH SPLITS GENERATE FLATTEN(MyCustomUDF($0));
>
> Thanks,
>
> James