You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Norbert Burger <no...@gmail.com> on 2012/05/29 19:20:15 UTC

LOAD function vs. UDF eval

We're analyzing session(s) using Pig and HBase, and this session data is
currently stored in a single HBase table, where rowkey is a
sessionid-eventid combo (tall table).  I'm trying to optimize the
"extract-all-events-for-a-given-session" step of our workflow.

This could be a simple JOIN.  But this seems inefficient even for a
replicated join, since the vast majority of repjoin mappers will have no
output.  It also seems inefficient because all events are grouped together
in clusters, and a de facto JOIN would ignore this locality.

Possibly it'd be cleaner to model this as a LOAD function, which would use
the HBase client API to issue several scans in parallel.  But to handle our
use case, I'd have to be able to pass 10000s of sessionids to the loader
(perhaps UDFContext?)

So I'm back to writing a UDF eval function to handle this, which seems
wonky.

But this made me think -- is there any performance benefit to modeling
these "load" style steps as LOAD functions vs. generic UDFs?  In both
cases, they'd return a bag per row.

Norbert

Re: LOAD function vs. UDF eval

Posted by Norbert Burger <no...@gmail.com>.
Thanks, Raghu.  Maybe another benefit of the UDF route is that it could
support the accumulator interface.

Since both approaches would use the HBase client API directly, there's no
Pig-specific benefit to using a loader, right?

Norbert

On Tue, May 29, 2012 at 8:37 PM, Raghu Angadi <an...@gmail.com> wrote:

> I would still use a UDF, it is lot more flexible.
>
> Passing large number of ids to the loader is part of the problem..
>
> Your UDF would take a bag of ids and return bag{(session, events:bag{})}
>
> You can pass the bag of ids in various ways :
>   - load ids as a relation, group all to put all of them in a single bag..
>        - in fact you can use RANDOM() to batch 10 ids in a bag (this
> avoids buffering all of the output in UDF).
>    - or put them in a json and load json..
>
>  etc...
>
> On Tue, May 29, 2012 at 10:20 AM, Norbert Burger
> <no...@gmail.com>wrote:
>
> > We're analyzing session(s) using Pig and HBase, and this session data is
> > currently stored in a single HBase table, where rowkey is a
> > sessionid-eventid combo (tall table).  I'm trying to optimize the
> > "extract-all-events-for-a-given-session" step of our workflow.
> >
> > This could be a simple JOIN.  But this seems inefficient even for a
> > replicated join, since the vast majority of repjoin mappers will have no
> > output.  It also seems inefficient because all events are grouped
> together
> > in clusters, and a de facto JOIN would ignore this locality.
> >
> > Possibly it'd be cleaner to model this as a LOAD function, which would
> use
> > the HBase client API to issue several scans in parallel.  But to handle
> our
> > use case, I'd have to be able to pass 10000s of sessionids to the loader
> > (perhaps UDFContext?)
> >
> > So I'm back to writing a UDF eval function to handle this, which seems
> > wonky.
> >
> > But this made me think -- is there any performance benefit to modeling
> > these "load" style steps as LOAD functions vs. generic UDFs?  In both
> > cases, they'd return a bag per row.
> >
> > Norbert
> >
>

Re: LOAD function vs. UDF eval

Posted by Raghu Angadi <an...@gmail.com>.
I would still use a UDF, it is lot more flexible.

Passing large number of ids to the loader is part of the problem..

Your UDF would take a bag of ids and return bag{(session, events:bag{})}

You can pass the bag of ids in various ways :
   - load ids as a relation, group all to put all of them in a single bag..
        - in fact you can use RANDOM() to batch 10 ids in a bag (this
avoids buffering all of the output in UDF).
    - or put them in a json and load json..

 etc...

On Tue, May 29, 2012 at 10:20 AM, Norbert Burger
<no...@gmail.com>wrote:

> We're analyzing session(s) using Pig and HBase, and this session data is
> currently stored in a single HBase table, where rowkey is a
> sessionid-eventid combo (tall table).  I'm trying to optimize the
> "extract-all-events-for-a-given-session" step of our workflow.
>
> This could be a simple JOIN.  But this seems inefficient even for a
> replicated join, since the vast majority of repjoin mappers will have no
> output.  It also seems inefficient because all events are grouped together
> in clusters, and a de facto JOIN would ignore this locality.
>
> Possibly it'd be cleaner to model this as a LOAD function, which would use
> the HBase client API to issue several scans in parallel.  But to handle our
> use case, I'd have to be able to pass 10000s of sessionids to the loader
> (perhaps UDFContext?)
>
> So I'm back to writing a UDF eval function to handle this, which seems
> wonky.
>
> But this made me think -- is there any performance benefit to modeling
> these "load" style steps as LOAD functions vs. generic UDFs?  In both
> cases, they'd return a bag per row.
>
> Norbert
>