You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Sandeep Mellacheruvu <sa...@insideview.com> on 2015/07/29 15:51:59 UTC

Need help on writing a UDF

Hi,

I need to write a pig UDF which takes string and personId as an input tuple. The personId is a key to query hbase within this UDF. I have created a connection to hbase when the UDF class loads. 

The problem here is PigStorage actually treats each row as a tuple and I have to query each personId independently and because of this I couldnt do a bulk query on hbase. For example I want to do a query on 1000 personId’s at a time. The reason for this is to improve the round trip performance and I have already created prototype on this and seen drastic improvement.

I tried to extend the PigStorage by creating NLineStorage class and overriding the getNext() method. But the getNext() method returns a tuple, but all I want is a Bag of tuples. Even if I implement this as a tuple, I can’t implement the tuple’s method’s like getType etc., because those methods are for individual columns not for an entire tuple (since this is a list of tuples). 

I am struck on this and I am not able to proceed on this. Can someone please help me on this ? Am I doing something wrong here ? By the way I do not have any way of using Accumulator interface because I cannot in any way use a groupBy.

Any help on this will be deeply appreciated. 


Thanks,
Sandeep



Re: Need help on writing a UDF

Posted by Aaron Zimmerman <az...@sproutsocial.com>.
I don't entirely follow what your problem is, but it sounds a bit like you
might do better to load all of the data from HBase into its own relation
and then join it?  What is the overall objective of the pig script you are
writing?

On Wed, Jul 29, 2015 at 8:51 AM, Sandeep Mellacheruvu <
sandeep.mellacheruvu@insideview.com> wrote:

> Hi,
>
> I need to write a pig UDF which takes string and personId as an input
> tuple. The personId is a key to query hbase within this UDF. I have created
> a connection to hbase when the UDF class loads.
>
> The problem here is PigStorage actually treats each row as a tuple and I
> have to query each personId independently and because of this I couldnt do
> a bulk query on hbase. For example I want to do a query on 1000 personId’s
> at a time. The reason for this is to improve the round trip performance and
> I have already created prototype on this and seen drastic improvement.
>
> I tried to extend the PigStorage by creating NLineStorage class and
> overriding the getNext() method. But the getNext() method returns a tuple,
> but all I want is a Bag of tuples. Even if I implement this as a tuple, I
> can’t implement the tuple’s method’s like getType etc., because those
> methods are for individual columns not for an entire tuple (since this is a
> list of tuples).
>
> I am struck on this and I am not able to proceed on this. Can someone
> please help me on this ? Am I doing something wrong here ? By the way I do
> not have any way of using Accumulator interface because I cannot in any way
> use a groupBy.
>
> Any help on this will be deeply appreciated.
>
>
> Thanks,
> Sandeep
>
>
>