You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Dexin Wang <wa...@gmail.com> on 2012/06/26 19:27:31 UTC

Passing a BAG to Pig UDF constructor?

Is it possible to pass a bag to a Pig UDF constructor?

Basically in the constructor I want to initialize some hash map so that on
every exec operation, I can use the hashmap to do a lookup and find the
value I need, and apply some algorithm to it.

I realize I could just do a replicated join to achieve similar things but
the algorithm is more than a few lines and there are some edge cases so I
would rather wrap that logic inside a UDF function. I also realize I could
just pass a file path to the constructor and read the files to initialize
the hashmap but my files are on Amazon's S3 and I don't want to deal with
S3 API to read the file.

Is this possible or is there some alternative ways to achieve the same
thing?

Thanks.
Dexin

Re: Passing a BAG to Pig UDF constructor?

Posted by Jonathan Coveney <jc...@gmail.com>.
I would run a perf test, but compared to the many other costs, I think it
will be minimal (unless it's a really massive bag). Pig should probably
allow for more graceful initialization in cases like this, but in my
experience I haven't noticed any serious degradation from this sort of
thing.

2012/6/29 Mridul Muralidharan <mr...@yahoo-inc.com>

>
>
> > -----Original Message-----
> > From: Dexin Wang [mailto:wangdexin@gmail.com]
> > Sent: Wednesday, June 27, 2012 11:00 PM
> > To: user@pig.apache.org
> > Subject: Re: Passing a BAG to Pig UDF constructor?
> >
> > That's a good idea (to pass the bag to UDF and initialize it on first
> > UDF invocation). Thanks.
> >
> > Why do you think it is expensive Mridul?
>
>
> You will be passing the bag with each tuple, but using it only for the
> first invocation per mapper/reducer.
> If other computations are more expensive, then it will get amortized over
> time; but it is a cost nonetheless ... only a perf test will tell you if it
> is small enough to ignore !
>
>
> Regards,
> Mridul
>
>
> >
> > On Tue, Jun 26, 2012 at 2:50 PM, Mridul Muralidharan
> > <mr...@yahoo-inc.com>wrote:
> >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Jonathan Coveney [mailto:jcoveney@gmail.com]
> > > > Sent: Wednesday, June 27, 2012 3:12 AM
> > > > To: user@pig.apache.org
> > > > Subject: Re: Passing a BAG to Pig UDF constructor?
> > > >
> > > > You can also just pass the bag to the UDF, and have a lazy
> > > > initializer in exec that loads the bag into memory.
> > >
> > >
> > > Can you elaborate what you mean by pass the bag to the UDF ?
> > > Pass it as part of the input to the udf in exec and initialize it
> > only
> > > once (first time) ? (If yes, this is expensive) Or something else ?
> > >
> > >
> > > Regards,
> > > Mridul
> > >
> > >
> > >
> > > >
> > > > 2012/6/26 Mridul Muralidharan <mr...@yahoo-inc.com>
> > > >
> > > > > You could dump the data in a dfs file and pass the location of
> > the
> > > > > file as param to your udf in define - so that it initializes
> > > > > itself using that data ...
> > > > >
> > > > >
> > > > > - Mridul
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Dexin Wang [mailto:wangdexin@gmail.com]
> > > > > > Sent: Tuesday, June 26, 2012 10:58 PM
> > > > > > To: user@pig.apache.org
> > > > > > Subject: Passing a BAG to Pig UDF constructor?
> > > > > >
> > > > > > Is it possible to pass a bag to a Pig UDF constructor?
> > > > > >
> > > > > > Basically in the constructor I want to initialize some hash map
> > > > > > so that on every exec operation, I can use the hashmap to do a
> > > > > > lookup and find the value I need, and apply some algorithm to
> > it.
> > > > > >
> > > > > > I realize I could just do a replicated join to achieve similar
> > > > > > things but the algorithm is more than a few lines and there are
> > > > some
> > > > > > edge cases so I would rather wrap that logic inside a UDF
> > function.
> > > > > > I also realize I could just pass a file path to the constructor
> > > > > > and read the files to initialize the hashmap but my files are
> > on
> > > > > > Amazon's S3 and I don't want to deal with
> > > > > > S3 API to read the file.
> > > > > >
> > > > > > Is this possible or is there some alternative ways to achieve
> > > > > > the same thing?
> > > > > >
> > > > > > Thanks.
> > > > > > Dexin
> > > > >
> > >
>

RE: Passing a BAG to Pig UDF constructor?

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.

> -----Original Message-----
> From: Dexin Wang [mailto:wangdexin@gmail.com]
> Sent: Wednesday, June 27, 2012 11:00 PM
> To: user@pig.apache.org
> Subject: Re: Passing a BAG to Pig UDF constructor?
> 
> That's a good idea (to pass the bag to UDF and initialize it on first
> UDF invocation). Thanks.
> 
> Why do you think it is expensive Mridul?


You will be passing the bag with each tuple, but using it only for the first invocation per mapper/reducer.
If other computations are more expensive, then it will get amortized over time; but it is a cost nonetheless ... only a perf test will tell you if it is small enough to ignore !


Regards,
Mridul


> 
> On Tue, Jun 26, 2012 at 2:50 PM, Mridul Muralidharan
> <mr...@yahoo-inc.com>wrote:
> 
> >
> >
> > > -----Original Message-----
> > > From: Jonathan Coveney [mailto:jcoveney@gmail.com]
> > > Sent: Wednesday, June 27, 2012 3:12 AM
> > > To: user@pig.apache.org
> > > Subject: Re: Passing a BAG to Pig UDF constructor?
> > >
> > > You can also just pass the bag to the UDF, and have a lazy
> > > initializer in exec that loads the bag into memory.
> >
> >
> > Can you elaborate what you mean by pass the bag to the UDF ?
> > Pass it as part of the input to the udf in exec and initialize it
> only
> > once (first time) ? (If yes, this is expensive) Or something else ?
> >
> >
> > Regards,
> > Mridul
> >
> >
> >
> > >
> > > 2012/6/26 Mridul Muralidharan <mr...@yahoo-inc.com>
> > >
> > > > You could dump the data in a dfs file and pass the location of
> the
> > > > file as param to your udf in define - so that it initializes
> > > > itself using that data ...
> > > >
> > > >
> > > > - Mridul
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Dexin Wang [mailto:wangdexin@gmail.com]
> > > > > Sent: Tuesday, June 26, 2012 10:58 PM
> > > > > To: user@pig.apache.org
> > > > > Subject: Passing a BAG to Pig UDF constructor?
> > > > >
> > > > > Is it possible to pass a bag to a Pig UDF constructor?
> > > > >
> > > > > Basically in the constructor I want to initialize some hash map
> > > > > so that on every exec operation, I can use the hashmap to do a
> > > > > lookup and find the value I need, and apply some algorithm to
> it.
> > > > >
> > > > > I realize I could just do a replicated join to achieve similar
> > > > > things but the algorithm is more than a few lines and there are
> > > some
> > > > > edge cases so I would rather wrap that logic inside a UDF
> function.
> > > > > I also realize I could just pass a file path to the constructor
> > > > > and read the files to initialize the hashmap but my files are
> on
> > > > > Amazon's S3 and I don't want to deal with
> > > > > S3 API to read the file.
> > > > >
> > > > > Is this possible or is there some alternative ways to achieve
> > > > > the same thing?
> > > > >
> > > > > Thanks.
> > > > > Dexin
> > > >
> >

Re: Passing a BAG to Pig UDF constructor?

Posted by Abhinav Neelam <ab...@gmail.com>.
You're right I guess. There's no reason why the two steps should happen on
the same nodes. To get around this, you'd have to make the hash available
on all the nodes - through the distributed cache or by putting it on HDFS
as Mridul suggested. Speaking of which, what's wrong with Mridul's
solution? (#2 in this thread)

If you absolutely have to go with the bag+UDF approach, you can try this if
your 'bag1' has a small number of tuples.
bag1 = LOAD 'somefile' AS (f1, f2, f3);
bag_grouped = GROUP bag1 ALL;
-- copy bag_grouped into each tuple of a
a_bag = FOREACH a GENERATE bag_grouped.bag1,$0 ..;
-- build and use your hash
b = FOREACH a_bag GENERATE myUDF($0 ..);

I don't think this'd be a good approach because you'd 1) be passing 'bag1'
with every tuple in a, and 2)building your hash every time. Maybe you can
avoid 2) by saving the hash on the first run in a globally available store
and reusing it, but this wouldn't be much different from Mridul's solution
of making the contents of 'bag1' available on HDFS in the first place.

Thanks,
Abhinav

On 29 June 2012 00:16, Dexin Wang <wa...@gmail.com> wrote:

> This (your second method) is very neat, thanks a lot Abhinav.
>
> Some problems though. First, I would have to do a STORE or DUMP of
> bag_dummy. Otherwise, Pig won't even run the bag_dummy line.
>
> Another problem, is it possible that the invocation of "build" step (that
> iterates through the bag1) and the "check" step (that iterates through the
> data bag "a") happen in different JVM or even different compute node? If
> that happens, the "check" step will not have access to all the hashes it
> needs.
>
> Hi Jonathan, you initially mentioned passing BAG to UDF, how would you do
> that? Is what Abhinav said something similar to what you had in mind?
>
> Thanks.
>
> On Thu, Jun 28, 2012 at 2:54 AM, Abhinav Neelam <abhinavrock17@gmail.com
> >wrote:
>
> > You're not passing a bag to your UDF, you're passing a relation. I
> believe
> > the FOREACH.. GENERATE looks for columns within the relation being
> iterated
> > on meaning that it's looking for 'bag1' within the schema of 'a'
> >
> > One way of doing this is generating a bag containing all the tuples in
> > relation b, and passing that to the UDF.
> > bag1 = LOAD 'somefile' AS (f1, f2, f3);
> > bag_grouped = GROUP bag1 ALL;
> > -- build your hash here
> > bag_dummy = FOREACH bag_grouped GENERATE myUDF(bag1);
> > -- write some logic into the UDF to check if it's receiving a bag or two
> > scalars, if you wish to reuse it
> > b = FOREACH a GENERATE myUDF(a1,a2);
> >
> > The problem here is the GROUP... ALL statement as it uses only reducer in
> > the reduce phase. You can make your myUDF algebraic (if possible) to
> speed
> > up the hash-building FOREACH...GENERATE step.
> >
> > Another way of doing this (I'm just throwing this one out there) is maybe
> > to simply FOREACH..GENERATE over the relation 'bag1', and in the exec
> > function build your hash using using the input tuples of bag1 (f1,f2,f3)
> > (Do you need all the tuples in bag1 at one time to build your hash?)
> >
> > bag1 = LOAD 'somefile' AS (f1, f2, f3);
> > -- build your hash here, perhaps use some identifier if you wish to reuse
> > your UDF
> > bag_dummy = FOREACH bag1 GENERATE myUDF('build',f1, f2, f3);
> > -- now use the hash
> > b = FOREACH a GENERATE myUDF('check',a1,a2);
> >
> >
> > Regards,
> > Abhinva
> > On 28 June 2012 04:38, Dexin Wang <wa...@gmail.com> wrote:
> >
> > > Actually how do you pass a bag to UDF? I did this:
> > >
> > >    a = LOAD 'file_a' AS (a1, a2, a3);
> > >
> > >    *bag1* = LOAD 'somefile' AS (f1, f2, f3);
> > >
> > >    b = FOREACH a GENERATE myUDF(*bag1*, a1, a2);
> > >
> > > But I got this error:
> > >
> > >     Invalid scalar projection: bag1 : A column needs to be projected
> from
> > > a relation for it to be used as a scalar
> > >
> > > What is the right way of doing this? Thanks.
> > >
> > >
> > > On Wed, Jun 27, 2012 at 10:30 AM, Dexin Wang <wa...@gmail.com>
> > wrote:
> > >
> > > > That's a good idea (to pass the bag to UDF and initialize it on first
> > UDF
> > > > invocation). Thanks.
> > > >
> > > > Why do you think it is expensive Mridul?
> > > >
> > > >
> > > > On Tue, Jun 26, 2012 at 2:50 PM, Mridul Muralidharan <
> > > > mridulm@yahoo-inc.com> wrote:
> > > >
> > > >>
> > > >>
> > > >> > -----Original Message-----
> > > >> > From: Jonathan Coveney [mailto:jcoveney@gmail.com]
> > > >> > Sent: Wednesday, June 27, 2012 3:12 AM
> > > >> > To: user@pig.apache.org
> > > >> > Subject: Re: Passing a BAG to Pig UDF constructor?
> > > >> >
> > > >> > You can also just pass the bag to the UDF, and have a lazy
> > initializer
> > > >> > in exec that loads the bag into memory.
> > > >>
> > > >>
> > > >> Can you elaborate what you mean by pass the bag to the UDF ?
> > > >> Pass it as part of the input to the udf in exec and initialize it
> only
> > > >> once (first time) ? (If yes, this is expensive)
> > > >> Or something else ?
> > > >>
> > > >>
> > > >> Regards,
> > > >> Mridul
> > > >>
> > > >>
> > > >>
> > > >> >
> > > >> > 2012/6/26 Mridul Muralidharan <mr...@yahoo-inc.com>
> > > >> >
> > > >> > > You could dump the data in a dfs file and pass the location of
> the
> > > >> > > file as param to your udf in define - so that it initializes
> > itself
> > > >> > > using that data ...
> > > >> > >
> > > >> > >
> > > >> > > - Mridul
> > > >> > >
> > > >> > >
> > > >> > > > -----Original Message-----
> > > >> > > > From: Dexin Wang [mailto:wangdexin@gmail.com]
> > > >> > > > Sent: Tuesday, June 26, 2012 10:58 PM
> > > >> > > > To: user@pig.apache.org
> > > >> > > > Subject: Passing a BAG to Pig UDF constructor?
> > > >> > > >
> > > >> > > > Is it possible to pass a bag to a Pig UDF constructor?
> > > >> > > >
> > > >> > > > Basically in the constructor I want to initialize some hash
> map
> > so
> > > >> > > > that on every exec operation, I can use the hashmap to do a
> > lookup
> > > >> > > > and find the value I need, and apply some algorithm to it.
> > > >> > > >
> > > >> > > > I realize I could just do a replicated join to achieve similar
> > > >> > > > things but the algorithm is more than a few lines and there
> are
> > > >> > some
> > > >> > > > edge cases so I would rather wrap that logic inside a UDF
> > > function.
> > > >> > > > I also realize I could just pass a file path to the
> constructor
> > > and
> > > >> > > > read the files to initialize the hashmap but my files are on
> > > >> > > > Amazon's S3 and I don't want to deal with
> > > >> > > > S3 API to read the file.
> > > >> > > >
> > > >> > > > Is this possible or is there some alternative ways to achieve
> > the
> > > >> > > > same thing?
> > > >> > > >
> > > >> > > > Thanks.
> > > >> > > > Dexin
> > > >> > >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: Passing a BAG to Pig UDF constructor?

Posted by Dexin Wang <wa...@gmail.com>.
This (your second method) is very neat, thanks a lot Abhinav.

Some problems though. First, I would have to do a STORE or DUMP of
bag_dummy. Otherwise, Pig won't even run the bag_dummy line.

Another problem, is it possible that the invocation of "build" step (that
iterates through the bag1) and the "check" step (that iterates through the
data bag "a") happen in different JVM or even different compute node? If
that happens, the "check" step will not have access to all the hashes it
needs.

Hi Jonathan, you initially mentioned passing BAG to UDF, how would you do
that? Is what Abhinav said something similar to what you had in mind?

Thanks.

On Thu, Jun 28, 2012 at 2:54 AM, Abhinav Neelam <ab...@gmail.com>wrote:

> You're not passing a bag to your UDF, you're passing a relation. I believe
> the FOREACH.. GENERATE looks for columns within the relation being iterated
> on meaning that it's looking for 'bag1' within the schema of 'a'
>
> One way of doing this is generating a bag containing all the tuples in
> relation b, and passing that to the UDF.
> bag1 = LOAD 'somefile' AS (f1, f2, f3);
> bag_grouped = GROUP bag1 ALL;
> -- build your hash here
> bag_dummy = FOREACH bag_grouped GENERATE myUDF(bag1);
> -- write some logic into the UDF to check if it's receiving a bag or two
> scalars, if you wish to reuse it
> b = FOREACH a GENERATE myUDF(a1,a2);
>
> The problem here is the GROUP... ALL statement as it uses only reducer in
> the reduce phase. You can make your myUDF algebraic (if possible) to speed
> up the hash-building FOREACH...GENERATE step.
>
> Another way of doing this (I'm just throwing this one out there) is maybe
> to simply FOREACH..GENERATE over the relation 'bag1', and in the exec
> function build your hash using using the input tuples of bag1 (f1,f2,f3)
> (Do you need all the tuples in bag1 at one time to build your hash?)
>
> bag1 = LOAD 'somefile' AS (f1, f2, f3);
> -- build your hash here, perhaps use some identifier if you wish to reuse
> your UDF
> bag_dummy = FOREACH bag1 GENERATE myUDF('build',f1, f2, f3);
> -- now use the hash
> b = FOREACH a GENERATE myUDF('check',a1,a2);
>
>
> Regards,
> Abhinva
> On 28 June 2012 04:38, Dexin Wang <wa...@gmail.com> wrote:
>
> > Actually how do you pass a bag to UDF? I did this:
> >
> >    a = LOAD 'file_a' AS (a1, a2, a3);
> >
> >    *bag1* = LOAD 'somefile' AS (f1, f2, f3);
> >
> >    b = FOREACH a GENERATE myUDF(*bag1*, a1, a2);
> >
> > But I got this error:
> >
> >     Invalid scalar projection: bag1 : A column needs to be projected from
> > a relation for it to be used as a scalar
> >
> > What is the right way of doing this? Thanks.
> >
> >
> > On Wed, Jun 27, 2012 at 10:30 AM, Dexin Wang <wa...@gmail.com>
> wrote:
> >
> > > That's a good idea (to pass the bag to UDF and initialize it on first
> UDF
> > > invocation). Thanks.
> > >
> > > Why do you think it is expensive Mridul?
> > >
> > >
> > > On Tue, Jun 26, 2012 at 2:50 PM, Mridul Muralidharan <
> > > mridulm@yahoo-inc.com> wrote:
> > >
> > >>
> > >>
> > >> > -----Original Message-----
> > >> > From: Jonathan Coveney [mailto:jcoveney@gmail.com]
> > >> > Sent: Wednesday, June 27, 2012 3:12 AM
> > >> > To: user@pig.apache.org
> > >> > Subject: Re: Passing a BAG to Pig UDF constructor?
> > >> >
> > >> > You can also just pass the bag to the UDF, and have a lazy
> initializer
> > >> > in exec that loads the bag into memory.
> > >>
> > >>
> > >> Can you elaborate what you mean by pass the bag to the UDF ?
> > >> Pass it as part of the input to the udf in exec and initialize it only
> > >> once (first time) ? (If yes, this is expensive)
> > >> Or something else ?
> > >>
> > >>
> > >> Regards,
> > >> Mridul
> > >>
> > >>
> > >>
> > >> >
> > >> > 2012/6/26 Mridul Muralidharan <mr...@yahoo-inc.com>
> > >> >
> > >> > > You could dump the data in a dfs file and pass the location of the
> > >> > > file as param to your udf in define - so that it initializes
> itself
> > >> > > using that data ...
> > >> > >
> > >> > >
> > >> > > - Mridul
> > >> > >
> > >> > >
> > >> > > > -----Original Message-----
> > >> > > > From: Dexin Wang [mailto:wangdexin@gmail.com]
> > >> > > > Sent: Tuesday, June 26, 2012 10:58 PM
> > >> > > > To: user@pig.apache.org
> > >> > > > Subject: Passing a BAG to Pig UDF constructor?
> > >> > > >
> > >> > > > Is it possible to pass a bag to a Pig UDF constructor?
> > >> > > >
> > >> > > > Basically in the constructor I want to initialize some hash map
> so
> > >> > > > that on every exec operation, I can use the hashmap to do a
> lookup
> > >> > > > and find the value I need, and apply some algorithm to it.
> > >> > > >
> > >> > > > I realize I could just do a replicated join to achieve similar
> > >> > > > things but the algorithm is more than a few lines and there are
> > >> > some
> > >> > > > edge cases so I would rather wrap that logic inside a UDF
> > function.
> > >> > > > I also realize I could just pass a file path to the constructor
> > and
> > >> > > > read the files to initialize the hashmap but my files are on
> > >> > > > Amazon's S3 and I don't want to deal with
> > >> > > > S3 API to read the file.
> > >> > > >
> > >> > > > Is this possible or is there some alternative ways to achieve
> the
> > >> > > > same thing?
> > >> > > >
> > >> > > > Thanks.
> > >> > > > Dexin
> > >> > >
> > >>
> > >
> > >
> >
>

Re: Passing a BAG to Pig UDF constructor?

Posted by Abhinav Neelam <ab...@gmail.com>.
You're not passing a bag to your UDF, you're passing a relation. I believe
the FOREACH.. GENERATE looks for columns within the relation being iterated
on meaning that it's looking for 'bag1' within the schema of 'a'

One way of doing this is generating a bag containing all the tuples in
relation b, and passing that to the UDF.
bag1 = LOAD 'somefile' AS (f1, f2, f3);
bag_grouped = GROUP bag1 ALL;
-- build your hash here
bag_dummy = FOREACH bag_grouped GENERATE myUDF(bag1);
-- write some logic into the UDF to check if it's receiving a bag or two
scalars, if you wish to reuse it
b = FOREACH a GENERATE myUDF(a1,a2);

The problem here is the GROUP... ALL statement as it uses only reducer in
the reduce phase. You can make your myUDF algebraic (if possible) to speed
up the hash-building FOREACH...GENERATE step.

Another way of doing this (I'm just throwing this one out there) is maybe
to simply FOREACH..GENERATE over the relation 'bag1', and in the exec
function build your hash using using the input tuples of bag1 (f1,f2,f3)
(Do you need all the tuples in bag1 at one time to build your hash?)

bag1 = LOAD 'somefile' AS (f1, f2, f3);
-- build your hash here, perhaps use some identifier if you wish to reuse
your UDF
bag_dummy = FOREACH bag1 GENERATE myUDF('build',f1, f2, f3);
-- now use the hash
b = FOREACH a GENERATE myUDF('check',a1,a2);


Regards,
Abhinva
On 28 June 2012 04:38, Dexin Wang <wa...@gmail.com> wrote:

> Actually how do you pass a bag to UDF? I did this:
>
>    a = LOAD 'file_a' AS (a1, a2, a3);
>
>    *bag1* = LOAD 'somefile' AS (f1, f2, f3);
>
>    b = FOREACH a GENERATE myUDF(*bag1*, a1, a2);
>
> But I got this error:
>
>     Invalid scalar projection: bag1 : A column needs to be projected from
> a relation for it to be used as a scalar
>
> What is the right way of doing this? Thanks.
>
>
> On Wed, Jun 27, 2012 at 10:30 AM, Dexin Wang <wa...@gmail.com> wrote:
>
> > That's a good idea (to pass the bag to UDF and initialize it on first UDF
> > invocation). Thanks.
> >
> > Why do you think it is expensive Mridul?
> >
> >
> > On Tue, Jun 26, 2012 at 2:50 PM, Mridul Muralidharan <
> > mridulm@yahoo-inc.com> wrote:
> >
> >>
> >>
> >> > -----Original Message-----
> >> > From: Jonathan Coveney [mailto:jcoveney@gmail.com]
> >> > Sent: Wednesday, June 27, 2012 3:12 AM
> >> > To: user@pig.apache.org
> >> > Subject: Re: Passing a BAG to Pig UDF constructor?
> >> >
> >> > You can also just pass the bag to the UDF, and have a lazy initializer
> >> > in exec that loads the bag into memory.
> >>
> >>
> >> Can you elaborate what you mean by pass the bag to the UDF ?
> >> Pass it as part of the input to the udf in exec and initialize it only
> >> once (first time) ? (If yes, this is expensive)
> >> Or something else ?
> >>
> >>
> >> Regards,
> >> Mridul
> >>
> >>
> >>
> >> >
> >> > 2012/6/26 Mridul Muralidharan <mr...@yahoo-inc.com>
> >> >
> >> > > You could dump the data in a dfs file and pass the location of the
> >> > > file as param to your udf in define - so that it initializes itself
> >> > > using that data ...
> >> > >
> >> > >
> >> > > - Mridul
> >> > >
> >> > >
> >> > > > -----Original Message-----
> >> > > > From: Dexin Wang [mailto:wangdexin@gmail.com]
> >> > > > Sent: Tuesday, June 26, 2012 10:58 PM
> >> > > > To: user@pig.apache.org
> >> > > > Subject: Passing a BAG to Pig UDF constructor?
> >> > > >
> >> > > > Is it possible to pass a bag to a Pig UDF constructor?
> >> > > >
> >> > > > Basically in the constructor I want to initialize some hash map so
> >> > > > that on every exec operation, I can use the hashmap to do a lookup
> >> > > > and find the value I need, and apply some algorithm to it.
> >> > > >
> >> > > > I realize I could just do a replicated join to achieve similar
> >> > > > things but the algorithm is more than a few lines and there are
> >> > some
> >> > > > edge cases so I would rather wrap that logic inside a UDF
> function.
> >> > > > I also realize I could just pass a file path to the constructor
> and
> >> > > > read the files to initialize the hashmap but my files are on
> >> > > > Amazon's S3 and I don't want to deal with
> >> > > > S3 API to read the file.
> >> > > >
> >> > > > Is this possible or is there some alternative ways to achieve the
> >> > > > same thing?
> >> > > >
> >> > > > Thanks.
> >> > > > Dexin
> >> > >
> >>
> >
> >
>

Re: Passing a BAG to Pig UDF constructor?

Posted by Dexin Wang <wa...@gmail.com>.
Actually how do you pass a bag to UDF? I did this:

    a = LOAD 'file_a' AS (a1, a2, a3);

    *bag1* = LOAD 'somefile' AS (f1, f2, f3);

    b = FOREACH a GENERATE myUDF(*bag1*, a1, a2);

But I got this error:

     Invalid scalar projection: bag1 : A column needs to be projected from
a relation for it to be used as a scalar

What is the right way of doing this? Thanks.


On Wed, Jun 27, 2012 at 10:30 AM, Dexin Wang <wa...@gmail.com> wrote:

> That's a good idea (to pass the bag to UDF and initialize it on first UDF
> invocation). Thanks.
>
> Why do you think it is expensive Mridul?
>
>
> On Tue, Jun 26, 2012 at 2:50 PM, Mridul Muralidharan <
> mridulm@yahoo-inc.com> wrote:
>
>>
>>
>> > -----Original Message-----
>> > From: Jonathan Coveney [mailto:jcoveney@gmail.com]
>> > Sent: Wednesday, June 27, 2012 3:12 AM
>> > To: user@pig.apache.org
>> > Subject: Re: Passing a BAG to Pig UDF constructor?
>> >
>> > You can also just pass the bag to the UDF, and have a lazy initializer
>> > in exec that loads the bag into memory.
>>
>>
>> Can you elaborate what you mean by pass the bag to the UDF ?
>> Pass it as part of the input to the udf in exec and initialize it only
>> once (first time) ? (If yes, this is expensive)
>> Or something else ?
>>
>>
>> Regards,
>> Mridul
>>
>>
>>
>> >
>> > 2012/6/26 Mridul Muralidharan <mr...@yahoo-inc.com>
>> >
>> > > You could dump the data in a dfs file and pass the location of the
>> > > file as param to your udf in define - so that it initializes itself
>> > > using that data ...
>> > >
>> > >
>> > > - Mridul
>> > >
>> > >
>> > > > -----Original Message-----
>> > > > From: Dexin Wang [mailto:wangdexin@gmail.com]
>> > > > Sent: Tuesday, June 26, 2012 10:58 PM
>> > > > To: user@pig.apache.org
>> > > > Subject: Passing a BAG to Pig UDF constructor?
>> > > >
>> > > > Is it possible to pass a bag to a Pig UDF constructor?
>> > > >
>> > > > Basically in the constructor I want to initialize some hash map so
>> > > > that on every exec operation, I can use the hashmap to do a lookup
>> > > > and find the value I need, and apply some algorithm to it.
>> > > >
>> > > > I realize I could just do a replicated join to achieve similar
>> > > > things but the algorithm is more than a few lines and there are
>> > some
>> > > > edge cases so I would rather wrap that logic inside a UDF function.
>> > > > I also realize I could just pass a file path to the constructor and
>> > > > read the files to initialize the hashmap but my files are on
>> > > > Amazon's S3 and I don't want to deal with
>> > > > S3 API to read the file.
>> > > >
>> > > > Is this possible or is there some alternative ways to achieve the
>> > > > same thing?
>> > > >
>> > > > Thanks.
>> > > > Dexin
>> > >
>>
>
>

Re: Passing a BAG to Pig UDF constructor?

Posted by Dexin Wang <wa...@gmail.com>.
That's a good idea (to pass the bag to UDF and initialize it on first UDF
invocation). Thanks.

Why do you think it is expensive Mridul?

On Tue, Jun 26, 2012 at 2:50 PM, Mridul Muralidharan
<mr...@yahoo-inc.com>wrote:

>
>
> > -----Original Message-----
> > From: Jonathan Coveney [mailto:jcoveney@gmail.com]
> > Sent: Wednesday, June 27, 2012 3:12 AM
> > To: user@pig.apache.org
> > Subject: Re: Passing a BAG to Pig UDF constructor?
> >
> > You can also just pass the bag to the UDF, and have a lazy initializer
> > in exec that loads the bag into memory.
>
>
> Can you elaborate what you mean by pass the bag to the UDF ?
> Pass it as part of the input to the udf in exec and initialize it only
> once (first time) ? (If yes, this is expensive)
> Or something else ?
>
>
> Regards,
> Mridul
>
>
>
> >
> > 2012/6/26 Mridul Muralidharan <mr...@yahoo-inc.com>
> >
> > > You could dump the data in a dfs file and pass the location of the
> > > file as param to your udf in define - so that it initializes itself
> > > using that data ...
> > >
> > >
> > > - Mridul
> > >
> > >
> > > > -----Original Message-----
> > > > From: Dexin Wang [mailto:wangdexin@gmail.com]
> > > > Sent: Tuesday, June 26, 2012 10:58 PM
> > > > To: user@pig.apache.org
> > > > Subject: Passing a BAG to Pig UDF constructor?
> > > >
> > > > Is it possible to pass a bag to a Pig UDF constructor?
> > > >
> > > > Basically in the constructor I want to initialize some hash map so
> > > > that on every exec operation, I can use the hashmap to do a lookup
> > > > and find the value I need, and apply some algorithm to it.
> > > >
> > > > I realize I could just do a replicated join to achieve similar
> > > > things but the algorithm is more than a few lines and there are
> > some
> > > > edge cases so I would rather wrap that logic inside a UDF function.
> > > > I also realize I could just pass a file path to the constructor and
> > > > read the files to initialize the hashmap but my files are on
> > > > Amazon's S3 and I don't want to deal with
> > > > S3 API to read the file.
> > > >
> > > > Is this possible or is there some alternative ways to achieve the
> > > > same thing?
> > > >
> > > > Thanks.
> > > > Dexin
> > >
>

RE: Passing a BAG to Pig UDF constructor?

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.

> -----Original Message-----
> From: Jonathan Coveney [mailto:jcoveney@gmail.com]
> Sent: Wednesday, June 27, 2012 3:12 AM
> To: user@pig.apache.org
> Subject: Re: Passing a BAG to Pig UDF constructor?
> 
> You can also just pass the bag to the UDF, and have a lazy initializer
> in exec that loads the bag into memory.


Can you elaborate what you mean by pass the bag to the UDF ?
Pass it as part of the input to the udf in exec and initialize it only once (first time) ? (If yes, this is expensive)
Or something else ?


Regards,
Mridul



> 
> 2012/6/26 Mridul Muralidharan <mr...@yahoo-inc.com>
> 
> > You could dump the data in a dfs file and pass the location of the
> > file as param to your udf in define - so that it initializes itself
> > using that data ...
> >
> >
> > - Mridul
> >
> >
> > > -----Original Message-----
> > > From: Dexin Wang [mailto:wangdexin@gmail.com]
> > > Sent: Tuesday, June 26, 2012 10:58 PM
> > > To: user@pig.apache.org
> > > Subject: Passing a BAG to Pig UDF constructor?
> > >
> > > Is it possible to pass a bag to a Pig UDF constructor?
> > >
> > > Basically in the constructor I want to initialize some hash map so
> > > that on every exec operation, I can use the hashmap to do a lookup
> > > and find the value I need, and apply some algorithm to it.
> > >
> > > I realize I could just do a replicated join to achieve similar
> > > things but the algorithm is more than a few lines and there are
> some
> > > edge cases so I would rather wrap that logic inside a UDF function.
> > > I also realize I could just pass a file path to the constructor and
> > > read the files to initialize the hashmap but my files are on
> > > Amazon's S3 and I don't want to deal with
> > > S3 API to read the file.
> > >
> > > Is this possible or is there some alternative ways to achieve the
> > > same thing?
> > >
> > > Thanks.
> > > Dexin
> >

Re: Passing a BAG to Pig UDF constructor?

Posted by Jonathan Coveney <jc...@gmail.com>.
You can also just pass the bag to the UDF, and have a lazy initializer in
exec that loads the bag into memory.

2012/6/26 Mridul Muralidharan <mr...@yahoo-inc.com>

> You could dump the data in a dfs file and pass the location of the file as
> param to your udf in define - so that it initializes itself using that data
> ...
>
>
> - Mridul
>
>
> > -----Original Message-----
> > From: Dexin Wang [mailto:wangdexin@gmail.com]
> > Sent: Tuesday, June 26, 2012 10:58 PM
> > To: user@pig.apache.org
> > Subject: Passing a BAG to Pig UDF constructor?
> >
> > Is it possible to pass a bag to a Pig UDF constructor?
> >
> > Basically in the constructor I want to initialize some hash map so that
> > on every exec operation, I can use the hashmap to do a lookup and find
> > the value I need, and apply some algorithm to it.
> >
> > I realize I could just do a replicated join to achieve similar things
> > but the algorithm is more than a few lines and there are some edge
> > cases so I would rather wrap that logic inside a UDF function. I also
> > realize I could just pass a file path to the constructor and read the
> > files to initialize the hashmap but my files are on Amazon's S3 and I
> > don't want to deal with
> > S3 API to read the file.
> >
> > Is this possible or is there some alternative ways to achieve the same
> > thing?
> >
> > Thanks.
> > Dexin
>

RE: Passing a BAG to Pig UDF constructor?

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
You could dump the data in a dfs file and pass the location of the file as param to your udf in define - so that it initializes itself using that data ...


- Mridul


> -----Original Message-----
> From: Dexin Wang [mailto:wangdexin@gmail.com]
> Sent: Tuesday, June 26, 2012 10:58 PM
> To: user@pig.apache.org
> Subject: Passing a BAG to Pig UDF constructor?
> 
> Is it possible to pass a bag to a Pig UDF constructor?
> 
> Basically in the constructor I want to initialize some hash map so that
> on every exec operation, I can use the hashmap to do a lookup and find
> the value I need, and apply some algorithm to it.
> 
> I realize I could just do a replicated join to achieve similar things
> but the algorithm is more than a few lines and there are some edge
> cases so I would rather wrap that logic inside a UDF function. I also
> realize I could just pass a file path to the constructor and read the
> files to initialize the hashmap but my files are on Amazon's S3 and I
> don't want to deal with
> S3 API to read the file.
> 
> Is this possible or is there some alternative ways to achieve the same
> thing?
> 
> Thanks.
> Dexin