You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Shibu Thomas <sh...@microsoft.com> on 2012/02/23 05:57:08 UTC

Retaining state in PIG UDF

Hi,

Is there any mechanism of retaining state between PIG UDF invocations?

Thanks

Shibu Thomas
MSCIS-IS
Office :  +91 (40) 669 32660
Mobile: +91 95811 51116 


Re: Retaining state in PIG UDF

Posted by Jonathan Coveney <jc...@gmail.com>.
I don't think you need to do this with multiple invocations of the UDF, but
I'm not sure... but I mean, if you did something like the following:

a = load 'thing' as (x:int);
b = foreach (group a all) generate flatten(GenIdNumber(a.x));

then your output would be a bunch of ID numbers, and it would all be based
on one invocation. The caveat is that it wouldn't be terribly
efficient...for efficiency, you'd want to make this Algebraic, however,
then you're going to have a bunch of invocations.

Generally, you can't make any assumptions about how many invocations there
are going to be because this is M/R, so you don't know how it's going to be
split up. So once again, it depends on what you're trying to do
specifically.. my guess is that there is an efficient way to achieve it
that doesn't explicitly need one global invocation (and you can see why
that would go against distributed processing as a paradigm).

2012/2/22 Shibu Thomas <sh...@microsoft.com>

> Hi Jonathan,
>
> The PIG UDF will generate a block of sequence numbers.
> The parent PIG script will call this UDF in a foreach statement and the
> UDF has to return the next number from the sequence.
>
> Thanks
>
> Shibu Thomas
> MSCIS-IS
> Office :  +91 (40) 669 32660
> Mobile: +91 95811 51116
>
>
> -----Original Message-----
> From: Jonathan Coveney [mailto:jcoveney@gmail.com]
> Sent: Thursday, February 23, 2012 10:46 AM
> To: user@pig.apache.org
> Subject: Re: Retaining state in PIG UDF
>
> You need to be clearer about what you hope to achieve
>
> 2012/2/22 Shibu Thomas <sh...@microsoft.com>
>
> > Hi,
> >
> > Is there any mechanism of retaining state between PIG UDF invocations?
> >
> > Thanks
> >
> > Shibu Thomas
> > MSCIS-IS
> > Office :  +91 (40) 669 32660
> > Mobile: +91 95811 51116
> >
> >
>

RE: Retaining state in PIG UDF

Posted by Shibu Thomas <sh...@microsoft.com>.
Hi Jonathan,

The PIG UDF will generate a block of sequence numbers.
The parent PIG script will call this UDF in a foreach statement and the UDF has to return the next number from the sequence.

Thanks

Shibu Thomas
MSCIS-IS
Office :  +91 (40) 669 32660
Mobile: +91 95811 51116 


-----Original Message-----
From: Jonathan Coveney [mailto:jcoveney@gmail.com] 
Sent: Thursday, February 23, 2012 10:46 AM
To: user@pig.apache.org
Subject: Re: Retaining state in PIG UDF

You need to be clearer about what you hope to achieve

2012/2/22 Shibu Thomas <sh...@microsoft.com>

> Hi,
>
> Is there any mechanism of retaining state between PIG UDF invocations?
>
> Thanks
>
> Shibu Thomas
> MSCIS-IS
> Office :  +91 (40) 669 32660
> Mobile: +91 95811 51116
>
>

Re: Retaining state in PIG UDF

Posted by Jonathan Coveney <jc...@gmail.com>.
You need to be clearer about what you hope to achieve

2012/2/22 Shibu Thomas <sh...@microsoft.com>

> Hi,
>
> Is there any mechanism of retaining state between PIG UDF invocations?
>
> Thanks
>
> Shibu Thomas
> MSCIS-IS
> Office :  +91 (40) 669 32660
> Mobile: +91 95811 51116
>
>

Re: Retaining state in PIG UDF

Posted by Alan Gates <ga...@hortonworks.com>.
You mean if you use the same UDF in multiple places in the Pig Latin script?  UDFContext allows you to pass a signature when using it so you can distinguish between different invocations of the same UDF.

Alan.

On Feb 23, 2012, at 2:08 PM, Jonathan Coveney wrote:

> Alan, with the work that was done to get Schema information on the backend,
> is UDFContext now safe to use? It used to create issues if you used the UDF
> in multiple places because it was static.
> 
> 2012/2/23 Alan Gates <ga...@hortonworks.com>
> 
>> You can use UDFContext to pass information from the frontend to the
>> backend.  That is, if you want the UDF to generate some sequence number
>> during the parse/planning stage and pass that to itself for use during
>> execution, you can do that.
>> 
>> You cannot pass information between invocations of a UDF once you are
>> running in the job.  MapReduce does not provide any method of execution
>> between tasks beyond the data.  Doing so would require synchronization and
>> a number of other features MapReduce doesn't provide.
>> 
>> Alan.
>> 
>> On Feb 22, 2012, at 8:57 PM, Shibu Thomas wrote:
>> 
>>> Hi,
>>> 
>>> Is there any mechanism of retaining state between PIG UDF invocations?
>>> 
>>> Thanks
>>> 
>>> Shibu Thomas
>>> MSCIS-IS
>>> Office :  +91 (40) 669 32660
>>> Mobile: +91 95811 51116
>>> 
>> 
>> 


Re: Retaining state in PIG UDF

Posted by Jonathan Coveney <jc...@gmail.com>.
Alan, with the work that was done to get Schema information on the backend,
is UDFContext now safe to use? It used to create issues if you used the UDF
in multiple places because it was static.

2012/2/23 Alan Gates <ga...@hortonworks.com>

> You can use UDFContext to pass information from the frontend to the
> backend.  That is, if you want the UDF to generate some sequence number
> during the parse/planning stage and pass that to itself for use during
> execution, you can do that.
>
> You cannot pass information between invocations of a UDF once you are
> running in the job.  MapReduce does not provide any method of execution
> between tasks beyond the data.  Doing so would require synchronization and
> a number of other features MapReduce doesn't provide.
>
> Alan.
>
> On Feb 22, 2012, at 8:57 PM, Shibu Thomas wrote:
>
> > Hi,
> >
> > Is there any mechanism of retaining state between PIG UDF invocations?
> >
> > Thanks
> >
> > Shibu Thomas
> > MSCIS-IS
> > Office :  +91 (40) 669 32660
> > Mobile: +91 95811 51116
> >
>
>

Re: Retaining state in PIG UDF

Posted by Alan Gates <ga...@hortonworks.com>.
You can use UDFContext to pass information from the frontend to the backend.  That is, if you want the UDF to generate some sequence number during the parse/planning stage and pass that to itself for use during execution, you can do that.

You cannot pass information between invocations of a UDF once you are running in the job.  MapReduce does not provide any method of execution between tasks beyond the data.  Doing so would require synchronization and a number of other features MapReduce doesn't provide.

Alan.

On Feb 22, 2012, at 8:57 PM, Shibu Thomas wrote:

> Hi,
> 
> Is there any mechanism of retaining state between PIG UDF invocations?
> 
> Thanks
> 
> Shibu Thomas
> MSCIS-IS
> Office :  +91 (40) 669 32660
> Mobile: +91 95811 51116 
>