You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Mark Laczin <ma...@gmail.com> on 2011/04/20 15:27:12 UTC

Question about bags and UDFs

I'm trying to do something like this:
(if 'data' is a set of tuples loaded from a file containing fields a, b and
c)
(if 'M' is another set of tuples loaded from a file)

data = FOREACH data GENERATE *, someUDF(a, b, M);

What I'm looking for is to generate (in this case, a string) based on a and
b, using the contents of M inside the UDF.

The UDF looks like this, in pseudocode:

foreach element x in M {
  if a matches x or b matches x {
    return "something"
  }
}
return "something else"

Is this possible?  I keep getting errors related to "Scalars can only be
used with projections" and the like.
The thing holding me back from using filters is that I won't know what's in
M until it's read, and since (in this case) they'll be regular expressions,
I'd need to be able to join/group with regex matching which I don't think
Pig can do.

-Mark

Re: Question about bags and UDFs

Posted by Mark Laczin <ma...@gmail.com>.

For example,

Using -param path=filename.txt

DEFINE MYUDF org.package.MYUDF('$path') SHIP('$path');

Doesn't seem to work...

On Thu, Apr 21, 2011 at 11:18 AM, Mark Laczin <ma...@gmail.com> wrote:

> Does anyone know how to ship the config file in this situation?
> I'm encountering problems with file not found exceptions when trying to run
> this over a cluster.
>
>
> On Wed, Apr 20, 2011 at 1:03 PM, Mark Laczin <ma...@gmail.com>wrote:
>
>> I kind of solved it by reading in the data from my UDF constructor (it's
>> just a file with a list of like 10 regular expressions, so I did manual file
>> I/O), by passing the path (provided as a parameter), and then just storing
>> it (and then, looping over it and testing a, b by hand).  It's not the
>> MapReduce way, but it will work for this application, considering the small
>> size of the file.
>>
>> If anyone knows how my "patch" might fail, or if there is a better way -
>> feel free to speak up.
>>
>> -Mark
>>
>>
>> On Wed, Apr 20, 2011 at 12:51 PM, Bill Graham <bi...@gmail.com>wrote:
>>
>>> You could try doing GROUP ALL on the contents of M, which would
>>> produce a since bag containing each record and then joining M with
>>> data using a surrogate constant key. Or CROSS would also work instead
>>> of the join I suspect. Then you'd have a tuple like this to work with:
>>>
>>> (a, b, M:bag)
>>>
>>> I'm not sure if things would blow up if M is too large to fit into
>>> memory in your UDF though.
>>>
>>>
>>> On Wed, Apr 20, 2011 at 6:27 AM, Mark Laczin <ma...@gmail.com>
>>> wrote:
>>> > I'm trying to do something like this:
>>> > (if 'data' is a set of tuples loaded from a file containing fields a, b
>>> and
>>> > c)
>>> > (if 'M' is another set of tuples loaded from a file)
>>> >
>>> > data = FOREACH data GENERATE *, someUDF(a, b, M);
>>> >
>>> > What I'm looking for is to generate (in this case, a string) based on a
>>> and
>>> > b, using the contents of M inside the UDF.
>>> >
>>> > The UDF looks like this, in pseudocode:
>>> >
>>> > foreach element x in M {
>>> >  if a matches x or b matches x {
>>> >    return "something"
>>> >  }
>>> > }
>>> > return "something else"
>>> >
>>> > Is this possible?  I keep getting errors related to "Scalars can only
>>> be
>>> > used with projections" and the like.
>>> > The thing holding me back from using filters is that I won't know
>>> what's in
>>> > M until it's read, and since (in this case) they'll be regular
>>> expressions,
>>> > I'd need to be able to join/group with regex matching which I don't
>>> think
>>> > Pig can do.
>>> >
>>> > -Mark
>>> >
>>>
>>
>>
>

Re: Question about bags and UDFs

Posted by Xiaomeng Wan <sh...@gmail.com>.

I used something like this before, it worked:

in pig script
set mapred.cache.archives /user/shawn/share/xxx.dat#cached

in udf
String cachedfiles = UDFContext.getUDFContext().getJobConf()
						.get("mapred.cache.archives");
				int endoffilename = cachedfiles.lastIndexOf("#");
				String cachepath = cachedfiles.substring( endoffilename + 1);
				String cachedfile =
cachedfiles.substring(cachedfiles.lastIndexOf("/"), endoffilename);

String localpath = cachepath + cachedfile;

Shawn

On Fri, Apr 22, 2011 at 6:10 AM, Mark Laczin <ma...@gmail.com> wrote:
> Follow-up question, how do you add it to the cache in a pig script, and once
> it's in there can you access it from the UDF using regular Java file I/O?
>  That is, it is as simple as saying:
>
> copyFromLocal $localFilePath udfFile.txt
> DEFINE someudf org.someudf CACHE('udfFile.txt#udfFile.txt');
>
> And then the UDF can read it using regular Java file streams/etc?
>
> Thanks for your help so far - the mailing list has been fairly kind to me in
> this regard, especially considering my lack of Pig experience.
>
> -Mark
>
> On Fri, Apr 22, 2011 at 7:40 AM, Mark Laczin <ma...@gmail.com> wrote:
>
>> I think I may have to go with your second option - but thanks for the info,
>> I'll keep an eye on 0.9.0.
>>
>>
>> On Thu, Apr 21, 2011 at 4:16 PM, Alan Gates <ga...@yahoo-inc.com> wrote:
>>
>>> Starting with Pig 0.9 (not yet released but you can build it off the
>>> branch) a UDF can specify a file to put in the distributed cache.  You could
>>> thus have your UDF pick up the file locally on your box and put it in the
>>> distributed cache, and then read it from the distributed cache on the back
>>> end.  If running with an un-released version isn't an option for you, you
>>> could manually load the file into the distributed cache and then read it
>>> from your UDF.
>>>
>>> Alan.
>>>
>>>
>>> On Apr 21, 2011, at 8:18 AM, Mark Laczin wrote:
>>>
>>>  Does anyone know how to ship the config file in this situation?
>>>> I'm encountering problems with file not found exceptions when trying to
>>>> run
>>>> this over a cluster.
>>>>
>>>> On Wed, Apr 20, 2011 at 1:03 PM, Mark Laczin <ma...@gmail.com>
>>>> wrote:
>>>>
>>>>  I kind of solved it by reading in the data from my UDF constructor (it's
>>>>> just a file with a list of like 10 regular expressions, so I did manual
>>>>> file
>>>>> I/O), by passing the path (provided as a parameter), and then just
>>>>> storing
>>>>> it (and then, looping over it and testing a, b by hand).  It's not the
>>>>> MapReduce way, but it will work for this application, considering the
>>>>> small
>>>>> size of the file.
>>>>>
>>>>> If anyone knows how my "patch" might fail, or if there is a better way -
>>>>> feel free to speak up.
>>>>>
>>>>> -Mark
>>>>>
>>>>>
>>>>> On Wed, Apr 20, 2011 at 12:51 PM, Bill Graham <billgraham@gmail.com
>>>>> >wrote:
>>>>>
>>>>>  You could try doing GROUP ALL on the contents of M, which would
>>>>>> produce a since bag containing each record and then joining M with
>>>>>> data using a surrogate constant key. Or CROSS would also work instead
>>>>>> of the join I suspect. Then you'd have a tuple like this to work with:
>>>>>>
>>>>>> (a, b, M:bag)
>>>>>>
>>>>>> I'm not sure if things would blow up if M is too large to fit into
>>>>>> memory in your UDF though.
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 20, 2011 at 6:27 AM, Mark Laczin <ma...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I'm trying to do something like this:
>>>>>>> (if 'data' is a set of tuples loaded from a file containing fields a,
>>>>>>> b
>>>>>>>
>>>>>> and
>>>>>>
>>>>>>> c)
>>>>>>> (if 'M' is another set of tuples loaded from a file)
>>>>>>>
>>>>>>> data = FOREACH data GENERATE *, someUDF(a, b, M);
>>>>>>>
>>>>>>> What I'm looking for is to generate (in this case, a string) based on
>>>>>>> a
>>>>>>>
>>>>>> and
>>>>>>
>>>>>>> b, using the contents of M inside the UDF.
>>>>>>>
>>>>>>> The UDF looks like this, in pseudocode:
>>>>>>>
>>>>>>> foreach element x in M {
>>>>>>> if a matches x or b matches x {
>>>>>>>  return "something"
>>>>>>> }
>>>>>>> }
>>>>>>> return "something else"
>>>>>>>
>>>>>>> Is this possible?  I keep getting errors related to "Scalars can only
>>>>>>> be
>>>>>>> used with projections" and the like.
>>>>>>> The thing holding me back from using filters is that I won't know
>>>>>>> what's
>>>>>>>
>>>>>> in
>>>>>>
>>>>>>> M until it's read, and since (in this case) they'll be regular
>>>>>>>
>>>>>> expressions,
>>>>>>
>>>>>>> I'd need to be able to join/group with regex matching which I don't
>>>>>>>
>>>>>> think
>>>>>>
>>>>>>> Pig can do.
>>>>>>>
>>>>>>> -Mark
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>
>>
>

Re: Question about bags and UDFs

Posted by Mark Laczin <ma...@gmail.com>.

Follow-up question, how do you add it to the cache in a pig script, and once
it's in there can you access it from the UDF using regular Java file I/O?
 That is, it is as simple as saying:

copyFromLocal $localFilePath udfFile.txt
DEFINE someudf org.someudf CACHE('udfFile.txt#udfFile.txt');

And then the UDF can read it using regular Java file streams/etc?

Thanks for your help so far - the mailing list has been fairly kind to me in
this regard, especially considering my lack of Pig experience.

-Mark

On Fri, Apr 22, 2011 at 7:40 AM, Mark Laczin <ma...@gmail.com> wrote:

> I think I may have to go with your second option - but thanks for the info,
> I'll keep an eye on 0.9.0.
>
>
> On Thu, Apr 21, 2011 at 4:16 PM, Alan Gates <ga...@yahoo-inc.com> wrote:
>
>> Starting with Pig 0.9 (not yet released but you can build it off the
>> branch) a UDF can specify a file to put in the distributed cache.  You could
>> thus have your UDF pick up the file locally on your box and put it in the
>> distributed cache, and then read it from the distributed cache on the back
>> end.  If running with an un-released version isn't an option for you, you
>> could manually load the file into the distributed cache and then read it
>> from your UDF.
>>
>> Alan.
>>
>>
>> On Apr 21, 2011, at 8:18 AM, Mark Laczin wrote:
>>
>>  Does anyone know how to ship the config file in this situation?
>>> I'm encountering problems with file not found exceptions when trying to
>>> run
>>> this over a cluster.
>>>
>>> On Wed, Apr 20, 2011 at 1:03 PM, Mark Laczin <ma...@gmail.com>
>>> wrote:
>>>
>>>  I kind of solved it by reading in the data from my UDF constructor (it's
>>>> just a file with a list of like 10 regular expressions, so I did manual
>>>> file
>>>> I/O), by passing the path (provided as a parameter), and then just
>>>> storing
>>>> it (and then, looping over it and testing a, b by hand).  It's not the
>>>> MapReduce way, but it will work for this application, considering the
>>>> small
>>>> size of the file.
>>>>
>>>> If anyone knows how my "patch" might fail, or if there is a better way -
>>>> feel free to speak up.
>>>>
>>>> -Mark
>>>>
>>>>
>>>> On Wed, Apr 20, 2011 at 12:51 PM, Bill Graham <billgraham@gmail.com
>>>> >wrote:
>>>>
>>>>  You could try doing GROUP ALL on the contents of M, which would
>>>>> produce a since bag containing each record and then joining M with
>>>>> data using a surrogate constant key. Or CROSS would also work instead
>>>>> of the join I suspect. Then you'd have a tuple like this to work with:
>>>>>
>>>>> (a, b, M:bag)
>>>>>
>>>>> I'm not sure if things would blow up if M is too large to fit into
>>>>> memory in your UDF though.
>>>>>
>>>>>
>>>>> On Wed, Apr 20, 2011 at 6:27 AM, Mark Laczin <ma...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I'm trying to do something like this:
>>>>>> (if 'data' is a set of tuples loaded from a file containing fields a,
>>>>>> b
>>>>>>
>>>>> and
>>>>>
>>>>>> c)
>>>>>> (if 'M' is another set of tuples loaded from a file)
>>>>>>
>>>>>> data = FOREACH data GENERATE *, someUDF(a, b, M);
>>>>>>
>>>>>> What I'm looking for is to generate (in this case, a string) based on
>>>>>> a
>>>>>>
>>>>> and
>>>>>
>>>>>> b, using the contents of M inside the UDF.
>>>>>>
>>>>>> The UDF looks like this, in pseudocode:
>>>>>>
>>>>>> foreach element x in M {
>>>>>> if a matches x or b matches x {
>>>>>>  return "something"
>>>>>> }
>>>>>> }
>>>>>> return "something else"
>>>>>>
>>>>>> Is this possible?  I keep getting errors related to "Scalars can only
>>>>>> be
>>>>>> used with projections" and the like.
>>>>>> The thing holding me back from using filters is that I won't know
>>>>>> what's
>>>>>>
>>>>> in
>>>>>
>>>>>> M until it's read, and since (in this case) they'll be regular
>>>>>>
>>>>> expressions,
>>>>>
>>>>>> I'd need to be able to join/group with regex matching which I don't
>>>>>>
>>>>> think
>>>>>
>>>>>> Pig can do.
>>>>>>
>>>>>> -Mark
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>
>

Re: Question about bags and UDFs

Posted by Mark Laczin <ma...@gmail.com>.

I think I may have to go with your second option - but thanks for the info,
I'll keep an eye on 0.9.0.

On Thu, Apr 21, 2011 at 4:16 PM, Alan Gates <ga...@yahoo-inc.com> wrote:

> Starting with Pig 0.9 (not yet released but you can build it off the
> branch) a UDF can specify a file to put in the distributed cache.  You could
> thus have your UDF pick up the file locally on your box and put it in the
> distributed cache, and then read it from the distributed cache on the back
> end.  If running with an un-released version isn't an option for you, you
> could manually load the file into the distributed cache and then read it
> from your UDF.
>
> Alan.
>
>
> On Apr 21, 2011, at 8:18 AM, Mark Laczin wrote:
>
>  Does anyone know how to ship the config file in this situation?
>> I'm encountering problems with file not found exceptions when trying to
>> run
>> this over a cluster.
>>
>> On Wed, Apr 20, 2011 at 1:03 PM, Mark Laczin <ma...@gmail.com>
>> wrote:
>>
>>  I kind of solved it by reading in the data from my UDF constructor (it's
>>> just a file with a list of like 10 regular expressions, so I did manual
>>> file
>>> I/O), by passing the path (provided as a parameter), and then just
>>> storing
>>> it (and then, looping over it and testing a, b by hand).  It's not the
>>> MapReduce way, but it will work for this application, considering the
>>> small
>>> size of the file.
>>>
>>> If anyone knows how my "patch" might fail, or if there is a better way -
>>> feel free to speak up.
>>>
>>> -Mark
>>>
>>>
>>> On Wed, Apr 20, 2011 at 12:51 PM, Bill Graham <billgraham@gmail.com
>>> >wrote:
>>>
>>>  You could try doing GROUP ALL on the contents of M, which would
>>>> produce a since bag containing each record and then joining M with
>>>> data using a surrogate constant key. Or CROSS would also work instead
>>>> of the join I suspect. Then you'd have a tuple like this to work with:
>>>>
>>>> (a, b, M:bag)
>>>>
>>>> I'm not sure if things would blow up if M is too large to fit into
>>>> memory in your UDF though.
>>>>
>>>>
>>>> On Wed, Apr 20, 2011 at 6:27 AM, Mark Laczin <ma...@gmail.com>
>>>> wrote:
>>>>
>>>>> I'm trying to do something like this:
>>>>> (if 'data' is a set of tuples loaded from a file containing fields a, b
>>>>>
>>>> and
>>>>
>>>>> c)
>>>>> (if 'M' is another set of tuples loaded from a file)
>>>>>
>>>>> data = FOREACH data GENERATE *, someUDF(a, b, M);
>>>>>
>>>>> What I'm looking for is to generate (in this case, a string) based on a
>>>>>
>>>> and
>>>>
>>>>> b, using the contents of M inside the UDF.
>>>>>
>>>>> The UDF looks like this, in pseudocode:
>>>>>
>>>>> foreach element x in M {
>>>>> if a matches x or b matches x {
>>>>>  return "something"
>>>>> }
>>>>> }
>>>>> return "something else"
>>>>>
>>>>> Is this possible?  I keep getting errors related to "Scalars can only
>>>>> be
>>>>> used with projections" and the like.
>>>>> The thing holding me back from using filters is that I won't know
>>>>> what's
>>>>>
>>>> in
>>>>
>>>>> M until it's read, and since (in this case) they'll be regular
>>>>>
>>>> expressions,
>>>>
>>>>> I'd need to be able to join/group with regex matching which I don't
>>>>>
>>>> think
>>>>
>>>>> Pig can do.
>>>>>
>>>>> -Mark
>>>>>
>>>>>
>>>>
>>>
>>>
>

Re: Question about bags and UDFs

Posted by Alan Gates <ga...@yahoo-inc.com>.

Starting with Pig 0.9 (not yet released but you can build it off the  
branch) a UDF can specify a file to put in the distributed cache.  You  
could thus have your UDF pick up the file locally on your box and put  
it in the distributed cache, and then read it from the distributed  
cache on the back end.  If running with an un-released version isn't  
an option for you, you could manually load the file into the  
distributed cache and then read it from your UDF.

Alan.

On Apr 21, 2011, at 8:18 AM, Mark Laczin wrote:

> Does anyone know how to ship the config file in this situation?
> I'm encountering problems with file not found exceptions when trying  
> to run
> this over a cluster.
>
> On Wed, Apr 20, 2011 at 1:03 PM, Mark Laczin <ma...@gmail.com>  
> wrote:
>
>> I kind of solved it by reading in the data from my UDF constructor  
>> (it's
>> just a file with a list of like 10 regular expressions, so I did  
>> manual file
>> I/O), by passing the path (provided as a parameter), and then just  
>> storing
>> it (and then, looping over it and testing a, b by hand).  It's not  
>> the
>> MapReduce way, but it will work for this application, considering  
>> the small
>> size of the file.
>>
>> If anyone knows how my "patch" might fail, or if there is a better  
>> way -
>> feel free to speak up.
>>
>> -Mark
>>
>>
>> On Wed, Apr 20, 2011 at 12:51 PM, Bill Graham  
>> <bi...@gmail.com>wrote:
>>
>>> You could try doing GROUP ALL on the contents of M, which would
>>> produce a since bag containing each record and then joining M with
>>> data using a surrogate constant key. Or CROSS would also work  
>>> instead
>>> of the join I suspect. Then you'd have a tuple like this to work  
>>> with:
>>>
>>> (a, b, M:bag)
>>>
>>> I'm not sure if things would blow up if M is too large to fit into
>>> memory in your UDF though.
>>>
>>>
>>> On Wed, Apr 20, 2011 at 6:27 AM, Mark Laczin <ma...@gmail.com>
>>> wrote:
>>>> I'm trying to do something like this:
>>>> (if 'data' is a set of tuples loaded from a file containing  
>>>> fields a, b
>>> and
>>>> c)
>>>> (if 'M' is another set of tuples loaded from a file)
>>>>
>>>> data = FOREACH data GENERATE *, someUDF(a, b, M);
>>>>
>>>> What I'm looking for is to generate (in this case, a string)  
>>>> based on a
>>> and
>>>> b, using the contents of M inside the UDF.
>>>>
>>>> The UDF looks like this, in pseudocode:
>>>>
>>>> foreach element x in M {
>>>> if a matches x or b matches x {
>>>>   return "something"
>>>> }
>>>> }
>>>> return "something else"
>>>>
>>>> Is this possible?  I keep getting errors related to "Scalars can  
>>>> only be
>>>> used with projections" and the like.
>>>> The thing holding me back from using filters is that I won't know  
>>>> what's
>>> in
>>>> M until it's read, and since (in this case) they'll be regular
>>> expressions,
>>>> I'd need to be able to join/group with regex matching which I don't
>>> think
>>>> Pig can do.
>>>>
>>>> -Mark
>>>>
>>>
>>
>>

Re: Question about bags and UDFs

Posted by Mark Laczin <ma...@gmail.com>.

Does anyone know how to ship the config file in this situation?
I'm encountering problems with file not found exceptions when trying to run
this over a cluster.

On Wed, Apr 20, 2011 at 1:03 PM, Mark Laczin <ma...@gmail.com> wrote:

> I kind of solved it by reading in the data from my UDF constructor (it's
> just a file with a list of like 10 regular expressions, so I did manual file
> I/O), by passing the path (provided as a parameter), and then just storing
> it (and then, looping over it and testing a, b by hand).  It's not the
> MapReduce way, but it will work for this application, considering the small
> size of the file.
>
> If anyone knows how my "patch" might fail, or if there is a better way -
> feel free to speak up.
>
> -Mark
>
>
> On Wed, Apr 20, 2011 at 12:51 PM, Bill Graham <bi...@gmail.com>wrote:
>
>> You could try doing GROUP ALL on the contents of M, which would
>> produce a since bag containing each record and then joining M with
>> data using a surrogate constant key. Or CROSS would also work instead
>> of the join I suspect. Then you'd have a tuple like this to work with:
>>
>> (a, b, M:bag)
>>
>> I'm not sure if things would blow up if M is too large to fit into
>> memory in your UDF though.
>>
>>
>> On Wed, Apr 20, 2011 at 6:27 AM, Mark Laczin <ma...@gmail.com>
>> wrote:
>> > I'm trying to do something like this:
>> > (if 'data' is a set of tuples loaded from a file containing fields a, b
>> and
>> > c)
>> > (if 'M' is another set of tuples loaded from a file)
>> >
>> > data = FOREACH data GENERATE *, someUDF(a, b, M);
>> >
>> > What I'm looking for is to generate (in this case, a string) based on a
>> and
>> > b, using the contents of M inside the UDF.
>> >
>> > The UDF looks like this, in pseudocode:
>> >
>> > foreach element x in M {
>> >  if a matches x or b matches x {
>> >    return "something"
>> >  }
>> > }
>> > return "something else"
>> >
>> > Is this possible?  I keep getting errors related to "Scalars can only be
>> > used with projections" and the like.
>> > The thing holding me back from using filters is that I won't know what's
>> in
>> > M until it's read, and since (in this case) they'll be regular
>> expressions,
>> > I'd need to be able to join/group with regex matching which I don't
>> think
>> > Pig can do.
>> >
>> > -Mark
>> >
>>
>
>

Re: Question about bags and UDFs

Posted by Mark Laczin <ma...@gmail.com>.

I kind of solved it by reading in the data from my UDF constructor (it's
just a file with a list of like 10 regular expressions, so I did manual file
I/O), by passing the path (provided as a parameter), and then just storing
it (and then, looping over it and testing a, b by hand).  It's not the
MapReduce way, but it will work for this application, considering the small
size of the file.

If anyone knows how my "patch" might fail, or if there is a better way -
feel free to speak up.

-Mark

On Wed, Apr 20, 2011 at 12:51 PM, Bill Graham <bi...@gmail.com> wrote:

> You could try doing GROUP ALL on the contents of M, which would
> produce a since bag containing each record and then joining M with
> data using a surrogate constant key. Or CROSS would also work instead
> of the join I suspect. Then you'd have a tuple like this to work with:
>
> (a, b, M:bag)
>
> I'm not sure if things would blow up if M is too large to fit into
> memory in your UDF though.
>
>
> On Wed, Apr 20, 2011 at 6:27 AM, Mark Laczin <ma...@gmail.com>
> wrote:
> > I'm trying to do something like this:
> > (if 'data' is a set of tuples loaded from a file containing fields a, b
> and
> > c)
> > (if 'M' is another set of tuples loaded from a file)
> >
> > data = FOREACH data GENERATE *, someUDF(a, b, M);
> >
> > What I'm looking for is to generate (in this case, a string) based on a
> and
> > b, using the contents of M inside the UDF.
> >
> > The UDF looks like this, in pseudocode:
> >
> > foreach element x in M {
> >  if a matches x or b matches x {
> >    return "something"
> >  }
> > }
> > return "something else"
> >
> > Is this possible?  I keep getting errors related to "Scalars can only be
> > used with projections" and the like.
> > The thing holding me back from using filters is that I won't know what's
> in
> > M until it's read, and since (in this case) they'll be regular
> expressions,
> > I'd need to be able to join/group with regex matching which I don't think
> > Pig can do.
> >
> > -Mark
> >
>

Re: Question about bags and UDFs

Posted by Bill Graham <bi...@gmail.com>.

You could try doing GROUP ALL on the contents of M, which would
produce a since bag containing each record and then joining M with
data using a surrogate constant key. Or CROSS would also work instead
of the join I suspect. Then you'd have a tuple like this to work with:

(a, b, M:bag)

I'm not sure if things would blow up if M is too large to fit into
memory in your UDF though.


On Wed, Apr 20, 2011 at 6:27 AM, Mark Laczin <ma...@gmail.com> wrote:
> I'm trying to do something like this:
> (if 'data' is a set of tuples loaded from a file containing fields a, b and
> c)
> (if 'M' is another set of tuples loaded from a file)
>
> data = FOREACH data GENERATE *, someUDF(a, b, M);
>
> What I'm looking for is to generate (in this case, a string) based on a and
> b, using the contents of M inside the UDF.
>
> The UDF looks like this, in pseudocode:
>
> foreach element x in M {
>  if a matches x or b matches x {
>    return "something"
>  }
> }
> return "something else"
>
> Is this possible?  I keep getting errors related to "Scalars can only be
> used with projections" and the like.
> The thing holding me back from using filters is that I won't know what's in
> M until it's read, and since (in this case) they'll be regular expressions,
> I'd need to be able to join/group with regex matching which I don't think
> Pig can do.
>
> -Mark
>