You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Brian Adams <Br...@chacha.com> on 2010/12/06 21:25:42 UTC

Regex Match Tagger UDF?

I have al is of regex patterns that I would like to run against a data
set, and if it matches a particular pattern in the list, tag it with the
predefined tag for that pattern.
Has this been done, or available somewhere? 
I've not written any UDF's, and although I'm not against doing so, I
probably don't have the time to write one at this point.

If this isn't available somewhere I can work around this roadblock, but
it would be awesome if someone has cooked up this functionality
somewhere.

-----Original Message-----
From: Anze [mailto:anzenews@volja.net] 
Sent: Monday, December 06, 2010 3:09 PM
To: user@pig.apache.org
Subject: Re: Easy question...difference between this::form and
this.form?


Sorry to hijack your question, Jonathan, but while we are at it... :) 

Is there a way to tell Pig NOT to add "base_alias::"? Almost half my
code 
consists of FOREACH... GENERATE that just remove these prefixes. 

Thanks,

Anze

On Monday 06 December 2010, Daniel Dai wrote:
> After join, cross, foreach flatten, Pig will automatically add
> "base_alias::" prefix. All other cases use "."
> 
> Daniel
> 
> Jonathan Coveney wrote:
> > It's very hard to search for this among the docs because it's so
generic,
> > so I thought I'd ask... I'm sure the answer is painfully easy.
> > 
> > Taking a look at this code that I found online, for example
> > 
> > --
> > -- Read in a bag of tuples (timeseries for this example) and divide
the
> > -- numeric column by its maximum.
> > --
> > %default DATABAG 'data/timeseries.tsv'
> > 
> > data       = LOAD '$DATABAG' AS (month:chararray, count:int);
> > accumulate = GROUP data ALL;
> > calc_max   = FOREACH accumulate GENERATE FLATTEN(data),
> > MAX(data.count) AS max_count;
> > normalize  = FOREACH calc_max GENERATE data::month AS month,
> > data::count AS count, (float)data::count / (float)max_count AS
> > normed_count;
> > DUMP normalize;
> > 
> > What purpose does data::month serve versus data.count?
> > 
> > Thanks


Re: Regex Match Tagger UDF?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
All good questions. I'll put all of this into a readme in the project, and
on the Pig wiki.
Thanks for you willingness to contribute!

0) all contributions should have the apache license
1) fork and make a pull request. The docs I will write up will include
something along the lines of "by sending a pull request you implicitly
confirm that you have the right to release this code under the apache 2.0
license"
2) I like camel-cased UDFs and generally follow the standard Sun code
conventions, though I prefer a two-space indentation. One of the pain points
for folks contributing to the main piggybank has been an overabundance of
requirements; I think for wild-west piggybank, we will be a lot more
lenient. Which has its costs, granted...
3) no restrictions, though a more generic package would be cool. LinkedIn
already contributed stuff under com.linkedin so there's precedent. If folks
feel strongly about the implicit attribution, I am cool with that.
4) assuming they are apache, just change the ant build file. Ivy preferred
over checking in jars.
5) If your UDF is not a Load/Store func, the interface is the same, so it
doesn't matter. Most likely when I pull in the real piggybank, we'll just
change version compatibility to 8.

-D

On Tue, Dec 7, 2010 at 11:34 AM, Zach Bailey <za...@dataclip.com>wrote:

>
>  Dmitriy,
>
>
> I'm happy to contribute those UDF classes to that Github repo. Are there
> instructions anywhere on how I should go about doing so? Of main concern
> are:
>
>
> * how to get repo access (should I fork and do a pull request?),
> * style/format/naming restrictions/suggestions (java code format -
> checkstyle, should the UDFs be upper cased, camel cased, etc.)
> * java package restrictions/suggestions (can the UDFs stay in
> com.dataclip.piggybank or should they be repackaged elsewhere)
> * how to handle repackaged code/libraries (one of my UDFs depends on a
> repackaged implementation of the Aho-Corasick algorithm)
> * pig version compatibility (the repo has 0.6.1, mine are written against
> 0.7.0)
>
> Thanks,
> Zach
>
>
> On Monday, December 6, 2010 at 9:26 PM, Dmitriy Ryaboy wrote:
>
> > Zach,
> > Do you mind contributing that directly to the Piggybank's upcoming home,
> > https://github.com/wilbur/Piggybank ?
> >
> > D
> >
> > On Mon, Dec 6, 2010 at 2:25 PM, Zach Bailey <zach.bailey@dataclip.com
> >wrote:
> >
> >
> > >
> > >  Here you go:
> > >
> > >
> > > https://github.com/znbailey/Dataclip-Piggybank
> > >
> > >
> > >  The UDF you'll be interested in is here:
> > >
> > >
> > >
> > >
> https://github.com/znbailey/Dataclip-Piggybank/blob/master/src/java/com/dataclip/piggybank/AHO_CORASICK.java
> > >
> > >
> > >  I would recommend grabbing the entire repo as that UDF depends on the
> > >  repackaged version of Aho-Corasick in org/arabidopsis/ahocorasick
> > >
> > >
> > >  Enjoy,
> > >  Zach
> > >
> > >
> > >  On Monday, December 6, 2010 at 4:55 PM, Brian Adams wrote:
> > >
> > > > No problem.
> > > > Sounds good. And no worry about messy code. We are all well aware
> that
> > >  code often elegance when you are just trying to get it out the door.
> > > > -----Original Message-----
> > > > From: Zach Bailey [mailto:zach.bailey@dataclip.com]
> > > > Sent: Monday, December 06, 2010 4:46 PM
> > > > To: user@pig.apache.org
> > > > Subject: Re: Regex Match Tagger UDF?
> > > >
> > > >
> > > > Great. Let me clean up the code a bit and I'd be happy to post it.
> I'm
> > >  definitely open to some alternatives in terms of how this UDF would be
> > >  initialized, whether it is via a file sitting on HDFS, etc. The
> current
> > >  initialization scheme is admittedly crude but was simple to code and
> works
> > >  for us for now.
> > > >
> > > > Cheers,
> > > > Zach
> > > >
> > > >
> > > > On Monday, December 6, 2010 at 4:15 PM, Brian Adams wrote:
> > > >
> > > >
> > > > > That is an interesting approach. I like it. Not ideal, but I think
> it
> > >  could work for what I am doing.
> > > > >
> > > > > In general I think that is useful to the community and you should
> > >  github it.
> > > > > By all means, I would love to use this.
> > > > >
> > > > > I think I could extend/fork this for my need.
> > > > >
> > > > > Thank you Zach!
> > > > >
> > > > > -----Original Message-----
> > > > > From: Zach Bailey [mailto:zach.bailey@dataclip.com]
> > > > > Sent: Monday, December 06, 2010 3:38 PM
> > > > > To: user@pig.apache.org
> > > > > Subject: Re: Regex Match Tagger UDF?
> > > > >
> > > > >
> > > > > Does the UDF have to support regular expressions? If not, I have
> > >  adapted the Aho-Corasick algorithm [1] to do something similar to what
> > >  you're asking for. It works as follows:
> > > > >
> > > > >
> > > > > 1.) Initialize the Aho-Corasick UDF with a list of tokens to search
> > >  for, and a result to output when that token is found:
> > > > >
> > > > >
> > > > > define AC_MATCHER
> > > > > com.my.piggybank.AHO_CORASICK('dogs=[terrier|retriever|pit
> > > > >
> bull];cats=[tabby|mainecoon|tuxedo];birds=[parakeet|parrot|cuckoo]')
> > > > >
> > > > >
> > > > > 2.) apply the AC_MATCHER to a tuple
> > > > >
> > > > >
> > > > > strings = LOAD 'myfile.txt' as (string:chararray); tagged_strings =
> > > > > FOREACH strings GENERATE string, AC_MATCHER(string) as tags;
> > > > >
> > > > >
> > > > > The tagged_strings will then contain the original line along with a
> > >  bag of matches. For instance if we had the following in myfile.txt:
> > > > >
> > > > >
> > > > > terrier parakeet
> > > > > hello
> > > > > goodbye
> > > > > tabby
> > > > > pit bull
> > > > >
> > > > >
> > > > > after running the commands in #2 tagged_strings would look like
> > >  (pardon the ad-hoc notation):
> > > > >
> > > > >
> > > > > { string: 'terrier parakeet', tags: { 'dogs', 'birds' } } { string:
> > > > > 'hello', tags: {} } { string: 'goodbye', tags: {} } { string:
> 'tabby',
> > > > > tags: { 'cats' } } { string: 'pit bull', tags: { 'dogs' } }
> > > > >
> > > > >
> > > > > If this is something you'd be interested in using/extended I can
> put
> > >  it up on github for your forking pleasure.
> > > > >
> > > > > Cheers,
> > > > > Zach
> > > > >
> > > > >
> > > > > On Monday, December 6, 2010 at 3:25 PM, Brian Adams wrote:
> > > > >
> > > > >
> > > > > > I have al is of regex patterns that I would like to run against a
> > > > > > data set, and if it matches a particular pattern in the list, tag
> > > > > > it with the predefined tag for that pattern.
> > > > > > Has this been done, or available somewhere?
> > > > > > I've not written any UDF's, and although I'm not against doing
> so,
> > > > > > I probably don't have the time to write one at this point.
> > > > > >
> > > > > > If this isn't available somewhere I can work around this
> roadblock,
> > > > > > but it would be awesome if someone has cooked up this
> functionality
> > > > > > somewhere.
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Anze [mailto:anzenews@volja.net]
> > > > > > Sent: Monday, December 06, 2010 3:09 PM
> > > > > > To: user@pig.apache.org
> > > > > > Subject: Re: Easy question...difference between this::form and
> > > > > > this.form?
> > > > > >
> > > > > >
> > > > > > Sorry to hijack your question, Jonathan, but while we are at
> it...
> > > > > > :)
> > > > > >
> > > > > > Is there a way to tell Pig NOT to add "base_alias::"? Almost half
> > > > > > my code consists of FOREACH... GENERATE that just remove these
> > >  prefixes.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Anze
> > > > > >
> > > > > > On Monday 06 December 2010, Daniel Dai wrote:
> > > > > >
> > > > > > > After join, cross, foreach flatten, Pig will automatically add
> > > > > > > "base_alias::" prefix. All other cases use "."
> > > > > > >
> > > > > > > Daniel
> > > > > > >
> > > > > > > Jonathan Coveney wrote:
> > > > > > > > It's very hard to search for this among the docs because it's
> so
> > > > > > >
> > > > > > >
> > > > > > generic,
> > > > > >
> > > > > > > > so I thought I'd ask... I'm sure the answer is painfully
> easy.
> > > > > > > >
> > > > > > > > Taking a look at this code that I found online, for example
> > > > > > > >
> > > > > > > > --
> > > > > > > > -- Read in a bag of tuples (timeseries for this example) and
> > > > > > > > divide
> > > > > > >
> > > > > > >
> > > > > > the
> > > > > >
> > > > > > > > -- numeric column by its maximum.
> > > > > > > > --
> > > > > > > > %default DATABAG 'data/timeseries.tsv'
> > > > > > > >
> > > > > > > > data = LOAD '$DATABAG' AS (month:chararray, count:int);
> > > > > > > > accumulate = GROUP data ALL; calc_max = FOREACH accumulate
> > > > > > > > GENERATE FLATTEN(data),
> > > > > > > > MAX(data.count) AS max_count;
> > > > > > > > normalize = FOREACH calc_max GENERATE data::month AS month,
> > > > > > > > data::count AS count, (float)data::count / (float)max_count
> AS
> > > > > > > > normed_count; DUMP normalize;
> > > > > > > >
> > > > > > > > What purpose does data::month serve versus data.count?
> > > > > > > >
> > > > > > > > Thanks
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > >
> > >
> > >
> >
> >
> >
> >
>
>
>

Re: Regex Match Tagger UDF?

Posted by Zach Bailey <za...@dataclip.com>.
 Dmitriy,


I'm happy to contribute those UDF classes to that Github repo. Are there instructions anywhere on how I should go about doing so? Of main concern are:


* how to get repo access (should I fork and do a pull request?),
* style/format/naming restrictions/suggestions (java code format - checkstyle, should the UDFs be upper cased, camel cased, etc.)
* java package restrictions/suggestions (can the UDFs stay in com.dataclip.piggybank or should they be repackaged elsewhere)
* how to handle repackaged code/libraries (one of my UDFs depends on a repackaged implementation of the Aho-Corasick algorithm)
* pig version compatibility (the repo has 0.6.1, mine are written against 0.7.0)

Thanks,
Zach


On Monday, December 6, 2010 at 9:26 PM, Dmitriy Ryaboy wrote:

> Zach,
> Do you mind contributing that directly to the Piggybank's upcoming home,
> https://github.com/wilbur/Piggybank ?
> 
> D
> 
> On Mon, Dec 6, 2010 at 2:25 PM, Zach Bailey <za...@dataclip.com>wrote:
> 
> 
> > 
> >  Here you go:
> > 
> > 
> > https://github.com/znbailey/Dataclip-Piggybank
> > 
> > 
> >  The UDF you'll be interested in is here:
> > 
> > 
> > 
> > https://github.com/znbailey/Dataclip-Piggybank/blob/master/src/java/com/dataclip/piggybank/AHO_CORASICK.java
> > 
> > 
> >  I would recommend grabbing the entire repo as that UDF depends on the
> >  repackaged version of Aho-Corasick in org/arabidopsis/ahocorasick
> > 
> > 
> >  Enjoy,
> >  Zach
> > 
> > 
> >  On Monday, December 6, 2010 at 4:55 PM, Brian Adams wrote:
> > 
> > > No problem.
> > > Sounds good. And no worry about messy code. We are all well aware that
> >  code often elegance when you are just trying to get it out the door.
> > > -----Original Message-----
> > > From: Zach Bailey [mailto:zach.bailey@dataclip.com]
> > > Sent: Monday, December 06, 2010 4:46 PM
> > > To: user@pig.apache.org
> > > Subject: Re: Regex Match Tagger UDF?
> > >
> > >
> > > Great. Let me clean up the code a bit and I'd be happy to post it. I'm
> >  definitely open to some alternatives in terms of how this UDF would be
> >  initialized, whether it is via a file sitting on HDFS, etc. The current
> >  initialization scheme is admittedly crude but was simple to code and works
> >  for us for now.
> > >
> > > Cheers,
> > > Zach
> > >
> > >
> > > On Monday, December 6, 2010 at 4:15 PM, Brian Adams wrote:
> > >
> > >
> > > > That is an interesting approach. I like it. Not ideal, but I think it
> >  could work for what I am doing.
> > > >
> > > > In general I think that is useful to the community and you should
> >  github it.
> > > > By all means, I would love to use this.
> > > >
> > > > I think I could extend/fork this for my need.
> > > >
> > > > Thank you Zach!
> > > >
> > > > -----Original Message-----
> > > > From: Zach Bailey [mailto:zach.bailey@dataclip.com]
> > > > Sent: Monday, December 06, 2010 3:38 PM
> > > > To: user@pig.apache.org
> > > > Subject: Re: Regex Match Tagger UDF?
> > > >
> > > >
> > > > Does the UDF have to support regular expressions? If not, I have
> >  adapted the Aho-Corasick algorithm [1] to do something similar to what
> >  you're asking for. It works as follows:
> > > >
> > > >
> > > > 1.) Initialize the Aho-Corasick UDF with a list of tokens to search
> >  for, and a result to output when that token is found:
> > > >
> > > >
> > > > define AC_MATCHER
> > > > com.my.piggybank.AHO_CORASICK('dogs=[terrier|retriever|pit
> > > > bull];cats=[tabby|mainecoon|tuxedo];birds=[parakeet|parrot|cuckoo]')
> > > >
> > > >
> > > > 2.) apply the AC_MATCHER to a tuple
> > > >
> > > >
> > > > strings = LOAD 'myfile.txt' as (string:chararray); tagged_strings =
> > > > FOREACH strings GENERATE string, AC_MATCHER(string) as tags;
> > > >
> > > >
> > > > The tagged_strings will then contain the original line along with a
> >  bag of matches. For instance if we had the following in myfile.txt:
> > > >
> > > >
> > > > terrier parakeet
> > > > hello
> > > > goodbye
> > > > tabby
> > > > pit bull
> > > >
> > > >
> > > > after running the commands in #2 tagged_strings would look like
> >  (pardon the ad-hoc notation):
> > > >
> > > >
> > > > { string: 'terrier parakeet', tags: { 'dogs', 'birds' } } { string:
> > > > 'hello', tags: {} } { string: 'goodbye', tags: {} } { string: 'tabby',
> > > > tags: { 'cats' } } { string: 'pit bull', tags: { 'dogs' } }
> > > >
> > > >
> > > > If this is something you'd be interested in using/extended I can put
> >  it up on github for your forking pleasure.
> > > >
> > > > Cheers,
> > > > Zach
> > > >
> > > >
> > > > On Monday, December 6, 2010 at 3:25 PM, Brian Adams wrote:
> > > >
> > > >
> > > > > I have al is of regex patterns that I would like to run against a
> > > > > data set, and if it matches a particular pattern in the list, tag
> > > > > it with the predefined tag for that pattern.
> > > > > Has this been done, or available somewhere?
> > > > > I've not written any UDF's, and although I'm not against doing so,
> > > > > I probably don't have the time to write one at this point.
> > > > >
> > > > > If this isn't available somewhere I can work around this roadblock,
> > > > > but it would be awesome if someone has cooked up this functionality
> > > > > somewhere.
> > > > >
> > > > > -----Original Message-----
> > > > > From: Anze [mailto:anzenews@volja.net]
> > > > > Sent: Monday, December 06, 2010 3:09 PM
> > > > > To: user@pig.apache.org
> > > > > Subject: Re: Easy question...difference between this::form and
> > > > > this.form?
> > > > >
> > > > >
> > > > > Sorry to hijack your question, Jonathan, but while we are at it...
> > > > > :)
> > > > >
> > > > > Is there a way to tell Pig NOT to add "base_alias::"? Almost half
> > > > > my code consists of FOREACH... GENERATE that just remove these
> >  prefixes.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Anze
> > > > >
> > > > > On Monday 06 December 2010, Daniel Dai wrote:
> > > > >
> > > > > > After join, cross, foreach flatten, Pig will automatically add
> > > > > > "base_alias::" prefix. All other cases use "."
> > > > > >
> > > > > > Daniel
> > > > > >
> > > > > > Jonathan Coveney wrote:
> > > > > > > It's very hard to search for this among the docs because it's so
> > > > > >
> > > > > >
> > > > > generic,
> > > > >
> > > > > > > so I thought I'd ask... I'm sure the answer is painfully easy.
> > > > > > >
> > > > > > > Taking a look at this code that I found online, for example
> > > > > > >
> > > > > > > --
> > > > > > > -- Read in a bag of tuples (timeseries for this example) and
> > > > > > > divide
> > > > > >
> > > > > >
> > > > > the
> > > > >
> > > > > > > -- numeric column by its maximum.
> > > > > > > --
> > > > > > > %default DATABAG 'data/timeseries.tsv'
> > > > > > >
> > > > > > > data = LOAD '$DATABAG' AS (month:chararray, count:int);
> > > > > > > accumulate = GROUP data ALL; calc_max = FOREACH accumulate
> > > > > > > GENERATE FLATTEN(data),
> > > > > > > MAX(data.count) AS max_count;
> > > > > > > normalize = FOREACH calc_max GENERATE data::month AS month,
> > > > > > > data::count AS count, (float)data::count / (float)max_count AS
> > > > > > > normed_count; DUMP normalize;
> > > > > > >
> > > > > > > What purpose does data::month serve versus data.count?
> > > > > > >
> > > > > > > Thanks
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > >
> > >
> > 
> > 
> > 
> > 
> > 
> 
> 
> 
> 



Re: Regex Match Tagger UDF?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Zach,
Do you mind contributing that directly to the Piggybank's upcoming home,
https://github.com/wilbur/Piggybank ?

D

On Mon, Dec 6, 2010 at 2:25 PM, Zach Bailey <za...@dataclip.com>wrote:

>
>  Here you go:
>
>
> https://github.com/znbailey/Dataclip-Piggybank
>
>
> The UDF you'll be interested in is here:
>
>
>
> https://github.com/znbailey/Dataclip-Piggybank/blob/master/src/java/com/dataclip/piggybank/AHO_CORASICK.java
>
>
> I would recommend grabbing the entire repo as that UDF depends on the
> repackaged version of Aho-Corasick in org/arabidopsis/ahocorasick
>
>
> Enjoy,
> Zach
>
>
> On Monday, December 6, 2010 at 4:55 PM, Brian Adams wrote:
>
> > No problem.
> > Sounds good. And no worry about messy code. We are all well aware that
> code often elegance when you are just trying to get it out the door.
> > -----Original Message-----
> > From: Zach Bailey [mailto:zach.bailey@dataclip.com]
> > Sent: Monday, December 06, 2010 4:46 PM
> > To: user@pig.apache.org
> > Subject: Re: Regex Match Tagger UDF?
> >
> >
> >  Great. Let me clean up the code a bit and I'd be happy to post it. I'm
> definitely open to some alternatives in terms of how this UDF would be
> initialized, whether it is via a file sitting on HDFS, etc. The current
> initialization scheme is admittedly crude but was simple to code and works
> for us for now.
> >
> > Cheers,
> > Zach
> >
> >
> > On Monday, December 6, 2010 at 4:15 PM, Brian Adams wrote:
> >
> >
> > >  That is an interesting approach. I like it. Not ideal, but I think it
> could work for what I am doing.
> > >
> > >  In general I think that is useful to the community and you should
> github it.
> > >  By all means, I would love to use this.
> > >
> > >  I think I could extend/fork this for my need.
> > >
> > >  Thank you Zach!
> > >
> > >  -----Original Message-----
> > >  From: Zach Bailey [mailto:zach.bailey@dataclip.com]
> > >  Sent: Monday, December 06, 2010 3:38 PM
> > >  To: user@pig.apache.org
> > >  Subject: Re: Regex Match Tagger UDF?
> > >
> > >
> > >  Does the UDF have to support regular expressions? If not, I have
> adapted the Aho-Corasick algorithm [1] to do something similar to what
> you're asking for. It works as follows:
> > >
> > >
> > >  1.) Initialize the Aho-Corasick UDF with a list of tokens to search
> for, and a result to output when that token is found:
> > >
> > >
> > >  define AC_MATCHER
> > >  com.my.piggybank.AHO_CORASICK('dogs=[terrier|retriever|pit
> > >  bull];cats=[tabby|mainecoon|tuxedo];birds=[parakeet|parrot|cuckoo]')
> > >
> > >
> > >  2.) apply the AC_MATCHER to a tuple
> > >
> > >
> > >  strings = LOAD 'myfile.txt' as (string:chararray); tagged_strings =
> > >  FOREACH strings GENERATE string, AC_MATCHER(string) as tags;
> > >
> > >
> > >  The tagged_strings will then contain the original line along with a
> bag of matches. For instance if we had the following in myfile.txt:
> > >
> > >
> > >  terrier parakeet
> > >  hello
> > >  goodbye
> > >  tabby
> > >  pit bull
> > >
> > >
> > >  after running the commands in #2 tagged_strings would look like
> (pardon the ad-hoc notation):
> > >
> > >
> > >  { string: 'terrier parakeet', tags: { 'dogs', 'birds' } } { string:
> > >  'hello', tags: {} } { string: 'goodbye', tags: {} } { string: 'tabby',
> > >  tags: { 'cats' } } { string: 'pit bull', tags: { 'dogs' } }
> > >
> > >
> > >  If this is something you'd be interested in using/extended I can put
> it up on github for your forking pleasure.
> > >
> > >  Cheers,
> > >  Zach
> > >
> > >
> > >  On Monday, December 6, 2010 at 3:25 PM, Brian Adams wrote:
> > >
> > >
> > > > I have al is of regex patterns that I would like to run against a
> > > > data set, and if it matches a particular pattern in the list, tag
> > > > it with the predefined tag for that pattern.
> > > > Has this been done, or available somewhere?
> > > > I've not written any UDF's, and although I'm not against doing so,
> > > > I probably don't have the time to write one at this point.
> > > >
> > > > If this isn't available somewhere I can work around this roadblock,
> > > > but it would be awesome if someone has cooked up this functionality
> > > > somewhere.
> > > >
> > > > -----Original Message-----
> > > > From: Anze [mailto:anzenews@volja.net]
> > > > Sent: Monday, December 06, 2010 3:09 PM
> > > > To: user@pig.apache.org
> > > > Subject: Re: Easy question...difference between this::form and
> > > > this.form?
> > > >
> > > >
> > > > Sorry to hijack your question, Jonathan, but while we are at it...
> > > > :)
> > > >
> > > > Is there a way to tell Pig NOT to add "base_alias::"? Almost half
> > > > my code consists of FOREACH... GENERATE that just remove these
> prefixes.
> > > >
> > > > Thanks,
> > > >
> > > > Anze
> > > >
> > > > On Monday 06 December 2010, Daniel Dai wrote:
> > > >
> > > > > After join, cross, foreach flatten, Pig will automatically add
> > > > > "base_alias::" prefix. All other cases use "."
> > > > >
> > > > > Daniel
> > > > >
> > > > > Jonathan Coveney wrote:
> > > > > > It's very hard to search for this among the docs because it's so
> > > > >
> > > > >
> > > > generic,
> > > >
> > > > > > so I thought I'd ask... I'm sure the answer is painfully easy.
> > > > > >
> > > > > > Taking a look at this code that I found online, for example
> > > > > >
> > > > > > --
> > > > > > -- Read in a bag of tuples (timeseries for this example) and
> > > > > > divide
> > > > >
> > > > >
> > > > the
> > > >
> > > > > > -- numeric column by its maximum.
> > > > > > --
> > > > > > %default DATABAG 'data/timeseries.tsv'
> > > > > >
> > > > > > data = LOAD '$DATABAG' AS (month:chararray, count:int);
> > > > > > accumulate = GROUP data ALL; calc_max = FOREACH accumulate
> > > > > > GENERATE FLATTEN(data),
> > > > > > MAX(data.count) AS max_count;
> > > > > > normalize = FOREACH calc_max GENERATE data::month AS month,
> > > > > > data::count AS count, (float)data::count / (float)max_count AS
> > > > > > normed_count; DUMP normalize;
> > > > > >
> > > > > > What purpose does data::month serve versus data.count?
> > > > > >
> > > > > > Thanks
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
> >
> >
> >
>
>
>

Re: Regex Match Tagger UDF?

Posted by Zach Bailey <za...@dataclip.com>.
 Here you go:


https://github.com/znbailey/Dataclip-Piggybank


The UDF you'll be interested in is here:


https://github.com/znbailey/Dataclip-Piggybank/blob/master/src/java/com/dataclip/piggybank/AHO_CORASICK.java


I would recommend grabbing the entire repo as that UDF depends on the repackaged version of Aho-Corasick in org/arabidopsis/ahocorasick


Enjoy,
Zach


On Monday, December 6, 2010 at 4:55 PM, Brian Adams wrote:

> No problem.
> Sounds good. And no worry about messy code. We are all well aware that code often elegance when you are just trying to get it out the door.
> -----Original Message-----
> From: Zach Bailey [mailto:zach.bailey@dataclip.com] 
> Sent: Monday, December 06, 2010 4:46 PM
> To: user@pig.apache.org
> Subject: Re: Regex Match Tagger UDF?
> 
> 
>  Great. Let me clean up the code a bit and I'd be happy to post it. I'm definitely open to some alternatives in terms of how this UDF would be initialized, whether it is via a file sitting on HDFS, etc. The current initialization scheme is admittedly crude but was simple to code and works for us for now.
> 
> Cheers,
> Zach
> 
> 
> On Monday, December 6, 2010 at 4:15 PM, Brian Adams wrote:
> 
> 
> >  That is an interesting approach. I like it. Not ideal, but I think it could work for what I am doing.
> > 
> >  In general I think that is useful to the community and you should github it. 
> >  By all means, I would love to use this.
> > 
> >  I think I could extend/fork this for my need.
> > 
> >  Thank you Zach!
> > 
> >  -----Original Message-----
> >  From: Zach Bailey [mailto:zach.bailey@dataclip.com]
> >  Sent: Monday, December 06, 2010 3:38 PM
> >  To: user@pig.apache.org
> >  Subject: Re: Regex Match Tagger UDF?
> > 
> > 
> >  Does the UDF have to support regular expressions? If not, I have adapted the Aho-Corasick algorithm [1] to do something similar to what you're asking for. It works as follows:
> > 
> > 
> >  1.) Initialize the Aho-Corasick UDF with a list of tokens to search for, and a result to output when that token is found:
> > 
> > 
> >  define AC_MATCHER 
> >  com.my.piggybank.AHO_CORASICK('dogs=[terrier|retriever|pit 
> >  bull];cats=[tabby|mainecoon|tuxedo];birds=[parakeet|parrot|cuckoo]')
> > 
> > 
> >  2.) apply the AC_MATCHER to a tuple
> > 
> > 
> >  strings = LOAD 'myfile.txt' as (string:chararray); tagged_strings = 
> >  FOREACH strings GENERATE string, AC_MATCHER(string) as tags;
> > 
> > 
> >  The tagged_strings will then contain the original line along with a bag of matches. For instance if we had the following in myfile.txt:
> > 
> > 
> >  terrier parakeet
> >  hello
> >  goodbye
> >  tabby
> >  pit bull
> > 
> > 
> >  after running the commands in #2 tagged_strings would look like (pardon the ad-hoc notation):
> > 
> > 
> >  { string: 'terrier parakeet', tags: { 'dogs', 'birds' } } { string: 
> >  'hello', tags: {} } { string: 'goodbye', tags: {} } { string: 'tabby', 
> >  tags: { 'cats' } } { string: 'pit bull', tags: { 'dogs' } }
> > 
> > 
> >  If this is something you'd be interested in using/extended I can put it up on github for your forking pleasure.
> > 
> >  Cheers,
> >  Zach
> > 
> > 
> >  On Monday, December 6, 2010 at 3:25 PM, Brian Adams wrote:
> > 
> > 
> > > I have al is of regex patterns that I would like to run against a 
> > > data set, and if it matches a particular pattern in the list, tag 
> > > it with the predefined tag for that pattern.
> > > Has this been done, or available somewhere? 
> > > I've not written any UDF's, and although I'm not against doing so, 
> > > I probably don't have the time to write one at this point.
> > > 
> > > If this isn't available somewhere I can work around this roadblock, 
> > > but it would be awesome if someone has cooked up this functionality 
> > > somewhere.
> > > 
> > > -----Original Message-----
> > > From: Anze [mailto:anzenews@volja.net]
> > > Sent: Monday, December 06, 2010 3:09 PM
> > > To: user@pig.apache.org
> > > Subject: Re: Easy question...difference between this::form and 
> > > this.form?
> > > 
> > > 
> > > Sorry to hijack your question, Jonathan, but while we are at it... 
> > > :)
> > > 
> > > Is there a way to tell Pig NOT to add "base_alias::"? Almost half 
> > > my code consists of FOREACH... GENERATE that just remove these prefixes.
> > > 
> > > Thanks,
> > > 
> > > Anze
> > > 
> > > On Monday 06 December 2010, Daniel Dai wrote:
> > > 
> > > > After join, cross, foreach flatten, Pig will automatically add 
> > > > "base_alias::" prefix. All other cases use "."
> > > > 
> > > > Daniel
> > > > 
> > > > Jonathan Coveney wrote:
> > > > > It's very hard to search for this among the docs because it's so
> > > > 
> > > > 
> > > generic,
> > > 
> > > > > so I thought I'd ask... I'm sure the answer is painfully easy.
> > > > > 
> > > > > Taking a look at this code that I found online, for example
> > > > > 
> > > > > --
> > > > > -- Read in a bag of tuples (timeseries for this example) and 
> > > > > divide
> > > > 
> > > > 
> > > the
> > > 
> > > > > -- numeric column by its maximum.
> > > > > --
> > > > > %default DATABAG 'data/timeseries.tsv'
> > > > > 
> > > > > data = LOAD '$DATABAG' AS (month:chararray, count:int); 
> > > > > accumulate = GROUP data ALL; calc_max = FOREACH accumulate 
> > > > > GENERATE FLATTEN(data),
> > > > > MAX(data.count) AS max_count;
> > > > > normalize = FOREACH calc_max GENERATE data::month AS month, 
> > > > > data::count AS count, (float)data::count / (float)max_count AS 
> > > > > normed_count; DUMP normalize;
> > > > > 
> > > > > What purpose does data::month serve versus data.count?
> > > > > 
> > > > > Thanks
> > > > 
> > > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > 
> > 
> > 
> > 
> > 
> > 
> 
> 
> 
> 



RE: Regex Match Tagger UDF?

Posted by Brian Adams <Br...@chacha.com>.
No problem.
Sounds good. And no worry about messy code. We are all well aware that code often elegance when you are just trying to get it out the door.
-----Original Message-----
From: Zach Bailey [mailto:zach.bailey@dataclip.com] 
Sent: Monday, December 06, 2010 4:46 PM
To: user@pig.apache.org
Subject: Re: Regex Match Tagger UDF?


 Great. Let me clean up the code a bit and I'd be happy to post it. I'm definitely open to some alternatives in terms of how this UDF would be initialized, whether it is via a file sitting on HDFS, etc. The current initialization scheme is admittedly crude but was simple to code and works for us for now.

Cheers,
Zach


On Monday, December 6, 2010 at 4:15 PM, Brian Adams wrote:

> That is an interesting approach. I like it. Not ideal, but I think it could work for what I am doing.
> 
> In general I think that is useful to the community and you should github it. 
> By all means, I would love to use this.
> 
> I think I could extend/fork this for my need.
> 
> Thank you Zach!
> 
> -----Original Message-----
> From: Zach Bailey [mailto:zach.bailey@dataclip.com]
> Sent: Monday, December 06, 2010 3:38 PM
> To: user@pig.apache.org
> Subject: Re: Regex Match Tagger UDF?
> 
> 
>  Does the UDF have to support regular expressions? If not, I have adapted the Aho-Corasick algorithm [1] to do something similar to what you're asking for. It works as follows:
> 
> 
> 1.) Initialize the Aho-Corasick UDF with a list of tokens to search for, and a result to output when that token is found:
> 
> 
> define AC_MATCHER 
> com.my.piggybank.AHO_CORASICK('dogs=[terrier|retriever|pit 
> bull];cats=[tabby|mainecoon|tuxedo];birds=[parakeet|parrot|cuckoo]')
> 
> 
> 2.) apply the AC_MATCHER to a tuple
> 
> 
> strings = LOAD 'myfile.txt' as (string:chararray); tagged_strings = 
> FOREACH strings GENERATE string, AC_MATCHER(string) as tags;
> 
> 
> The tagged_strings will then contain the original line along with a bag of matches. For instance if we had the following in myfile.txt:
> 
> 
> terrier parakeet
> hello
> goodbye
> tabby
> pit bull
> 
> 
> after running the commands in #2 tagged_strings would look like (pardon the ad-hoc notation):
> 
> 
> { string: 'terrier parakeet', tags: { 'dogs', 'birds' } } { string: 
> 'hello', tags: {} } { string: 'goodbye', tags: {} } { string: 'tabby', 
> tags: { 'cats' } } { string: 'pit bull', tags: { 'dogs' } }
> 
> 
> If this is something you'd be interested in using/extended I can put it up on github for your forking pleasure.
> 
> Cheers,
> Zach
> 
> 
> On Monday, December 6, 2010 at 3:25 PM, Brian Adams wrote:
> 
> 
> >  I have al is of regex patterns that I would like to run against a 
> > data  set, and if it matches a particular pattern in the list, tag 
> > it with  the predefined tag for that pattern.
> >  Has this been done, or available somewhere? 
> >  I've not written any UDF's, and although I'm not against doing so, 
> > I  probably don't have the time to write one at this point.
> > 
> >  If this isn't available somewhere I can work around this roadblock,  
> > but it would be awesome if someone has cooked up this functionality  
> > somewhere.
> > 
> >  -----Original Message-----
> >  From: Anze [mailto:anzenews@volja.net]
> >  Sent: Monday, December 06, 2010 3:09 PM
> >  To: user@pig.apache.org
> >  Subject: Re: Easy question...difference between this::form and  
> > this.form?
> > 
> > 
> >  Sorry to hijack your question, Jonathan, but while we are at it... 
> > :)
> > 
> >  Is there a way to tell Pig NOT to add "base_alias::"? Almost half 
> > my  code consists of FOREACH... GENERATE that just remove these prefixes.
> > 
> >  Thanks,
> > 
> >  Anze
> > 
> >  On Monday 06 December 2010, Daniel Dai wrote:
> > 
> > > After join, cross, foreach flatten, Pig will automatically add 
> > > "base_alias::" prefix. All other cases use "."
> > > 
> > > Daniel
> > > 
> > > Jonathan Coveney wrote:
> > > > It's very hard to search for this among the docs because it's so
> > > 
> > > 
> >  generic,
> > 
> > > > so I thought I'd ask... I'm sure the answer is painfully easy.
> > > > 
> > > > Taking a look at this code that I found online, for example
> > > > 
> > > > --
> > > > -- Read in a bag of tuples (timeseries for this example) and 
> > > > divide
> > > 
> > > 
> >  the
> > 
> > > > -- numeric column by its maximum.
> > > > --
> > > > %default DATABAG 'data/timeseries.tsv'
> > > > 
> > > > data = LOAD '$DATABAG' AS (month:chararray, count:int); 
> > > > accumulate = GROUP data ALL; calc_max = FOREACH accumulate 
> > > > GENERATE FLATTEN(data),
> > > > MAX(data.count) AS max_count;
> > > > normalize = FOREACH calc_max GENERATE data::month AS month, 
> > > > data::count AS count, (float)data::count / (float)max_count AS 
> > > > normed_count; DUMP normalize;
> > > > 
> > > > What purpose does data::month serve versus data.count?
> > > > 
> > > > Thanks
> > > 
> > > 
> > 
> > 
> > 
> > 
> > 
> > 
> 
> 
> 
> 




Re: Regex Match Tagger UDF?

Posted by Zach Bailey <za...@dataclip.com>.
 Great. Let me clean up the code a bit and I'd be happy to post it. I'm definitely open to some alternatives in terms of how this UDF would be initialized, whether it is via a file sitting on HDFS, etc. The current initialization scheme is admittedly crude but was simple to code and works for us for now.

Cheers,
Zach


On Monday, December 6, 2010 at 4:15 PM, Brian Adams wrote:

> That is an interesting approach. I like it. Not ideal, but I think it could work for what I am doing.
> 
> In general I think that is useful to the community and you should github it. 
> By all means, I would love to use this.
> 
> I think I could extend/fork this for my need.
> 
> Thank you Zach!
> 
> -----Original Message-----
> From: Zach Bailey [mailto:zach.bailey@dataclip.com] 
> Sent: Monday, December 06, 2010 3:38 PM
> To: user@pig.apache.org
> Subject: Re: Regex Match Tagger UDF?
> 
> 
>  Does the UDF have to support regular expressions? If not, I have adapted the Aho-Corasick algorithm [1] to do something similar to what you're asking for. It works as follows:
> 
> 
> 1.) Initialize the Aho-Corasick UDF with a list of tokens to search for, and a result to output when that token is found:
> 
> 
> define AC_MATCHER com.my.piggybank.AHO_CORASICK('dogs=[terrier|retriever|pit bull];cats=[tabby|mainecoon|tuxedo];birds=[parakeet|parrot|cuckoo]')
> 
> 
> 2.) apply the AC_MATCHER to a tuple
> 
> 
> strings = LOAD 'myfile.txt' as (string:chararray); tagged_strings = FOREACH strings GENERATE string, AC_MATCHER(string) as tags;
> 
> 
> The tagged_strings will then contain the original line along with a bag of matches. For instance if we had the following in myfile.txt:
> 
> 
> terrier parakeet
> hello
> goodbye
> tabby
> pit bull
> 
> 
> after running the commands in #2 tagged_strings would look like (pardon the ad-hoc notation):
> 
> 
> { string: 'terrier parakeet', tags: { 'dogs', 'birds' } } { string: 'hello', tags: {} } { string: 'goodbye', tags: {} } { string: 'tabby', tags: { 'cats' } } { string: 'pit bull', tags: { 'dogs' } }
> 
> 
> If this is something you'd be interested in using/extended I can put it up on github for your forking pleasure.
> 
> Cheers,
> Zach
> 
> 
> On Monday, December 6, 2010 at 3:25 PM, Brian Adams wrote:
> 
> 
> >  I have al is of regex patterns that I would like to run against a data 
> >  set, and if it matches a particular pattern in the list, tag it with 
> >  the predefined tag for that pattern.
> >  Has this been done, or available somewhere? 
> >  I've not written any UDF's, and although I'm not against doing so, I 
> >  probably don't have the time to write one at this point.
> > 
> >  If this isn't available somewhere I can work around this roadblock, 
> >  but it would be awesome if someone has cooked up this functionality 
> >  somewhere.
> > 
> >  -----Original Message-----
> >  From: Anze [mailto:anzenews@volja.net]
> >  Sent: Monday, December 06, 2010 3:09 PM
> >  To: user@pig.apache.org
> >  Subject: Re: Easy question...difference between this::form and 
> >  this.form?
> > 
> > 
> >  Sorry to hijack your question, Jonathan, but while we are at it... :)
> > 
> >  Is there a way to tell Pig NOT to add "base_alias::"? Almost half my 
> >  code consists of FOREACH... GENERATE that just remove these prefixes.
> > 
> >  Thanks,
> > 
> >  Anze
> > 
> >  On Monday 06 December 2010, Daniel Dai wrote:
> > 
> > > After join, cross, foreach flatten, Pig will automatically add 
> > > "base_alias::" prefix. All other cases use "."
> > > 
> > > Daniel
> > > 
> > > Jonathan Coveney wrote:
> > > > It's very hard to search for this among the docs because it's so
> > > 
> > > 
> >  generic,
> > 
> > > > so I thought I'd ask... I'm sure the answer is painfully easy.
> > > > 
> > > > Taking a look at this code that I found online, for example
> > > > 
> > > > --
> > > > -- Read in a bag of tuples (timeseries for this example) and 
> > > > divide
> > > 
> > > 
> >  the
> > 
> > > > -- numeric column by its maximum.
> > > > --
> > > > %default DATABAG 'data/timeseries.tsv'
> > > > 
> > > > data = LOAD '$DATABAG' AS (month:chararray, count:int); accumulate 
> > > > = GROUP data ALL; calc_max = FOREACH accumulate GENERATE 
> > > > FLATTEN(data),
> > > > MAX(data.count) AS max_count;
> > > > normalize = FOREACH calc_max GENERATE data::month AS month, 
> > > > data::count AS count, (float)data::count / (float)max_count AS 
> > > > normed_count; DUMP normalize;
> > > > 
> > > > What purpose does data::month serve versus data.count?
> > > > 
> > > > Thanks
> > > 
> > > 
> > 
> > 
> > 
> > 
> > 
> > 
> 
> 
> 
> 




RE: Regex Match Tagger UDF?

Posted by Brian Adams <Br...@chacha.com>.
That is an interesting approach. I like it. Not ideal, but I think it could work for what I am doing.

In general I think that is useful to the community and you should github it. 
By all means, I would love to use this.

I think I could extend/fork this for my need.

Thank you  Zach!

-----Original Message-----
From: Zach Bailey [mailto:zach.bailey@dataclip.com] 
Sent: Monday, December 06, 2010 3:38 PM
To: user@pig.apache.org
Subject: Re: Regex Match Tagger UDF?


 Does the UDF have to support regular expressions? If not, I have adapted the Aho-Corasick algorithm [1] to do something similar to what you're asking for. It works as follows:


1.) Initialize the Aho-Corasick UDF with a list of tokens to search for, and a result to output when that token is found:


define AC_MATCHER com.my.piggybank.AHO_CORASICK('dogs=[terrier|retriever|pit bull];cats=[tabby|mainecoon|tuxedo];birds=[parakeet|parrot|cuckoo]')


2.) apply the AC_MATCHER to a tuple


strings = LOAD 'myfile.txt' as (string:chararray); tagged_strings = FOREACH strings GENERATE string, AC_MATCHER(string) as tags;


The tagged_strings will then contain the original line along with a bag of matches. For instance if we had the following in myfile.txt:


terrier parakeet
hello
goodbye
tabby
pit bull


after running the commands in #2 tagged_strings would look like (pardon the ad-hoc notation):


{ string: 'terrier parakeet', tags: { 'dogs', 'birds' } } { string: 'hello', tags: {} } { string: 'goodbye', tags: {} } { string: 'tabby', tags: { 'cats' } } { string: 'pit bull', tags: { 'dogs' } }


If this is something you'd be interested in using/extended I can put it up on github for your forking pleasure.

Cheers,
Zach


On Monday, December 6, 2010 at 3:25 PM, Brian Adams wrote:

> I have al is of regex patterns that I would like to run against a data 
> set, and if it matches a particular pattern in the list, tag it with 
> the predefined tag for that pattern.
> Has this been done, or available somewhere? 
> I've not written any UDF's, and although I'm not against doing so, I 
> probably don't have the time to write one at this point.
> 
> If this isn't available somewhere I can work around this roadblock, 
> but it would be awesome if someone has cooked up this functionality 
> somewhere.
> 
> -----Original Message-----
> From: Anze [mailto:anzenews@volja.net]
> Sent: Monday, December 06, 2010 3:09 PM
> To: user@pig.apache.org
> Subject: Re: Easy question...difference between this::form and 
> this.form?
> 
> 
> Sorry to hijack your question, Jonathan, but while we are at it... :)
> 
> Is there a way to tell Pig NOT to add "base_alias::"? Almost half my 
> code consists of FOREACH... GENERATE that just remove these prefixes.
> 
> Thanks,
> 
> Anze
> 
> On Monday 06 December 2010, Daniel Dai wrote:
> 
> >  After join, cross, foreach flatten, Pig will automatically add  
> > "base_alias::" prefix. All other cases use "."
> > 
> >  Daniel
> > 
> >  Jonathan Coveney wrote:
> > > It's very hard to search for this among the docs because it's so
> > 
> > 
> generic,
> 
> > > so I thought I'd ask... I'm sure the answer is painfully easy.
> > > 
> > > Taking a look at this code that I found online, for example
> > > 
> > > --
> > > -- Read in a bag of tuples (timeseries for this example) and 
> > > divide
> > 
> > 
> the
> 
> > > -- numeric column by its maximum.
> > > --
> > > %default DATABAG 'data/timeseries.tsv'
> > > 
> > > data = LOAD '$DATABAG' AS (month:chararray, count:int); accumulate 
> > > = GROUP data ALL; calc_max = FOREACH accumulate GENERATE 
> > > FLATTEN(data),
> > > MAX(data.count) AS max_count;
> > > normalize = FOREACH calc_max GENERATE data::month AS month, 
> > > data::count AS count, (float)data::count / (float)max_count AS 
> > > normed_count; DUMP normalize;
> > > 
> > > What purpose does data::month serve versus data.count?
> > > 
> > > Thanks
> > 
> > 
> 
> 
> 
> 



Re: Regex Match Tagger UDF?

Posted by Zach Bailey <za...@dataclip.com>.
 Does the UDF have to support regular expressions? If not, I have adapted the Aho-Corasick algorithm [1] to do something similar to what you're asking for. It works as follows:


1.) Initialize the Aho-Corasick UDF with a list of tokens to search for, and a result to output when that token is found:


define AC_MATCHER com.my.piggybank.AHO_CORASICK('dogs=[terrier|retriever|pit bull];cats=[tabby|mainecoon|tuxedo];birds=[parakeet|parrot|cuckoo]')


2.) apply the AC_MATCHER to a tuple


strings = LOAD 'myfile.txt' as (string:chararray);
tagged_strings = FOREACH strings GENERATE string, AC_MATCHER(string) as tags;


The tagged_strings will then contain the original line along with a bag of matches. For instance if we had the following in myfile.txt:


terrier parakeet
hello
goodbye
tabby
pit bull


after running the commands in #2 tagged_strings would look like (pardon the ad-hoc notation):


{ string: 'terrier parakeet', tags: { 'dogs', 'birds' } }
{ string: 'hello', tags: {} }
{ string: 'goodbye', tags: {} }
{ string: 'tabby', tags: { 'cats' } }
{ string: 'pit bull', tags: { 'dogs' } }


If this is something you'd be interested in using/extended I can put it up on github for your forking pleasure.

Cheers,
Zach


On Monday, December 6, 2010 at 3:25 PM, Brian Adams wrote:

> I have al is of regex patterns that I would like to run against a data
> set, and if it matches a particular pattern in the list, tag it with the
> predefined tag for that pattern.
> Has this been done, or available somewhere? 
> I've not written any UDF's, and although I'm not against doing so, I
> probably don't have the time to write one at this point.
> 
> If this isn't available somewhere I can work around this roadblock, but
> it would be awesome if someone has cooked up this functionality
> somewhere.
> 
> -----Original Message-----
> From: Anze [mailto:anzenews@volja.net] 
> Sent: Monday, December 06, 2010 3:09 PM
> To: user@pig.apache.org
> Subject: Re: Easy question...difference between this::form and
> this.form?
> 
> 
> Sorry to hijack your question, Jonathan, but while we are at it... :) 
> 
> Is there a way to tell Pig NOT to add "base_alias::"? Almost half my
> code 
> consists of FOREACH... GENERATE that just remove these prefixes. 
> 
> Thanks,
> 
> Anze
> 
> On Monday 06 December 2010, Daniel Dai wrote:
> 
> >  After join, cross, foreach flatten, Pig will automatically add
> >  "base_alias::" prefix. All other cases use "."
> > 
> >  Daniel
> > 
> >  Jonathan Coveney wrote:
> > > It's very hard to search for this among the docs because it's so
> > 
> > 
> generic,
> 
> > > so I thought I'd ask... I'm sure the answer is painfully easy.
> > > 
> > > Taking a look at this code that I found online, for example
> > > 
> > > --
> > > -- Read in a bag of tuples (timeseries for this example) and divide
> > 
> > 
> the
> 
> > > -- numeric column by its maximum.
> > > --
> > > %default DATABAG 'data/timeseries.tsv'
> > > 
> > > data = LOAD '$DATABAG' AS (month:chararray, count:int);
> > > accumulate = GROUP data ALL;
> > > calc_max = FOREACH accumulate GENERATE FLATTEN(data),
> > > MAX(data.count) AS max_count;
> > > normalize = FOREACH calc_max GENERATE data::month AS month,
> > > data::count AS count, (float)data::count / (float)max_count AS
> > > normed_count;
> > > DUMP normalize;
> > > 
> > > What purpose does data::month serve versus data.count?
> > > 
> > > Thanks
> > 
> > 
> 
> 
> 
>