You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Stuart Robinson <st...@gmail.com> on 2014/05/22 01:43:05 UTC

custom token classes for NER model training

Hi, all. I'm using OpenNLP to recognize phone numbers, for which there
isn't a pre-existing model. I've been training my own and have gotten
pretty decent results so far with the simple tokenizer and out-of-the-box
features but I'd now like to improve the features that it's training on. In
particular, I'd like to define some token classes that are specific to the
domain of phone numbers. From what I've read so far (e.g., in Taming Text),
the out-of-the-box token classes are:

1. token is lowercase alphabetic
2. token is two digits
3. token is four digits
4. token contains a number and a letter
5. token contains a number and a hyphen
6. token contains a number and a backlash
7. token contains a number and a comma
8. token contains a number and a period
9. tokens contains a number
10. token is all caps, single letter
11. token is all caps, multiple letters
12. token's initial letters are caps
13. other

I'd like to be able to define feature like the following:

a. token is five digits
b. token is six digits
c. token is seven digits
d. token is eight digits
e. token is greater than eight digits
etc.

I know that you can override features when calling NameFinderME.train by
passing in your own AggregatedFeatureGenerator object, but it's not clear
how an individual feature generator could use custom token classes.
Pointers to the appropriate entry point in the code (and any other
suggestions or advice) would be greatly appreciated.

Thanks in advance.

Regards,
Stuart

Re: custom token classes for NER model training

Posted by Mark G <gi...@gmail.com>.

Perhaps you could write your own AdaptiveFeatureGenerator implementation. I
think this would allow you to add your features to the tokens with your
rules. It is in the tools.util.featuregen package. Take a look at some of
it's current impls, Hope this helps
MG


On Thu, May 22, 2014 at 1:25 PM, Stuart Robinson <stuartprobinson1@gmail.com
> wrote:

> Hi, Mark. Thanks for your suggestion. My initial approach was to use
> regular expressions, but I'm looking at social media and there is a lot
> more variation in the formatting of phone numbers than you would expect (as
> well as various kinds of obfuscation). So I think a named entity recognizer
> will ultimately be more robust. Hence my interest in custom token classes.
>
> Best,
> Stuart
>
>
> On Wed, May 21, 2014 at 6:09 PM, Mark Giaconia <giaconiamark@gmail.com
> >wrote:
>
> >
> >
> > Sounds like you could use a regexnamefinder since these patterns are so
> > well defined with a set of rules.
> >
> > > On May 21, 2014, at 7:43 PM, Stuart Robinson <
> stuartprobinson1@gmail.com>
> > wrote:
> > >
> > > Hi, all. I'm using OpenNLP to recognize phone numbers, for which there
> > > isn't a pre-existing model. I've been training my own and have gotten
> > > pretty decent results so far with the simple tokenizer and
> out-of-the-box
> > > features but I'd now like to improve the features that it's training
> on.
> > In
> > > particular, I'd like to define some token classes that are specific to
> > the
> > > domain of phone numbers. From what I've read so far (e.g., in Taming
> > Text),
> > > the out-of-the-box token classes are:
> > >
> > > 1. token is lowercase alphabetic
> > > 2. token is two digits
> > > 3. token is four digits
> > > 4. token contains a number and a letter
> > > 5. token contains a number and a hyphen
> > > 6. token contains a number and a backlash
> > > 7. token contains a number and a comma
> > > 8. token contains a number and a period
> > > 9. tokens contains a number
> > > 10. token is all caps, single letter
> > > 11. token is all caps, multiple letters
> > > 12. token's initial letters are caps
> > > 13. other
> > >
> > > I'd like to be able to define feature like the following:
> > >
> > > a. token is five digits
> > > b. token is six digits
> > > c. token is seven digits
> > > d. token is eight digits
> > > e. token is greater than eight digits
> > > etc.
> > >
> > > I know that you can override features when calling NameFinderME.train
> by
> > > passing in your own AggregatedFeatureGenerator object, but it's not
> clear
> > > how an individual feature generator could use custom token classes.
> > > Pointers to the appropriate entry point in the code (and any other
> > > suggestions or advice) would be greatly appreciated.
> > >
> > > Thanks in advance.
> > >
> > > Regards,
> > > Stuart
> >
>

Re: custom token classes for NER model training

Posted by Stuart Robinson <st...@gmail.com>.

Hi, Mark. Thanks for your suggestion. My initial approach was to use
regular expressions, but I'm looking at social media and there is a lot
more variation in the formatting of phone numbers than you would expect (as
well as various kinds of obfuscation). So I think a named entity recognizer
will ultimately be more robust. Hence my interest in custom token classes.

Best,
Stuart


On Wed, May 21, 2014 at 6:09 PM, Mark Giaconia <gi...@gmail.com>wrote:

>
>
> Sounds like you could use a regexnamefinder since these patterns are so
> well defined with a set of rules.
>
> > On May 21, 2014, at 7:43 PM, Stuart Robinson <st...@gmail.com>
> wrote:
> >
> > Hi, all. I'm using OpenNLP to recognize phone numbers, for which there
> > isn't a pre-existing model. I've been training my own and have gotten
> > pretty decent results so far with the simple tokenizer and out-of-the-box
> > features but I'd now like to improve the features that it's training on.
> In
> > particular, I'd like to define some token classes that are specific to
> the
> > domain of phone numbers. From what I've read so far (e.g., in Taming
> Text),
> > the out-of-the-box token classes are:
> >
> > 1. token is lowercase alphabetic
> > 2. token is two digits
> > 3. token is four digits
> > 4. token contains a number and a letter
> > 5. token contains a number and a hyphen
> > 6. token contains a number and a backlash
> > 7. token contains a number and a comma
> > 8. token contains a number and a period
> > 9. tokens contains a number
> > 10. token is all caps, single letter
> > 11. token is all caps, multiple letters
> > 12. token's initial letters are caps
> > 13. other
> >
> > I'd like to be able to define feature like the following:
> >
> > a. token is five digits
> > b. token is six digits
> > c. token is seven digits
> > d. token is eight digits
> > e. token is greater than eight digits
> > etc.
> >
> > I know that you can override features when calling NameFinderME.train by
> > passing in your own AggregatedFeatureGenerator object, but it's not clear
> > how an individual feature generator could use custom token classes.
> > Pointers to the appropriate entry point in the code (and any other
> > suggestions or advice) would be greatly appreciated.
> >
> > Thanks in advance.
> >
> > Regards,
> > Stuart
>

Re: custom token classes for NER model training

Posted by Mark Giaconia <gi...@gmail.com>.


Sounds like you could use a regexnamefinder since these patterns are so well defined with a set of rules.

> On May 21, 2014, at 7:43 PM, Stuart Robinson <st...@gmail.com> wrote:
> 
> Hi, all. I'm using OpenNLP to recognize phone numbers, for which there
> isn't a pre-existing model. I've been training my own and have gotten
> pretty decent results so far with the simple tokenizer and out-of-the-box
> features but I'd now like to improve the features that it's training on. In
> particular, I'd like to define some token classes that are specific to the
> domain of phone numbers. From what I've read so far (e.g., in Taming Text),
> the out-of-the-box token classes are:
> 
> 1. token is lowercase alphabetic
> 2. token is two digits
> 3. token is four digits
> 4. token contains a number and a letter
> 5. token contains a number and a hyphen
> 6. token contains a number and a backlash
> 7. token contains a number and a comma
> 8. token contains a number and a period
> 9. tokens contains a number
> 10. token is all caps, single letter
> 11. token is all caps, multiple letters
> 12. token's initial letters are caps
> 13. other
> 
> I'd like to be able to define feature like the following:
> 
> a. token is five digits
> b. token is six digits
> c. token is seven digits
> d. token is eight digits
> e. token is greater than eight digits
> etc.
> 
> I know that you can override features when calling NameFinderME.train by
> passing in your own AggregatedFeatureGenerator object, but it's not clear
> how an individual feature generator could use custom token classes.
> Pointers to the appropriate entry point in the code (and any other
> suggestions or advice) would be greatly appreciated.
> 
> Thanks in advance.
> 
> Regards,
> Stuart