You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by Rodrigo Agerri <ra...@apache.org> on 2016/03/07 12:07:42 UTC

Re: Question about deprecated NameFinderME constructors

Hi,

You can do all those tasks by using the create method in the
TokenNameFinderFactory:

http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/namefind/TokenNameFinderFactory.java?revision=1712553&view=markup#l100

For that you need to:

1. Provide the name of the factory class you are using, it could be
the same factory class: TokenNameFinderFactory.class.getName()
2. Create an XML descriptor and pass it as a byte[] array
3. Load the resources (e.g., clusters) in a resources map consisting
of the id of the resource and the serializer.
4. The sequenceCodec: BIO or BILOU.

There Namefinder documentation was already updated:

http://svn.apache.org/viewvc/opennlp/trunk/opennlp-docs/src/docbkx/namefinder.xml?view=markup

There is sample code to do that in the CLI class:

http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/namefind/TokenNameFinderTrainerTool.java?revision=1674262&view=markup

and to run it from the CLI:

1. Create an XML feature descriptor, e.g., brown-feature.xml

<generators>
  <cache>
    <generators>
      <window prevLength = "2" nextLength = "2">
        <token/>
      </window>
      <window prevLength = "2" nextLength = "2">
        <brownclustertoken dict="brownBllipClusters" />
      </window>
    </generators>
  </cache>
</generators>

2. Put your clustering lexicon(s) in a directory, .e.g, clusters
3. Train: bin/opennlp TokenNameFinderTrainer -featuregen brown.xml
-resources clusters/ -params lang/ml/PerceptronTrainerParams.txt -lang
en -model brown.bin -data
~/experiments/nerc/opennlp/en/conll03/en-testb.opennlp -encoding UTF-8

If you open the brown.bin model you will see the cluster lexicon
seralized inside the model.

You can now use it like any other model, the TokenNameFinderFactory
will read again all the required resources when loading the model in
the TokenNameFinderME class.

HTH,

R






On Mon, Feb 15, 2016 at 7:59 AM, Cohan Sujay Carlos <co...@aiaioo.com> wrote:
> Hi,
>
> I noticed that in the OpenNLP SVM 'trunk', the formerly deprecated
> constructors for the class *NameFinderME*:
>
> *public NameFinderME(TokenNameFinderModel model, AdaptiveFeatureGenerator
> generator, int beamSize, SequenceValidator<String> sequenceValidator);*
>
> and
>
>
> *public NameFinderME(TokenNameFinderModel model, AdaptiveFeatureGenerator
> generator, int beamSize)*
>
> have been removed, along with
>
> *public NameFinderME(TokenNameFinderModel model, int beamSize)*
>
> The deprecation comments said:
>
> @deprecated the beam size is now configured during training time in the
> trainer parameter file via beamSearch.beamSize
>
> and
>
> @deprecated Use {@link #NameFinderME(TokenNameFinderModel)} instead and use
> the {@link TokenNameFinderFactory} to configure it.
>
> I wanted to point out a few potential problems:
>
> 1.  The corresponding train methods have not been removed.  So, it is
> possible to train a NameFinderME using a *custom* AdaptiveFeatureGenerator
> class to do feature engineering, but once a model has been so trained,
> there is no way to load and use the stored model with the same
> AdaptiveFeatureGenerator.
>
> 2.  There is still no documentation on the TokenNameFinderFactory which is
> supposed to replace the constructor with the AdaptiveFeatureGenerator.
>
> 3.  I went over the code of TokenNameFinderFactory and a few places where
> it is used and it seemed to be designed for working with an XML
> specification of feature combinations.  I have also in the references
> included a mailing list conversation that says this class should be used
> with an XML file.  However, it turns out that custom feature sets for
> sequential classification are often important, so might we be dropping
> valuable feature engineering support?
>
> Finally, in light of the above, could we keep the deprecated constructors
> around until the alternative constructor (using TokenNameFinderFactory)
> enters into production, and examples and documentation for it become widely
> available?
>
> References:
>
> On the TokenNameFinderFactory using XML:
> https://mail-archives.apache.org/mod_mbox/opennlp-dev/201410.mbox/%3CCAKvDkVDfAx5BMvwVOrbvpZm7xV9erRQzrzbCDpfd+Cq6m=xqQw@mail.gmail.com%3E
>
> Relevant JIRA issues:
> https://issues.apache.org/jira/browse/OPENNLP-718
> https://issues.apache.org/jira/browse/OPENNLP-717
>
> Thank you,
>
> Cohan Sujay Carlos

Re: Question about deprecated NameFinderME constructors

Posted by Joern Kottmann <ko...@gmail.com>.

There is a custom xml element where it can load a user defined class
 for feature generation.

So you would add an element like this:
<custom
class="com.x.y.AgglutinativeSouthIndianLanguageMorphologicalSuffixFeatureGenerator""/>

https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen

I think we should remove the deprecated training methods so it is no longer
possible to train models which can't be loaded.

Jörn

On Mon, Mar 7, 2016 at 6:45 PM, Cohan Sujay Carlos <co...@aiaioo.com> wrote:

> Dear Rodrigo,
>
> Thank you for the informative reply.
>
> I just wanted to say I feel there is a use-case that the new constructor
> still does not support.  Let me explain with an example.
>
> Let's first take the example of brown-feature.xml, which is defined as ...
>
> <generators>
>   <cache>
>     <generators>
>       <window prevLength = "2" nextLength = "2">
>         <token/>
>       </window>
>       <window prevLength = "2" nextLength = "2">
>         <brownclustertoken dict="brownBllipClusters" />
>       </window>
>     </generators>
>   </cache>
> </generators>
>
> ... In this feature generator, I believe "window" maps to the
> WindowFeatureGenerator
> <
> https://opennlp.apache.org/documentation/1.5.2-incubating/apidocs/opennlp-tools/opennlp/tools/util/featuregen/WindowFeatureGenerator.html
> >
> and "token" maps to TokenFeatureGenerator
> <
> https://opennlp.apache.org/documentation/1.5.2-incubating/apidocs/opennlp-tools/opennlp/tools/util/featuregen/TokenFeatureGenerator.html
> >
> .
>
> It's clear that we can create new feature generators that are combinations
> of existing feature generators.
>
> However, let's say I have a task / language where none of the existing
> feature generators or combinations work very well.
>
> Say, for example, that I want to create a new feature generator that pulls
> out morphemes from agglutinative South Indian languages ... let's call it
> "AgglutinativeSouthIndianLanguageMorphologicalSuffixFeatureGenerator".
>
> It's not clear how one could create XML tags for this feature generator
> using the new constructor.
>
> The same thing is easy to do programmatically using the old constructors ->
> I would just extend the AdaptiveFeatureGenerator
> <
> https://opennlp.apache.org/documentation/1.5.2-incubating/apidocs/opennlp-tools/opennlp/tools/util/featuregen/AdaptiveFeatureGenerator.html
> >
> .
>
> So, I was wondering ... are we giving up some API flexibility and
> simplicity by removing the constructors that enable me to use subclasses of
> AdaptiveFeatureGenerator
> <
> https://opennlp.apache.org/documentation/1.5.2-incubating/apidocs/opennlp-tools/opennlp/tools/util/featuregen/AdaptiveFeatureGenerator.html
> >
> while
> there is no easy way to create something like a
> AgglutinativeSouthIndianLanguageMorphologicalSuffixFeatureGenerator and use
> it as a feature generator in the NameFinderME using the new constructor's
> XML specification.
>
> Cohan Sujay Carlos
> Aiaioo Labs, +91-77605-80015, http://www.aiaioo.com
>
> On Mon, Mar 7, 2016 at 4:37 PM, Rodrigo Agerri <ra...@apache.org> wrote:
>
> > Hi,
> >
> > You can do all those tasks by using the create method in the
> > TokenNameFinderFactory:
> >
> >
> >
> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/namefind/TokenNameFinderFactory.java?revision=1712553&view=markup#l100
> >
> > For that you need to:
> >
> > 1. Provide the name of the factory class you are using, it could be
> > the same factory class: TokenNameFinderFactory.class.getName()
> > 2. Create an XML descriptor and pass it as a byte[] array
> > 3. Load the resources (e.g., clusters) in a resources map consisting
> > of the id of the resource and the serializer.
> > 4. The sequenceCodec: BIO or BILOU.
> >
> > There Namefinder documentation was already updated:
> >
> >
> >
> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-docs/src/docbkx/namefinder.xml?view=markup
> >
> > There is sample code to do that in the CLI class:
> >
> >
> >
> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/namefind/TokenNameFinderTrainerTool.java?revision=1674262&view=markup
> >
> > and to run it from the CLI:
> >
> > 1. Create an XML feature descriptor, e.g., brown-feature.xml
> >
> > <generators>
> >   <cache>
> >     <generators>
> >       <window prevLength = "2" nextLength = "2">
> >         <token/>
> >       </window>
> >       <window prevLength = "2" nextLength = "2">
> >         <brownclustertoken dict="brownBllipClusters" />
> >       </window>
> >     </generators>
> >   </cache>
> > </generators>
> >
> > 2. Put your clustering lexicon(s) in a directory, .e.g, clusters
> > 3. Train: bin/opennlp TokenNameFinderTrainer -featuregen brown.xml
> > -resources clusters/ -params lang/ml/PerceptronTrainerParams.txt -lang
> > en -model brown.bin -data
> > ~/experiments/nerc/opennlp/en/conll03/en-testb.opennlp -encoding UTF-8
> >
> > If you open the brown.bin model you will see the cluster lexicon
> > seralized inside the model.
> >
> > You can now use it like any other model, the TokenNameFinderFactory
> > will read again all the required resources when loading the model in
> > the TokenNameFinderME class.
> >
> > HTH,
> >
> > R
> >
> >
> >
> >
> >
> >
> > On Mon, Feb 15, 2016 at 7:59 AM, Cohan Sujay Carlos <co...@aiaioo.com>
> > wrote:
> > > Hi,
> > >
> > > I noticed that in the OpenNLP SVM 'trunk', the formerly deprecated
> > > constructors for the class *NameFinderME*:
> > >
> > > *public NameFinderME(TokenNameFinderModel model,
> AdaptiveFeatureGenerator
> > > generator, int beamSize, SequenceValidator<String> sequenceValidator);*
> > >
> > > and
> > >
> > >
> > > *public NameFinderME(TokenNameFinderModel model,
> AdaptiveFeatureGenerator
> > > generator, int beamSize)*
> > >
> > > have been removed, along with
> > >
> > > *public NameFinderME(TokenNameFinderModel model, int beamSize)*
> > >
> > > The deprecation comments said:
> > >
> > > @deprecated the beam size is now configured during training time in the
> > > trainer parameter file via beamSearch.beamSize
> > >
> > > and
> > >
> > > @deprecated Use {@link #NameFinderME(TokenNameFinderModel)} instead and
> > use
> > > the {@link TokenNameFinderFactory} to configure it.
> > >
> > > I wanted to point out a few potential problems:
> > >
> > > 1.  The corresponding train methods have not been removed.  So, it is
> > > possible to train a NameFinderME using a *custom*
> > AdaptiveFeatureGenerator
> > > class to do feature engineering, but once a model has been so trained,
> > > there is no way to load and use the stored model with the same
> > > AdaptiveFeatureGenerator.
> > >
> > > 2.  There is still no documentation on the TokenNameFinderFactory which
> > is
> > > supposed to replace the constructor with the AdaptiveFeatureGenerator.
> > >
> > > 3.  I went over the code of TokenNameFinderFactory and a few places
> where
> > > it is used and it seemed to be designed for working with an XML
> > > specification of feature combinations.  I have also in the references
> > > included a mailing list conversation that says this class should be
> used
> > > with an XML file.  However, it turns out that custom feature sets for
> > > sequential classification are often important, so might we be dropping
> > > valuable feature engineering support?
> > >
> > > Finally, in light of the above, could we keep the deprecated
> constructors
> > > around until the alternative constructor (using TokenNameFinderFactory)
> > > enters into production, and examples and documentation for it become
> > widely
> > > available?
> > >
> > > References:
> > >
> > > On the TokenNameFinderFactory using XML:
> > >
> >
> https://mail-archives.apache.org/mod_mbox/opennlp-dev/201410.mbox/%3CCAKvDkVDfAx5BMvwVOrbvpZm7xV9erRQzrzbCDpfd+Cq6m=xqQw@mail.gmail.com%3E
> > >
> > > Relevant JIRA issues:
> > > https://issues.apache.org/jira/browse/OPENNLP-718
> > > https://issues.apache.org/jira/browse/OPENNLP-717
> > >
> > > Thank you,
> > >
> > > Cohan Sujay Carlos
> >
>

Re: Question about deprecated NameFinderME constructors

Posted by Cohan Sujay Carlos <co...@aiaioo.com>.

Dear Rodrigo,

Thank you for the informative reply.

I just wanted to say I feel there is a use-case that the new constructor
still does not support.  Let me explain with an example.

Let's first take the example of brown-feature.xml, which is defined as ...

<generators>
  <cache>
    <generators>
      <window prevLength = "2" nextLength = "2">
        <token/>
      </window>
      <window prevLength = "2" nextLength = "2">
        <brownclustertoken dict="brownBllipClusters" />
      </window>
    </generators>
  </cache>
</generators>

... In this feature generator, I believe "window" maps to the
WindowFeatureGenerator
<https://opennlp.apache.org/documentation/1.5.2-incubating/apidocs/opennlp-tools/opennlp/tools/util/featuregen/WindowFeatureGenerator.html>
and "token" maps to TokenFeatureGenerator
<https://opennlp.apache.org/documentation/1.5.2-incubating/apidocs/opennlp-tools/opennlp/tools/util/featuregen/TokenFeatureGenerator.html>
.

It's clear that we can create new feature generators that are combinations
of existing feature generators.

However, let's say I have a task / language where none of the existing
feature generators or combinations work very well.

Say, for example, that I want to create a new feature generator that pulls
out morphemes from agglutinative South Indian languages ... let's call it
"AgglutinativeSouthIndianLanguageMorphologicalSuffixFeatureGenerator".

It's not clear how one could create XML tags for this feature generator
using the new constructor.

The same thing is easy to do programmatically using the old constructors ->
I would just extend the AdaptiveFeatureGenerator
<https://opennlp.apache.org/documentation/1.5.2-incubating/apidocs/opennlp-tools/opennlp/tools/util/featuregen/AdaptiveFeatureGenerator.html>
.

So, I was wondering ... are we giving up some API flexibility and
simplicity by removing the constructors that enable me to use subclasses of
AdaptiveFeatureGenerator
<https://opennlp.apache.org/documentation/1.5.2-incubating/apidocs/opennlp-tools/opennlp/tools/util/featuregen/AdaptiveFeatureGenerator.html>
while
there is no easy way to create something like a
AgglutinativeSouthIndianLanguageMorphologicalSuffixFeatureGenerator and use
it as a feature generator in the NameFinderME using the new constructor's
XML specification.

Cohan Sujay Carlos
Aiaioo Labs, +91-77605-80015, http://www.aiaioo.com

On Mon, Mar 7, 2016 at 4:37 PM, Rodrigo Agerri <ra...@apache.org> wrote:

> Hi,
>
> You can do all those tasks by using the create method in the
> TokenNameFinderFactory:
>
>
> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/namefind/TokenNameFinderFactory.java?revision=1712553&view=markup#l100
>
> For that you need to:
>
> 1. Provide the name of the factory class you are using, it could be
> the same factory class: TokenNameFinderFactory.class.getName()
> 2. Create an XML descriptor and pass it as a byte[] array
> 3. Load the resources (e.g., clusters) in a resources map consisting
> of the id of the resource and the serializer.
> 4. The sequenceCodec: BIO or BILOU.
>
> There Namefinder documentation was already updated:
>
>
> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-docs/src/docbkx/namefinder.xml?view=markup
>
> There is sample code to do that in the CLI class:
>
>
> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/namefind/TokenNameFinderTrainerTool.java?revision=1674262&view=markup
>
> and to run it from the CLI:
>
> 1. Create an XML feature descriptor, e.g., brown-feature.xml
>
> <generators>
>   <cache>
>     <generators>
>       <window prevLength = "2" nextLength = "2">
>         <token/>
>       </window>
>       <window prevLength = "2" nextLength = "2">
>         <brownclustertoken dict="brownBllipClusters" />
>       </window>
>     </generators>
>   </cache>
> </generators>
>
> 2. Put your clustering lexicon(s) in a directory, .e.g, clusters
> 3. Train: bin/opennlp TokenNameFinderTrainer -featuregen brown.xml
> -resources clusters/ -params lang/ml/PerceptronTrainerParams.txt -lang
> en -model brown.bin -data
> ~/experiments/nerc/opennlp/en/conll03/en-testb.opennlp -encoding UTF-8
>
> If you open the brown.bin model you will see the cluster lexicon
> seralized inside the model.
>
> You can now use it like any other model, the TokenNameFinderFactory
> will read again all the required resources when loading the model in
> the TokenNameFinderME class.
>
> HTH,
>
> R
>
>
>
>
>
>
> On Mon, Feb 15, 2016 at 7:59 AM, Cohan Sujay Carlos <co...@aiaioo.com>
> wrote:
> > Hi,
> >
> > I noticed that in the OpenNLP SVM 'trunk', the formerly deprecated
> > constructors for the class *NameFinderME*:
> >
> > *public NameFinderME(TokenNameFinderModel model, AdaptiveFeatureGenerator
> > generator, int beamSize, SequenceValidator<String> sequenceValidator);*
> >
> > and
> >
> >
> > *public NameFinderME(TokenNameFinderModel model, AdaptiveFeatureGenerator
> > generator, int beamSize)*
> >
> > have been removed, along with
> >
> > *public NameFinderME(TokenNameFinderModel model, int beamSize)*
> >
> > The deprecation comments said:
> >
> > @deprecated the beam size is now configured during training time in the
> > trainer parameter file via beamSearch.beamSize
> >
> > and
> >
> > @deprecated Use {@link #NameFinderME(TokenNameFinderModel)} instead and
> use
> > the {@link TokenNameFinderFactory} to configure it.
> >
> > I wanted to point out a few potential problems:
> >
> > 1.  The corresponding train methods have not been removed.  So, it is
> > possible to train a NameFinderME using a *custom*
> AdaptiveFeatureGenerator
> > class to do feature engineering, but once a model has been so trained,
> > there is no way to load and use the stored model with the same
> > AdaptiveFeatureGenerator.
> >
> > 2.  There is still no documentation on the TokenNameFinderFactory which
> is
> > supposed to replace the constructor with the AdaptiveFeatureGenerator.
> >
> > 3.  I went over the code of TokenNameFinderFactory and a few places where
> > it is used and it seemed to be designed for working with an XML
> > specification of feature combinations.  I have also in the references
> > included a mailing list conversation that says this class should be used
> > with an XML file.  However, it turns out that custom feature sets for
> > sequential classification are often important, so might we be dropping
> > valuable feature engineering support?
> >
> > Finally, in light of the above, could we keep the deprecated constructors
> > around until the alternative constructor (using TokenNameFinderFactory)
> > enters into production, and examples and documentation for it become
> widely
> > available?
> >
> > References:
> >
> > On the TokenNameFinderFactory using XML:
> >
> https://mail-archives.apache.org/mod_mbox/opennlp-dev/201410.mbox/%3CCAKvDkVDfAx5BMvwVOrbvpZm7xV9erRQzrzbCDpfd+Cq6m=xqQw@mail.gmail.com%3E
> >
> > Relevant JIRA issues:
> > https://issues.apache.org/jira/browse/OPENNLP-718
> > https://issues.apache.org/jira/browse/OPENNLP-717
> >
> > Thank you,
> >
> > Cohan Sujay Carlos
>