You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@opennlp.apache.org by Rodrigo Agerri <ra...@apache.org> on 2014/10/03 11:58:47 UTC

[opennlp-dev] TokenNameFinderFactory new features and extension

Hello,

I have implemented a number of new features for the name finder. These
include Brown clusters features (duplicated per Brown path for each
feature activated involving a token) and Clark cluster features
(similar to the WordClusterFeatureGenerator currently available) among
other local extra features which interact well with the clustering
ones.

I think it will be nice to include them before the new release. I will
open issues about each of them. What do you think?

In the meantime, I am in the process of testing these new features
locally but I have run into a number of issues/questions about how to
proceed about the extension of the TokenNameFinderFactory:

1. I add the new features to the GeneratorFactory.
2. I create a new feature descriptor accordingly with some of the new features.
3. I extend the TokenNameFinderFactory and I instantiate the subclass
via the TokenNameFinderFactory.create(subclassName, featuregenerator[]
bytes, resources, sequenceCodec) method.
4. I override the TokenNameFinderFactory.createFeatureGenerators()
method in the extended class.
5. At this point, I do not have access to the featureGeneratorBytes[]
because the TokenNameFinderFactory does not provide a getter. I add a
getter accordingly in the TokenNameFinderFactory class.
Should we do this? Or I am doing the extension of the TokenNameFactory
in a wrong way?
6. *Some* of the new features work. If an Element name in the
descriptor does not match in the GeneratorFactory, then the
TokenNameFinderFactory.createFeatureGenerators() gives a null and the
TokenNameFinderFactory.createContextGenerator() automatically stops
the feature creation and goes for the
NameFinderME.createFeatureGenerator().
Is this the desired behaviour? Perhaps we could add a log somewhere?
To inform of the backoff to the default features if one descriptor
element does not match?

Any comment appreciated.

Thanks,

Rodrigo

Re: [opennlp-dev] TokenNameFinderFactory new features and extension

Posted by Jörn Kottmann <ko...@gmail.com>.
On 10/07/2014 06:40 PM, Rodrigo Agerri wrote:
> Hello,
>
> One question regarding the WordClusterFeatureGenerator implementation
> which I am using as template for the Brown features and so on. I
> cannot seem to make it work, it complains all the time that the value
> of the attribute "dict" I provide is not an instance of a
> W2VClassesDictionary:
>
> Exception in thread "main"
> opennlp.tools.namefind.TokenNameFinderModel$FeatureGeneratorCreationError:
> opennlp.tools.util.InvalidFormatException: Not a W2VClassesDictionary
> resource for key: opennlp.tools.util.featuregen.W2VClassesDictionary
>
> I have tried both from the CLI and programatically and I get the same result.
>  From the CLI I write an element like this:
> <w2vwordcluster dict="opennlp.tools.util.featuregen.W2VClassesDictionary" />

The dict parameter expects the name of the w2v dictionary resource. The 
way it works is
that you create a resource directory e.g. "model-resources" and in this 
directory you place
a file. The name of the file goes in the dict attribute.

For example:

<w2vwordcluster dict="xyz-classes.txt" />


> which i add to the default descriptor. I also pass the relevant
> directory containing the word2vec clusters via the -resources
> parameter.

Yes, that has to be done, otherwise it doesn't know where to look for 
the resources.

The entire resource directory is included in the model, and there can be 
multiple
feature generators using files from these resources.

Jörn

Re: [opennlp-dev] TokenNameFinderFactory new features and extension

Posted by Rodrigo Agerri <ro...@ehu.es>.
Hello,

One question regarding the WordClusterFeatureGenerator implementation
which I am using as template for the Brown features and so on. I
cannot seem to make it work, it complains all the time that the value
of the attribute "dict" I provide is not an instance of a
W2VClassesDictionary:

Exception in thread "main"
opennlp.tools.namefind.TokenNameFinderModel$FeatureGeneratorCreationError:
opennlp.tools.util.InvalidFormatException: Not a W2VClassesDictionary
resource for key: opennlp.tools.util.featuregen.W2VClassesDictionary

I have tried both from the CLI and programatically and I get the same result.
>From the CLI I write an element like this:
<w2vwordcluster dict="opennlp.tools.util.featuregen.W2VClassesDictionary" />

which i add to the default descriptor. I also pass the relevant
directory containing the word2vec clusters via the -resources
parameter.

Programatically I build the same element by:

Element word2vecClusterFeatures = new Element("w2vwordcluster");
InputStream inputStream = new FileInputStream(word2vecClusterPath);
ArtifactSerializer serializer = new
W2VClassesDictionary.W2VClassesDictionarySerializer();
word2vecClusterFeatures.setAttribute("dict",
serializer.create(inputStream).getClass().getName());

and it prints the same descriptor as above as well as the same error
in the GeneratorFactory class.

This is the command: bin/opennlp TokenNameFinderTrainer -featuregen
lang/en/namefinder/en-namefinder.xml -factory
opennlp.tools.namefind.TokenNameFinderFactory -resources ~/word2vec/
-params lang/ml/PerceptronTrainerParams.txt -lang en -model test.bin
-data ~/experiments/nerc/opennlp/data/en/conll2003/opennlp-eng.train

Thanks,

Rodrigo

On Fri, Oct 3, 2014 at 12:40 PM, Jörn Kottmann <ko...@gmail.com> wrote:
> On 10/03/2014 11:58 AM, Rodrigo Agerri wrote:
>>
>> I have implemented a number of new features for the name finder. These
>> include Brown clusters features (duplicated per Brown path for each
>> feature activated involving a token) and Clark cluster features
>> (similar to the WordClusterFeatureGenerator currently available) among
>> other local extra features which interact well with the clustering
>> ones.
>>
>> I think it will be nice to include them before the new release. I will
>> open issues about each of them. What do you think?
>
>
> Yes please open issues for them. It would be really nice to receive them as
> a contribution.
>
> There are two things you need to do:
> 1. Implement the feature generators
> - Implement AdaptiveFeatureGenerator or extend CustomFeatureGenerator if you
> need to pass parameters to it
>
> 2.//Implement support for load and serialize the data they need
> - This class should implement SerializableArtifact
> - And if you want to load use it the Feature Generator should implement
> ArtifactToSerializerMapper, that one tells
> the loader which class to use to read the data file
>
> The above is the procedure you should use if you want to have a real custom
> feature generator which is not part of
> the OpenNLP Tools jar.
>
> When you contribute it, things are slightly different. You should add a
> XmlFeatureGeneratorFactory inside the GeneratorFactory
> class. This factory creates the feature generator based on a defined xml
> element inside the descriptor.
>
>> 6.*Some*  of the new features work. If an Element name in the
>> descriptor does not match in the GeneratorFactory, then the
>> TokenNameFinderFactory.createFeatureGenerators() gives a null and the
>> TokenNameFinderFactory.createContextGenerator() automatically stops
>> the feature creation and goes for the
>> NameFinderME.createFeatureGenerator().
>> Is this the desired behaviour? Perhaps we could add a log somewhere?
>> To inform of the backoff to the default features if one descriptor
>> element does not match?
>
>
> That sounds really bad. If there is a problem in the mapping it should fail
> hard and throw an
> exception. The user should be forced to decide by himself what do to, either
> fix his descriptor
> or use defaults.
>
> The steps 4 and 5 you describe should not be necessary to add new feature
> generators.
>
> The idea is that we always use the xml descriptor to define the feature
> generation, that way we can have different
> configurations without changing the OpenNLP code itself, and don't need
> special user code to integrate a
> customized name finder model. If a model makes use of external classes these
> of course need to be on the classpath
> since we can't ship them as a part of the model.
>
> HTH,
> Jörn

Re: [opennlp-dev] TokenNameFinderFactory new features and extension

Posted by Rodrigo Agerri <ro...@ehu.es>.
On Mon, Oct 6, 2014 at 4:35 PM, Jörn Kottmann <ko...@gmail.com> wrote:
> On 10/04/2014 12:53 AM, Rodrigo Agerri wrote:
>>
>> Hi,
>>
>> As a followed up, it turns out that currently we can provide a feature
>> generator via -featuregen parameter if you provide a subclass via the
>> -factory parameter only. I do not know if that is intended.
>
>
> It is not. The featuregen parameter works (or should) without the -factory
> parameter.
> I will have a look. What happens if you don't provide the -factory
> parameter?
> Is there an exception?

No, it just silently trains with the default generator.

Rodrigo

Re: [opennlp-dev] TokenNameFinderFactory new features and extension

Posted by Jörn Kottmann <ko...@gmail.com>.
On 10/04/2014 12:53 AM, Rodrigo Agerri wrote:
> Hi,
>
> As a followed up, it turns out that currently we can provide a feature
> generator via -featuregen parameter if you provide a subclass via the
> -factory parameter only. I do not know if that is intended.

It is not. The featuregen parameter works (or should) without the 
-factory parameter.
I will have a look. What happens if you don't provide the -factory 
parameter?
Is there an exception?

Jörn

Re: [opennlp-dev] TokenNameFinderFactory new features and extension

Posted by Mark G <ma...@apache.org>.
Rodrigo, thanks for this fix, let me know if you want me to test, or just
put a second set of eyes on any particular parts.
MG

On Wed, Oct 8, 2014 at 2:32 PM, Jörn Kottmann <ko...@gmail.com> wrote:

> Well done, this was a serious regression.
>
> You observed this earlier:
> > Only one issue remains: The requirement to add -factory parameter for
> > the -featuregen parameter to work and its backing-off to default
> > features without warning if the -factory param is not used.
>
> Is that still the case with the fix you applied?
>
> We need to add a test case to be able to detect bugs in the setup of the
> feature generation. I wouldn't be surprised if something similar happens
> again at some point.
>
> Jörn
>
> On Wed, 2014-10-08 at 17:27 +0100, Rodrigo Agerri wrote:
> > Hello,
> >
> > On Wed, Oct 8, 2014 at 8:17 AM, Jörn Kottmann <ko...@gmail.com>
> wrote:
> > >
> > > +1 for the first option.
> >
> > Great, I have commit and close the issue.
> >
> > Thanks!
> >
> > Rodrigo
>
>
>

Re: [opennlp-dev] TokenNameFinderFactory new features and extension

Posted by Rodrigo Agerri <ro...@ehu.es>.
Hello,

On Wed, Oct 8, 2014 at 7:32 PM, Jörn Kottmann <ko...@gmail.com> wrote:
> You observed this earlier:
>> Only one issue remains: The requirement to add -factory parameter for
>> the -featuregen parameter to work and its backing-off to default
>> features without warning if the -factory param is not used.
>
> Is that still the case with the fix you applied?

Yes, that is still the case. I have open an issue about this too.

https://issues.apache.org/jira/browse/OPENNLP-718

Any comments anyone?

> We need to add a test case to be able to detect bugs in the setup of the
> feature generation. I wouldn't be surprised if something similar happens
> again at some point.

Yes, perhaps adding a test for features that are not the default ones,
or even by type of features:

1. A test for local features different to the default one.
2. A test for features requiring external resources.

That way, the descriptors parsing, resources loading and factory
creation will all be affected.

@markg, this also addresses your comments.

What do you think?

Cheers,

Rodrigo

Re: [opennlp-dev] TokenNameFinderFactory new features and extension

Posted by Jörn Kottmann <ko...@gmail.com>.
Well done, this was a serious regression.

You observed this earlier:
> Only one issue remains: The requirement to add -factory parameter for
> the -featuregen parameter to work and its backing-off to default
> features without warning if the -factory param is not used.

Is that still the case with the fix you applied?

We need to add a test case to be able to detect bugs in the setup of the
feature generation. I wouldn't be surprised if something similar happens
again at some point.

Jörn

On Wed, 2014-10-08 at 17:27 +0100, Rodrigo Agerri wrote:
> Hello,
> 
> On Wed, Oct 8, 2014 at 8:17 AM, Jörn Kottmann <ko...@gmail.com> wrote:
> >
> > +1 for the first option.
> 
> Great, I have commit and close the issue.
> 
> Thanks!
> 
> Rodrigo



Re: [opennlp-dev] TokenNameFinderFactory new features and extension

Posted by Rodrigo Agerri <ro...@ehu.es>.
Hello,

On Wed, Oct 8, 2014 at 8:17 AM, Jörn Kottmann <ko...@gmail.com> wrote:
>
> +1 for the first option.

Great, I have commit and close the issue.

Thanks!

Rodrigo

Re: [opennlp-dev] TokenNameFinderFactory new features and extension

Posted by Jörn Kottmann <ko...@gmail.com>.
On 10/06/2014 11:35 PM, Rodrigo Agerri wrote:
> Hi,
>
> On Mon, Oct 6, 2014 at 11:19 PM, Jörn Kottmann <ko...@gmail.com> wrote:
>
>> I see two ways to fix this:
>> - The way you suggested, by extracting the XMÖ descriptor from the
>> TokenNameFinderFactory
>> - Or by returning the XML descriptor as part of the
>> TokenNameFinder.getResources() method.
> The first one is conceptually easier for me as it treats equally the
> three components of the factory, the resources, the seqcodec and the
> featureGenerator. Also, it takes two lines of code to do.
> Having said, that, removing code is always tempting, so the second
> option is also good.


The feature generator descriptor is just a resource too. Anyway I think 
it is good to leave it
like it is today, and treat it separately. We can still easily change 
that in the future if there is
a need.

>> The second option is probably better, because it gives more control to
>> the TokenNameFinderFactory. It would be possible to change the format of
>> the feature generator description with a different factory
>> implementation. And it might simplify the TokenNameFinderModel
>> implementation, since most code dealing with the feature generation
>> could be removed.
> Something I do not fully understand is why would not it be possible in
> the first option to change the feature generator description with a
> different factory implementation.

Yes, your are right. The factory still has full control over parsing the 
descriptor.

+1 for the first option.

Sorry, for my late reply, I answered this mail already, but suspect it 
didn't get send.

Jörn



Re: [opennlp-dev] TokenNameFinderFactory new features and extension

Posted by Rodrigo Agerri <ro...@ehu.es>.
Hi,

On Mon, Oct 6, 2014 at 11:19 PM, Jörn Kottmann <ko...@gmail.com> wrote:

> I see two ways to fix this:
> - The way you suggested, by extracting the XMÖ descriptor from the
> TokenNameFinderFactory
> - Or by returning the XML descriptor as part of the
> TokenNameFinder.getResources() method.

The first one is conceptually easier for me as it treats equally the
three components of the factory, the resources, the seqcodec and the
featureGenerator. Also, it takes two lines of code to do.
Having said, that, removing code is always tempting, so the second
option is also good.

> The second option is probably better, because it gives more control to
> the TokenNameFinderFactory. It would be possible to change the format of
> the feature generator description with a different factory
> implementation. And it might simplify the TokenNameFinderModel
> implementation, since most code dealing with the feature generation
> could be removed.

Something I do not fully understand is why would not it be possible in
the first option to change the feature generator description with a
different factory implementation.

Cheers,

R

Re: [opennlp-dev] TokenNameFinderFactory new features and extension

Posted by Jörn Kottmann <ko...@gmail.com>.
Hello,

now I understand it much better. Previously it was
only possible to provide the XML descriptor as bytes or an instance
of a Feature Generator to the train method. 

The method which accepted the XML descriptor then instantiated it and
called the train method which takes a Feature Generator, after the
training was done it updated the model so that is included the XML
descriptor artifact.

Through the recent change, the concept was changed and the train method
taking the Feature Generator was changed to instead accept the
TokenNameFinderFactory. Anyway, since then it doesn't write the XML
descriptor anymore into the model.

I see two ways to fix this:
- The way you suggested, by extracting the XMÖ descriptor from the
TokenNameFinderFactory
- Or by returning the XML descriptor as part of the
TokenNameFinder.getResources() method. 

The second option is probably better, because it gives more control to
the TokenNameFinderFactory. It would be possible to change the format of
the feature generator description with a different factory
implementation. And it might simplify the TokenNameFinderModel
implementation, since most code dealing with the feature generation
could be removed.

Any opinions?

Jörn

On Mon, 2014-10-06 at 17:57 +0200, Rodrigo Agerri wrote:
> Hi,
> 
> On Mon, Oct 6, 2014 at 5:41 PM, Jörn Kottmann <ko...@gmail.com> wrote:
> >
> > Isn't that how it is implemented today? The feature generators can't be
> > shared
> > and therefore we have the createFeatureGenerators method in the
> > TokenNameFinderFactory
> > which creates a new feature generator every time one is needed.
> > That one tries to read the xml descriptor from the model and creates the
> > feature generators.
> 
> Yes, but with one exception: it all goes well until it arrives to line
> 361 of NameFinderME:
> 
> return new TokenNameFinderModel(languageCode, nameFinderModel,
> beamSize, null, factory.getResources(), manifestInfoEntries,
> factory.getSequenceCodec());
> 
> that "null" parameter is the featureGenerator. The init() method in
> the TokenNameFinderModel class get that null and returns the default
> feature generator.
> 
> what is needed is to pass the featureGenerator created by the
> TokenNameFinder.createContext() as a parameter. That is why I added a
> getter in the TokenNameFinderFactory for the field private byte[]
> featureGeneratorBytes. I just add it and in to create the
> TokenNameFinderModel above in NameFinderME I say:
> 
>  return new TokenNameFinderModel(languageCode, nameFinderModel,
> beamSize, factory.getFeatureGenerator(), factory.getResources(),
> manifestInfoEntries, factory.getSequenceCodec());
> 
> and it all works as expected.
> 
> > I will try to reproduce the bug you see.
> >
> > How can I do that?
> >
> > First train a model with this command:
> 
> > bin/opennlp TokenNameFinderTrainer -featuregen bigram.xml -factory
> > opennlp.tools.namefind.TokenNameFinderFactory -sequenceCodec BIO
> > -params lang/ml/PerceptronTrainerParams.txt -lang nl -model test.bin
> > -data ~/experiments/nerc/opennlp/data/nl/conll2002/nl_opennlp.testa.train
> >
> > and this feature generator config:
> > <generators>
> >   <cache>
> >     <generators>
> >       <window prevLength = "2" nextLength = "2">
> >         <tokenclass/>
> >       </window>
> >       <window prevLength = "2" nextLength = "2">
> >         <token/>
> >       </window>
> >       <definition/>
> >       <prevmap/>
> >       <bigram/>
> >       <sentence begin="true" end="false"/>
> >       <prefix/>
> >       <suffix/>
> >     </generators>
> >   </cache>
> > </generators>
> >
> > Did you use the command line tool for the evaluation too?
> > Maybe you can post the command for that.
> 
> Yes, and then try to train with the default featureGenerator in the
> lang/en/namefind directory.
> 
> bin/opennlp TokenNameFinderEvaluator -model test.bin -data
> ~/experiments/nerc/opennlp/data/nl/conll2002/opennlp-nl.testb
> 
> Cheers,
> 
> Rodrigo



Re: [opennlp-dev] TokenNameFinderFactory new features and extension

Posted by Rodrigo Agerri <ro...@ehu.es>.
Hi,

On Mon, Oct 6, 2014 at 5:41 PM, Jörn Kottmann <ko...@gmail.com> wrote:
>
> Isn't that how it is implemented today? The feature generators can't be
> shared
> and therefore we have the createFeatureGenerators method in the
> TokenNameFinderFactory
> which creates a new feature generator every time one is needed.
> That one tries to read the xml descriptor from the model and creates the
> feature generators.

Yes, but with one exception: it all goes well until it arrives to line
361 of NameFinderME:

return new TokenNameFinderModel(languageCode, nameFinderModel,
beamSize, null, factory.getResources(), manifestInfoEntries,
factory.getSequenceCodec());

that "null" parameter is the featureGenerator. The init() method in
the TokenNameFinderModel class get that null and returns the default
feature generator.

what is needed is to pass the featureGenerator created by the
TokenNameFinder.createContext() as a parameter. That is why I added a
getter in the TokenNameFinderFactory for the field private byte[]
featureGeneratorBytes. I just add it and in to create the
TokenNameFinderModel above in NameFinderME I say:

 return new TokenNameFinderModel(languageCode, nameFinderModel,
beamSize, factory.getFeatureGenerator(), factory.getResources(),
manifestInfoEntries, factory.getSequenceCodec());

and it all works as expected.

> I will try to reproduce the bug you see.
>
> How can I do that?
>
> First train a model with this command:

> bin/opennlp TokenNameFinderTrainer -featuregen bigram.xml -factory
> opennlp.tools.namefind.TokenNameFinderFactory -sequenceCodec BIO
> -params lang/ml/PerceptronTrainerParams.txt -lang nl -model test.bin
> -data ~/experiments/nerc/opennlp/data/nl/conll2002/nl_opennlp.testa.train
>
> and this feature generator config:
> <generators>
>   <cache>
>     <generators>
>       <window prevLength = "2" nextLength = "2">
>         <tokenclass/>
>       </window>
>       <window prevLength = "2" nextLength = "2">
>         <token/>
>       </window>
>       <definition/>
>       <prevmap/>
>       <bigram/>
>       <sentence begin="true" end="false"/>
>       <prefix/>
>       <suffix/>
>     </generators>
>   </cache>
> </generators>
>
> Did you use the command line tool for the evaluation too?
> Maybe you can post the command for that.

Yes, and then try to train with the default featureGenerator in the
lang/en/namefind directory.

bin/opennlp TokenNameFinderEvaluator -model test.bin -data
~/experiments/nerc/opennlp/data/nl/conll2002/opennlp-nl.testb

Cheers,

Rodrigo

Re: [opennlp-dev] TokenNameFinderFactory new features and extension

Posted by Jörn Kottmann <ko...@gmail.com>.
On 10/06/2014 04:49 PM, Rodrigo Agerri wrote:
> As I said, I have issue 717 solved by adding a getter for the
> featureGenerator in the TokenNameFactory and using that getter to
> parametrized correctly the creation of the TokenNameFinderModel after
> training.

Isn't that how it is implemented today? The feature generators can't be 
shared
and therefore we have the createFeatureGenerators method in the 
TokenNameFinderFactory
which creates a new feature generator every time one is needed.
That one tries to read the xml descriptor from the model and creates the 
feature generators.

You say it uses the default feature generation, that can only happen if 
the createFeatureGenerator
method returns null. Is that true in your case?

In which place, exactly, did you add the getter method to fix the 
problem, and where in TokenNameFinderModel
did you call it? The TokenNameFinderFactory doesn't have an instance 
variable called featureGenerator.
I am just trying to understand how your proposed fix works.

Usually the model is created by using one of the constructors which take 
an InputStream,
File or URL. Did you use a different constructor to create the model?

I will try to reproduce the bug you see.

How can I do that?

First train a model with this command:
bin/opennlp TokenNameFinderTrainer -featuregen bigram.xml -factory
opennlp.tools.namefind.TokenNameFinderFactory -sequenceCodec BIO
-params lang/ml/PerceptronTrainerParams.txt -lang nl -model test.bin
-data ~/experiments/nerc/opennlp/data/nl/conll2002/nl_opennlp.testa.train

and this feature generator config:
<generators>
   <cache>
     <generators>
       <window prevLength = "2" nextLength = "2">
         <tokenclass/>
       </window>
       <window prevLength = "2" nextLength = "2">
         <token/>
       </window>
       <definition/>
       <prevmap/>
       <bigram/>
       <sentence begin="true" end="false"/>
       <prefix/>
       <suffix/>
     </generators>
   </cache>
</generators>

Did you use the command line tool for the evaluation too?
Maybe you can post the command for that.

Jörn

Re: [opennlp-dev] TokenNameFinderFactory new features and extension

Posted by Rodrigo Agerri <ro...@ehu.es>.
Cool,

As I said, I have issue 717 solved by adding a getter for the
featureGenerator in the TokenNameFactory and using that getter to
parametrized correctly the creation of the TokenNameFinderModel after
training.

But maybe another solution is possible, of course.

R

On Mon, Oct 6, 2014 at 4:46 PM, Jörn Kottmann <ko...@gmail.com> wrote:
> On 10/06/2014 02:04 PM, Rodrigo Agerri wrote:
>>
>> All these problems are solved as per this issue:
>>
>> https://issues.apache.org/jira/browse/OPENNLP-717
>>
>> Only one issue remains: The requirement to add -factory parameter for
>> the -featuregen parameter to work and its backing-off to default
>> features without warning if the -factory param is not used.
>
>
> I will have a look at the code to see what went wrong in OPENNLP-717, I
> suspect that
> bug was introduced through some recent changes when we added the factory
> support for
> the name finder.
>
> Jörn

Re: [opennlp-dev] TokenNameFinderFactory new features and extension

Posted by Jörn Kottmann <ko...@gmail.com>.
On 10/06/2014 02:04 PM, Rodrigo Agerri wrote:
> All these problems are solved as per this issue:
>
> https://issues.apache.org/jira/browse/OPENNLP-717
>
> Only one issue remains: The requirement to add -factory parameter for
> the -featuregen parameter to work and its backing-off to default
> features without warning if the -factory param is not used.

I will have a look at the code to see what went wrong in OPENNLP-717, I 
suspect that
bug was introduced through some recent changes when we added the factory 
support for
the name finder.

Jörn

Re: [opennlp-dev] TokenNameFinderFactory new features and extension

Posted by Rodrigo Agerri <ra...@apache.org>.
Hi again,

All these problems are solved as per this issue:

https://issues.apache.org/jira/browse/OPENNLP-717

Only one issue remains: The requirement to add -factory parameter for
the -featuregen parameter to work and its backing-off to default
features without warning if the -factory param is not used.

Thanks,

Rodrigo

On Sat, Oct 4, 2014 at 12:53 AM, Rodrigo Agerri <ro...@ehu.es> wrote:
> Hi,
>
> As a followed up, it turns out that currently we can provide a feature
> generator via -featuregen parameter if you provide a subclass via the
> -factory parameter only. I do not know if that is intended. Also, I
> have noticed a very weird behaviour: I pass several descriptors via
> CLI (starting with token features only, then adding tokenclass, etc.)
> and it all goes well until I add either the Prefix or the
> SuffixFeatureGenerator on which the performance drops alarmingly to
> 49.65 F1 when prefix and suffix are added to the default descriptor:
>
> bin/opennlp TokenNameFinderTrainer -featuregen bigram.xml -factory
> opennlp.tools.namefind.TokenNameFinderFactory -sequenceCodec BIO
> -params lang/ml/PerceptronTrainerParams.txt -lang nl -model test.bin
> -data ~/experiments/nerc/opennlp/data/nl/conll2002/nl_opennlp.testa.train
>
> I get this behaviour with all conll03 and conll02 four datasets.
>
> <generators>
>   <cache>
>     <generators>
>       <window prevLength = "2" nextLength = "2">
>         <tokenclass/>
>       </window>
>       <window prevLength = "2" nextLength = "2">
>         <token/>
>       </window>
>       <definition/>
>       <prevmap/>
>       <bigram/>
>       <sentence begin="true" end="false"/>
>       <prefix/>
>       <suffix/>
>     </generators>
>   </cache>
> </generators>
>
> Cheers,
>
> Rodrigo
>
> On Fri, Oct 3, 2014 at 5:55 PM, Rodrigo Agerri <ro...@ehu.es> wrote:
>> Hi Jörn,
>>
>> On Fri, Oct 3, 2014 at 12:40 PM, Jörn Kottmann <ko...@gmail.com> wrote:
>>>
>>> There are two things you need to do:
>>> 1. Implement the feature generators
>>> - Implement AdaptiveFeatureGenerator or extend CustomFeatureGenerator if you
>>> need to pass parameters to it
>>
>> OK, lets say I have this descriptor:
>>
>> <generators>
>>   <cache>
>>     <generators>
>>       <window prevLength="2" nextLength="2">
>>         <token />
>>       </window>
>>       <custom class="es.ehu.si.ixa.pipe.nerc.features.Prefix34FeatureGenerator"
>> />
>>     </generators>
>>   </cache>
>> </generators>
>>
>> Now, if I understand correctly the implementation (and your comments):
>>
>> 1. I should just create a Prefix34FeatureGenerator class extending
>> FeatureGeneratorAdapter.
>> 2. If I wanted to pass parameters, e.g. descriptor attributes, then I
>> should extend CustomFeatureGenerator.
>> 3. If I load such descriptor as argument of -featuregen in the CLI,
>> the CLI should complain if such class is not in the classpath, I
>> guess. If it is in the classpath, then it should use the custom
>> generator.
>> 4. As it is now, no matter what value you pass to the -featuregen, it
>> always train the default features. It does not complain even if the
>> custom feature generator is not well-formed. Even if I only pass the
>> token features, it still loads the default generator. With version
>> 1.5.3 it works fine though. I am looking into it, but any hints
>> welcome :)
>> 5. When I do this programatically, e.g., load the featuregenerator
>> descriptor to an extension of the TokenNameFinderFactory, it seems to
>> load the custom generators,  the GeneratorFactory loads the descriptor
>> I pass, e.g., if only tokens then it trains successfully only with
>> tokens. However, if I pass a custom generator, it does not complain,
>> it trains and the performance drops to 40 F1. For the record, I build
>> the descriptor programatically like this
>>
>>  Element prefixFeature = new Element("custom");
>>  prefixFeature.setAttribute("class", Prefix34FeatureGenerator.class.getName());
>>  generators.addContent(prefixFeature);
>>
>> and then the GeneratorFactory does get it without errors.
>>
>>> 2.//Implement support for load and serialize the data they need
>>> - This class should implement SerializableArtifact
>>> - And if you want to load use it the Feature Generator should implement
>>> ArtifactToSerializerMapper, that one tells
>>> the loader which class to use to read the data file
>>
>> This is only for the clustering features resources and such, I guess.
>>
>>> The above is the procedure you should use if you want to have a real custom
>>> feature generator which is not part of
>>> the OpenNLP Tools jar.
>>
>> Yes, what I do is include opennlp as maven dependency in an uber jar,
>> e.g., with all classes inside, including opennlp and my custom feature
>> generators. The classpath should be ok in this case, but I still
>> cannot make them work.
>>
>>>
>>>> 6.*Some*  of the new features work. If an Element name in the
>>>> descriptor does not match in the GeneratorFactory, then the
>>>> TokenNameFinderFactory.createFeatureGenerators() gives a null and the
>>>> TokenNameFinderFactory.createContextGenerator() automatically stops
>>>> the feature creation and goes for the
>>>> NameFinderME.createFeatureGenerator().
>>>> Is this the desired behaviour? Perhaps we could add a log somewhere?
>>>> To inform of the backoff to the default features if one descriptor
>>>> element does not match?
>>>
>>>
>>> That sounds really bad. If there is a problem in the mapping it should fail
>>> hard and throw an
>>> exception. The user should be forced to decide by himself what do to, either
>>> fix his descriptor
>>> or use defaults.
>>
>> I can open an issue and look into it.
>>
>>> The idea is that we always use the xml descriptor to define the feature
>>> generation, that way we can have different
>>> configurations without changing the OpenNLP code itself, and don't need
>>> special user code to integrate a
>>> customized name finder model. If a model makes use of external classes these
>>> of course need to be on the classpath
>>> since we can't ship them as a part of the model.
>>
>> OK, but I think what I did above is what you meant, is it not?
>>
>> Thanks,
>>
>> Rodrigo

Re: [opennlp-dev] TokenNameFinderFactory new features and extension

Posted by Rodrigo Agerri <ro...@ehu.es>.
Hi,

As a followed up, it turns out that currently we can provide a feature
generator via -featuregen parameter if you provide a subclass via the
-factory parameter only. I do not know if that is intended. Also, I
have noticed a very weird behaviour: I pass several descriptors via
CLI (starting with token features only, then adding tokenclass, etc.)
and it all goes well until I add either the Prefix or the
SuffixFeatureGenerator on which the performance drops alarmingly to
49.65 F1 when prefix and suffix are added to the default descriptor:

bin/opennlp TokenNameFinderTrainer -featuregen bigram.xml -factory
opennlp.tools.namefind.TokenNameFinderFactory -sequenceCodec BIO
-params lang/ml/PerceptronTrainerParams.txt -lang nl -model test.bin
-data ~/experiments/nerc/opennlp/data/nl/conll2002/nl_opennlp.testa.train

I get this behaviour with all conll03 and conll02 four datasets.

<generators>
  <cache>
    <generators>
      <window prevLength = "2" nextLength = "2">
        <tokenclass/>
      </window>
      <window prevLength = "2" nextLength = "2">
        <token/>
      </window>
      <definition/>
      <prevmap/>
      <bigram/>
      <sentence begin="true" end="false"/>
      <prefix/>
      <suffix/>
    </generators>
  </cache>
</generators>

Cheers,

Rodrigo

On Fri, Oct 3, 2014 at 5:55 PM, Rodrigo Agerri <ro...@ehu.es> wrote:
> Hi Jörn,
>
> On Fri, Oct 3, 2014 at 12:40 PM, Jörn Kottmann <ko...@gmail.com> wrote:
>>
>> There are two things you need to do:
>> 1. Implement the feature generators
>> - Implement AdaptiveFeatureGenerator or extend CustomFeatureGenerator if you
>> need to pass parameters to it
>
> OK, lets say I have this descriptor:
>
> <generators>
>   <cache>
>     <generators>
>       <window prevLength="2" nextLength="2">
>         <token />
>       </window>
>       <custom class="es.ehu.si.ixa.pipe.nerc.features.Prefix34FeatureGenerator"
> />
>     </generators>
>   </cache>
> </generators>
>
> Now, if I understand correctly the implementation (and your comments):
>
> 1. I should just create a Prefix34FeatureGenerator class extending
> FeatureGeneratorAdapter.
> 2. If I wanted to pass parameters, e.g. descriptor attributes, then I
> should extend CustomFeatureGenerator.
> 3. If I load such descriptor as argument of -featuregen in the CLI,
> the CLI should complain if such class is not in the classpath, I
> guess. If it is in the classpath, then it should use the custom
> generator.
> 4. As it is now, no matter what value you pass to the -featuregen, it
> always train the default features. It does not complain even if the
> custom feature generator is not well-formed. Even if I only pass the
> token features, it still loads the default generator. With version
> 1.5.3 it works fine though. I am looking into it, but any hints
> welcome :)
> 5. When I do this programatically, e.g., load the featuregenerator
> descriptor to an extension of the TokenNameFinderFactory, it seems to
> load the custom generators,  the GeneratorFactory loads the descriptor
> I pass, e.g., if only tokens then it trains successfully only with
> tokens. However, if I pass a custom generator, it does not complain,
> it trains and the performance drops to 40 F1. For the record, I build
> the descriptor programatically like this
>
>  Element prefixFeature = new Element("custom");
>  prefixFeature.setAttribute("class", Prefix34FeatureGenerator.class.getName());
>  generators.addContent(prefixFeature);
>
> and then the GeneratorFactory does get it without errors.
>
>> 2.//Implement support for load and serialize the data they need
>> - This class should implement SerializableArtifact
>> - And if you want to load use it the Feature Generator should implement
>> ArtifactToSerializerMapper, that one tells
>> the loader which class to use to read the data file
>
> This is only for the clustering features resources and such, I guess.
>
>> The above is the procedure you should use if you want to have a real custom
>> feature generator which is not part of
>> the OpenNLP Tools jar.
>
> Yes, what I do is include opennlp as maven dependency in an uber jar,
> e.g., with all classes inside, including opennlp and my custom feature
> generators. The classpath should be ok in this case, but I still
> cannot make them work.
>
>>
>>> 6.*Some*  of the new features work. If an Element name in the
>>> descriptor does not match in the GeneratorFactory, then the
>>> TokenNameFinderFactory.createFeatureGenerators() gives a null and the
>>> TokenNameFinderFactory.createContextGenerator() automatically stops
>>> the feature creation and goes for the
>>> NameFinderME.createFeatureGenerator().
>>> Is this the desired behaviour? Perhaps we could add a log somewhere?
>>> To inform of the backoff to the default features if one descriptor
>>> element does not match?
>>
>>
>> That sounds really bad. If there is a problem in the mapping it should fail
>> hard and throw an
>> exception. The user should be forced to decide by himself what do to, either
>> fix his descriptor
>> or use defaults.
>
> I can open an issue and look into it.
>
>> The idea is that we always use the xml descriptor to define the feature
>> generation, that way we can have different
>> configurations without changing the OpenNLP code itself, and don't need
>> special user code to integrate a
>> customized name finder model. If a model makes use of external classes these
>> of course need to be on the classpath
>> since we can't ship them as a part of the model.
>
> OK, but I think what I did above is what you meant, is it not?
>
> Thanks,
>
> Rodrigo

Re: [opennlp-dev] TokenNameFinderFactory new features and extension

Posted by Jörn Kottmann <ko...@gmail.com>.
On 10/03/2014 11:58 AM, Rodrigo Agerri wrote:
> I have implemented a number of new features for the name finder. These
> include Brown clusters features (duplicated per Brown path for each
> feature activated involving a token) and Clark cluster features
> (similar to the WordClusterFeatureGenerator currently available) among
> other local extra features which interact well with the clustering
> ones.
>
> I think it will be nice to include them before the new release. I will
> open issues about each of them. What do you think?

Yes please open issues for them. It would be really nice to receive them 
as a contribution.

There are two things you need to do:
1. Implement the feature generators
- Implement AdaptiveFeatureGenerator or extend CustomFeatureGenerator if 
you need to pass parameters to it

2.//Implement support for load and serialize the data they need
- This class should implement SerializableArtifact
- And if you want to load use it the Feature Generator should implement 
ArtifactToSerializerMapper, that one tells
the loader which class to use to read the data file

The above is the procedure you should use if you want to have a real 
custom feature generator which is not part of
the OpenNLP Tools jar.

When you contribute it, things are slightly different. You should add a 
XmlFeatureGeneratorFactory inside the GeneratorFactory
class. This factory creates the feature generator based on a defined xml 
element inside the descriptor.

> 6.*Some*  of the new features work. If an Element name in the
> descriptor does not match in the GeneratorFactory, then the
> TokenNameFinderFactory.createFeatureGenerators() gives a null and the
> TokenNameFinderFactory.createContextGenerator() automatically stops
> the feature creation and goes for the
> NameFinderME.createFeatureGenerator().
> Is this the desired behaviour? Perhaps we could add a log somewhere?
> To inform of the backoff to the default features if one descriptor
> element does not match?

That sounds really bad. If there is a problem in the mapping it should 
fail hard and throw an
exception. The user should be forced to decide by himself what do to, 
either fix his descriptor
or use defaults.

The steps 4 and 5 you describe should not be necessary to add new 
feature generators.

The idea is that we always use the xml descriptor to define the feature 
generation, that way we can have different
configurations without changing the OpenNLP code itself, and don't need 
special user code to integrate a
customized name finder model. If a model makes use of external classes 
these of course need to be on the classpath
since we can't ship them as a part of the model.

HTH,
Jörn