You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by "william.colen@gmail.com" <wi...@gmail.com> on 2011/06/14 04:23:54 UTC

Custom feature generators

Hi,

Currently we only have implemented custom feature generators that we can
pass from command line only for NameFinder, but it would be very nice to
have it for all tools.
The Thai sentence detector customization is nice and simple, but to do
something for other languages the user would need to branch the code. We
should allow users to pass a factory class name from command line. Maybe we
could do it for every tool that doesn't use sequence feature generator. Also
would be nice to save the factory class name to the model to make sure we
are using the same feature generator during runtime and evaluation.

What do you think? Maybe you have thought a better solution for that.

Thanks
William

Re: Custom feature generators

Posted by James Kosin <ja...@gmail.com>.

Hi again,

I think Jorn was going to expand this to the other models as well once
we got a handle on the XML and creation.  I'll have to look into that
again and see if we are saving the information to the model... which
would allow us to reload the model with the same feature generator as
the training.

That aside, we wouldn't be able to support everything and it would take
some creative support on supporting the newer infrastructure.  I don't
think it would be a bad thing....

I think Jorn chose the Nanefinder first, because it was simpler to
expand to the XML architecture and many already have a need to define
their own series of feature generators.

James

On 6/13/2011 11:35 PM, william.colen@gmail.com wrote:
> Hi James,
>
> On Mon, Jun 13, 2011 at 11:32 PM, James Kosin <ja...@gmail.com> wrote:
>
>> On 6/13/2011 10:23 PM, william.colen@gmail.com wrote:
>>> Hi,
>>>
>>> Currently we only have implemented custom feature generators that we can
>>> pass from command line only for NameFinder, but it would be very nice to
>>> have it for all tools.
>>> The Thai sentence detector customization is nice and simple, but to do
>>> something for other languages the user would need to branch the code. We
>>> should allow users to pass a factory class name from command line. Maybe
>> we
>>> could do it for every tool that doesn't use sequence feature generator.
>> Also
>>> would be nice to save the factory class name to the model to make sure we
>>> are using the same feature generator during runtime and evaluation.
>>>
>>> What do you think? Maybe you have thought a better solution for that.
>>>
>>> Thanks
>>> William
>>>
>> William,
>>
>> We discussed various options, unfortunately, most involved some security
>> risk for the Java engine; including allowing the saving of the actual
>> feature generator constructor itself to the model.  Maybe the XML option
>> may be a better route for the long run.  We could even save the copy of
>> the XML document in the model itself.  But again that opens us up for
>> issues if someone writes bad XML to cause issues.
>>
> Yes, it is very nice with the NameFinder because we can reuse code using the
> XML descriptors.
>
>
>> Maybe, we could have the feature generator a generic class that needed a
>> constructor.  Then each implementing language could have a new
>> constructor that correctly built the feature generator.  Unfortunately,
>> it means a change would break any models.
>>
> I can't see why it would break the models. We could by default use the
> current feature generators.
> If we use factory to create the feature generator, the user is free to
> create it using any resource (another dictionary implementation for example)
>
> We may need to re-open the issue when Jorn comes back or at least get
>> another discussion going so we can try and weed out the issues with the
>> options available.
>>
> Thanks
> William
>

Re: Custom feature generators

Posted by "william.colen@gmail.com" <wi...@gmail.com>.

Hi James,

On Mon, Jun 13, 2011 at 11:32 PM, James Kosin <ja...@gmail.com> wrote:

> On 6/13/2011 10:23 PM, william.colen@gmail.com wrote:
> > Hi,
> >
> > Currently we only have implemented custom feature generators that we can
> > pass from command line only for NameFinder, but it would be very nice to
> > have it for all tools.
> > The Thai sentence detector customization is nice and simple, but to do
> > something for other languages the user would need to branch the code. We
> > should allow users to pass a factory class name from command line. Maybe
> we
> > could do it for every tool that doesn't use sequence feature generator.
> Also
> > would be nice to save the factory class name to the model to make sure we
> > are using the same feature generator during runtime and evaluation.
> >
> > What do you think? Maybe you have thought a better solution for that.
> >
> > Thanks
> > William
> >
> William,
>
> We discussed various options, unfortunately, most involved some security
> risk for the Java engine; including allowing the saving of the actual
> feature generator constructor itself to the model.  Maybe the XML option
> may be a better route for the long run.  We could even save the copy of
> the XML document in the model itself.  But again that opens us up for
> issues if someone writes bad XML to cause issues.
>

Yes, it is very nice with the NameFinder because we can reuse code using the
XML descriptors.


>
> Maybe, we could have the feature generator a generic class that needed a
> constructor.  Then each implementing language could have a new
> constructor that correctly built the feature generator.  Unfortunately,
> it means a change would break any models.
>

I can't see why it would break the models. We could by default use the
current feature generators.
If we use factory to create the feature generator, the user is free to
create it using any resource (another dictionary implementation for example)

We may need to re-open the issue when Jorn comes back or at least get
> another discussion going so we can try and weed out the issues with the
> options available.
>

Thanks
William

Re: Custom feature generators

Posted by James Kosin <ja...@gmail.com>.

On 6/13/2011 10:23 PM, william.colen@gmail.com wrote:
> Hi,
>
> Currently we only have implemented custom feature generators that we can
> pass from command line only for NameFinder, but it would be very nice to
> have it for all tools.
> The Thai sentence detector customization is nice and simple, but to do
> something for other languages the user would need to branch the code. We
> should allow users to pass a factory class name from command line. Maybe we
> could do it for every tool that doesn't use sequence feature generator. Also
> would be nice to save the factory class name to the model to make sure we
> are using the same feature generator during runtime and evaluation.
>
> What do you think? Maybe you have thought a better solution for that.
>
> Thanks
> William
>
William,

We discussed various options, unfortunately, most involved some security
risk for the Java engine; including allowing the saving of the actual
feature generator constructor itself to the model.  Maybe the XML option
may be a better route for the long run.  We could even save the copy of
the XML document in the model itself.  But again that opens us up for
issues if someone writes bad XML to cause issues.

Maybe, we could have the feature generator a generic class that needed a
constructor.  Then each implementing language could have a new
constructor that correctly built the feature generator.  Unfortunately,
it means a change would break any models.

We may need to re-open the issue when Jorn comes back or at least get
another discussion going so we can try and weed out the issues with the
options available.

James

Re: Custom feature generators

Posted by "william.colen@gmail.com" <wi...@gmail.com>.

Hi, Jörn,

On Tue, Feb 7, 2012 at 12:56 PM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 2/7/12 3:35 PM, william.colen@gmail.com wrote:
>
>> And what about sequence validators? How to alternate from the default one?
>>
>
> Maybe we should make a default Factory which people can sub-class if they
> want
> to modify the sequence validator they create a different one. Anyway that
> could
> also be done via sub-classing the component itself (the way we are
> currently doing it)
> and then the Factory would only be responsible to instantiate the
> sub-classed component.
>
>
>
I like the option of sub-classing the component itself, it would be more
flexible. But what about the static train methods?

William

Re: Custom feature generators

Posted by Aliaksandr Autayeu <al...@autayeu.com>.

> I would like to work on that now, passing a Factory class name to the CLI
> tools and saving it to the model as a configuration.
> Do you still think it is a good idea? Or we should find a better way to
> load custom feature generator and custom sequence validators? I would like
> to do it for SentenceDetector and POS Tagger for now.
>
+1


> A very important point is that we can reuse the code to instantiate a
> component
> over and over again without modifying it for a customization.
> This way all the models will work anywhere were OpenNLP is integrated and
> the extension jar files are on the classpath.
>

I like these two. I needed this myself a year or so ago, had to invent
workarounds.

Aliaksandr

Re: Custom feature generators

Posted by Jörn Kottmann <ko...@gmail.com>.

On 2/7/12 3:35 PM, william.colen@gmail.com wrote:
> And what about sequence validators? How to alternate from the default one?

Maybe we should make a default Factory which people can sub-class if 
they want
to modify the sequence validator they create a different one. Anyway 
that could
also be done via sub-classing the component itself (the way we are 
currently doing it)
and then the Factory would only be responsible to instantiate the 
sub-classed component.

> The factory should be used to load custom resources, like a different
> implementation of a dictionary, am I right?
>
Yes, but it could also be something else.

A very important point is that we can reuse the code to instantiate a 
component
over and over again without modifying it for a customization.
This way all the models will work anywhere were OpenNLP is integrated and
the extension jar files are on the classpath.

Jörn

Re: Custom feature generators

Posted by "william.colen@gmail.com" <wi...@gmail.com>.

And what about sequence validators? How to alternate from the default one?

The factory should be used to load custom resources, like a different
implementation of a dictionary, am I right?

Thank you,
William

On Tue, Feb 7, 2012 at 11:57 AM, Joern Kottmann <ko...@gmail.com> wrote:

> Yes, lets see what we could do.
>
> The name finder already supports custom feature generation,
> the same feature generation code could be reused by the POS Tagger.
> This is actually already half done.
>
> One of the current limitations is that we cannot store "custom" resources
> in
> a model. If we specify some kind of Factory class it would be nice if it
> can help
> us to locate the Artifact Serializer for a custom resource.
>
> We could define one Factory class per component which is able to influence
> how this component is created from the model.
>
> What do you think?
>
> Jörn
>
> On Tue, Feb 7, 2012 at 2:17 PM, william.colen@gmail.com <
> william.colen@gmail.com> wrote:
>
> > Hi,
> >
> > I would like to work on that now, passing a Factory class name to the CLI
> > tools and saving it to the model as a configuration.
> > Do you still think it is a good idea? Or we should find a better way to
> > load custom feature generator and custom sequence validators? I would
> like
> > to do it for SentenceDetector and POS Tagger for now.
> >
> > Thanks,
> > William
> >
> > On Tue, Jun 21, 2011 at 11:58 AM, Jörn Kottmann <ko...@gmail.com>
> > wrote:
> >
> > > On 6/14/11 4:23 AM, william.colen@gmail.com wrote:
> > >
> > >> Hi,
> > >>
> > >> Currently we only have implemented custom feature generators that we
> can
> > >> pass from command line only for NameFinder, but it would be very nice
> to
> > >> have it for all tools.
> > >> The Thai sentence detector customization is nice and simple, but to do
> > >> something for other languages the user would need to branch the code.
> We
> > >> should allow users to pass a factory class name from command line.
> Maybe
> > >> we
> > >> could do it for every tool that doesn't use sequence feature
> generator.
> > >> Also
> > >> would be nice to save the factory class name to the model to make sure
> > we
> > >> are using the same feature generator during runtime and evaluation.
> > >>
> > >> What do you think? Maybe you have thought a better solution for that.
> > >>
> > >
> > > The first approach OpenNLP come up with to customize the feature
> > generation
> > > of a component is to simply pass in a context generator. Well, that
> does
> > > not
> > > really work with the new model packages and the command line.
> > > We never really came up with a solution to this problem or discussed
> it.
> > >
> > > William suggest that we should use a class name to load a factory
> class.
> > > And I think we then should also remove the support to pass in a context
> > > generator.
> > >
> > > I believe it is a good way of solving the issue, since the model can
> than
> > > be used
> > > by an code which integrates OpenNLP and has an additional jar on the
> > > classpath.
> > > That will for example work well with our UIMA integration.
> > >
> > > These models might not be well suited for distribution to a wider group
> > of
> > > people
> > > since they always need the factory class which we cannot put inside the
> > > model because
> > > of security issues.
> > >
> > > For components where we need to adapt the feature generation to a
> > language
> > > I still
> > > suggest that we continue to define default feature generation which is
> > > dependent on
> > > the language, as we already do for thai in the sentence detector.
> > >
> > > Well, I am not yet sure how it should be done for the parser, doccat
> and
> > > coref.
> > >
> > > Jörn
> > >
> >
>

Re: Custom feature generators

Posted by Joern Kottmann <ko...@gmail.com>.

Yes, lets see what we could do.

The name finder already supports custom feature generation,
the same feature generation code could be reused by the POS Tagger.
This is actually already half done.

One of the current limitations is that we cannot store "custom" resources
in
a model. If we specify some kind of Factory class it would be nice if it
can help
us to locate the Artifact Serializer for a custom resource.

We could define one Factory class per component which is able to influence
how this component is created from the model.

What do you think?

Jörn

On Tue, Feb 7, 2012 at 2:17 PM, william.colen@gmail.com <
william.colen@gmail.com> wrote:

> Hi,
>
> I would like to work on that now, passing a Factory class name to the CLI
> tools and saving it to the model as a configuration.
> Do you still think it is a good idea? Or we should find a better way to
> load custom feature generator and custom sequence validators? I would like
> to do it for SentenceDetector and POS Tagger for now.
>
> Thanks,
> William
>
> On Tue, Jun 21, 2011 at 11:58 AM, Jörn Kottmann <ko...@gmail.com>
> wrote:
>
> > On 6/14/11 4:23 AM, william.colen@gmail.com wrote:
> >
> >> Hi,
> >>
> >> Currently we only have implemented custom feature generators that we can
> >> pass from command line only for NameFinder, but it would be very nice to
> >> have it for all tools.
> >> The Thai sentence detector customization is nice and simple, but to do
> >> something for other languages the user would need to branch the code. We
> >> should allow users to pass a factory class name from command line. Maybe
> >> we
> >> could do it for every tool that doesn't use sequence feature generator.
> >> Also
> >> would be nice to save the factory class name to the model to make sure
> we
> >> are using the same feature generator during runtime and evaluation.
> >>
> >> What do you think? Maybe you have thought a better solution for that.
> >>
> >
> > The first approach OpenNLP come up with to customize the feature
> generation
> > of a component is to simply pass in a context generator. Well, that does
> > not
> > really work with the new model packages and the command line.
> > We never really came up with a solution to this problem or discussed it.
> >
> > William suggest that we should use a class name to load a factory class.
> > And I think we then should also remove the support to pass in a context
> > generator.
> >
> > I believe it is a good way of solving the issue, since the model can than
> > be used
> > by an code which integrates OpenNLP and has an additional jar on the
> > classpath.
> > That will for example work well with our UIMA integration.
> >
> > These models might not be well suited for distribution to a wider group
> of
> > people
> > since they always need the factory class which we cannot put inside the
> > model because
> > of security issues.
> >
> > For components where we need to adapt the feature generation to a
> language
> > I still
> > suggest that we continue to define default feature generation which is
> > dependent on
> > the language, as we already do for thai in the sentence detector.
> >
> > Well, I am not yet sure how it should be done for the parser, doccat and
> > coref.
> >
> > Jörn
> >
>

Re: Custom feature generators

Posted by "william.colen@gmail.com" <wi...@gmail.com>.

Hi,

I would like to work on that now, passing a Factory class name to the CLI
tools and saving it to the model as a configuration.
Do you still think it is a good idea? Or we should find a better way to
load custom feature generator and custom sequence validators? I would like
to do it for SentenceDetector and POS Tagger for now.

Thanks,
William

On Tue, Jun 21, 2011 at 11:58 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 6/14/11 4:23 AM, william.colen@gmail.com wrote:
>
>> Hi,
>>
>> Currently we only have implemented custom feature generators that we can
>> pass from command line only for NameFinder, but it would be very nice to
>> have it for all tools.
>> The Thai sentence detector customization is nice and simple, but to do
>> something for other languages the user would need to branch the code. We
>> should allow users to pass a factory class name from command line. Maybe
>> we
>> could do it for every tool that doesn't use sequence feature generator.
>> Also
>> would be nice to save the factory class name to the model to make sure we
>> are using the same feature generator during runtime and evaluation.
>>
>> What do you think? Maybe you have thought a better solution for that.
>>
>
> The first approach OpenNLP come up with to customize the feature generation
> of a component is to simply pass in a context generator. Well, that does
> not
> really work with the new model packages and the command line.
> We never really came up with a solution to this problem or discussed it.
>
> William suggest that we should use a class name to load a factory class.
> And I think we then should also remove the support to pass in a context
> generator.
>
> I believe it is a good way of solving the issue, since the model can than
> be used
> by an code which integrates OpenNLP and has an additional jar on the
> classpath.
> That will for example work well with our UIMA integration.
>
> These models might not be well suited for distribution to a wider group of
> people
> since they always need the factory class which we cannot put inside the
> model because
> of security issues.
>
> For components where we need to adapt the feature generation to a language
> I still
> suggest that we continue to define default feature generation which is
> dependent on
> the language, as we already do for thai in the sentence detector.
>
> Well, I am not yet sure how it should be done for the parser, doccat and
> coref.
>
> Jörn
>

Re: Custom feature generators

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/14/11 4:23 AM, william.colen@gmail.com wrote:
> Hi,
>
> Currently we only have implemented custom feature generators that we can
> pass from command line only for NameFinder, but it would be very nice to
> have it for all tools.
> The Thai sentence detector customization is nice and simple, but to do
> something for other languages the user would need to branch the code. We
> should allow users to pass a factory class name from command line. Maybe we
> could do it for every tool that doesn't use sequence feature generator. Also
> would be nice to save the factory class name to the model to make sure we
> are using the same feature generator during runtime and evaluation.
>
> What do you think? Maybe you have thought a better solution for that.

The first approach OpenNLP come up with to customize the feature generation
of a component is to simply pass in a context generator. Well, that does not
really work with the new model packages and the command line.
We never really came up with a solution to this problem or discussed it.

William suggest that we should use a class name to load a factory class.
And I think we then should also remove the support to pass in a context 
generator.

I believe it is a good way of solving the issue, since the model can 
than be used
by an code which integrates OpenNLP and has an additional jar on the 
classpath.
That will for example work well with our UIMA integration.

These models might not be well suited for distribution to a wider group 
of people
since they always need the factory class which we cannot put inside the 
model because
of security issues.

For components where we need to adapt the feature generation to a 
language I still
suggest that we continue to define default feature generation which is 
dependent on
the language, as we already do for thai in the sentence detector.

Well, I am not yet sure how it should be done for the parser, doccat and 
coref.

Jörn