You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@opennlp.apache.org by "william.colen@gmail.com" <wi...@gmail.com> on 2012/05/02 19:00:26 UTC

Create POS Tagger during training / evaluation

Hi,

I am thinking of adding a new feature for the POS Tagger component and I
would appreciate some comments.

POS Tagger effectiveness increases a lot with a POSDictionary, but today
the only option is to provide one. It would be nice if we could induce the
dictionary from training data, or expand the existing dictionary with the
training data.

To activate that the user could pass in a cutoff value. Only word + tag
with frequency higher than the cutoff should be added to the dictionary.
While performing cross validation we should keep in mind that we can only
expand / create a dictionary using the training portion of the corpus.

The only problem I see now is how we should create / expand this dictionary
if we are using the new Factory mechanism. One issue is that the tools can
not access the dictionary directly, also, depending on the dictionary
implementation we are using, maybe the factory itself should perform the
task of populating it. The base Factory implementation should implement it
for the default POSDictionary.

In this case, I would add the following methods to the POSTaggerFactory:

1) expandPOSDictionary( TrainingSampleStream<POSSample> samples, Integer
cutoff, boolean keepOriginal );
This method would expand / create the dictionary using the data from
samples, respecting the cutoff. The argument keepOriginal is used to inform
the implementation that it should backup the original dictionary

2) restorePOSDictionary();
Restores the dictionary backup to start another cross-validation


What do you think? I am not sure this feature would help others, also I
don't like the POSTaggerFactory to take this responsibility, but I can't
see a cleaner option right now.

Thank you,
William

Re: Create POS Tagger during training / evaluation

Posted by Jörn Kottmann <ko...@gmail.com>.
On 05/02/2012 11:00 PM, william.colen@gmail.com wrote:
> I don't see how to do it. Create a new one means loading it from the model?

While doing cross validation we don't have a model. So we just load
it again with the help of the factory from its original source, e.g. the 
file
on disk.

Jörn

Re: Create POS Tagger during training / evaluation

Posted by "william.colen@gmail.com" <wi...@gmail.com>.
Hi, Jörn,

On Wed, May 2, 2012 at 5:25 PM, Jörn Kottmann <ko...@gmail.com> wrote:
>
>
> Well, what you need is a mutable dictionary.
>
> A user who provides a custom dictionary must also provide
> support for serialization of it. He could decide to implement an
> interface to make the dictionary mutable (just an option how that could be
> done)
>

Yes, thanks. With that we can check if it is a mutable dictionary, and if
yes, we add the data new words without having to ask the Factory to do
that.

In the cross validation case I would just create a new one with
> the help of the factory from the original data.


I don't see how to do it. Create a new one means loading it from the model?

Re: Create POS Tagger during training / evaluation

Posted by Jörn Kottmann <ko...@gmail.com>.
On 05/02/2012 07:00 PM, william.colen@gmail.com wrote:
> Hi,
>
> I am thinking of adding a new feature for the POS Tagger component and I
> would appreciate some comments.
>
> POS Tagger effectiveness increases a lot with a POSDictionary, but today
> the only option is to provide one. It would be nice if we could induce the
> dictionary from training data, or expand the existing dictionary with the
> training data.
>
> To activate that the user could pass in a cutoff value. Only word + tag
> with frequency higher than the cutoff should be added to the dictionary.
> While performing cross validation we should keep in mind that we can only
> expand / create a dictionary using the training portion of the corpus.
>
> The only problem I see now is how we should create / expand this dictionary
> if we are using the new Factory mechanism. One issue is that the tools can
> not access the dictionary directly, also, depending on the dictionary
> implementation we are using, maybe the factory itself should perform the
> task of populating it. The base Factory implementation should implement it
> for the default POSDictionary.
>
> In this case, I would add the following methods to the POSTaggerFactory:
>
> 1) expandPOSDictionary( TrainingSampleStream<POSSample>  samples, Integer
> cutoff, boolean keepOriginal );
> This method would expand / create the dictionary using the data from
> samples, respecting the cutoff. The argument keepOriginal is used to inform
> the implementation that it should backup the original dictionary
>
> 2) restorePOSDictionary();
> Restores the dictionary backup to start another cross-validation
>
>
> What do you think? I am not sure this feature would help others, also I
> don't like the POSTaggerFactory to take this responsibility, but I can't
> see a cleaner option right now.

Well, what you need is a mutable dictionary.

A user who provides a custom dictionary must also provide
support for serialization of it. He could decide to implement an
interface to make the dictionary mutable (just an option how that could 
be done)

In the cross validation case I would just create a new one with
the help of the factory from the original data.

What do you think?

Jörn