You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by William Colen <co...@apache.org> on 2012/02/02 05:13:46 UTC

Morphological analyser (featurizer)

Hi,

I am trying to develop an OpenNLP based learnable featurizer. It can attach
tags like gender, number, mood, person and verb tense. The input is the
sentence tokens and the POS Tags.
The context generator I am using is based on the one from Chunker, plus
some prefix and suffix features.

The current accuracy is 95,395%, but I think I can improve it using a
sequence validator.

Question:
Is it possible to create a sequence validator that, besides the tokens,
also knows the POS Tags? I would like to check if the combination POS Tag +
features is OK (tense tags only for verbs for example).

Thank you in advance. If it works, and you think it is a good tool, I will
contribute the featurizer to OpenNLP.

William

Re: Morphological analyser (featurizer)

Posted by William Colen <co...@apache.org>.

Hi,

Turns out that it was something easy to do.
I created a class TokenTag to hold a token and its postag. Then I changed
the Featurizer to work with BeamSearch<TokenTag> and
SenquenceValidator<TokenTag>. With this change we can access the token and
its postag from inside the sequence validator.

For now I am only validating the features using a tag dictionary.

The accuracy now in a 10-fold cross-validation using the brazilian corpus
is 97.142%.

The accuracy should increase if I modify the evaluator: if the Featurizer
selects, for example, male as the gender of a token, but according to the
corpus it has two genders, the evaluator considers it as an error.

Thank you,
William

On Thu, Feb 2, 2012 at 2:13 AM, William Colen <co...@apache.org> wrote:

> Hi,
>
> I am trying to develop an OpenNLP based learnable featurizer. It can
> attach tags like gender, number, mood, person and verb tense. The input is
> the sentence tokens and the POS Tags.
> The context generator I am using is based on the one from Chunker, plus
> some prefix and suffix features.
>
> The current accuracy is 95,395%, but I think I can improve it using a
> sequence validator.
>
> Question:
> Is it possible to create a sequence validator that, besides the tokens,
> also knows the POS Tags? I would like to check if the combination POS Tag +
> features is OK (tense tags only for verbs for example).
>
> Thank you in advance. If it works, and you think it is a good tool, I will
> contribute the featurizer to OpenNLP.
>
> William
>
>
>
>
>