You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@opennlp.apache.org by Jörn Kottmann <ko...@gmail.com> on 2011/08/04 12:58:19 UTC

AlphaNumOpt in tokenizer

Hi William,

I saw your change to the alpha num optimization in the
tokenizer.

I am aware of the fact that it is not perfect currently, especially
for non-english languages. In my opinion we should use unicode
to determine what is a letter and what is a numerical.

Since it is a performance optimization I think we should
undo the change you made and rather look into the unicode approach.

What do you think?

We might want more options anyway, e.g. a tokenization dictionary for
some frequent cases. In such a dictionary the tokenizer could lookup how
a certain input char sequence should be tokenized.

Jörn

Re: AlphaNumOpt in tokenizer

Posted by "william.colen@gmail.com" <wi...@gmail.com>.
On Thu, Aug 4, 2011 at 7:58 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> Hi William,
>
> I saw your change to the alpha num optimization in the
> tokenizer.
>
> I am aware of the fact that it is not perfect currently, especially
> for non-english languages. In my opinion we should use unicode
> to determine what is a letter and what is a numerical.
>
> Since it is a performance optimization I think we should
> undo the change you made and rather look into the unicode approach.
>
> What do you think?
>

+1, but I don't know about the uincode approach.


>
> We might want more options anyway, e.g. a tokenization dictionary for
> some frequent cases. In such a dictionary the tokenizer could lookup how
> a certain input char sequence should be tokenized.
>

Yes. The F score of the models I create using OpenNLP tokenizer is high
(>99%), but it fails in some cases, maybe because my training data don't
have enough of these cases.
I added the abbreviation dictionary, but it is not helping that much.