You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Nikolai Krot <ta...@gmail.com> on 2019/01/17 10:08:06 UTC

Tokenizing untokenizable (French)

Hallo OpenNLPists,

We have trained a Word Tokenizer model for French on our own data and see
weird cases where spitting occurs in the middle of a word, like this

Portsmouth --> Ports mouth

This is a word from the testing corpus that is normal French text found on
the web, though the word itself is not in French.

I wonder why the word tokenizer attempts to split *between* two alphabetic
characters? I can imagine where splitting in the middle of a word can
indeed be useful, like in case of proclitics and enclitics, but I would
like to perform the latter as an additional step, making the word tokenizer
target only punctuation marks. Is it somehow configurable in OpenNLP?

Best regards,
Nikolai

Re: Tokenizing untokenizable (French)

Posted by Joern Kottmann <ko...@gmail.com>.

Yes it is configurable. There is the so called alpha numeric
optimisation, if this is set to true the tokenizer will not split
between chars of the same category.

Jörn

On Thu, Jan 17, 2019 at 11:08 AM Nikolai Krot <ta...@gmail.com> wrote:
>
> Hallo OpenNLPists,
>
> We have trained a Word Tokenizer model for French on our own data and see
> weird cases where spitting occurs in the middle of a word, like this
>
> Portsmouth --> Ports mouth
>
> This is a word from the testing corpus that is normal French text found on
> the web, though the word itself is not in French.
>
> I wonder why the word tokenizer attempts to split *between* two alphabetic
> characters? I can imagine where splitting in the middle of a word can
> indeed be useful, like in case of proclitics and enclitics, but I would
> like to perform the latter as an additional step, making the word tokenizer
> target only punctuation marks. Is it somehow configurable in OpenNLP?
>
> Best regards,
> Nikolai