You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Joern Kottmann <ko...@gmail.com> on 2019/02/06 19:17:30 UTC

Re: Tokenizing untokenizable (French)

Yes it is configurable. There is the so called alpha numeric
optimisation, if this is set to true the tokenizer will not split
between chars of the same category.

Jörn

On Thu, Jan 17, 2019 at 11:08 AM Nikolai Krot <ta...@gmail.com> wrote:
>
> Hallo OpenNLPists,
>
> We have trained a Word Tokenizer model for French on our own data and see
> weird cases where spitting occurs in the middle of a word, like this
>
> Portsmouth --> Ports mouth
>
> This is a word from the testing corpus that is normal French text found on
> the web, though the word itself is not in French.
>
> I wonder why the word tokenizer attempts to split *between* two alphabetic
> characters? I can imagine where splitting in the middle of a word can
> indeed be useful, like in case of proclitics and enclitics, but I would
> like to perform the latter as an additional step, making the word tokenizer
> target only punctuation marks. Is it somehow configurable in OpenNLP?
>
> Best regards,
> Nikolai