You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Damiano Porta <da...@gmail.com> on 2017/09/04 11:24:07 UTC

Custom tokenizer?

Hello everybody,

I have to build a custom tokenizer that has one more class NOSPLIT.
At the moment the current tokenizer supports SPLIT class, i should extend
it because i have special code/products that must be in single token (but
unfortunately they have whitespaces inside).

What approach should I use to extend it ?
or
Should I create a simple classifier, like the postagger, that takes the
context into the account?

Somethings like:

Sentence:
The product valvole x 158 78 9

Training:
The*_SPLIT* product*_SPLIT* vavole*_NOSPLIT* x*_NOSPLIT* 158*_NOSPLIT* 78
*_NOSPLIT* 9*_NOSPLIT*

Result:
"the", "product", "valvole x 158 78 9"

What do you think?
Thank you!

Damiano