You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by "william.colen@gmail.com" <wi...@gmail.com> on 2012/03/29 00:38:20 UTC

DetokenizationOperation.MERGE_BOTH

Hi!

I need something like DetokenizationOperation.MERGE_BOTH to train a
Tokenizer from NameFinder data. A sample of the data is:

... devolva - me o livro .... (give the book back to me)

I need detokenize it to "devolva-me o livro"

So I would need to add the hyphen to the detokenizer dictionary and
configure it to something like MERGE_BOTH, but we don't have such option.
Do you see another way of doing it or should I extend the
the DetokenizationOperation ?

Thanks
William

Re: DetokenizationOperation.MERGE_BOTH

Posted by Jörn Kottmann <ko...@gmail.com>.

+1 to add a MERGE_BOTH.

I recently worked with news paper texts where the quotation mark was
never separated by at least one white space. For that it would be nice 
to use
MERGE_BOTH to retrain the tokenizer.

Jörn

On 03/29/2012 12:38 AM, william.colen@gmail.com wrote:
> Hi!
>
> I need something like DetokenizationOperation.MERGE_BOTH to train a
> Tokenizer from NameFinder data. A sample of the data is:
>
> ... devolva - me o livro .... (give the book back to me)
>
> I need detokenize it to "devolva-me o livro"
>
> So I would need to add the hyphen to the detokenizer dictionary and
> configure it to something like MERGE_BOTH, but we don't have such option.
> Do you see another way of doing it or should I extend the
> the DetokenizationOperation ?
>
> Thanks
> William
>