You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@opennlp.apache.org by "william.colen@gmail.com" <wi...@gmail.com> on 2012/03/29 00:38:20 UTC
DetokenizationOperation.MERGE_BOTH
Hi!
I need something like DetokenizationOperation.MERGE_BOTH to train a
Tokenizer from NameFinder data. A sample of the data is:
... devolva - me o livro .... (give the book back to me)
I need detokenize it to "devolva-me o livro"
So I would need to add the hyphen to the detokenizer dictionary and
configure it to something like MERGE_BOTH, but we don't have such option.
Do you see another way of doing it or should I extend the
the DetokenizationOperation ?
Thanks
William
Re: DetokenizationOperation.MERGE_BOTH
Posted by Jörn Kottmann <ko...@gmail.com>.
+1 to add a MERGE_BOTH.
I recently worked with news paper texts where the quotation mark was
never separated by at least one white space. For that it would be nice
to use
MERGE_BOTH to retrain the tokenizer.
Jörn
On 03/29/2012 12:38 AM, william.colen@gmail.com wrote:
> Hi!
>
> I need something like DetokenizationOperation.MERGE_BOTH to train a
> Tokenizer from NameFinder data. A sample of the data is:
>
> ... devolva - me o livro .... (give the book back to me)
>
> I need detokenize it to "devolva-me o livro"
>
> So I would need to add the hyphen to the detokenizer dictionary and
> configure it to something like MERGE_BOTH, but we don't have such option.
> Do you see another way of doing it or should I extend the
> the DetokenizationOperation ?
>
> Thanks
> William
>