You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Damiano (JIRA)" <ji...@apache.org> on 2015/08/29 16:32:45 UTC

[jira] [Created] (OPENNLP-809) Detokenize instead of splitting string with whitespaces

Damiano created OPENNLP-809:
-------------------------------

             Summary: Detokenize instead of splitting string with whitespaces
                 Key: OPENNLP-809
                 URL: https://issues.apache.org/jira/browse/OPENNLP-809
             Project: OpenNLP
          Issue Type: Bug
          Components: Name Finder
    Affects Versions: 1.6.0
            Reporter: Damiano
            Priority: Critical


Hello,
I do not understand why you are splitting the tokens with a whitespace in RegexNameFinder. It is pointless to me. 

When we call `find(String[] token)` you rebuilt the string by appending a whitespace at the end of each token. Why?

I am saying that because maybe the original string has been tokenized by the *SimpleTokenizer*, and, as you know this tokenizer adds (for example) a whitespace within a *word* and a *point*. Example:

Original:
I am visiting Rome.

Tokenized:
I am visiting Rome*[SPLIT]*.

Regex is applied to: 
I am visiting Rome . 
(instead of the original)

In this version you have introduced a find() method that allows a String instead of String[], but in this case someone pass the original string not the rebuilt string, so the result are different.

Why do not apply a *detokenize* method to do the *EXACT* inverse operation of the tokenization? (and get the original string again instead of a modified string)

Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)