You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Damiano (JIRA)" <ji...@apache.org> on 2015/08/29 16:32:45 UTC
[jira] [Created] (OPENNLP-809) Detokenize instead of splitting
string with whitespaces
Damiano created OPENNLP-809:
-------------------------------
Summary: Detokenize instead of splitting string with whitespaces
Key: OPENNLP-809
URL: https://issues.apache.org/jira/browse/OPENNLP-809
Project: OpenNLP
Issue Type: Bug
Components: Name Finder
Affects Versions: 1.6.0
Reporter: Damiano
Priority: Critical
Hello,
I do not understand why you are splitting the tokens with a whitespace in RegexNameFinder. It is pointless to me.
When we call `find(String[] token)` you rebuilt the string by appending a whitespace at the end of each token. Why?
I am saying that because maybe the original string has been tokenized by the *SimpleTokenizer*, and, as you know this tokenizer adds (for example) a whitespace within a *word* and a *point*. Example:
Original:
I am visiting Rome.
Tokenized:
I am visiting Rome*[SPLIT]*.
Regex is applied to:
I am visiting Rome .
(instead of the original)
In this version you have introduced a find() method that allows a String instead of String[], but in this case someone pass the original string not the rebuilt string, so the result are different.
Why do not apply a *detokenize* method to do the *EXACT* inverse operation of the tokenization? (and get the original string again instead of a modified string)
Thanks.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)