You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Joern Kottmann (JIRA)" <ji...@apache.org> on 2015/09/03 10:19:45 UTC

[jira] [Commented] (OPENNLP-809) Detokenize instead of splitting string with whitespaces

    [ https://issues.apache.org/jira/browse/OPENNLP-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14728677#comment-14728677 ] 

Joern Kottmann commented on OPENNLP-809:
----------------------------------------

Exactly, the regular expressions are always applied against a whitespace tokenized string.

The input to the find method should be tokenized, therefore it is not possible to pass in the orginal non-tokenized string.
Is there a disadvantage in writing regular expression against a whitespace tokenized string instead of the original untokenized string?

At the time the RegexNameFinder was written the detokenizer didn't exists.



> Detokenize instead of splitting string with whitespaces
> -------------------------------------------------------
>
>                 Key: OPENNLP-809
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-809
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Name Finder
>    Affects Versions: 1.6.0
>            Reporter: Damiano
>            Assignee: Joern Kottmann
>            Priority: Critical
>
> Hello,
> I do not understand why you are splitting the tokens with a whitespace in RegexNameFinder. It is pointless to me. 
> When we call `find(String[] token)` you rebuilt the string by appending a whitespace at the end of each token. Why?
> I am saying that because maybe the original string has been tokenized by the *SimpleTokenizer*, and, as you know this tokenizer adds (for example) a whitespace within a *word* and a *point*. Example:
> Original:
> I am visiting Rome.
> Tokenized:
> I am visiting Rome*[SPLIT]*.
> Regex is applied to: 
> I am visiting Rome . 
> (instead of the original)
> In this version you have introduced a find() method that allows a String instead of String[], but in this case someone pass the original string not the rebuilt string, so the result are different.
> Why do not apply a *detokenize* method to do the *EXACT* inverse operation of the tokenization? (and get the original string again instead of a modified string)
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)