You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/04/02 19:25:00 UTC

[jira] [Commented] (OPENNLP-1185) Tokenizers should be able to output a new line token

    [ https://issues.apache.org/jira/browse/OPENNLP-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516380#comment-17516380 ] 

ASF GitHub Bot commented on OPENNLP-1185:
-----------------------------------------

jzonthemtn commented on a change in pull request #337:
URL: https://github.com/apache/opennlp/pull/337#discussion_r841110488



##########
File path: opennlp-tools/src/main/java/opennlp/tools/tokenize/SimpleTokenizer.java
##########
@@ -101,4 +107,12 @@ else if (Character.isDigit(c)) {
     }
     return tokens.toArray(new Span[tokens.size()]);
   }
+
+  private boolean isLineSeparator(char character) {
+    return character == Character.LINE_SEPARATOR || character == Character.LETTER_NUMBER;
+  }
+
+  public void setKeepNewLines(boolean keepNewLines) {
+    this.keepNewLines = keepNewLines;
+  }

Review comment:
       Good call. I wrote [OPENNLP-1364](https://issues.apache.org/jira/browse/OPENNLP-1364) to capture this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Tokenizers should be able to output a new line token
> ----------------------------------------------------
>
>                 Key: OPENNLP-1185
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1185
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Tokenizer
>            Reporter: Jörn Kottmann
>            Priority: Major
>              Labels: ctakes
>
> Some use cases need the tokenizers to also output new line tokens. This is needed e.g. by cTakes to process clinical notes, or by the name finder to process list of names where each name is written in one line. Also it helps the name finder to process news articles.
> To fix this issue add an option to all three tokenizers to emit new line tokens.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)