You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Dippy Aggarwal (JIRA)" <ji...@apache.org> on 2018/06/12 00:31:00 UTC
[jira] [Created] (OPENNLP-1202) Word tokenization
Dippy Aggarwal created OPENNLP-1202:
---------------------------------------
Summary: Word tokenization
Key: OPENNLP-1202
URL: https://issues.apache.org/jira/browse/OPENNLP-1202
Project: OpenNLP
Issue Type: Bug
Components: language model
Environment: Windows Server 2016, R version 3.3.3
Reporter: Dippy Aggarwal
Attachments: openNLP-output.png, openNLPTest.r
Came across an issue for identifying words in a sentence. For words such as *can't*, the tokenization using openNLP yields two words: "ca" and "n't"
As an example (captured in the screenshot), see the tokenization for the string
*When heard the Xenogears soundtrack, so can't really describe.*
Note the words marked by ID's 9 and 10 in the openNLP-output.png file.
Not sure if I am missing any parameters that would produce the correct result?
Would appreciate any ideas/community's attention to this issue. Thanks.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)