You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Dippy Aggarwal (JIRA)" <ji...@apache.org> on 2018/06/12 00:31:00 UTC

[jira] [Created] (OPENNLP-1202) Word tokenization

Dippy Aggarwal created OPENNLP-1202:
---------------------------------------

             Summary: Word tokenization 
                 Key: OPENNLP-1202
                 URL: https://issues.apache.org/jira/browse/OPENNLP-1202
             Project: OpenNLP
          Issue Type: Bug
          Components: language model
         Environment: Windows Server 2016, R version 3.3.3
            Reporter: Dippy Aggarwal
         Attachments: openNLP-output.png, openNLPTest.r

Came across an issue for identifying words in a sentence. For words such as *can't*, the tokenization using openNLP yields two words: "ca" and "n't"

As an example (captured in the screenshot), see the tokenization for the string

*When heard the Xenogears soundtrack, so can't really describe.*

Note the words marked by ID's 9 and 10 in the openNLP-output.png file. 

Not sure if I am missing any parameters that would produce the correct result? 

Would appreciate any ideas/community's attention to this issue. Thanks. 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)