You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Bharani Sruthi (Jira)" <ji...@apache.org> on 2020/08/23 03:52:00 UTC
[jira] [Commented] (OPENNLP-1202) Word tokenization
[ https://issues.apache.org/jira/browse/OPENNLP-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182577#comment-17182577 ]
Bharani Sruthi commented on OPENNLP-1202:
-----------------------------------------
We have uploaded a fix for this bug and added some more contractions and their expansions. Please find the attached
1) contractionsdiff.txt - Contains the added code for fixing the issue
2)OpenNLPTest.py - contains test class
3)OpenNLPSampleProgramOutput.png - contains the output of the test.
Could you please let us know how to raise the Pull Request for this fix?
> Word tokenization
> ------------------
>
> Key: OPENNLP-1202
> URL: https://issues.apache.org/jira/browse/OPENNLP-1202
> Project: OpenNLP
> Issue Type: Bug
> Components: language model
> Environment: Windows Server 2016, R version 3.3.3
> Reporter: Dippy Aggarwal
> Priority: Major
> Labels: Annotations
> Attachments: OpenNLPSampleProgramOutput.png, contractionsdiff.txt, openNLP-output.png, openNLPTest.py, openNLPTest.r
>
>
> Came across an issue for identifying words in a sentence. For words such as *can't*, the tokenization using openNLP yields two words: "ca" and "n't"
> As an example (captured in the screenshot), see the tokenization for the string
> *When heard the Xenogears soundtrack, so can't really describe.*
> Note the words marked by ID's 9 and 10 in the openNLP-output.png file.
> Not sure if I am missing any parameters that would produce the correct result?
> Would appreciate any ideas/community's attention to this issue. Thanks.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)