You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@opennlp.apache.org by "Bharani Sruthi (Jira)" <ji...@apache.org> on 2020/08/23 03:52:00 UTC

[jira] [Commented] (OPENNLP-1202) Word tokenization

    [ https://issues.apache.org/jira/browse/OPENNLP-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182577#comment-17182577 ] 

Bharani Sruthi commented on OPENNLP-1202:
-----------------------------------------

We have uploaded a fix for this  bug and added some more contractions and their expansions. Please find the attached

 

1) contractionsdiff.txt  - Contains the added code for fixing the issue

2)OpenNLPTest.py - contains test class

3)OpenNLPSampleProgramOutput.png - contains the output of the test.

 

Could you please let us know how to raise the Pull Request for this fix?

> Word tokenization 
> ------------------
>
>                 Key: OPENNLP-1202
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1202
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: language model
>         Environment: Windows Server 2016, R version 3.3.3
>            Reporter: Dippy Aggarwal
>            Priority: Major
>              Labels: Annotations
>         Attachments: OpenNLPSampleProgramOutput.png, contractionsdiff.txt, openNLP-output.png, openNLPTest.py, openNLPTest.r
>
>
> Came across an issue for identifying words in a sentence. For words such as *can't*, the tokenization using openNLP yields two words: "ca" and "n't"
> As an example (captured in the screenshot), see the tokenization for the string
> *When heard the Xenogears soundtrack, so can't really describe.*
> Note the words marked by ID's 9 and 10 in the openNLP-output.png file. 
> Not sure if I am missing any parameters that would produce the correct result? 
> Would appreciate any ideas/community's attention to this issue. Thanks. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)