You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Joern Kottmann (JIRA)" <ji...@apache.org> on 2014/10/21 00:08:33 UTC

[jira] [Commented] (OPENNLP-711) SentenceDetectorME::sentPosDetect() with useTokenEnd=false

    [ https://issues.apache.org/jira/browse/OPENNLP-711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177556#comment-14177556 ] 

Joern Kottmann commented on OPENNLP-711:
----------------------------------------

The position which is calculated in the case useTokenEnd false should be the start index of the next sentence. The start index is the index of the first char in the next sentence.

The above code sets the eos char as the start index of the next sentence. We should apply the proposed fix and add one to cint to handle the case of useTokenEnd false correctly.



> SentenceDetectorME::sentPosDetect() with useTokenEnd=false
> ----------------------------------------------------------
>
>                 Key: OPENNLP-711
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-711
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Sentence Detector
>    Affects Versions: 1.6.0
>            Reporter: Eugen Hanussek
>            Priority: Minor
>             Fix For: 1.6.0
>
>
> I trained the SentenceModel with a german korpus and wondered about the results for the following input (a mark indicates the expected split):
> {code:xml}
> "I am hungry.Ich bin Mr. Bean.Ein guter Satz."
>              ^                ^
> {code}
> The result was 3 sentences. Good, but the split was not at the eosChar. It was after the token with the eosChar: "I am hungry.Ich" , "bin Mr. Bean.Ein", ...
> After some debugging I found out that I have to set useTokenEnd=false in the SentenceDetectorFactory-ctor.
> And then I found a *little bug in SentenceDetectorME* when the span is calculated:
> {code:java}
>   public Span[] sentPosDetect(String s) {
> ...
>       if (bestOutcome.equals(SPLIT) && isAcceptableBreak(s, index, cint)) {
>         if (index != cint) {
>           if (useTokenEnd) {
>             positions.add(getFirstNonWS(s, getFirstWS(s,cint + 1)));
>           }
>           else {
>             positions.add(getFirstNonWS(s,cint)); // this should be positions.add(getFirstNonWS(s,cint + 1)); 
>           }
>           sentProbs.add(probs[model.getIndex(bestOutcome)]);
>         }
>         index = cint + 1;
>       }
> ...
> {code}
> This change has only impact on models with useTokenEnd=false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)