You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Eugen Hanussek (JIRA)" <ji...@apache.org> on 2014/08/07 13:43:11 UTC

[jira] [Updated] (OPENNLP-711) SentenceDetectorME::sentPosDetect() with useTokenEnd=false

     [ https://issues.apache.org/jira/browse/OPENNLP-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eugen Hanussek updated OPENNLP-711:
-----------------------------------

    Description: 
I trained the SentenceModel with a german korpus and wondered about the results for the following input (a mark indicates the expected split):
{code:xml}
"I am hungry.Ich bin Mr. Bean.Ein guter Satz."
             ^                ^
{code}
The result was 3 sentences. Good, but the split was not at the eosChar. It was after the token with the eosChar: "I am hungry.Ich" , "bin Mr. Bean.Ein", ...

After some debugging I found out that I have to set useTokenEnd=false in the SentenceDetectorFactory-ctor.
And then I found a *little bug in SentenceDetectorME* when the span is calculated:
{code:java}
  public Span[] sentPosDetect(String s) {
...
      if (bestOutcome.equals(SPLIT) && isAcceptableBreak(s, index, cint)) {
        if (index != cint) {
          if (useTokenEnd) {
            positions.add(getFirstNonWS(s, getFirstWS(s,cint + 1)));
          }
          else {
            positions.add(getFirstNonWS(s,cint)); // this should be positions.add(getFirstNonWS(s,cint + 1)); 
          }
          sentProbs.add(probs[model.getIndex(bestOutcome)]);
        }
        index = cint + 1;
      }
...
{code}

This change has only impact on models with useTokenEnd=false

  was:
I trained the SentenceModel with a german korpus and wondered about the results for the following input (a mark indicates the expected split):
{code:xml}
"I am hungry.Ich bin Mr. Bean.Ein guter Satz."
             ^                ^
{code}
The result was 3 sentences. Good, but the split was not at the eosChar. It was after the token with the eosChar: "I am hungry.Ich" , "bin Mr. Bean.Ein", ...

After some debugging I found out that I have to set useTokenEnd=false.
And then I found a *little bug in SentenceDetectorME* when the span is calculated:
{code:java}
  public Span[] sentPosDetect(String s) {
...
      if (bestOutcome.equals(SPLIT) && isAcceptableBreak(s, index, cint)) {
        if (index != cint) {
          if (useTokenEnd) {
            positions.add(getFirstNonWS(s, getFirstWS(s,cint + 1)));
          }
          else {
            positions.add(getFirstNonWS(s,cint)); // this should be positions.add(getFirstNonWS(s,cint + 1)); 
          }
          sentProbs.add(probs[model.getIndex(bestOutcome)]);
        }
        index = cint + 1;
      }
...
{code}

This change has only impact on models with useTokenEnd=false


> SentenceDetectorME::sentPosDetect() with useTokenEnd=false
> ----------------------------------------------------------
>
>                 Key: OPENNLP-711
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-711
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Sentence Detector
>    Affects Versions: 1.6.0
>            Reporter: Eugen Hanussek
>            Priority: Minor
>
> I trained the SentenceModel with a german korpus and wondered about the results for the following input (a mark indicates the expected split):
> {code:xml}
> "I am hungry.Ich bin Mr. Bean.Ein guter Satz."
>              ^                ^
> {code}
> The result was 3 sentences. Good, but the split was not at the eosChar. It was after the token with the eosChar: "I am hungry.Ich" , "bin Mr. Bean.Ein", ...
> After some debugging I found out that I have to set useTokenEnd=false in the SentenceDetectorFactory-ctor.
> And then I found a *little bug in SentenceDetectorME* when the span is calculated:
> {code:java}
>   public Span[] sentPosDetect(String s) {
> ...
>       if (bestOutcome.equals(SPLIT) && isAcceptableBreak(s, index, cint)) {
>         if (index != cint) {
>           if (useTokenEnd) {
>             positions.add(getFirstNonWS(s, getFirstWS(s,cint + 1)));
>           }
>           else {
>             positions.add(getFirstNonWS(s,cint)); // this should be positions.add(getFirstNonWS(s,cint + 1)); 
>           }
>           sentProbs.add(probs[model.getIndex(bestOutcome)]);
>         }
>         index = cint + 1;
>       }
> ...
> {code}
> This change has only impact on models with useTokenEnd=false



--
This message was sent by Atlassian JIRA
(v6.2#6252)