You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Joern Kottmann (JIRA)" <ji...@apache.org> on 2016/11/08 11:57:59 UTC

[jira] [Commented] (OPENNLP-772) Japanese end of sentence fix

    [ https://issues.apache.org/jira/browse/OPENNLP-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15647343#comment-15647343 ] 

Joern Kottmann commented on OPENNLP-772:
----------------------------------------

We are preparing the next release and this could easily pulled in. Do you still need this change? Did this turn out to work well for you?

> Japanese end of sentence fix
> ----------------------------
>
>                 Key: OPENNLP-772
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-772
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Sentence Detector
>    Affects Versions: tools-1.5.3
>            Reporter: Bar Perach
>              Labels: patch
>             Fix For: 1.7.0
>
>
> the end of sentence characters list was wrong for japanese
> removed duplicate code
> Index: opennlp-tools/src/main/java/opennlp/tools/sentdetect/lang/Factory.java
> ===================================================================
> --- opennlp-tools/src/main/java/opennlp/tools/sentdetect/lang/Factory.java	(revision 1678426)
> +++ opennlp-tools/src/main/java/opennlp/tools/sentdetect/lang/Factory.java	(local)
> @@ -36,14 +36,12 @@
>  
>    public static final char[] thEosCharacters = new char[] { ' ','\n' };
>  
> +  // TODO add more sentence enders
> +  public static final char[] jpEosCharacters = new char[] {'。', '!', '?'};
> +
>    public EndOfSentenceScanner createEndOfSentenceScanner(String languageCode) {
> -    if ("th".equals(languageCode)) {
> -      return new DefaultEndOfSentenceScanner(new char[]{' ','\n'});
> -    } else if("pt".equals(languageCode)) {
> -      return new DefaultEndOfSentenceScanner(ptEosCharacters);
> -    }
>  
> -    return new DefaultEndOfSentenceScanner(defaultEosCharacters);
> +    return new DefaultEndOfSentenceScanner(getEOSCharacters(languageCode));
>    }
>  
>    public EndOfSentenceScanner createEndOfSentenceScanner(
> @@ -76,6 +74,8 @@
>        return thEosCharacters;
>      } else if ("pt".equals(languageCode)) {
>        return ptEosCharacters;
> +    } else if ("jp".equals(languageCode)) {
> +      return jpEosCharacters;
>      }
>  
>      return defaultEosCharacters;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)