You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Joern Kottmann (JIRA)" <ji...@apache.org> on 2016/12/20 22:37:58 UTC

[jira] [Updated] (OPENNLP-772) Japanese end of sentence fix

     [ https://issues.apache.org/jira/browse/OPENNLP-772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joern Kottmann updated OPENNLP-772:
-----------------------------------
    Priority: Minor  (was: Major)

> Japanese end of sentence fix
> ----------------------------
>
>                 Key: OPENNLP-772
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-772
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Sentence Detector
>    Affects Versions: tools-1.5.3
>            Reporter: Bar Perach
>            Assignee: Joern Kottmann
>            Priority: Minor
>              Labels: patch
>             Fix For: 1.7.0
>
>
> the end of sentence characters list was wrong for japanese
> removed duplicate code
> Index: opennlp-tools/src/main/java/opennlp/tools/sentdetect/lang/Factory.java
> ===================================================================
> --- opennlp-tools/src/main/java/opennlp/tools/sentdetect/lang/Factory.java	(revision 1678426)
> +++ opennlp-tools/src/main/java/opennlp/tools/sentdetect/lang/Factory.java	(local)
> @@ -36,14 +36,12 @@
>  
>    public static final char[] thEosCharacters = new char[] { ' ','\n' };
>  
> +  // TODO add more sentence enders
> +  public static final char[] jpEosCharacters = new char[] {'。', '!', '?'};
> +
>    public EndOfSentenceScanner createEndOfSentenceScanner(String languageCode) {
> -    if ("th".equals(languageCode)) {
> -      return new DefaultEndOfSentenceScanner(new char[]{' ','\n'});
> -    } else if("pt".equals(languageCode)) {
> -      return new DefaultEndOfSentenceScanner(ptEosCharacters);
> -    }
>  
> -    return new DefaultEndOfSentenceScanner(defaultEosCharacters);
> +    return new DefaultEndOfSentenceScanner(getEOSCharacters(languageCode));
>    }
>  
>    public EndOfSentenceScanner createEndOfSentenceScanner(
> @@ -76,6 +74,8 @@
>        return thEosCharacters;
>      } else if ("pt".equals(languageCode)) {
>        return ptEosCharacters;
> +    } else if ("jp".equals(languageCode)) {
> +      return jpEosCharacters;
>      }
>  
>      return defaultEosCharacters;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)