You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@opennlp.apache.org by "Joern Kottmann (JIRA)" <ji...@apache.org> on 2015/09/10 11:08:45 UTC

[jira] [Commented] (OPENNLP-788) Add a language detection component

    [ https://issues.apache.org/jira/browse/OPENNLP-788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14738452#comment-14738452 ] 

Joern Kottmann commented on OPENNLP-788:
----------------------------------------

Maybe the following interface would be suitable:
public interface LanguageDetector {
  Language[] detectLanguage(CharSequence content);
  Set<String> getSupportedLanguages();
  String getLanguageCoding();
}

The doccat component can already do language detection with a custom factory. Maybe we can find a way to build a language detector based on the doccat work. This would avoid quite some code duplication.

> Add a language detection component
> ----------------------------------
>
>                 Key: OPENNLP-788
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-788
>             Project: OpenNLP
>          Issue Type: Improvement
>            Reporter: Joern Kottmann
>
> Many of the components in OpenNLP are sensitive to the input language. It would be nice if OpenNLP would have a component to detect the language of an input text.
> Two commonly used solutions today are:
> Apache Tikas Language Identifier
> Language Detection from Shuyo, Nakatani



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)