You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ken Krugler (Resolved) (JIRA)" <ji...@apache.org> on 2012/02/02 01:59:53 UTC

[jira] [Resolved] (TIKA-855) Language Detection not working for Japanese and Chinese.

     [ https://issues.apache.org/jira/browse/TIKA-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler resolved TIKA-855.
------------------------------

    Resolution: Not A Problem
      Assignee: Ken Krugler

Tika currently doesn't support Japanese or Chinese for language identification. I'd recommend filing a feature request to add these languages to Tika.

I'm wondering if Tika (or the CLI) should not return any language if the match level is too low, or at least provide some indication of the same, so when trying Tika on unsupported language X it's more obvious that the issue isn't in the language profile, but that it's not supported.
                
> Language Detection not working for Japanese and Chinese.
> --------------------------------------------------------
>
>                 Key: TIKA-855
>                 URL: https://issues.apache.org/jira/browse/TIKA-855
>             Project: Tika
>          Issue Type: Bug
>          Components: languageidentifier
>    Affects Versions: 1.0
>         Environment: Windows XP, Vista and Linux Ubuntu 11.10 using Sun Java 6 and Oracle Java 7
>            Reporter: James Sullivan
>            Assignee: Ken Krugler
>            Priority: Minor
>              Labels: Chinese, Japanese
>
> I have tried Tika 1.0 language detection (java -jar tika.jar -l .\Japanese.txt) on several Japanese files (both PDF and text files) and it consistently returns lt (Lithuanian???) instead of ja. I also tried on a Chinese file which similarly incorrectly returned lt. Both English language and French language detection worked correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira