You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Nick Burch (Commented) (JIRA)" <ji...@apache.org> on 2012/02/02 01:55:53 UTC

[jira] [Commented] (TIKA-855) Language Detection not working for Japanese and Chinese.

    [ https://issues.apache.org/jira/browse/TIKA-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13198415#comment-13198415 ] 

Nick Burch commented on TIKA-855:
---------------------------------

I believe we're currently missing language profiles for those two, which would explain the detection issue. I think we probably need someone with a large corpus of text in the two languages to help with generating them
                
> Language Detection not working for Japanese and Chinese.
> --------------------------------------------------------
>
>                 Key: TIKA-855
>                 URL: https://issues.apache.org/jira/browse/TIKA-855
>             Project: Tika
>          Issue Type: Bug
>          Components: languageidentifier
>    Affects Versions: 1.0
>         Environment: Windows XP, Vista and Linux Ubuntu 11.10 using Sun Java 6 and Oracle Java 7
>            Reporter: James Sullivan
>            Priority: Minor
>              Labels: Chinese, Japanese
>
> I have tried Tika 1.0 language detection (java -jar tika.jar -l .\Japanese.txt) on several Japanese files (both PDF and text files) and it consistently returns lt (Lithuanian???) instead of ja. I also tried on a Chinese file which similarly incorrectly returned lt. Both English language and French language detection worked correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira