You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (Commented) (JIRA)" <ji...@apache.org> on 2012/02/02 01:55:53 UTC
[jira] [Commented] (TIKA-855) Language Detection not working for
Japanese and Chinese.
[ https://issues.apache.org/jira/browse/TIKA-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13198415#comment-13198415 ]
Nick Burch commented on TIKA-855:
---------------------------------
I believe we're currently missing language profiles for those two, which would explain the detection issue. I think we probably need someone with a large corpus of text in the two languages to help with generating them
> Language Detection not working for Japanese and Chinese.
> --------------------------------------------------------
>
> Key: TIKA-855
> URL: https://issues.apache.org/jira/browse/TIKA-855
> Project: Tika
> Issue Type: Bug
> Components: languageidentifier
> Affects Versions: 1.0
> Environment: Windows XP, Vista and Linux Ubuntu 11.10 using Sun Java 6 and Oracle Java 7
> Reporter: James Sullivan
> Priority: Minor
> Labels: Chinese, Japanese
>
> I have tried Tika 1.0 language detection (java -jar tika.jar -l .\Japanese.txt) on several Japanese files (both PDF and text files) and it consistently returns lt (Lithuanian???) instead of ja. I also tried on a Chinese file which similarly incorrectly returned lt. Both English language and French language detection worked correctly.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira