You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "James Sullivan (Created) (JIRA)" <ji...@apache.org> on 2012/02/02 01:27:53 UTC

[jira] [Created] (TIKA-855) Language Detection not working for Japanese and Chinese.

Language Detection not working for Japanese and Chinese.
--------------------------------------------------------

                 Key: TIKA-855
                 URL: https://issues.apache.org/jira/browse/TIKA-855
             Project: Tika
          Issue Type: Bug
          Components: languageidentifier
    Affects Versions: 1.0
         Environment: Windows XP, Vista and Linux Ubuntu 11.10 using Sun Java 6 and Oracle Java 7
            Reporter: James Sullivan
            Priority: Minor


I have tried Tika 1.0 language detection (java -jar tika.jar -l .\Japanese.txt) on several Japanese files (both PDF and text files) and it consistently returns lt (Lithuanian???) instead of ja. I also tried on a Chinese file which similarly incorrectly returned lt. Both English language and French language detection worked correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: [jira] [Commented] (TIKA-855) Language Detection not working for Japanese and Chinese.

Posted by Oleg Tikhonov <ol...@gmail.com>.
For Chinese we need to create/get two profiles: Chinese Traditional and
Chinese Simplified.

Oleg

On Thu, Feb 2, 2012 at 6:13 AM, James Sullivan (Commented) (JIRA) <
jira@apache.org> wrote:

>
>    [
> https://issues.apache.org/jira/browse/TIKA-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13198521#comment-13198521]
>
> James Sullivan commented on TIKA-855:
> -------------------------------------
>
> If it is just a missing language profile issue let me know what is needed
> as at least for Japanese I am aware of number of large publicly available
> corpora that might be suitable and may be able to help generate the
> profiles. However, it sounds like there might be more to it than just
> generating the profile...I have added this as feature request TIKA-856.
>
> > Language Detection not working for Japanese and Chinese.
> > --------------------------------------------------------
> >
> >                 Key: TIKA-855
> >                 URL: https://issues.apache.org/jira/browse/TIKA-855
> >             Project: Tika
> >          Issue Type: Bug
> >          Components: languageidentifier
> >    Affects Versions: 1.0
> >         Environment: Windows XP, Vista and Linux Ubuntu 11.10 using Sun
> Java 6 and Oracle Java 7
> >            Reporter: James Sullivan
> >            Assignee: Ken Krugler
> >            Priority: Minor
> >              Labels: Chinese, Japanese
> >
> > I have tried Tika 1.0 language detection (java -jar tika.jar -l
> .\Japanese.txt) on several Japanese files (both PDF and text files) and it
> consistently returns lt (Lithuanian???) instead of ja. I also tried on a
> Chinese file which similarly incorrectly returned lt. Both English language
> and French language detection worked correctly.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>

[jira] [Resolved] (TIKA-855) Language Detection not working for Japanese and Chinese.

Posted by "Ken Krugler (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler resolved TIKA-855.
------------------------------

    Resolution: Not A Problem
      Assignee: Ken Krugler

Tika currently doesn't support Japanese or Chinese for language identification. I'd recommend filing a feature request to add these languages to Tika.

I'm wondering if Tika (or the CLI) should not return any language if the match level is too low, or at least provide some indication of the same, so when trying Tika on unsupported language X it's more obvious that the issue isn't in the language profile, but that it's not supported.
                
> Language Detection not working for Japanese and Chinese.
> --------------------------------------------------------
>
>                 Key: TIKA-855
>                 URL: https://issues.apache.org/jira/browse/TIKA-855
>             Project: Tika
>          Issue Type: Bug
>          Components: languageidentifier
>    Affects Versions: 1.0
>         Environment: Windows XP, Vista and Linux Ubuntu 11.10 using Sun Java 6 and Oracle Java 7
>            Reporter: James Sullivan
>            Assignee: Ken Krugler
>            Priority: Minor
>              Labels: Chinese, Japanese
>
> I have tried Tika 1.0 language detection (java -jar tika.jar -l .\Japanese.txt) on several Japanese files (both PDF and text files) and it consistently returns lt (Lithuanian???) instead of ja. I also tried on a Chinese file which similarly incorrectly returned lt. Both English language and French language detection worked correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-855) Language Detection not working for Japanese and Chinese.

Posted by "Christian Moen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211760#comment-13211760 ] 

Christian Moen commented on TIKA-855:
-------------------------------------

Thanks, James.  I've linked the issues.  Perhaps we can track this in TIKA-856.
                
> Language Detection not working for Japanese and Chinese.
> --------------------------------------------------------
>
>                 Key: TIKA-855
>                 URL: https://issues.apache.org/jira/browse/TIKA-855
>             Project: Tika
>          Issue Type: Bug
>          Components: languageidentifier
>    Affects Versions: 1.0
>         Environment: Windows XP, Vista and Linux Ubuntu 11.10 using Sun Java 6 and Oracle Java 7
>            Reporter: James Sullivan
>            Assignee: Ken Krugler
>            Priority: Minor
>              Labels: Chinese, Japanese
>
> I have tried Tika 1.0 language detection (java -jar tika.jar -l .\Japanese.txt) on several Japanese files (both PDF and text files) and it consistently returns lt (Lithuanian???) instead of ja. I also tried on a Chinese file which similarly incorrectly returned lt. Both English language and French language detection worked correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-855) Language Detection not working for Japanese and Chinese.

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13198415#comment-13198415 ] 

Nick Burch commented on TIKA-855:
---------------------------------

I believe we're currently missing language profiles for those two, which would explain the detection issue. I think we probably need someone with a large corpus of text in the two languages to help with generating them
                
> Language Detection not working for Japanese and Chinese.
> --------------------------------------------------------
>
>                 Key: TIKA-855
>                 URL: https://issues.apache.org/jira/browse/TIKA-855
>             Project: Tika
>          Issue Type: Bug
>          Components: languageidentifier
>    Affects Versions: 1.0
>         Environment: Windows XP, Vista and Linux Ubuntu 11.10 using Sun Java 6 and Oracle Java 7
>            Reporter: James Sullivan
>            Priority: Minor
>              Labels: Chinese, Japanese
>
> I have tried Tika 1.0 language detection (java -jar tika.jar -l .\Japanese.txt) on several Japanese files (both PDF and text files) and it consistently returns lt (Lithuanian???) instead of ja. I also tried on a Chinese file which similarly incorrectly returned lt. Both English language and French language detection worked correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-855) Language Detection not working for Japanese and Chinese.

Posted by "James Sullivan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13198521#comment-13198521 ] 

James Sullivan commented on TIKA-855:
-------------------------------------

If it is just a missing language profile issue let me know what is needed as at least for Japanese I am aware of number of large publicly available corpora that might be suitable and may be able to help generate the profiles. However, it sounds like there might be more to it than just generating the profile...I have added this as feature request TIKA-856.
                
> Language Detection not working for Japanese and Chinese.
> --------------------------------------------------------
>
>                 Key: TIKA-855
>                 URL: https://issues.apache.org/jira/browse/TIKA-855
>             Project: Tika
>          Issue Type: Bug
>          Components: languageidentifier
>    Affects Versions: 1.0
>         Environment: Windows XP, Vista and Linux Ubuntu 11.10 using Sun Java 6 and Oracle Java 7
>            Reporter: James Sullivan
>            Assignee: Ken Krugler
>            Priority: Minor
>              Labels: Chinese, Japanese
>
> I have tried Tika 1.0 language detection (java -jar tika.jar -l .\Japanese.txt) on several Japanese files (both PDF and text files) and it consistently returns lt (Lithuanian???) instead of ja. I also tried on a Chinese file which similarly incorrectly returned lt. Both English language and French language detection worked correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira