You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "James Sullivan (Created) (JIRA)" <ji...@apache.org> on 2012/02/02 05:03:53 UTC

[jira] [Created] (TIKA-856) Support CJK (Chinese, Japanese and Korean) language detection

Support CJK (Chinese, Japanese and Korean) language detection
-------------------------------------------------------------

                 Key: TIKA-856
                 URL: https://issues.apache.org/jira/browse/TIKA-856
             Project: Tika
          Issue Type: New Feature
          Components: languageidentifier
    Affects Versions: 1.0
         Environment: All
            Reporter: James Sullivan


Support language detection of CJK (Chinese, Japanese and Korean).
Some estimates have Chinese users overtaking English users on the Internet  so it is important that these languages used by large number of people be supported.

See TIKA-855


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-856) Support CJK (Chinese, Japanese and Korean) language detection

Posted by "Christian Moen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211762#comment-13211762 ] 

Christian Moen commented on TIKA-856:
-------------------------------------

James, could you share more information how you built your corpus?  I'm thinking I should give this a shot using Wikipedia content, i.e. abstracts.
                
> Support CJK (Chinese, Japanese and Korean) language detection
> -------------------------------------------------------------
>
>                 Key: TIKA-856
>                 URL: https://issues.apache.org/jira/browse/TIKA-856
>             Project: Tika
>          Issue Type: New Feature
>          Components: languageidentifier
>    Affects Versions: 1.0
>         Environment: All
>            Reporter: James Sullivan
>              Labels: Chinese, Japanese
>
> Support language detection of CJK (Chinese, Japanese and Korean).
> Some estimates have Chinese users overtaking English users on the Internet  so it is important that these languages used by large number of people be supported.
> See TIKA-855

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-856) Support CJK (Chinese, Japanese and Korean) language detection

Posted by "James Sullivan (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

James Sullivan updated TIKA-856:
--------------------------------

    Attachment: ja.ngp

.ngp profile generated for Japanese language detection. Does not work and I suspect we will need to review the current implementation of the language detection algorithm.
                
> Support CJK (Chinese, Japanese and Korean) language detection
> -------------------------------------------------------------
>
>                 Key: TIKA-856
>                 URL: https://issues.apache.org/jira/browse/TIKA-856
>             Project: Tika
>          Issue Type: New Feature
>          Components: languageidentifier
>    Affects Versions: 1.0
>         Environment: All
>            Reporter: James Sullivan
>              Labels: Chinese, Japanese
>         Attachments: ja.ngp
>
>
> Support language detection of CJK (Chinese, Japanese and Korean).
> Some estimates have Chinese users overtaking English users on the Internet  so it is important that these languages used by large number of people be supported.
> See TIKA-855

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-856) Support CJK (Chinese, Japanese and Korean) language detection

Posted by "Jan Høydahl (Commented JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13201321#comment-13201321 ] 

Jan Høydahl commented on TIKA-856:
----------------------------------

The command to create a profile is:
{code}
java -jar tika-app-1.0.jar --create-profile=ja -eUTF-8 japanese.txt
{code}

The input text should be about the same as for the existing languages. I've found that if you sum the frequence numbers in the 1000 lines of the existing profiles, you arrive at a number around 6,1million and I suppose that's where new language profiles should lie too because of how Tika compares profiles, see TIKA-496.
                
> Support CJK (Chinese, Japanese and Korean) language detection
> -------------------------------------------------------------
>
>                 Key: TIKA-856
>                 URL: https://issues.apache.org/jira/browse/TIKA-856
>             Project: Tika
>          Issue Type: New Feature
>          Components: languageidentifier
>    Affects Versions: 1.0
>         Environment: All
>            Reporter: James Sullivan
>              Labels: Chinese, Japanese
>
> Support language detection of CJK (Chinese, Japanese and Korean).
> Some estimates have Chinese users overtaking English users on the Internet  so it is important that these languages used by large number of people be supported.
> See TIKA-855

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (TIKA-856) Support CJK (Chinese, Japanese and Korean) language detection

Posted by "Jan Riewe (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208582#comment-13208582 ] 

Jan Riewe commented on TIKA-856:
--------------------------------

Maybe this is helpful:

http://code.google.com/p/language-detection/wiki/Tools

a tool for generating ngram profiles by wikipedia entries
                
> Support CJK (Chinese, Japanese and Korean) language detection
> -------------------------------------------------------------
>
>                 Key: TIKA-856
>                 URL: https://issues.apache.org/jira/browse/TIKA-856
>             Project: Tika
>          Issue Type: New Feature
>          Components: languageidentifier
>    Affects Versions: 1.0
>         Environment: All
>            Reporter: James Sullivan
>              Labels: Chinese, Japanese
>
> Support language detection of CJK (Chinese, Japanese and Korean).
> Some estimates have Chinese users overtaking English users on the Internet  so it is important that these languages used by large number of people be supported.
> See TIKA-855

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-856) Support CJK (Chinese, Japanese and Korean) language detection

Posted by "James Sullivan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13218751#comment-13218751 ] 

James Sullivan commented on TIKA-856:
-------------------------------------

Actually, calling what I used a corpus may be overly generous. All I did was take the first 40,000 articles from the Japanese Wikipedia content and then removed the XML and non-Japanese content (of which there was a lot given the nature of Wikipedia). 40,000 articles was enough to get to the around 6,1 million that Jan H mentioned but it still was not working. I will attach the generated ja.ngp.
                
> Support CJK (Chinese, Japanese and Korean) language detection
> -------------------------------------------------------------
>
>                 Key: TIKA-856
>                 URL: https://issues.apache.org/jira/browse/TIKA-856
>             Project: Tika
>          Issue Type: New Feature
>          Components: languageidentifier
>    Affects Versions: 1.0
>         Environment: All
>            Reporter: James Sullivan
>              Labels: Chinese, Japanese
>
> Support language detection of CJK (Chinese, Japanese and Korean).
> Some estimates have Chinese users overtaking English users on the Internet  so it is important that these languages used by large number of people be supported.
> See TIKA-855

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-856) Support CJK (Chinese, Japanese and Korean) language detection

Posted by "Pander Musubi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13499480#comment-13499480 ] 

Pander Musubi commented on TIKA-856:
------------------------------------

Please see also https://issues.apache.org/jira/browse/TIKA-369 proposing to use https://code.google.com/p/language-detection/ for improved language detection.
                
> Support CJK (Chinese, Japanese and Korean) language detection
> -------------------------------------------------------------
>
>                 Key: TIKA-856
>                 URL: https://issues.apache.org/jira/browse/TIKA-856
>             Project: Tika
>          Issue Type: New Feature
>          Components: languageidentifier
>    Affects Versions: 1.0
>         Environment: All
>            Reporter: James Sullivan
>              Labels: Chinese, Japanese
>         Attachments: ja.ngp
>
>
> Support language detection of CJK (Chinese, Japanese and Korean).
> Some estimates have Chinese users overtaking English users on the Internet  so it is important that these languages used by large number of people be supported.
> See TIKA-855

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-856) Support CJK (Chinese, Japanese and Korean) language detection

Posted by "James Sullivan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211664#comment-13211664 ] 

James Sullivan commented on TIKA-856:
-------------------------------------

I gave it a shot this weekend using Jan H.'s instructions with a Japanese corpus I put together, which coincidentally used the same Wikipeda entries but not the tool Jan R.mentions. I could not get good results even playing around a little with what was included in the corpus. I could well have screwed something basic up but I suspect there is more to this than just generating a profile. Initially I thought the results would  be perfect given the lack of overlap between latin and Japanese character sets but looking at the only 1,000 lines in the .ngp file and knowing that there are 2,136 characters in Joyo Kanji alone I suspect some modifications are going to need to be made to the current implementation for this to work.
                
> Support CJK (Chinese, Japanese and Korean) language detection
> -------------------------------------------------------------
>
>                 Key: TIKA-856
>                 URL: https://issues.apache.org/jira/browse/TIKA-856
>             Project: Tika
>          Issue Type: New Feature
>          Components: languageidentifier
>    Affects Versions: 1.0
>         Environment: All
>            Reporter: James Sullivan
>              Labels: Chinese, Japanese
>
> Support language detection of CJK (Chinese, Japanese and Korean).
> Some estimates have Chinese users overtaking English users on the Internet  so it is important that these languages used by large number of people be supported.
> See TIKA-855

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-856) Support CJK (Chinese, Japanese and Korean) language detection

Posted by "Christian Moen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211440#comment-13211440 ] 

Christian Moen commented on TIKA-856:
-------------------------------------

Thanks, Jan R.  The {{language-detection}} library is similar to that of Tika's and the command line mentioned in your link and that Jan H. mentions above basically do the same thing.

Jan H., I'll see if I can put together some language profiles for CJK for Tika later this week.

                
> Support CJK (Chinese, Japanese and Korean) language detection
> -------------------------------------------------------------
>
>                 Key: TIKA-856
>                 URL: https://issues.apache.org/jira/browse/TIKA-856
>             Project: Tika
>          Issue Type: New Feature
>          Components: languageidentifier
>    Affects Versions: 1.0
>         Environment: All
>            Reporter: James Sullivan
>              Labels: Chinese, Japanese
>
> Support language detection of CJK (Chinese, Japanese and Korean).
> Some estimates have Chinese users overtaking English users on the Internet  so it is important that these languages used by large number of people be supported.
> See TIKA-855

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira