You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jan Høydahl (Commented JIRA)" <ji...@apache.org> on 2012/02/06 15:51:59 UTC

[jira] [Commented] (TIKA-856) Support CJK (Chinese, Japanese and Korean) language detection

    [ https://issues.apache.org/jira/browse/TIKA-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13201321#comment-13201321 ] 

Jan Høydahl commented on TIKA-856:
----------------------------------

The command to create a profile is:
{code}
java -jar tika-app-1.0.jar --create-profile=ja -eUTF-8 japanese.txt
{code}

The input text should be about the same as for the existing languages. I've found that if you sum the frequence numbers in the 1000 lines of the existing profiles, you arrive at a number around 6,1million and I suppose that's where new language profiles should lie too because of how Tika compares profiles, see TIKA-496.
                
> Support CJK (Chinese, Japanese and Korean) language detection
> -------------------------------------------------------------
>
>                 Key: TIKA-856
>                 URL: https://issues.apache.org/jira/browse/TIKA-856
>             Project: Tika
>          Issue Type: New Feature
>          Components: languageidentifier
>    Affects Versions: 1.0
>         Environment: All
>            Reporter: James Sullivan
>              Labels: Chinese, Japanese
>
> Support language detection of CJK (Chinese, Japanese and Korean).
> Some estimates have Chinese users overtaking English users on the Internet  so it is important that these languages used by large number of people be supported.
> See TIKA-855

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira