You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "KuroSaka TeruHiko (JIRA)" <ji...@apache.org> on 2006/10/14 04:48:04 UTC

[jira] Commented: (NUTCH-224) Nutch doesn't handle Korean text at all

    [ http://issues.apache.org/jira/browse/NUTCH-224?page=comments#action_12442140 ] 
            
KuroSaka TeruHiko commented on NUTCH-224:
-----------------------------------------


   [[ Old comment, sent by email on Tue, 13 Jun 2006 18:17:48 -0700 ]]



Thank you for taking care of this bug.
I can't read or write Korean.  I reported this bug because the code
does not look like not being able to handle Korean characters.
So, I can't really test the code.  Your code inspection would
be as good as mine.  Perhaps you can find some Korean
volunteers on nutch-user ML?

-kuro


> Nutch doesn't handle Korean text at all
> ---------------------------------------
>
>                 Key: NUTCH-224
>                 URL: http://issues.apache.org/jira/browse/NUTCH-224
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 0.7.1
>            Reporter: KuroSaka TeruHiko
>
> I was browing NutchAnalysis.jj and found that
> Hungul Syllables (U+AC00 ... U+D7AF; U+xxxx means
> a Unicode character of the hex value xxxx) are not
> part of LETTER or CJK class.  This seems to me that
> Nutch cannot handle Korean documents at all.
> I posted the above message at nutch-user ML and Cheolgoo Kang [appler@gmail.com]
> replied as:
> ------------------------------------------------------------------------------------
> There was similar issue with Lucene's StandardTokenizer.jj.
> http://issues.apache.org/jira/browse/LUCENE-444
> and
> http://issues.apache.org/jira/browse/LUCENE-461
> I'm have almost no experience with Nutch, but you can handle it like
> those issues above.
> ------------------------------------------------------------------------------------
> Both fixes should probably be ported back to NuatchAnalysis.jj.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira