You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Cheolgoo Kang (JIRA)" <ji...@apache.org> on 2005/11/08 07:57:19 UTC

[jira] Created: (LUCENE-461) StandardTokenizer splitting all of Korean words into separate characters

StandardTokenizer splitting all of Korean words into separate characters
------------------------------------------------------------------------

         Key: LUCENE-461
         URL: http://issues.apache.org/jira/browse/LUCENE-461
     Project: Lucene - Java
        Type: Bug
  Components: Analysis  
 Environment: Analyzing Korean text with Apache Lucene, esp. with StandardAnalyzer.
    Reporter: Cheolgoo Kang
    Priority: Minor


StandardTokenizer splits all those Korean words inth separate character tokens. For example, "안녕하세요" is one Korean word that means "Hello", but StandardAnalyzer separates it into five tokens of "안", "녕", "하", "세", "요".

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-461) StandardTokenizer splitting all of Korean words into separate characters

Posted by "Erik Hatcher (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/LUCENE-461?page=all ]
     
Erik Hatcher resolved LUCENE-461:
---------------------------------

    Fix Version: 1.9
     Resolution: Fixed

These patches have been applied, thanks! 

There is one thing to note, and that is a change in the token type emitted from "<CJK>" to "<CJ>".  It is possible that folks have written code to rely on that, but this token type is currently brittle as it is based on the JavaCC grammar definition and I view this as an acceptable break in full backwards compatibility because it is unlikely that anyone is using that token type.

> StandardTokenizer splitting all of Korean words into separate characters
> ------------------------------------------------------------------------
>
>          Key: LUCENE-461
>          URL: http://issues.apache.org/jira/browse/LUCENE-461
>      Project: Lucene - Java
>         Type: Bug
>   Components: Analysis
>  Environment: Analyzing Korean text with Apache Lucene, esp. with StandardAnalyzer.
>     Reporter: Cheolgoo Kang
>     Priority: Minor
>      Fix For: 1.9
>  Attachments: StandardTokenizer_KoreanWord.patch, TestStandardAnalyzer_KoreanWord.patch
>
> StandardTokenizer splits all those Korean words inth separate character tokens. For example, "?????" is one Korean word that means "Hello", but StandardAnalyzer separates it into five tokens of "?", "?", "?", "?", "?".

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Closed: (LUCENE-461) StandardTokenizer splitting all of Korean words into separate characters

Posted by "Erik Hatcher (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/LUCENE-461?page=all ]
     
Erik Hatcher closed LUCENE-461:
-------------------------------


> StandardTokenizer splitting all of Korean words into separate characters
> ------------------------------------------------------------------------
>
>          Key: LUCENE-461
>          URL: http://issues.apache.org/jira/browse/LUCENE-461
>      Project: Lucene - Java
>         Type: Bug
>   Components: Analysis
>  Environment: Analyzing Korean text with Apache Lucene, esp. with StandardAnalyzer.
>     Reporter: Cheolgoo Kang
>     Priority: Minor
>      Fix For: 1.9
>  Attachments: StandardTokenizer_KoreanWord.patch, TestStandardAnalyzer_KoreanWord.patch
>
> StandardTokenizer splits all those Korean words inth separate character tokens. For example, "?????" is one Korean word that means "Hello", but StandardAnalyzer separates it into five tokens of "?", "?", "?", "?", "?".

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-461) StandardTokenizer splitting all of Korean words into separate characters

Posted by "Cheolgoo Kang (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/LUCENE-461?page=all ]

Cheolgoo Kang updated LUCENE-461:
---------------------------------

    Attachment: StandardTokenizer_KoreanWord.patch
                TestStandardAnalyzer_KoreanWord.patch

Here are patches to preserve one Korean word not to be separated into each characters. The TestStandardAnalyzer test case attached has passed with StandardTokenizer with patch applied.

> StandardTokenizer splitting all of Korean words into separate characters
> ------------------------------------------------------------------------
>
>          Key: LUCENE-461
>          URL: http://issues.apache.org/jira/browse/LUCENE-461
>      Project: Lucene - Java
>         Type: Bug
>   Components: Analysis
>  Environment: Analyzing Korean text with Apache Lucene, esp. with StandardAnalyzer.
>     Reporter: Cheolgoo Kang
>     Priority: Minor
>  Attachments: StandardTokenizer_KoreanWord.patch, TestStandardAnalyzer_KoreanWord.patch
>
> StandardTokenizer splits all those Korean words inth separate character tokens. For example, "?????" is one Korean word that means "Hello", but StandardAnalyzer separates it into five tokens of "?", "?", "?", "?", "?".

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org