You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "John Wang (JIRA)" <ji...@apache.org> on 2005/12/08 01:55:09 UTC

[jira] Created: (LUCENE-478) CJK char list

CJK char list
-------------

         Key: LUCENE-478
         URL: http://issues.apache.org/jira/browse/LUCENE-478
     Project: Lucene - Java
        Type: Bug
  Components: Analysis  
    Versions: 1.4    
    Reporter: John Wang
    Priority: Minor


Seems the character list in the CJK section of the StandardTokenizer.jj is not quite complete. Following is a more complete list:

< CJK:                                          // non-alphabets
      [
	   "\u1100"-"\u11ff",
       "\u3040"-"\u30ff",
       "\u3130"-"\u318f",
       "\u31f0"-"\u31ff",
       "\u3300"-"\u337f",
       "\u3400"-"\u4dbf",
       "\u4e00"-"\u9fff",
       "\uac00"-"\ud7a3",
       "\uf900"-"\ufaff",
       "\uff65"-"\uffdc"       
      ]
  >



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Assigned: (LUCENE-478) CJK char list

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/LUCENE-478?page=all ]

Otis Gospodnetic reassigned LUCENE-478:
---------------------------------------

    Assign To: Otis Gospodnetic

> CJK char list
> -------------
>
>          Key: LUCENE-478
>          URL: http://issues.apache.org/jira/browse/LUCENE-478
>      Project: Lucene - Java
>         Type: Bug

>   Components: Analysis
>     Versions: 1.4
>     Reporter: John Wang
>     Assignee: Otis Gospodnetic
>     Priority: Minor
>  Attachments: StandardTokenizer.jj.diff, StandardTokenizer.jj.diff
>
> Seems the character list in the CJK section of the StandardTokenizer.jj is not quite complete. Following is a more complete list:
> < CJK:                                          // non-alphabets
>       [
> 	   "\u1100"-"\u11ff",
>        "\u3040"-"\u30ff",
>        "\u3130"-"\u318f",
>        "\u31f0"-"\u31ff",
>        "\u3300"-"\u337f",
>        "\u3400"-"\u4dbf",
>        "\u4e00"-"\u9fff",
>        "\uac00"-"\ud7a3",
>        "\uf900"-"\ufaff",
>        "\uff65"-"\uffdc"       
>       ]
>   >

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-478) CJK char list

Posted by "Daniel Naber (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-478?page=comments#action_12361679 ] 

Daniel Naber commented on LUCENE-478:
-------------------------------------

John, I'm not sure I understand: do you think that this issue can be closed now? If not, could you ask your i18n experts how your changes could be integrated into the current code (the one where K/Korean and CJ are separate things)?



> CJK char list
> -------------
>
>          Key: LUCENE-478
>          URL: http://issues.apache.org/jira/browse/LUCENE-478
>      Project: Lucene - Java
>         Type: Bug
>   Components: Analysis
>     Versions: 1.4
>     Reporter: John Wang
>     Priority: Minor

>
> Seems the character list in the CJK section of the StandardTokenizer.jj is not quite complete. Following is a more complete list:
> < CJK:                                          // non-alphabets
>       [
> 	   "\u1100"-"\u11ff",
>        "\u3040"-"\u30ff",
>        "\u3130"-"\u318f",
>        "\u31f0"-"\u31ff",
>        "\u3300"-"\u337f",
>        "\u3400"-"\u4dbf",
>        "\u4e00"-"\u9fff",
>        "\uac00"-"\ud7a3",
>        "\uf900"-"\ufaff",
>        "\uff65"-"\uffdc"       
>       ]
>   >

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-478) CJK char list

Posted by "Daniel Naber (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-478?page=comments#action_12361493 ] 

Daniel Naber commented on LUCENE-478:
-------------------------------------

This is how the code looks currently:

| < CJ:                                          // Chinese, Japanese
      [
       "\u3040"-"\u318f",
       "\u3300"-"\u337f",
       "\u3400"-"\u3d2d",
       "\u4e00"-"\u9fff",
       "\uf900"-"\ufaff"
      ]
  >
| < KOREAN:                                          // Korean
      [
       "\uac00"-"\ud7af"
      ]
  >

Are your suggested changes still needed and if so, where should which range be added (Chinese/Japanese or Korean)?


> CJK char list
> -------------
>
>          Key: LUCENE-478
>          URL: http://issues.apache.org/jira/browse/LUCENE-478
>      Project: Lucene - Java
>         Type: Bug
>   Components: Analysis
>     Versions: 1.4
>     Reporter: John Wang
>     Priority: Minor

>
> Seems the character list in the CJK section of the StandardTokenizer.jj is not quite complete. Following is a more complete list:
> < CJK:                                          // non-alphabets
>       [
> 	   "\u1100"-"\u11ff",
>        "\u3040"-"\u30ff",
>        "\u3130"-"\u318f",
>        "\u31f0"-"\u31ff",
>        "\u3300"-"\u337f",
>        "\u3400"-"\u4dbf",
>        "\u4e00"-"\u9fff",
>        "\uac00"-"\ud7a3",
>        "\uf900"-"\ufaff",
>        "\uff65"-"\uffdc"       
>       ]
>   >

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-478) CJK char list

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/LUCENE-478?page=all ]

Steven Rowe updated LUCENE-478:
-------------------------------

    Attachment: StandardTokenizer.jj.diff

Patch addressing the above-described issues

> CJK char list
> -------------
>
>          Key: LUCENE-478
>          URL: http://issues.apache.org/jira/browse/LUCENE-478
>      Project: Lucene - Java
>         Type: Bug
>   Components: Analysis
>     Versions: 1.4
>     Reporter: John Wang
>     Priority: Minor
>  Attachments: StandardTokenizer.jj.diff
>
> Seems the character list in the CJK section of the StandardTokenizer.jj is not quite complete. Following is a more complete list:
> < CJK:                                          // non-alphabets
>       [
> 	   "\u1100"-"\u11ff",
>        "\u3040"-"\u30ff",
>        "\u3130"-"\u318f",
>        "\u31f0"-"\u31ff",
>        "\u3300"-"\u337f",
>        "\u3400"-"\u4dbf",
>        "\u4e00"-"\u9fff",
>        "\uac00"-"\ud7a3",
>        "\uf900"-"\ufaff",
>        "\uff65"-"\uffdc"       
>       ]
>   >

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-478) CJK char list

Posted by "John Wang (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-478?page=comments#action_12361497 ] 

John Wang commented on LUCENE-478:
----------------------------------

Yes I am.

Our i18n team has provided a more up-to-date list and I thought I'd contribute it back.

-John

> CJK char list
> -------------
>
>          Key: LUCENE-478
>          URL: http://issues.apache.org/jira/browse/LUCENE-478
>      Project: Lucene - Java
>         Type: Bug
>   Components: Analysis
>     Versions: 1.4
>     Reporter: John Wang
>     Priority: Minor

>
> Seems the character list in the CJK section of the StandardTokenizer.jj is not quite complete. Following is a more complete list:
> < CJK:                                          // non-alphabets
>       [
> 	   "\u1100"-"\u11ff",
>        "\u3040"-"\u30ff",
>        "\u3130"-"\u318f",
>        "\u31f0"-"\u31ff",
>        "\u3300"-"\u337f",
>        "\u3400"-"\u4dbf",
>        "\u4e00"-"\u9fff",
>        "\uac00"-"\ud7a3",
>        "\uf900"-"\ufaff",
>        "\uff65"-"\uffdc"       
>       ]
>   >

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Resolved: (LUCENE-478) CJK char list

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi Steven,

I understood (and still do, actually) 5.b as "added U+3d2e - U+4DB5 and excluded....", so I expected to see the U+3d2e - U+4DB5 range in the patch, but I didn't see it.  The closest range was this:

+       "\u3400"-"\u4db5",

I'm about to go on vacation (no TV, no radio, no email, no Internet, no java, just sea, salt, sun), so please have a look at the version in the trunk and if any other ranges are missing, please send a patch.  Also feel free to look at those other ranges I left commented out in there.  Bob Carpenter should recognize them. :)

Otis

----- Original Message ---- 
From: Steven Rowe  
To: java-dev@lucene.apache.org 
Sent: Sunday, August 13, 2006 9:36:06 AM 
Subject: Re: [jira] Resolved: (LUCENE-478) CJK char list 

Otis Gospodnetic (JIRA) wrote: 
>      [ http://issues.apache.org/jira/browse/LUCENE-478?page=all ] 
>  
> Otis Gospodnetic resolved LUCENE-478. 
> ------------------------------------- 
>  
>     Resolution: Fixed 
>  
> Thanks, I committed Steven Rowe's patch, although it doesn't seem to 
> fully match what he said in comments above (e.g. in his patch, I 
> don't see the range he mentioned in 5.b). 

Hi Otis, 

Here's 5.b.: 

5. Character ranges in John's list that are missing in 
StandardTokenizer.jj, and that should be added to the newly 
re-labeled  section: 

   5.b. [ U+3d2e - U+4DB5 ] (non-chars [ U+4DB6 - U+4DBF ] excluded) 
        CJK Ideograph Extension A. 
        This range was introduced in Unicode 3.0. 

And here's the corresponding change from the patch: 

        "\u3300"-"\u337f", 
-       "\u3400"-"\u3d2d", 
+       "\u3400"-"\u4db5", 
        "\u4e00"-"\u9fff", 

I don't understand - it looks to me like the above change adds the range 
mentioned in 5.b. 

Are there other inconsistencies?  (You said that 5.b. was an example.) 

Steve 

--------------------------------------------------------------------- 
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org 
For additional commands, e-mail: java-dev-help@lucene.apache.org 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Resolved: (LUCENE-478) CJK char list

Posted by Steven Rowe <sa...@syr.edu>.

Otis Gospodnetic (JIRA) wrote:
>      [ http://issues.apache.org/jira/browse/LUCENE-478?page=all ]
> 
> Otis Gospodnetic resolved LUCENE-478.
> -------------------------------------
> 
>     Resolution: Fixed
> 
> Thanks, I committed Steven Rowe's patch, although it doesn't seem to
> fully match what he said in comments above (e.g. in his patch, I
> don't see the range he mentioned in 5.b).

Hi Otis,

Here's 5.b.:

5. Character ranges in John's list that are missing in
StandardTokenizer.jj, and that should be added to the newly
re-labeled <CJ> section:

   5.b. [ U+3d2e - U+4DB5 ] (non-chars [ U+4DB6 - U+4DBF ] excluded)
        CJK Ideograph Extension A.
        This range was introduced in Unicode 3.0.

And here's the corresponding change from the patch:

        "\u3300"-"\u337f",
-       "\u3400"-"\u3d2d",
+       "\u3400"-"\u4db5",
        "\u4e00"-"\u9fff",

I don't understand - it looks to me like the above change adds the range
mentioned in 5.b.

Are there other inconsistencies?  (You said that 5.b. was an example.)

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-478) CJK char list

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/LUCENE-478?page=all ]

Otis Gospodnetic resolved LUCENE-478.
-------------------------------------

    Resolution: Fixed

Thanks, I committed Steven Rowe's patch, although it doesn't seem to fully match what he said in comments above (e.g. in his patch, I don't see the range he mentioned in 5.b).

> CJK char list
> -------------
>
>                 Key: LUCENE-478
>                 URL: http://issues.apache.org/jira/browse/LUCENE-478
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.4
>            Reporter: John Wang
>         Assigned To: Otis Gospodnetic
>            Priority: Minor
>         Attachments: StandardTokenizer.jj.diff, StandardTokenizer.jj.diff
>
>
> Seems the character list in the CJK section of the StandardTokenizer.jj is not quite complete. Following is a more complete list:
> < CJK:                                          // non-alphabets
>       [
> 	   "\u1100"-"\u11ff",
>        "\u3040"-"\u30ff",
>        "\u3130"-"\u318f",
>        "\u31f0"-"\u31ff",
>        "\u3300"-"\u337f",
>        "\u3400"-"\u4dbf",
>        "\u4e00"-"\u9fff",
>        "\uac00"-"\ud7a3",
>        "\uf900"-"\ufaff",
>        "\uff65"-"\uffdc"       
>       ]
>   >

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-478) CJK char list

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-478?page=comments#action_12361804 ] 

Steven Rowe commented on LUCENE-478:
------------------------------------

There are six classes of issues:

1. A character range in StandardTokenizer.jj that is missing in
John's list, and should be left as-is in StandardTokenizer.jj
(in the <CJ> section):

   1.a. [ U+3100 - U+312F ]
        BoPoMoFo (a.k.a. ZhuYin): Phonetic transcription symbols
        used in Taiwan; not used on mainland China.

2. A character range in StandardTokenizer.jj that is also in
John's list, but in the <LETTER> section rather than in the <CJ>
section, and should be left as-is:

   2.a. [ U+1100 - U+11FF ]
        Korean Jamo (phonetic symbols)

3. A character range in StandardTokenizer.jj that is not present in
John's list, and that should be removed from the <KOREAN> section
in StandardTokenizer.jj:

   3.a. [ U+D7A4 - U+D7AF ]
        Non-character range at the end of the pre-composed Hangul 
        (Korean) block

4. A character range in John's list that is missing in
StandardTokenizer.jj, but which was not present in Unicode 3.0, and
so strictly should not be included when running on Java 1.4; since
this is a non-character range in Unicode 3.0, however, I think it
should be included in StandardTokenizer.jj (in the <CJ> section)
for future compatibility with Java 1.5 and Unicode 4.0:

   4.a. [ U+31F0 - U+31FF ]
        Japanese Katakana phonetic extensions; these were introduced
        in Unicode version 3.2 (see
        http://www.unicode.org/reports/tr28/tr28-3.html#10_3_katakana )

5. Character ranges in John's list that are missing in
StandardTokenizer.jj, and that should be added to the newly
re-labeled <CJ> section:

   5.a. [ U+FF65 - U+FF9F ]
        Half-width Japanese Katakana (phonetic symbols)
   
   5.b. [ U+3d2e - U+4DB5 ] (non-chars [ U+4DB6 - U+4DBF ] excluded)
        CJK Ideograph Extension A.  
        This range was introduced in Unicode 3.0.

6. A character range in John's list that is missing in
StandardTokenizer.jj, and that should be added to the <LETTER>
section, since it, like the [ U+1100 - U+11FF ] range already
included there, is a range of Korean Jamo (phonetic symbols):

   6.a. [ U+FFA0 - U+FFDC ]
        Half-width Korean Jamo (phonetic symbols)


> CJK char list
> -------------
>
>          Key: LUCENE-478
>          URL: http://issues.apache.org/jira/browse/LUCENE-478
>      Project: Lucene - Java
>         Type: Bug
>   Components: Analysis
>     Versions: 1.4
>     Reporter: John Wang
>     Priority: Minor

>
> Seems the character list in the CJK section of the StandardTokenizer.jj is not quite complete. Following is a more complete list:
> < CJK:                                          // non-alphabets
>       [
> 	   "\u1100"-"\u11ff",
>        "\u3040"-"\u30ff",
>        "\u3130"-"\u318f",
>        "\u31f0"-"\u31ff",
>        "\u3300"-"\u337f",
>        "\u3400"-"\u4dbf",
>        "\u4e00"-"\u9fff",
>        "\uac00"-"\ud7a3",
>        "\uf900"-"\ufaff",
>        "\uff65"-"\uffdc"       
>       ]
>   >

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-478) CJK char list

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/LUCENE-478?page=all ]

Steven Rowe updated LUCENE-478:
-------------------------------

    Attachment: StandardTokenizer.jj.diff

Removed stray comma - obsoletes previous patch

> CJK char list
> -------------
>
>          Key: LUCENE-478
>          URL: http://issues.apache.org/jira/browse/LUCENE-478
>      Project: Lucene - Java
>         Type: Bug
>   Components: Analysis
>     Versions: 1.4
>     Reporter: John Wang
>     Priority: Minor
>  Attachments: StandardTokenizer.jj.diff, StandardTokenizer.jj.diff
>
> Seems the character list in the CJK section of the StandardTokenizer.jj is not quite complete. Following is a more complete list:
> < CJK:                                          // non-alphabets
>       [
> 	   "\u1100"-"\u11ff",
>        "\u3040"-"\u30ff",
>        "\u3130"-"\u318f",
>        "\u31f0"-"\u31ff",
>        "\u3300"-"\u337f",
>        "\u3400"-"\u4dbf",
>        "\u4e00"-"\u9fff",
>        "\uac00"-"\ud7a3",
>        "\uf900"-"\ufaff",
>        "\uff65"-"\uffdc"       
>       ]
>   >

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org