You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "John Wang (JIRA)" <ji...@apache.org> on 2005/12/08 01:55:09 UTC
[jira] Created: (LUCENE-478) CJK char list
CJK char list
-------------
Key: LUCENE-478
URL: http://issues.apache.org/jira/browse/LUCENE-478
Project: Lucene - Java
Type: Bug
Components: Analysis
Versions: 1.4
Reporter: John Wang
Priority: Minor
Seems the character list in the CJK section of the StandardTokenizer.jj is not quite complete. Following is a more complete list:
< CJK: // non-alphabets
[
"\u1100"-"\u11ff",
"\u3040"-"\u30ff",
"\u3130"-"\u318f",
"\u31f0"-"\u31ff",
"\u3300"-"\u337f",
"\u3400"-"\u4dbf",
"\u4e00"-"\u9fff",
"\uac00"-"\ud7a3",
"\uf900"-"\ufaff",
"\uff65"-"\uffdc"
]
>
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Assigned: (LUCENE-478) CJK char list
Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/LUCENE-478?page=all ]
Otis Gospodnetic reassigned LUCENE-478:
---------------------------------------
Assign To: Otis Gospodnetic
> CJK char list
> -------------
>
> Key: LUCENE-478
> URL: http://issues.apache.org/jira/browse/LUCENE-478
> Project: Lucene - Java
> Type: Bug
> Components: Analysis
> Versions: 1.4
> Reporter: John Wang
> Assignee: Otis Gospodnetic
> Priority: Minor
> Attachments: StandardTokenizer.jj.diff, StandardTokenizer.jj.diff
>
> Seems the character list in the CJK section of the StandardTokenizer.jj is not quite complete. Following is a more complete list:
> < CJK: // non-alphabets
> [
> "\u1100"-"\u11ff",
> "\u3040"-"\u30ff",
> "\u3130"-"\u318f",
> "\u31f0"-"\u31ff",
> "\u3300"-"\u337f",
> "\u3400"-"\u4dbf",
> "\u4e00"-"\u9fff",
> "\uac00"-"\ud7a3",
> "\uf900"-"\ufaff",
> "\uff65"-"\uffdc"
> ]
> >
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Commented: (LUCENE-478) CJK char list
Posted by "Daniel Naber (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/LUCENE-478?page=comments#action_12361679 ]
Daniel Naber commented on LUCENE-478:
-------------------------------------
John, I'm not sure I understand: do you think that this issue can be closed now? If not, could you ask your i18n experts how your changes could be integrated into the current code (the one where K/Korean and CJ are separate things)?
> CJK char list
> -------------
>
> Key: LUCENE-478
> URL: http://issues.apache.org/jira/browse/LUCENE-478
> Project: Lucene - Java
> Type: Bug
> Components: Analysis
> Versions: 1.4
> Reporter: John Wang
> Priority: Minor
>
> Seems the character list in the CJK section of the StandardTokenizer.jj is not quite complete. Following is a more complete list:
> < CJK: // non-alphabets
> [
> "\u1100"-"\u11ff",
> "\u3040"-"\u30ff",
> "\u3130"-"\u318f",
> "\u31f0"-"\u31ff",
> "\u3300"-"\u337f",
> "\u3400"-"\u4dbf",
> "\u4e00"-"\u9fff",
> "\uac00"-"\ud7a3",
> "\uf900"-"\ufaff",
> "\uff65"-"\uffdc"
> ]
> >
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Commented: (LUCENE-478) CJK char list
Posted by "Daniel Naber (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/LUCENE-478?page=comments#action_12361493 ]
Daniel Naber commented on LUCENE-478:
-------------------------------------
This is how the code looks currently:
| < CJ: // Chinese, Japanese
[
"\u3040"-"\u318f",
"\u3300"-"\u337f",
"\u3400"-"\u3d2d",
"\u4e00"-"\u9fff",
"\uf900"-"\ufaff"
]
>
| < KOREAN: // Korean
[
"\uac00"-"\ud7af"
]
>
Are your suggested changes still needed and if so, where should which range be added (Chinese/Japanese or Korean)?
> CJK char list
> -------------
>
> Key: LUCENE-478
> URL: http://issues.apache.org/jira/browse/LUCENE-478
> Project: Lucene - Java
> Type: Bug
> Components: Analysis
> Versions: 1.4
> Reporter: John Wang
> Priority: Minor
>
> Seems the character list in the CJK section of the StandardTokenizer.jj is not quite complete. Following is a more complete list:
> < CJK: // non-alphabets
> [
> "\u1100"-"\u11ff",
> "\u3040"-"\u30ff",
> "\u3130"-"\u318f",
> "\u31f0"-"\u31ff",
> "\u3300"-"\u337f",
> "\u3400"-"\u4dbf",
> "\u4e00"-"\u9fff",
> "\uac00"-"\ud7a3",
> "\uf900"-"\ufaff",
> "\uff65"-"\uffdc"
> ]
> >
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Updated: (LUCENE-478) CJK char list
Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/LUCENE-478?page=all ]
Steven Rowe updated LUCENE-478:
-------------------------------
Attachment: StandardTokenizer.jj.diff
Patch addressing the above-described issues
> CJK char list
> -------------
>
> Key: LUCENE-478
> URL: http://issues.apache.org/jira/browse/LUCENE-478
> Project: Lucene - Java
> Type: Bug
> Components: Analysis
> Versions: 1.4
> Reporter: John Wang
> Priority: Minor
> Attachments: StandardTokenizer.jj.diff
>
> Seems the character list in the CJK section of the StandardTokenizer.jj is not quite complete. Following is a more complete list:
> < CJK: // non-alphabets
> [
> "\u1100"-"\u11ff",
> "\u3040"-"\u30ff",
> "\u3130"-"\u318f",
> "\u31f0"-"\u31ff",
> "\u3300"-"\u337f",
> "\u3400"-"\u4dbf",
> "\u4e00"-"\u9fff",
> "\uac00"-"\ud7a3",
> "\uf900"-"\ufaff",
> "\uff65"-"\uffdc"
> ]
> >
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Commented: (LUCENE-478) CJK char list
Posted by "John Wang (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/LUCENE-478?page=comments#action_12361497 ]
John Wang commented on LUCENE-478:
----------------------------------
Yes I am.
Our i18n team has provided a more up-to-date list and I thought I'd contribute it back.
-John
> CJK char list
> -------------
>
> Key: LUCENE-478
> URL: http://issues.apache.org/jira/browse/LUCENE-478
> Project: Lucene - Java
> Type: Bug
> Components: Analysis
> Versions: 1.4
> Reporter: John Wang
> Priority: Minor
>
> Seems the character list in the CJK section of the StandardTokenizer.jj is not quite complete. Following is a more complete list:
> < CJK: // non-alphabets
> [
> "\u1100"-"\u11ff",
> "\u3040"-"\u30ff",
> "\u3130"-"\u318f",
> "\u31f0"-"\u31ff",
> "\u3300"-"\u337f",
> "\u3400"-"\u4dbf",
> "\u4e00"-"\u9fff",
> "\uac00"-"\ud7a3",
> "\uf900"-"\ufaff",
> "\uff65"-"\uffdc"
> ]
> >
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [jira] Resolved: (LUCENE-478) CJK char list
Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi Steven,
I understood (and still do, actually) 5.b as "added U+3d2e - U+4DB5 and excluded....", so I expected to see the U+3d2e - U+4DB5 range in the patch, but I didn't see it. The closest range was this:
+ "\u3400"-"\u4db5",
I'm about to go on vacation (no TV, no radio, no email, no Internet, no java, just sea, salt, sun), so please have a look at the version in the trunk and if any other ranges are missing, please send a patch. Also feel free to look at those other ranges I left commented out in there. Bob Carpenter should recognize them. :)
Otis
----- Original Message ----
From: Steven Rowe
To: java-dev@lucene.apache.org
Sent: Sunday, August 13, 2006 9:36:06 AM
Subject: Re: [jira] Resolved: (LUCENE-478) CJK char list
Otis Gospodnetic (JIRA) wrote:
> [ http://issues.apache.org/jira/browse/LUCENE-478?page=all ]
>
> Otis Gospodnetic resolved LUCENE-478.
> -------------------------------------
>
> Resolution: Fixed
>
> Thanks, I committed Steven Rowe's patch, although it doesn't seem to
> fully match what he said in comments above (e.g. in his patch, I
> don't see the range he mentioned in 5.b).
Hi Otis,
Here's 5.b.:
5. Character ranges in John's list that are missing in
StandardTokenizer.jj, and that should be added to the newly
re-labeled section:
5.b. [ U+3d2e - U+4DB5 ] (non-chars [ U+4DB6 - U+4DBF ] excluded)
CJK Ideograph Extension A.
This range was introduced in Unicode 3.0.
And here's the corresponding change from the patch:
"\u3300"-"\u337f",
- "\u3400"-"\u3d2d",
+ "\u3400"-"\u4db5",
"\u4e00"-"\u9fff",
I don't understand - it looks to me like the above change adds the range
mentioned in 5.b.
Are there other inconsistencies? (You said that 5.b. was an example.)
Steve
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [jira] Resolved: (LUCENE-478) CJK char list
Posted by Steven Rowe <sa...@syr.edu>.
Otis Gospodnetic (JIRA) wrote:
> [ http://issues.apache.org/jira/browse/LUCENE-478?page=all ]
>
> Otis Gospodnetic resolved LUCENE-478.
> -------------------------------------
>
> Resolution: Fixed
>
> Thanks, I committed Steven Rowe's patch, although it doesn't seem to
> fully match what he said in comments above (e.g. in his patch, I
> don't see the range he mentioned in 5.b).
Hi Otis,
Here's 5.b.:
5. Character ranges in John's list that are missing in
StandardTokenizer.jj, and that should be added to the newly
re-labeled <CJ> section:
5.b. [ U+3d2e - U+4DB5 ] (non-chars [ U+4DB6 - U+4DBF ] excluded)
CJK Ideograph Extension A.
This range was introduced in Unicode 3.0.
And here's the corresponding change from the patch:
"\u3300"-"\u337f",
- "\u3400"-"\u3d2d",
+ "\u3400"-"\u4db5",
"\u4e00"-"\u9fff",
I don't understand - it looks to me like the above change adds the range
mentioned in 5.b.
Are there other inconsistencies? (You said that 5.b. was an example.)
Steve
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Resolved: (LUCENE-478) CJK char list
Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/LUCENE-478?page=all ]
Otis Gospodnetic resolved LUCENE-478.
-------------------------------------
Resolution: Fixed
Thanks, I committed Steven Rowe's patch, although it doesn't seem to fully match what he said in comments above (e.g. in his patch, I don't see the range he mentioned in 5.b).
> CJK char list
> -------------
>
> Key: LUCENE-478
> URL: http://issues.apache.org/jira/browse/LUCENE-478
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.4
> Reporter: John Wang
> Assigned To: Otis Gospodnetic
> Priority: Minor
> Attachments: StandardTokenizer.jj.diff, StandardTokenizer.jj.diff
>
>
> Seems the character list in the CJK section of the StandardTokenizer.jj is not quite complete. Following is a more complete list:
> < CJK: // non-alphabets
> [
> "\u1100"-"\u11ff",
> "\u3040"-"\u30ff",
> "\u3130"-"\u318f",
> "\u31f0"-"\u31ff",
> "\u3300"-"\u337f",
> "\u3400"-"\u4dbf",
> "\u4e00"-"\u9fff",
> "\uac00"-"\ud7a3",
> "\uf900"-"\ufaff",
> "\uff65"-"\uffdc"
> ]
> >
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Commented: (LUCENE-478) CJK char list
Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/LUCENE-478?page=comments#action_12361804 ]
Steven Rowe commented on LUCENE-478:
------------------------------------
There are six classes of issues:
1. A character range in StandardTokenizer.jj that is missing in
John's list, and should be left as-is in StandardTokenizer.jj
(in the <CJ> section):
1.a. [ U+3100 - U+312F ]
BoPoMoFo (a.k.a. ZhuYin): Phonetic transcription symbols
used in Taiwan; not used on mainland China.
2. A character range in StandardTokenizer.jj that is also in
John's list, but in the <LETTER> section rather than in the <CJ>
section, and should be left as-is:
2.a. [ U+1100 - U+11FF ]
Korean Jamo (phonetic symbols)
3. A character range in StandardTokenizer.jj that is not present in
John's list, and that should be removed from the <KOREAN> section
in StandardTokenizer.jj:
3.a. [ U+D7A4 - U+D7AF ]
Non-character range at the end of the pre-composed Hangul
(Korean) block
4. A character range in John's list that is missing in
StandardTokenizer.jj, but which was not present in Unicode 3.0, and
so strictly should not be included when running on Java 1.4; since
this is a non-character range in Unicode 3.0, however, I think it
should be included in StandardTokenizer.jj (in the <CJ> section)
for future compatibility with Java 1.5 and Unicode 4.0:
4.a. [ U+31F0 - U+31FF ]
Japanese Katakana phonetic extensions; these were introduced
in Unicode version 3.2 (see
http://www.unicode.org/reports/tr28/tr28-3.html#10_3_katakana )
5. Character ranges in John's list that are missing in
StandardTokenizer.jj, and that should be added to the newly
re-labeled <CJ> section:
5.a. [ U+FF65 - U+FF9F ]
Half-width Japanese Katakana (phonetic symbols)
5.b. [ U+3d2e - U+4DB5 ] (non-chars [ U+4DB6 - U+4DBF ] excluded)
CJK Ideograph Extension A.
This range was introduced in Unicode 3.0.
6. A character range in John's list that is missing in
StandardTokenizer.jj, and that should be added to the <LETTER>
section, since it, like the [ U+1100 - U+11FF ] range already
included there, is a range of Korean Jamo (phonetic symbols):
6.a. [ U+FFA0 - U+FFDC ]
Half-width Korean Jamo (phonetic symbols)
> CJK char list
> -------------
>
> Key: LUCENE-478
> URL: http://issues.apache.org/jira/browse/LUCENE-478
> Project: Lucene - Java
> Type: Bug
> Components: Analysis
> Versions: 1.4
> Reporter: John Wang
> Priority: Minor
>
> Seems the character list in the CJK section of the StandardTokenizer.jj is not quite complete. Following is a more complete list:
> < CJK: // non-alphabets
> [
> "\u1100"-"\u11ff",
> "\u3040"-"\u30ff",
> "\u3130"-"\u318f",
> "\u31f0"-"\u31ff",
> "\u3300"-"\u337f",
> "\u3400"-"\u4dbf",
> "\u4e00"-"\u9fff",
> "\uac00"-"\ud7a3",
> "\uf900"-"\ufaff",
> "\uff65"-"\uffdc"
> ]
> >
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Updated: (LUCENE-478) CJK char list
Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/LUCENE-478?page=all ]
Steven Rowe updated LUCENE-478:
-------------------------------
Attachment: StandardTokenizer.jj.diff
Removed stray comma - obsoletes previous patch
> CJK char list
> -------------
>
> Key: LUCENE-478
> URL: http://issues.apache.org/jira/browse/LUCENE-478
> Project: Lucene - Java
> Type: Bug
> Components: Analysis
> Versions: 1.4
> Reporter: John Wang
> Priority: Minor
> Attachments: StandardTokenizer.jj.diff, StandardTokenizer.jj.diff
>
> Seems the character list in the CJK section of the StandardTokenizer.jj is not quite complete. Following is a more complete list:
> < CJK: // non-alphabets
> [
> "\u1100"-"\u11ff",
> "\u3040"-"\u30ff",
> "\u3130"-"\u318f",
> "\u31f0"-"\u31ff",
> "\u3300"-"\u337f",
> "\u3400"-"\u4dbf",
> "\u4e00"-"\u9fff",
> "\uac00"-"\ud7a3",
> "\uf900"-"\ufaff",
> "\uff65"-"\uffdc"
> ]
> >
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org