You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Scott Smith <ss...@mainstreamdata.com> on 2008/07/18 23:03:37 UTC
Bug in CJKTokenizer
org.apache.lucene.analysis.cjk.CJKTokenizer is in the "contrib" portion of lucene, so I'm not sure if this is the right place to mention this or not. I was doing some detailed analysis of how this tokenizer worked and noticed the following behavior (which I would classify as a bug).
If you pass the word "construccion" to the tokenizer, it returns a single token: "construccion". That seems correct. If you pass the word "construcción" to this tokenizer, it will generate three tokens: "construcci", "ó", and "n". This is happens because the accented "o" is not treated as a Latin-1 character. Splitting the word seems like a bug and violates the "does a decent job for most European languages" statement.
The fix seems straight forward. I replaced the following 2 lines (in the CJKTokenizer class):
if ((ub == Character.UnicodeBlock.BASIC_LATIN)
|| (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS))
With
if ((ub == Character.UnicodeBlock.BASIC_LATIN) // chars 0x00-0x7f
|| (ub == Character.UnicodeBlock.LATIN_1_SUPPLEMENT) // char 0x80-0xff
|| (ub == Character.UnicodeBlock.LATIN_EXTENDED_A) // char 0x100-0x17f
|| (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS))
Am I missing something or does this seem like a reasonable thing to want to do?
RE: Bug in CJKTokenizer
Posted by Scott Smith <ss...@mainstreamdata.com>.
I'm certainly not a language expert and so you may be correct. I do see references to some of the eastern European languages in the descriptions of these two and so maybe they should be added as well.
-----Original Message-----
From: Steven A Rowe [mailto:sarowe@syr.edu]
Sent: Friday, July 18, 2008 3:29 PM
To: java-user@lucene.apache.org
Subject: RE: Bug in CJKTokenizer
Hi Scott,
I think this sounds reasonable, but why not also add LATIN_EXTENDED_B and LATIN_EXTENDED_ADDITIONAL? AFAICT, among other things, these cover some eastern European languages and Vietnamese, respectively.
Steve
On 07/18/2008 at 5:03 PM, Scott Smith wrote:
> org.apache.lucene.analysis.cjk.CJKTokenizer is in the
> "contrib" portion of lucene, so I'm not sure if this is the
> right place to mention this or not. I was doing some
> detailed analysis of how this tokenizer worked and noticed
> the following behavior (which I would classify as a bug).
>
> If you pass the word "construccion" to the tokenizer, it returns a
> single token: "construccion". That seems correct. If you pass the word
> "construcción" to this tokenizer, it will generate three tokens:
> "construcci", "ó", and "n". This is happens because the accented "o" is
> not treated as a Latin-1 character. Splitting the word seems like a bug
> and violates the "does a decent job for most European languages"
> statement.
>
> The fix seems straight forward. I replaced the following 2
> lines (in the CJKTokenizer class):
>
> if ((ub == Character.UnicodeBlock.BASIC_LATIN)
> || (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS))
>
> With
>
> if ((ub == Character.UnicodeBlock.BASIC_LATIN) // chars 0x00-0x7f
> || (ub == Character.UnicodeBlock.LATIN_1_SUPPLEMENT) // char 0x80-0xff
> || (ub == Character.UnicodeBlock.LATIN_EXTENDED_A) // char 0x100-0x17f
> || (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS))
>
> Am I missing something or does this seem like a reasonable
> thing to want to do?
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: Bug in CJKTokenizer
Posted by Steven A Rowe <sa...@syr.edu>.
Hi Scott,
I think this sounds reasonable, but why not also add LATIN_EXTENDED_B and LATIN_EXTENDED_ADDITIONAL? AFAICT, among other things, these cover some eastern European languages and Vietnamese, respectively.
Steve
On 07/18/2008 at 5:03 PM, Scott Smith wrote:
> org.apache.lucene.analysis.cjk.CJKTokenizer is in the
> "contrib" portion of lucene, so I'm not sure if this is the
> right place to mention this or not. I was doing some
> detailed analysis of how this tokenizer worked and noticed
> the following behavior (which I would classify as a bug).
>
> If you pass the word "construccion" to the tokenizer, it returns a
> single token: "construccion". That seems correct. If you pass the word
> "construcción" to this tokenizer, it will generate three tokens:
> "construcci", "ó", and "n". This is happens because the accented "o" is
> not treated as a Latin-1 character. Splitting the word seems like a bug
> and violates the "does a decent job for most European languages"
> statement.
>
> The fix seems straight forward. I replaced the following 2
> lines (in the CJKTokenizer class):
>
> if ((ub == Character.UnicodeBlock.BASIC_LATIN)
> || (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS))
>
> With
>
> if ((ub == Character.UnicodeBlock.BASIC_LATIN) // chars 0x00-0x7f
> || (ub == Character.UnicodeBlock.LATIN_1_SUPPLEMENT) // char 0x80-0xff
> || (ub == Character.UnicodeBlock.LATIN_EXTENDED_A) // char 0x100-0x17f
> || (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS))
>
> Am I missing something or does this seem like a reasonable
> thing to want to do?
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org