You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Doug Cutting <cu...@lucene.com> on 2002/08/09 20:06:33 UTC

Re: Choice of indexed Character set

Manish, in the future, please send questions to lucene-dev, not to me 
directly.  Thanks.

Manish Shukla wrote:
> Just wanted to ask you, what logic did we use to chose
> which characters to index while creating the
> StandardTokenizer.jj file 
> 
> We use follwing to index currently. and tokenize of
> rest.
> 
>        "\u0041"-"\u005a",
>        "\u0061"-"\u007a",
>        "\u00c0"-"\u00d6",
>        "\u00d8"-"\u00f6",
>        "\u00f8"-"\u00ff",
>        "\u0100"-"\u1fff",
>        "\u3040"-"\u318f",
>        "\u3300"-"\u337f",
>        "\u3400"-"\u3d2d",
>        "\u4e00"-"\u9fff",
>        "\uf900"-"\ufaff"
> 
> Looking at the list it seems a little arbitrary in
> some respects. we are indexing  
> Katakana, Hiragana,  Bopomofo,Hangul Compatibility
> Jamo but we are skipping some of the characters in
> latin Supplement and extended latin ranges.
> 
> I am a little confused. I want to index only 8859
> character set. Hence want to find out the logic. Am I
> missing something.

I don't remember where that came from.  I think it may have been copied 
from the Java 1.0 implementation of Character.isLetter().  It could 
probably stand to be updated.  Please feel free to make a proposal.

If you only want 8859, then you're probably best off writing your own 
tokenizer, perhaps modelling it after StandardTokenizer.

Doug



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>