You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Steven Rowe <sa...@syr.edu> on 2007/05/23 19:06:04 UTC

WhitespaceAnalyzer [was: Re: regaridng Reader.terms()]

Hi Mohammad,

WhitespaceAnalyzer uses Java's Character.isWhitespace(char) method to
determine whether or not a character should be part of a token.  As far
as I know, this method is problematic only for characters outside of the
Basic Multilingual Plane (BMP).  I think Lucene should switch to using
the int-based character methods, to support characters outside of the BMP.

Arabic characters are all in the BMP[1][2][3][4], so this method should
work properly.  The only Arabic or Persian characters I can find outside
of the BMP are in the Old Persian block[5], [ U+103A0 - U+103DF ], but
this is ancient cuneiform - are you really dealing with digital
cuneiform documents?

I suspect that you are using WhitespaceAnalyzer as the basis for a more
sophisticated tokenizer - if this is true, you may want to check out the
tokenizer in the AraMorph project[6][7].  (The AraMorph stemmer probably
will not serve your needs, though, since Persian and Arabic have
different lexicons and grammars.)

Hope it helps,
Steve

[1] Arabic: http://www.unicode.org/charts/PDF/U0600.pdf
[2] Arabic Supplement: http://www.unicode.org/charts/PDF/U0750.pdf
[3] Arabic Presentation Forms A: http://www.unicode.org/charts/PDF/UFB50.pdf
[4] Arabic Presentation Forms B: http://www.unicode.org/charts/PDF/UFE70.pdf
[5] Old Persian: http://www.unicode.org/charts/PDF/U103A0.pdf
[6] AraMorph: http://www.nongnu.org/aramorph/
[7] ArabicTokenizer for Lucene:
http://www.nongnu.org/aramorph/javadoc/gpl/pierrick/brihaye/aramorph/lucene/ArabicTokenizer.html


Mohammad Norouzi wrote:
> Hi Steve,
> No I didn't make any change on WhiteSpaceAnalyzer I just extends my classes
> from the original classes and then override my new changes. so I dont think
> I should to contribute my classes.
> 
> and my language is Persian, and only change I've made is not to ignoring
> unicode characters in Persian and arabic language, because with original
> WhitespaceAnalyzer it didnt work fine whether it ignore or something
> else, I
> dont know but I extends my classes and now I am using my analyzer to index.
> 
> On 5/22/07, Steven Rowe <sa...@syr.edu> wrote:
>>
>> Hi Mohammad,
>>
>> May I ask what your language is?  And what kind of changes to
>> WhitespaceAnalyzer were required to make it work with your language?
>>
>> If you have made modifications to WhitespaceAnalyzer that are generally
>> useful, please consider contributing your changes back to the Lucene
>> project.  There is some info here on how to get started:
>>
>>    http://wiki.apache.org/jakarta-lucene/HowToContribute
>>
>> Thanks,
>> Steve
>>
>> Mohammad Norouzi wrote:
>> > Walter,
>> > Yes I am using a customized WhiteSpaceAnalyzer while indexing.
>> > I said customized because I realized that standard WhiteSpaceAnalyzer
>> dont
>> > accept unicode terms in my language so I make some change to support
>> that.
>> >
>> > but for reading no Analyzer is used
>> >
>> > if I want to get that result, which analyzer should I use?
>> >
>> > in my case, I dont need any boost factor or any other feature of
>> lucene,
>> I
>> > need just searching through the index.
>> >
>> >
>> > On 5/22/07, Walter Ferrara <wa...@ecomware.it> wrote:
>> >>
>> >> If Reader.terms() gives you:
>> >> text3
>> >> text4
>> >> while you expect
>> >> text3 text4
>> >>
>> >> you should change, I presume, the Analyzer, maybe writing your own
>> one.
>> >>
>> >> Mohammad Norouzi wrote:
>> >> > Hi all
>> >> >
>> >> > consider following index
>> >> >
>> >> > field1           field2                              field3
>> >> > text1           text1 text2                      text3 text4
>> >> > text4           text2                              text2 text3 text5
>> >> >
>> >> > I want to get all terms in filed3
>> >> > if I use Reader.terms() it will returns: (however i have to put
>> an if
>> >> > statement to filter result of the field3 only)
>> >> > text3
>> >> > text4
>> >> > text2
>> >> > text5
>> >> >
>> >> > but I need following result:
>> >> > "text3 text4"
>> >> > "text2 text3 text5"
>> >> >
>> >> >
>> >> > is this possible? if yes, how? and if no, is there any tricky way to
>> >> get
>> >> > this result?
>> >> >
>> >> > thank you so much.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: WhitespaceAnalyzer [was: Re: regaridng Reader.terms()]

Posted by Steven Rowe <sa...@syr.edu>.

Hi Mohammad,

Mohammad Norouzi wrote:
> [Hoss wrote:]
>> ...are there Persian characters with a category type of SPACE_SEPARATOR,
>> LINE_SEPARATOR, or PARAGRAPH_SEPARATOR ?
> 
> How can I know that?

The Unicode standard's codes[1] for these are:

   SPACE SEPARATOR: Zs
   LINE SEPARATOR: Zl
   PARAGRAPH SEPARATOR: Zp

>From <http://www.unicode.org/Public/4.0-Update/PropList-4.0.0.txt>, the
only characters with these properties are:

   0020       ; White_Space # Zs    SPACE
   00A0       ; White_Space # Zs    NO-BREAK SPACE
   1680       ; White_Space # Zs    OGHAM SPACE MARK
   180E       ; White_Space # Zs    MONGOLIAN VOWEL SEPARATOR
   2000..200A ; White_Space # Zs    EN QUAD..HAIR SPACE
   200B       ; Other_Default_Ignorable_Code_Point # Zs ZERO WIDTH SPACE
   2028       ; White_Space # Zl    LINE SEPARATOR
   2029       ; White_Space # Zp    PARAGRAPH SEPARATOR
   202F       ; White_Space # Zs    NARROW NO-BREAK SPACE
   205F       ; White_Space # Zs    MEDIUM MATHEMATICAL SPACE
   3000       ; White_Space # Zs    IDEOGRAPHIC SPACE

Modern Persian uses Arabic orthography with four additional letters[2]
-- peh, tcheh, jeh, and gaf -- all of which are included in the Unicode
basic Arabic character set.

The Arabic Unicode character ranges are:

   [U+0600 - U+06FF] <http://www.unicode.org/charts/PDF/U0600.pdf>
   [U+0750 - U+077F] <http://www.unicode.org/charts/PDF/U0750.pdf>
   [U+FB50 - U+FC3F] <http://www.unicode.org/charts/PDF/UFB50.pdf>
   [U+FE70 - U+FEFF] <http://www.unicode.org/charts/PDF/UFE70.pdf>

The intersection of the sets { all Arabic characters } and { all Unicode
whitespace characters } is the null set.  Thus, it appears, there are no
Arabic-specific (and hence Persian-specific) whitespace characters in
the Unicode standard.

Steve

[1] Unicode 4.0.0 Character Database - Property value codes:
<http://www.unicode.org/Public/4.0-Update/UCD-4.0.0.html#Property_Values>
[2] http://en.wikipedia.org/wiki/Persian_alphabet

-- 
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: WhitespaceAnalyzer [was: Re: regaridng Reader.terms()]

Posted by Mohammad Norouzi <mn...@gmail.com>.

Hi Chris,

>
>     * It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or
>       PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0',
>       '\u2007', '\u202F').
>     * It is '\u0009', HORIZONTAL TABULATION.
>     * It is '\u000A', LINE FEED.
>     * It is '\u000B', VERTICAL TABULATION.
>     * It is '\u000C', FORM FEED.
>     * It is '\u000D', CARRIAGE RETURN.
>     * It is '\u001C', FILE SEPARATOR.
>     * It is '\u001D', GROUP SEPARATOR.
>     * It is '\u001E', RECORD SEPARATOR.
>     * It is '\u001F', UNIT SEPARATOR.






...are there Persian characters with a category type of SPACE_SEPARATOR,
> LINE_SEPARATOR, or PARAGRAPH_SEPARATOR ?
>
>
>
How can I know that?

-- 
Regards,
Mohammad
--------------------------
see my blog: http://brainable.blogspot.com/

Re: WhitespaceAnalyzer [was: Re: regaridng Reader.terms()]

Posted by Chris Hostetter <ho...@fucit.org>.

:     return !Character.isWhitespace(c);

: And my class override that method as this:

:     return !((int)c==32);

in my opinion that's a pretty naive change ... it won't split on tab
characters or newlines ... even for trivial ASCII text that's probably not
what you want.

: I think the Character.isWhitespace consider the unicodes as space :))
: so everything will mess up.

every character in java is a unicode character, so your comment doesn't
really make sense to me ... the javadocs are very clear about the
definition of "whitesace" in java...

    * It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or
      PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0',
      '\u2007', '\u202F').
    * It is '\u0009', HORIZONTAL TABULATION.
    * It is '\u000A', LINE FEED.
    * It is '\u000B', VERTICAL TABULATION.
    * It is '\u000C', FORM FEED.
    * It is '\u000D', CARRIAGE RETURN.
    * It is '\u001C', FILE SEPARATOR.
    * It is '\u001D', GROUP SEPARATOR.
    * It is '\u001E', RECORD SEPARATOR.
    * It is '\u001F', UNIT SEPARATOR.

...are there Persian characters with a category type of SPACE_SEPARATOR,
LINE_SEPARATOR, or PARAGRAPH_SEPARATOR ?


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: WhitespaceAnalyzer [was: Re: regaridng Reader.terms()]

Posted by Mohammad Norouzi <mn...@gmail.com>.

Sorry Steven
that change is in WhitespaceTokenizer not WhiteSpaceAnalyzer but in Analyzer
I had to call the tokenizer



On 5/24/07, Mohammad Norouzi <mn...@gmail.com> wrote:
>
> Hi Steven
> Thank you so much for your thorough comments about Analyzer
>
> I write that class a couple of months ago, now I take a look at my
> customized Analyzer
>
> the only change I've made as follows:
>
> the original class has this method:
> protected boolean isTokenChar(char c) {
>     return !Character.isWhitespace(c);
> }
>
> And my class override that method as this:
>
> protected boolean isTokenChar(char c) {
>     return !((int)c==32);
> }
>
>
> I think the Character.isWhitespace consider the unicodes as space :))
> so everything will mess up.
>
> what do you think?
>
> --
> Regards,
> Mohammad
> --------------------------
> see my blog: http://brainable.blogspot.com/




-- 
Regards,
Mohammad
--------------------------
see my blog: http://brainable.blogspot.com/

Re: WhitespaceAnalyzer [was: Re: regaridng Reader.terms()]

Posted by Mohammad Norouzi <mn...@gmail.com>.

Hi Steven
Thank you so much for your thorough comments about Analyzer

I write that class a couple of months ago, now I take a look at my
customized Analyzer

the only change I've made as follows:

the original class has this method:
protected boolean isTokenChar(char c) {
    return !Character.isWhitespace(c);
}

And my class override that method as this:

protected boolean isTokenChar(char c) {
    return !((int)c==32);
}


I think the Character.isWhitespace consider the unicodes as space :))
so everything will mess up.

what do you think?

-- 
Regards,
Mohammad
--------------------------
see my blog: http://brainable.blogspot.com/