You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Steven Rowe <sa...@syr.edu> on 2007/05/23 19:06:04 UTC
WhitespaceAnalyzer [was: Re: regaridng Reader.terms()]
Hi Mohammad,
WhitespaceAnalyzer uses Java's Character.isWhitespace(char) method to
determine whether or not a character should be part of a token. As far
as I know, this method is problematic only for characters outside of the
Basic Multilingual Plane (BMP). I think Lucene should switch to using
the int-based character methods, to support characters outside of the BMP.
Arabic characters are all in the BMP[1][2][3][4], so this method should
work properly. The only Arabic or Persian characters I can find outside
of the BMP are in the Old Persian block[5], [ U+103A0 - U+103DF ], but
this is ancient cuneiform - are you really dealing with digital
cuneiform documents?
I suspect that you are using WhitespaceAnalyzer as the basis for a more
sophisticated tokenizer - if this is true, you may want to check out the
tokenizer in the AraMorph project[6][7]. (The AraMorph stemmer probably
will not serve your needs, though, since Persian and Arabic have
different lexicons and grammars.)
Hope it helps,
Steve
[1] Arabic: http://www.unicode.org/charts/PDF/U0600.pdf
[2] Arabic Supplement: http://www.unicode.org/charts/PDF/U0750.pdf
[3] Arabic Presentation Forms A: http://www.unicode.org/charts/PDF/UFB50.pdf
[4] Arabic Presentation Forms B: http://www.unicode.org/charts/PDF/UFE70.pdf
[5] Old Persian: http://www.unicode.org/charts/PDF/U103A0.pdf
[6] AraMorph: http://www.nongnu.org/aramorph/
[7] ArabicTokenizer for Lucene:
http://www.nongnu.org/aramorph/javadoc/gpl/pierrick/brihaye/aramorph/lucene/ArabicTokenizer.html
Mohammad Norouzi wrote:
> Hi Steve,
> No I didn't make any change on WhiteSpaceAnalyzer I just extends my classes
> from the original classes and then override my new changes. so I dont think
> I should to contribute my classes.
>
> and my language is Persian, and only change I've made is not to ignoring
> unicode characters in Persian and arabic language, because with original
> WhitespaceAnalyzer it didnt work fine whether it ignore or something
> else, I
> dont know but I extends my classes and now I am using my analyzer to index.
>
> On 5/22/07, Steven Rowe <sa...@syr.edu> wrote:
>>
>> Hi Mohammad,
>>
>> May I ask what your language is? And what kind of changes to
>> WhitespaceAnalyzer were required to make it work with your language?
>>
>> If you have made modifications to WhitespaceAnalyzer that are generally
>> useful, please consider contributing your changes back to the Lucene
>> project. There is some info here on how to get started:
>>
>> http://wiki.apache.org/jakarta-lucene/HowToContribute
>>
>> Thanks,
>> Steve
>>
>> Mohammad Norouzi wrote:
>> > Walter,
>> > Yes I am using a customized WhiteSpaceAnalyzer while indexing.
>> > I said customized because I realized that standard WhiteSpaceAnalyzer
>> dont
>> > accept unicode terms in my language so I make some change to support
>> that.
>> >
>> > but for reading no Analyzer is used
>> >
>> > if I want to get that result, which analyzer should I use?
>> >
>> > in my case, I dont need any boost factor or any other feature of
>> lucene,
>> I
>> > need just searching through the index.
>> >
>> >
>> > On 5/22/07, Walter Ferrara <wa...@ecomware.it> wrote:
>> >>
>> >> If Reader.terms() gives you:
>> >> text3
>> >> text4
>> >> while you expect
>> >> text3 text4
>> >>
>> >> you should change, I presume, the Analyzer, maybe writing your own
>> one.
>> >>
>> >> Mohammad Norouzi wrote:
>> >> > Hi all
>> >> >
>> >> > consider following index
>> >> >
>> >> > field1 field2 field3
>> >> > text1 text1 text2 text3 text4
>> >> > text4 text2 text2 text3 text5
>> >> >
>> >> > I want to get all terms in filed3
>> >> > if I use Reader.terms() it will returns: (however i have to put
>> an if
>> >> > statement to filter result of the field3 only)
>> >> > text3
>> >> > text4
>> >> > text2
>> >> > text5
>> >> >
>> >> > but I need following result:
>> >> > "text3 text4"
>> >> > "text2 text3 text5"
>> >> >
>> >> >
>> >> > is this possible? if yes, how? and if no, is there any tricky way to
>> >> get
>> >> > this result?
>> >> >
>> >> > thank you so much.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: WhitespaceAnalyzer [was: Re: regaridng Reader.terms()]
Posted by Steven Rowe <sa...@syr.edu>.
Hi Mohammad,
Mohammad Norouzi wrote:
> [Hoss wrote:]
>> ...are there Persian characters with a category type of SPACE_SEPARATOR,
>> LINE_SEPARATOR, or PARAGRAPH_SEPARATOR ?
>
> How can I know that?
The Unicode standard's codes[1] for these are:
SPACE SEPARATOR: Zs
LINE SEPARATOR: Zl
PARAGRAPH SEPARATOR: Zp
>From <http://www.unicode.org/Public/4.0-Update/PropList-4.0.0.txt>, the
only characters with these properties are:
0020 ; White_Space # Zs SPACE
00A0 ; White_Space # Zs NO-BREAK SPACE
1680 ; White_Space # Zs OGHAM SPACE MARK
180E ; White_Space # Zs MONGOLIAN VOWEL SEPARATOR
2000..200A ; White_Space # Zs EN QUAD..HAIR SPACE
200B ; Other_Default_Ignorable_Code_Point # Zs ZERO WIDTH SPACE
2028 ; White_Space # Zl LINE SEPARATOR
2029 ; White_Space # Zp PARAGRAPH SEPARATOR
202F ; White_Space # Zs NARROW NO-BREAK SPACE
205F ; White_Space # Zs MEDIUM MATHEMATICAL SPACE
3000 ; White_Space # Zs IDEOGRAPHIC SPACE
Modern Persian uses Arabic orthography with four additional letters[2]
-- peh, tcheh, jeh, and gaf -- all of which are included in the Unicode
basic Arabic character set.
The Arabic Unicode character ranges are:
[U+0600 - U+06FF] <http://www.unicode.org/charts/PDF/U0600.pdf>
[U+0750 - U+077F] <http://www.unicode.org/charts/PDF/U0750.pdf>
[U+FB50 - U+FC3F] <http://www.unicode.org/charts/PDF/UFB50.pdf>
[U+FE70 - U+FEFF] <http://www.unicode.org/charts/PDF/UFE70.pdf>
The intersection of the sets { all Arabic characters } and { all Unicode
whitespace characters } is the null set. Thus, it appears, there are no
Arabic-specific (and hence Persian-specific) whitespace characters in
the Unicode standard.
Steve
[1] Unicode 4.0.0 Character Database - Property value codes:
<http://www.unicode.org/Public/4.0-Update/UCD-4.0.0.html#Property_Values>
[2] http://en.wikipedia.org/wiki/Persian_alphabet
--
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: WhitespaceAnalyzer [was: Re: regaridng Reader.terms()]
Posted by Mohammad Norouzi <mn...@gmail.com>.
Hi Chris,
>
> * It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0',
> '\u2007', '\u202F').
> * It is '\u0009', HORIZONTAL TABULATION.
> * It is '\u000A', LINE FEED.
> * It is '\u000B', VERTICAL TABULATION.
> * It is '\u000C', FORM FEED.
> * It is '\u000D', CARRIAGE RETURN.
> * It is '\u001C', FILE SEPARATOR.
> * It is '\u001D', GROUP SEPARATOR.
> * It is '\u001E', RECORD SEPARATOR.
> * It is '\u001F', UNIT SEPARATOR.
...are there Persian characters with a category type of SPACE_SEPARATOR,
> LINE_SEPARATOR, or PARAGRAPH_SEPARATOR ?
>
>
>
How can I know that?
--
Regards,
Mohammad
--------------------------
see my blog: http://brainable.blogspot.com/
Re: WhitespaceAnalyzer [was: Re: regaridng Reader.terms()]
Posted by Chris Hostetter <ho...@fucit.org>.
: return !Character.isWhitespace(c);
: And my class override that method as this:
: return !((int)c==32);
in my opinion that's a pretty naive change ... it won't split on tab
characters or newlines ... even for trivial ASCII text that's probably not
what you want.
: I think the Character.isWhitespace consider the unicodes as space :))
: so everything will mess up.
every character in java is a unicode character, so your comment doesn't
really make sense to me ... the javadocs are very clear about the
definition of "whitesace" in java...
* It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or
PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0',
'\u2007', '\u202F').
* It is '\u0009', HORIZONTAL TABULATION.
* It is '\u000A', LINE FEED.
* It is '\u000B', VERTICAL TABULATION.
* It is '\u000C', FORM FEED.
* It is '\u000D', CARRIAGE RETURN.
* It is '\u001C', FILE SEPARATOR.
* It is '\u001D', GROUP SEPARATOR.
* It is '\u001E', RECORD SEPARATOR.
* It is '\u001F', UNIT SEPARATOR.
...are there Persian characters with a category type of SPACE_SEPARATOR,
LINE_SEPARATOR, or PARAGRAPH_SEPARATOR ?
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: WhitespaceAnalyzer [was: Re: regaridng Reader.terms()]
Posted by Mohammad Norouzi <mn...@gmail.com>.
Sorry Steven
that change is in WhitespaceTokenizer not WhiteSpaceAnalyzer but in Analyzer
I had to call the tokenizer
On 5/24/07, Mohammad Norouzi <mn...@gmail.com> wrote:
>
> Hi Steven
> Thank you so much for your thorough comments about Analyzer
>
> I write that class a couple of months ago, now I take a look at my
> customized Analyzer
>
> the only change I've made as follows:
>
> the original class has this method:
> protected boolean isTokenChar(char c) {
> return !Character.isWhitespace(c);
> }
>
> And my class override that method as this:
>
> protected boolean isTokenChar(char c) {
> return !((int)c==32);
> }
>
>
> I think the Character.isWhitespace consider the unicodes as space :))
> so everything will mess up.
>
> what do you think?
>
> --
> Regards,
> Mohammad
> --------------------------
> see my blog: http://brainable.blogspot.com/
--
Regards,
Mohammad
--------------------------
see my blog: http://brainable.blogspot.com/
Re: WhitespaceAnalyzer [was: Re: regaridng Reader.terms()]
Posted by Mohammad Norouzi <mn...@gmail.com>.
Hi Steven
Thank you so much for your thorough comments about Analyzer
I write that class a couple of months ago, now I take a look at my
customized Analyzer
the only change I've made as follows:
the original class has this method:
protected boolean isTokenChar(char c) {
return !Character.isWhitespace(c);
}
And my class override that method as this:
protected boolean isTokenChar(char c) {
return !((int)c==32);
}
I think the Character.isWhitespace consider the unicodes as space :))
so everything will mess up.
what do you think?
--
Regards,
Mohammad
--------------------------
see my blog: http://brainable.blogspot.com/