You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by OBender <os...@hotmail.com> on 2009/07/20 16:40:29 UTC

question on custom filter

Hi All!

 

Let say I have a filter that produces new tokens based on the original ones.

How bad will it be if my filter sets the start of each token to 0 and end to
the length of a token?

An example (based on the phrase "How are you?":

 

Original token: 

[you?] (8,12)

 

New tokens: 

[you] (0,3)      

[?] (0,1)

 

It wouldn't be so hard to calculate the right numbers for left to right
languages and it is a bit more challenging to do it for right to left ones
but for mixed text it is quite hard. 

 

Thanks.

RE: question on custom filter

Posted by OBender <os...@hotmail.com>.

Well, the only thing I can say is that the order of tokens I've presented is what I see in the debugger.
It is what input.next(reusableToken) gives me, in that exact order and with that exact indexes.

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Monday, July 20, 2009 2:07 PM
To: java-user@lucene.apache.org
Subject: Re: question on custom filter

Obender, This is not true.
the text you pasted is the following in unicode:

\N{HEBREW LETTER TET}
\N{HEBREW LETTER VAV}
\N{HEBREW POINT HOLAM}
\N{HEBREW LETTER BET}
\N{SPACE}
\N{HEBREW LETTER AYIN}
\N{HEBREW POINT SEGOL}
\N{HEBREW LETTER RESH}
\N{HEBREW POINT SEGOL}
\N{HEBREW LETTER BET}

you can use this utility to see how your text is encoded:
http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91

For more information on directionality in unicode, see
http://unicode.org/reports/tr9/

On Mon, Jul 20, 2009 at 1:59 PM, OBender<os...@hotmail.com> wrote:
> Robert,
>
> I'm not sure you are correct on this one.
>
> If I have a Hebrew phrase:
> [טוֹב עֶרֶב]
> Then first token that filter receives is:
> [עֶרֶב] (0,5)
> and the second is:
> [טוֹב] (6,10)
> Which means that it counts from right to left (words and indexes).
>
> Am I missing something?
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Monday, July 20, 2009 1:43 PM
> To: java-user@lucene.apache.org
> Subject: Re: question on custom filter
>
> Obender, I don't think its as difficult as you think. Your filter does
> not need to be aware of this issue at all.
>
> In unicode, right-to-left languages are encoded in the data in logical order.
> The rendering system is what converts it to display in right-to-left
> for RTL languages.
>
> For example in Arabic, "Robert 1234" displays as روبرت 1234
> To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
> beh, waw, reh
>
> But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.
>
> 2009/7/20 OBender <os...@hotmail.com>:
>> Hi All!
>>
>>
>>
>> Let say I have a filter that produces new tokens based on the original ones.
>>
>> How bad will it be if my filter sets the start of each token to 0 and end to
>> the length of a token?
>>
>> An example (based on the phrase "How are you?":
>>
>>
>>
>> Original token:
>>
>> [you?] (8,12)
>>
>>
>>
>> New tokens:
>>
>> [you] (0,3)
>>
>> [?] (0,1)
>>
>>
>>
>> It wouldn't be so hard to calculate the right numbers for left to right
>> languages and it is a bit more challenging to do it for right to left ones
>> but for mixed text it is quite hard.
>>
>>
>>
>> Thanks.
>>
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: question on custom filter

Posted by OBender <os...@hotmail.com>.

Ok, it makes a lot of sense (the input being incorrect).
Let's just verify that :)

At the end of the line:
"but the text you sent as an example was" what I see is word TOV [טוֹב] on the left and EREV [עֶרֶב] on the right.
So it reads (for me) EREV TOV which is correct.

At the end of the line:
" Shouldn't the adjective follow the noun like this " what I see is the word EREV [עֶרֶב] on the left and TOV [טוֹב] on the right.
So it reads (for me) TOV EREV which is not correct.

Is the above the way you see the Hebrew text or it is other way around for you :) ?

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Monday, July 20, 2009 3:34 PM
To: java-user@lucene.apache.org
Subject: Re: question on custom filter

Obender, I think your input is incorrect. The hebrew text you pasted
in your example appears incorrect. Its gonna be hard for me to
communicate this since I think your computer is not displaying hebrew
correctly :)

but the text you sent as an example was [טוֹב עֶרֶב]

Shouldn't the adjective follow the noun like this: עֶרֶב טוֹב

This makes me think your input is incorrect because its being rendered
incorrectly, as I mentioned this isn't enabled by default in windows.
But your input appears correct to you :)

On Mon, Jul 20, 2009 at 3:29 PM, OBender<os...@hotmail.com> wrote:
> Interesting, the question now is why am I seeing (even in println) what I'm seeing :)
> I'm reading a string from the file which is in UTF-8 encoding. Could this somehow be related...?
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Monday, July 20, 2009 3:03 PM
> To: java-user@lucene.apache.org
> Subject: Re: question on custom filter
>
> Obender, i ran your code and it did what I expected (but not what you pasted):
>
> First token is: (טוֹב,0,4)
> Second token is: (עֶרֶב,5,10)
>
> I also loaded up your SimpleWhitespaceAnalyzer in Luke, with the same results.
>
> On Mon, Jul 20, 2009 at 2:53 PM, OBender<os...@hotmail.com> wrote:
>> Here is the simple code. If you run it with English and with Hebrew you will see that in case of English tokens returned from the left of the phrase to the right and with Hebrew from the right to the left.
>>
>> Again I'm talking about tokens not the individual letters here.
>>
>> public class XFilter extends TokenFilter
>> {
>>        protected XFilter( TokenStream tokenStream ) {
>>                super( tokenStream );
>>        }
>>
>>        @Override
>>        public Token next( final Token reusableToken ) throws IOException
>>        {
>>                Token nextToken = input.next( reusableToken );
>>                System.out.println( nextToken != null? nextToken: "" );
>>                return nextToken;
>>        }
>> }
>>
>> public class SimpleWhitespaceAnalyzer extends Analyzer
>> {
>>        @Override
>>        public TokenStream tokenStream( final String fieldName, final Reader reader )
>>        {
>>                TokenStream ts  = new WhitespaceTokenizer( reader );
>>                ts                      = new XFilter( ts );
>>
>>                return ts;
>>        }
>> }
>>
>> -----Original Message-----
>> From: Robert Muir [mailto:rcmuir@gmail.com]
>> Sent: Monday, July 20, 2009 2:26 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: question on custom filter
>>
>> Obender, I think something in your environment / display environment
>> might be causing some confusion.
>>
>> Are you using microsoft windows? If so, please verify that support for
>> right-to-left languages is enabled [control panel/regional and
>> language options]. It is possible you are "seeing something different"
>> because your rendering system is not actually rendering right-to-left
>> text in right-to-left direction!!!!
>>
>> Second, Instead of using a debugger, I would recommend using Luke to
>> look at resulting tokens from your analyzer.
>>
>> On Mon, Jul 20, 2009 at 2:21 PM, OBender<os...@hotmail.com> wrote:
>>> This is how it should be written:
>>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91
>>>
>>> -----Original Message-----
>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>> Sent: Monday, July 20, 2009 2:07 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: question on custom filter
>>>
>>> Obender, This is not true.
>>> the text you pasted is the following in unicode:
>>>
>>> \N{HEBREW LETTER TET}
>>> \N{HEBREW LETTER VAV}
>>> \N{HEBREW POINT HOLAM}
>>> \N{HEBREW LETTER BET}
>>> \N{SPACE}
>>> \N{HEBREW LETTER AYIN}
>>> \N{HEBREW POINT SEGOL}
>>> \N{HEBREW LETTER RESH}
>>> \N{HEBREW POINT SEGOL}
>>> \N{HEBREW LETTER BET}
>>>
>>> you can use this utility to see how your text is encoded:
>>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91
>>>
>>> For more information on directionality in unicode, see
>>> http://unicode.org/reports/tr9/
>>>
>>> On Mon, Jul 20, 2009 at 1:59 PM, OBender<os...@hotmail.com> wrote:
>>>> Robert,
>>>>
>>>> I'm not sure you are correct on this one.
>>>>
>>>> If I have a Hebrew phrase:
>>>> [טוֹב עֶרֶב]
>>>> Then first token that filter receives is:
>>>> [עֶרֶב] (0,5)
>>>> and the second is:
>>>> [טוֹב] (6,10)
>>>> Which means that it counts from right to left (words and indexes).
>>>>
>>>> Am I missing something?
>>>>
>>>> -----Original Message-----
>>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>>> Sent: Monday, July 20, 2009 1:43 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: question on custom filter
>>>>
>>>> Obender, I don't think its as difficult as you think. Your filter does
>>>> not need to be aware of this issue at all.
>>>>
>>>> In unicode, right-to-left languages are encoded in the data in logical order.
>>>> The rendering system is what converts it to display in right-to-left
>>>> for RTL languages.
>>>>
>>>> For example in Arabic, "Robert 1234" displays as روبرت 1234
>>>> To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
>>>> beh, waw, reh
>>>>
>>>> But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.
>>>>
>>>> 2009/7/20 OBender <os...@hotmail.com>:
>>>>> Hi All!
>>>>>
>>>>>
>>>>>
>>>>> Let say I have a filter that produces new tokens based on the original ones.
>>>>>
>>>>> How bad will it be if my filter sets the start of each token to 0 and end to
>>>>> the length of a token?
>>>>>
>>>>> An example (based on the phrase "How are you?":
>>>>>
>>>>>
>>>>>
>>>>> Original token:
>>>>>
>>>>> [you?] (8,12)
>>>>>
>>>>>
>>>>>
>>>>> New tokens:
>>>>>
>>>>> [you] (0,3)
>>>>>
>>>>> [?] (0,1)
>>>>>
>>>>>
>>>>>
>>>>> It wouldn't be so hard to calculate the right numbers for left to right
>>>>> languages and it is a bit more challenging to do it for right to left ones
>>>>> but for mixed text it is quite hard.
>>>>>
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Robert Muir
>>>> rcmuir@gmail.com
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Robert Muir
>>> rcmuir@gmail.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: question on custom filter

Posted by Robert Muir <rc...@gmail.com>.

Obender, I think your input is incorrect. The hebrew text you pasted
in your example appears incorrect. Its gonna be hard for me to
communicate this since I think your computer is not displaying hebrew
correctly :)

but the text you sent as an example was [טוֹב עֶרֶב]

Shouldn't the adjective follow the noun like this: עֶרֶב טוֹב

This makes me think your input is incorrect because its being rendered
incorrectly, as I mentioned this isn't enabled by default in windows.
But your input appears correct to you :)

On Mon, Jul 20, 2009 at 3:29 PM, OBender<os...@hotmail.com> wrote:
> Interesting, the question now is why am I seeing (even in println) what I'm seeing :)
> I'm reading a string from the file which is in UTF-8 encoding. Could this somehow be related...?
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Monday, July 20, 2009 3:03 PM
> To: java-user@lucene.apache.org
> Subject: Re: question on custom filter
>
> Obender, i ran your code and it did what I expected (but not what you pasted):
>
> First token is: (טוֹב,0,4)
> Second token is: (עֶרֶב,5,10)
>
> I also loaded up your SimpleWhitespaceAnalyzer in Luke, with the same results.
>
> On Mon, Jul 20, 2009 at 2:53 PM, OBender<os...@hotmail.com> wrote:
>> Here is the simple code. If you run it with English and with Hebrew you will see that in case of English tokens returned from the left of the phrase to the right and with Hebrew from the right to the left.
>>
>> Again I'm talking about tokens not the individual letters here.
>>
>> public class XFilter extends TokenFilter
>> {
>>        protected XFilter( TokenStream tokenStream ) {
>>                super( tokenStream );
>>        }
>>
>>        @Override
>>        public Token next( final Token reusableToken ) throws IOException
>>        {
>>                Token nextToken = input.next( reusableToken );
>>                System.out.println( nextToken != null? nextToken: "" );
>>                return nextToken;
>>        }
>> }
>>
>> public class SimpleWhitespaceAnalyzer extends Analyzer
>> {
>>        @Override
>>        public TokenStream tokenStream( final String fieldName, final Reader reader )
>>        {
>>                TokenStream ts  = new WhitespaceTokenizer( reader );
>>                ts                      = new XFilter( ts );
>>
>>                return ts;
>>        }
>> }
>>
>> -----Original Message-----
>> From: Robert Muir [mailto:rcmuir@gmail.com]
>> Sent: Monday, July 20, 2009 2:26 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: question on custom filter
>>
>> Obender, I think something in your environment / display environment
>> might be causing some confusion.
>>
>> Are you using microsoft windows? If so, please verify that support for
>> right-to-left languages is enabled [control panel/regional and
>> language options]. It is possible you are "seeing something different"
>> because your rendering system is not actually rendering right-to-left
>> text in right-to-left direction!!!!
>>
>> Second, Instead of using a debugger, I would recommend using Luke to
>> look at resulting tokens from your analyzer.
>>
>> On Mon, Jul 20, 2009 at 2:21 PM, OBender<os...@hotmail.com> wrote:
>>> This is how it should be written:
>>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91
>>>
>>> -----Original Message-----
>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>> Sent: Monday, July 20, 2009 2:07 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: question on custom filter
>>>
>>> Obender, This is not true.
>>> the text you pasted is the following in unicode:
>>>
>>> \N{HEBREW LETTER TET}
>>> \N{HEBREW LETTER VAV}
>>> \N{HEBREW POINT HOLAM}
>>> \N{HEBREW LETTER BET}
>>> \N{SPACE}
>>> \N{HEBREW LETTER AYIN}
>>> \N{HEBREW POINT SEGOL}
>>> \N{HEBREW LETTER RESH}
>>> \N{HEBREW POINT SEGOL}
>>> \N{HEBREW LETTER BET}
>>>
>>> you can use this utility to see how your text is encoded:
>>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91
>>>
>>> For more information on directionality in unicode, see
>>> http://unicode.org/reports/tr9/
>>>
>>> On Mon, Jul 20, 2009 at 1:59 PM, OBender<os...@hotmail.com> wrote:
>>>> Robert,
>>>>
>>>> I'm not sure you are correct on this one.
>>>>
>>>> If I have a Hebrew phrase:
>>>> [טוֹב עֶרֶב]
>>>> Then first token that filter receives is:
>>>> [עֶרֶב] (0,5)
>>>> and the second is:
>>>> [טוֹב] (6,10)
>>>> Which means that it counts from right to left (words and indexes).
>>>>
>>>> Am I missing something?
>>>>
>>>> -----Original Message-----
>>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>>> Sent: Monday, July 20, 2009 1:43 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: question on custom filter
>>>>
>>>> Obender, I don't think its as difficult as you think. Your filter does
>>>> not need to be aware of this issue at all.
>>>>
>>>> In unicode, right-to-left languages are encoded in the data in logical order.
>>>> The rendering system is what converts it to display in right-to-left
>>>> for RTL languages.
>>>>
>>>> For example in Arabic, "Robert 1234" displays as روبرت 1234
>>>> To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
>>>> beh, waw, reh
>>>>
>>>> But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.
>>>>
>>>> 2009/7/20 OBender <os...@hotmail.com>:
>>>>> Hi All!
>>>>>
>>>>>
>>>>>
>>>>> Let say I have a filter that produces new tokens based on the original ones.
>>>>>
>>>>> How bad will it be if my filter sets the start of each token to 0 and end to
>>>>> the length of a token?
>>>>>
>>>>> An example (based on the phrase "How are you?":
>>>>>
>>>>>
>>>>>
>>>>> Original token:
>>>>>
>>>>> [you?] (8,12)
>>>>>
>>>>>
>>>>>
>>>>> New tokens:
>>>>>
>>>>> [you] (0,3)
>>>>>
>>>>> [?] (0,1)
>>>>>
>>>>>
>>>>>
>>>>> It wouldn't be so hard to calculate the right numbers for left to right
>>>>> languages and it is a bit more challenging to do it for right to left ones
>>>>> but for mixed text it is quite hard.
>>>>>
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Robert Muir
>>>> rcmuir@gmail.com
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Robert Muir
>>> rcmuir@gmail.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: question on custom filter

Posted by OBender <os...@hotmail.com>.

Interesting, the question now is why am I seeing (even in println) what I'm seeing :)
I'm reading a string from the file which is in UTF-8 encoding. Could this somehow be related...?

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Monday, July 20, 2009 3:03 PM
To: java-user@lucene.apache.org
Subject: Re: question on custom filter

Obender, i ran your code and it did what I expected (but not what you pasted):

First token is: (טוֹב,0,4)
Second token is: (עֶרֶב,5,10)

I also loaded up your SimpleWhitespaceAnalyzer in Luke, with the same results.

On Mon, Jul 20, 2009 at 2:53 PM, OBender<os...@hotmail.com> wrote:
> Here is the simple code. If you run it with English and with Hebrew you will see that in case of English tokens returned from the left of the phrase to the right and with Hebrew from the right to the left.
>
> Again I'm talking about tokens not the individual letters here.
>
> public class XFilter extends TokenFilter
> {
>        protected XFilter( TokenStream tokenStream ) {
>                super( tokenStream );
>        }
>
>        @Override
>        public Token next( final Token reusableToken ) throws IOException
>        {
>                Token nextToken = input.next( reusableToken );
>                System.out.println( nextToken != null? nextToken: "" );
>                return nextToken;
>        }
> }
>
> public class SimpleWhitespaceAnalyzer extends Analyzer
> {
>        @Override
>        public TokenStream tokenStream( final String fieldName, final Reader reader )
>        {
>                TokenStream ts  = new WhitespaceTokenizer( reader );
>                ts                      = new XFilter( ts );
>
>                return ts;
>        }
> }
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Monday, July 20, 2009 2:26 PM
> To: java-user@lucene.apache.org
> Subject: Re: question on custom filter
>
> Obender, I think something in your environment / display environment
> might be causing some confusion.
>
> Are you using microsoft windows? If so, please verify that support for
> right-to-left languages is enabled [control panel/regional and
> language options]. It is possible you are "seeing something different"
> because your rendering system is not actually rendering right-to-left
> text in right-to-left direction!!!!
>
> Second, Instead of using a debugger, I would recommend using Luke to
> look at resulting tokens from your analyzer.
>
> On Mon, Jul 20, 2009 at 2:21 PM, OBender<os...@hotmail.com> wrote:
>> This is how it should be written:
>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91
>>
>> -----Original Message-----
>> From: Robert Muir [mailto:rcmuir@gmail.com]
>> Sent: Monday, July 20, 2009 2:07 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: question on custom filter
>>
>> Obender, This is not true.
>> the text you pasted is the following in unicode:
>>
>> \N{HEBREW LETTER TET}
>> \N{HEBREW LETTER VAV}
>> \N{HEBREW POINT HOLAM}
>> \N{HEBREW LETTER BET}
>> \N{SPACE}
>> \N{HEBREW LETTER AYIN}
>> \N{HEBREW POINT SEGOL}
>> \N{HEBREW LETTER RESH}
>> \N{HEBREW POINT SEGOL}
>> \N{HEBREW LETTER BET}
>>
>> you can use this utility to see how your text is encoded:
>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91
>>
>> For more information on directionality in unicode, see
>> http://unicode.org/reports/tr9/
>>
>> On Mon, Jul 20, 2009 at 1:59 PM, OBender<os...@hotmail.com> wrote:
>>> Robert,
>>>
>>> I'm not sure you are correct on this one.
>>>
>>> If I have a Hebrew phrase:
>>> [טוֹב עֶרֶב]
>>> Then first token that filter receives is:
>>> [עֶרֶב] (0,5)
>>> and the second is:
>>> [טוֹב] (6,10)
>>> Which means that it counts from right to left (words and indexes).
>>>
>>> Am I missing something?
>>>
>>> -----Original Message-----
>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>> Sent: Monday, July 20, 2009 1:43 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: question on custom filter
>>>
>>> Obender, I don't think its as difficult as you think. Your filter does
>>> not need to be aware of this issue at all.
>>>
>>> In unicode, right-to-left languages are encoded in the data in logical order.
>>> The rendering system is what converts it to display in right-to-left
>>> for RTL languages.
>>>
>>> For example in Arabic, "Robert 1234" displays as روبرت 1234
>>> To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
>>> beh, waw, reh
>>>
>>> But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.
>>>
>>> 2009/7/20 OBender <os...@hotmail.com>:
>>>> Hi All!
>>>>
>>>>
>>>>
>>>> Let say I have a filter that produces new tokens based on the original ones.
>>>>
>>>> How bad will it be if my filter sets the start of each token to 0 and end to
>>>> the length of a token?
>>>>
>>>> An example (based on the phrase "How are you?":
>>>>
>>>>
>>>>
>>>> Original token:
>>>>
>>>> [you?] (8,12)
>>>>
>>>>
>>>>
>>>> New tokens:
>>>>
>>>> [you] (0,3)
>>>>
>>>> [?] (0,1)
>>>>
>>>>
>>>>
>>>> It wouldn't be so hard to calculate the right numbers for left to right
>>>> languages and it is a bit more challenging to do it for right to left ones
>>>> but for mixed text it is quite hard.
>>>>
>>>>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Robert Muir
>>> rcmuir@gmail.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: question on custom filter

Posted by OBender <os...@hotmail.com>.

Never mind, I think I got it.

-----Original Message-----
From: OBender [mailto:osya_bender@hotmail.com] 
Sent: Monday, July 20, 2009 4:42 PM
To: java-user@lucene.apache.org
Subject: RE: question on custom filter

No, it reversed in the e-mail. Funny though, when I insert it in to the Excel it turns to the right order of words.
Thanks for all the help.

Maybe you have an idea on what could be the problem.
Here is how my data gets read and indexed.

I have a UTF-8 CSV file that is produced from Excel.
I read it in with Java (preserving UTF-8 encoding). At this point strings in the debugger look correct.
I insert it in to the DB (MySql) which is also UTF-8.
Then read it back and put in to index.

It looks like in UTF-8 CSV file the words are in "reverse" order from the grammar stand point (left to right, e.g., EREV left most then TOV). Should UTF-8 CSV file preserve the natural (language specific) order of words?

 
-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Monday, July 20, 2009 3:49 PM
To: java-user@lucene.apache.org
Subject: Re: question on custom filter

Obender, does the following text appear like the image in the link, or not?

שומר אחי

http://farm1.static.flickr.com/3/10445435_75b4546703.jpg?v=0


On Mon, Jul 20, 2009 at 3:34 PM, OBender<os...@hotmail.com> wrote:
> I've checked, and it appears to be enabled.
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Monday, July 20, 2009 3:18 PM
> To: java-user@lucene.apache.org
> Subject: Re: question on custom filter
>
> Obender, based on your previous comments (that you see text displayed
> in the wrong order), I again recommend that you enable support for RTL
> languages in your operating system, as I mentioned earlier... are you
> using a Windows-based OS, this is not enabled by default!
>
> I think you are seeing things in the incorrect order, and this is
> causing confusion for you!
>
> On Mon, Jul 20, 2009 at 3:02 PM, Robert Muir<rc...@gmail.com> wrote:
>> Obender, i ran your code and it did what I expected (but not what you pasted):
>>
>> First token is: (טוֹב,0,4)
>> Second token is: (עֶרֶב,5,10)
>>
>> I also loaded up your SimpleWhitespaceAnalyzer in Luke, with the same results.
>>
>> On Mon, Jul 20, 2009 at 2:53 PM, OBender<os...@hotmail.com> wrote:
>>> Here is the simple code. If you run it with English and with Hebrew you will see that in case of English tokens returned from the left of the phrase to the right and with Hebrew from the right to the left.
>>>
>>> Again I'm talking about tokens not the individual letters here.
>>>
>>> public class XFilter extends TokenFilter
>>> {
>>>        protected XFilter( TokenStream tokenStream ) {
>>>                super( tokenStream );
>>>        }
>>>
>>>        @Override
>>>        public Token next( final Token reusableToken ) throws IOException
>>>        {
>>>                Token nextToken = input.next( reusableToken );
>>>                System.out.println( nextToken != null? nextToken: "" );
>>>                return nextToken;
>>>        }
>>> }
>>>
>>> public class SimpleWhitespaceAnalyzer extends Analyzer
>>> {
>>>        @Override
>>>        public TokenStream tokenStream( final String fieldName, final Reader reader )
>>>        {
>>>                TokenStream ts  = new WhitespaceTokenizer( reader );
>>>                ts                      = new XFilter( ts );
>>>
>>>                return ts;
>>>        }
>>> }
>>>
>>> -----Original Message-----
>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>> Sent: Monday, July 20, 2009 2:26 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: question on custom filter
>>>
>>> Obender, I think something in your environment / display environment
>>> might be causing some confusion.
>>>
>>> Are you using microsoft windows? If so, please verify that support for
>>> right-to-left languages is enabled [control panel/regional and
>>> language options]. It is possible you are "seeing something different"
>>> because your rendering system is not actually rendering right-to-left
>>> text in right-to-left direction!!!!
>>>
>>> Second, Instead of using a debugger, I would recommend using Luke to
>>> look at resulting tokens from your analyzer.
>>>
>>> On Mon, Jul 20, 2009 at 2:21 PM, OBender<os...@hotmail.com> wrote:
>>>> This is how it should be written:
>>>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91
>>>>
>>>> -----Original Message-----
>>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>>> Sent: Monday, July 20, 2009 2:07 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: question on custom filter
>>>>
>>>> Obender, This is not true.
>>>> the text you pasted is the following in unicode:
>>>>
>>>> \N{HEBREW LETTER TET}
>>>> \N{HEBREW LETTER VAV}
>>>> \N{HEBREW POINT HOLAM}
>>>> \N{HEBREW LETTER BET}
>>>> \N{SPACE}
>>>> \N{HEBREW LETTER AYIN}
>>>> \N{HEBREW POINT SEGOL}
>>>> \N{HEBREW LETTER RESH}
>>>> \N{HEBREW POINT SEGOL}
>>>> \N{HEBREW LETTER BET}
>>>>
>>>> you can use this utility to see how your text is encoded:
>>>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91
>>>>
>>>> For more information on directionality in unicode, see
>>>> http://unicode.org/reports/tr9/
>>>>
>>>> On Mon, Jul 20, 2009 at 1:59 PM, OBender<os...@hotmail.com> wrote:
>>>>> Robert,
>>>>>
>>>>> I'm not sure you are correct on this one.
>>>>>
>>>>> If I have a Hebrew phrase:
>>>>> [טוֹב עֶרֶב]
>>>>> Then first token that filter receives is:
>>>>> [עֶרֶב] (0,5)
>>>>> and the second is:
>>>>> [טוֹב] (6,10)
>>>>> Which means that it counts from right to left (words and indexes).
>>>>>
>>>>> Am I missing something?
>>>>>
>>>>> -----Original Message-----
>>>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>>>> Sent: Monday, July 20, 2009 1:43 PM
>>>>> To: java-user@lucene.apache.org
>>>>> Subject: Re: question on custom filter
>>>>>
>>>>> Obender, I don't think its as difficult as you think. Your filter does
>>>>> not need to be aware of this issue at all.
>>>>>
>>>>> In unicode, right-to-left languages are encoded in the data in logical order.
>>>>> The rendering system is what converts it to display in right-to-left
>>>>> for RTL languages.
>>>>>
>>>>> For example in Arabic, "Robert 1234" displays as روبرت 1234
>>>>> To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
>>>>> beh, waw, reh
>>>>>
>>>>> But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.
>>>>>
>>>>> 2009/7/20 OBender <os...@hotmail.com>:
>>>>>> Hi All!
>>>>>>
>>>>>>
>>>>>>
>>>>>> Let say I have a filter that produces new tokens based on the original ones.
>>>>>>
>>>>>> How bad will it be if my filter sets the start of each token to 0 and end to
>>>>>> the length of a token?
>>>>>>
>>>>>> An example (based on the phrase "How are you?":
>>>>>>
>>>>>>
>>>>>>
>>>>>> Original token:
>>>>>>
>>>>>> [you?] (8,12)
>>>>>>
>>>>>>
>>>>>>
>>>>>> New tokens:
>>>>>>
>>>>>> [you] (0,3)
>>>>>>
>>>>>> [?] (0,1)
>>>>>>
>>>>>>
>>>>>>
>>>>>> It wouldn't be so hard to calculate the right numbers for left to right
>>>>>> languages and it is a bit more challenging to do it for right to left ones
>>>>>> but for mixed text it is quite hard.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Robert Muir
>>>>> rcmuir@gmail.com
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Robert Muir
>>>> rcmuir@gmail.com
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Robert Muir
>>> rcmuir@gmail.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: question on custom filter

Posted by OBender <os...@hotmail.com>.

No, it reversed in the e-mail. Funny though, when I insert it in to the Excel it turns to the right order of words.
Thanks for all the help.

Maybe you have an idea on what could be the problem.
Here is how my data gets read and indexed.

I have a UTF-8 CSV file that is produced from Excel.
I read it in with Java (preserving UTF-8 encoding). At this point strings in the debugger look correct.
I insert it in to the DB (MySql) which is also UTF-8.
Then read it back and put in to index.

It looks like in UTF-8 CSV file the words are in "reverse" order from the grammar stand point (left to right, e.g., EREV left most then TOV). Should UTF-8 CSV file preserve the natural (language specific) order of words?

 
-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Monday, July 20, 2009 3:49 PM
To: java-user@lucene.apache.org
Subject: Re: question on custom filter

Obender, does the following text appear like the image in the link, or not?

שומר אחי

http://farm1.static.flickr.com/3/10445435_75b4546703.jpg?v=0


On Mon, Jul 20, 2009 at 3:34 PM, OBender<os...@hotmail.com> wrote:
> I've checked, and it appears to be enabled.
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Monday, July 20, 2009 3:18 PM
> To: java-user@lucene.apache.org
> Subject: Re: question on custom filter
>
> Obender, based on your previous comments (that you see text displayed
> in the wrong order), I again recommend that you enable support for RTL
> languages in your operating system, as I mentioned earlier... are you
> using a Windows-based OS, this is not enabled by default!
>
> I think you are seeing things in the incorrect order, and this is
> causing confusion for you!
>
> On Mon, Jul 20, 2009 at 3:02 PM, Robert Muir<rc...@gmail.com> wrote:
>> Obender, i ran your code and it did what I expected (but not what you pasted):
>>
>> First token is: (טוֹב,0,4)
>> Second token is: (עֶרֶב,5,10)
>>
>> I also loaded up your SimpleWhitespaceAnalyzer in Luke, with the same results.
>>
>> On Mon, Jul 20, 2009 at 2:53 PM, OBender<os...@hotmail.com> wrote:
>>> Here is the simple code. If you run it with English and with Hebrew you will see that in case of English tokens returned from the left of the phrase to the right and with Hebrew from the right to the left.
>>>
>>> Again I'm talking about tokens not the individual letters here.
>>>
>>> public class XFilter extends TokenFilter
>>> {
>>>        protected XFilter( TokenStream tokenStream ) {
>>>                super( tokenStream );
>>>        }
>>>
>>>        @Override
>>>        public Token next( final Token reusableToken ) throws IOException
>>>        {
>>>                Token nextToken = input.next( reusableToken );
>>>                System.out.println( nextToken != null? nextToken: "" );
>>>                return nextToken;
>>>        }
>>> }
>>>
>>> public class SimpleWhitespaceAnalyzer extends Analyzer
>>> {
>>>        @Override
>>>        public TokenStream tokenStream( final String fieldName, final Reader reader )
>>>        {
>>>                TokenStream ts  = new WhitespaceTokenizer( reader );
>>>                ts                      = new XFilter( ts );
>>>
>>>                return ts;
>>>        }
>>> }
>>>
>>> -----Original Message-----
>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>> Sent: Monday, July 20, 2009 2:26 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: question on custom filter
>>>
>>> Obender, I think something in your environment / display environment
>>> might be causing some confusion.
>>>
>>> Are you using microsoft windows? If so, please verify that support for
>>> right-to-left languages is enabled [control panel/regional and
>>> language options]. It is possible you are "seeing something different"
>>> because your rendering system is not actually rendering right-to-left
>>> text in right-to-left direction!!!!
>>>
>>> Second, Instead of using a debugger, I would recommend using Luke to
>>> look at resulting tokens from your analyzer.
>>>
>>> On Mon, Jul 20, 2009 at 2:21 PM, OBender<os...@hotmail.com> wrote:
>>>> This is how it should be written:
>>>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91
>>>>
>>>> -----Original Message-----
>>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>>> Sent: Monday, July 20, 2009 2:07 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: question on custom filter
>>>>
>>>> Obender, This is not true.
>>>> the text you pasted is the following in unicode:
>>>>
>>>> \N{HEBREW LETTER TET}
>>>> \N{HEBREW LETTER VAV}
>>>> \N{HEBREW POINT HOLAM}
>>>> \N{HEBREW LETTER BET}
>>>> \N{SPACE}
>>>> \N{HEBREW LETTER AYIN}
>>>> \N{HEBREW POINT SEGOL}
>>>> \N{HEBREW LETTER RESH}
>>>> \N{HEBREW POINT SEGOL}
>>>> \N{HEBREW LETTER BET}
>>>>
>>>> you can use this utility to see how your text is encoded:
>>>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91
>>>>
>>>> For more information on directionality in unicode, see
>>>> http://unicode.org/reports/tr9/
>>>>
>>>> On Mon, Jul 20, 2009 at 1:59 PM, OBender<os...@hotmail.com> wrote:
>>>>> Robert,
>>>>>
>>>>> I'm not sure you are correct on this one.
>>>>>
>>>>> If I have a Hebrew phrase:
>>>>> [טוֹב עֶרֶב]
>>>>> Then first token that filter receives is:
>>>>> [עֶרֶב] (0,5)
>>>>> and the second is:
>>>>> [טוֹב] (6,10)
>>>>> Which means that it counts from right to left (words and indexes).
>>>>>
>>>>> Am I missing something?
>>>>>
>>>>> -----Original Message-----
>>>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>>>> Sent: Monday, July 20, 2009 1:43 PM
>>>>> To: java-user@lucene.apache.org
>>>>> Subject: Re: question on custom filter
>>>>>
>>>>> Obender, I don't think its as difficult as you think. Your filter does
>>>>> not need to be aware of this issue at all.
>>>>>
>>>>> In unicode, right-to-left languages are encoded in the data in logical order.
>>>>> The rendering system is what converts it to display in right-to-left
>>>>> for RTL languages.
>>>>>
>>>>> For example in Arabic, "Robert 1234" displays as روبرت 1234
>>>>> To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
>>>>> beh, waw, reh
>>>>>
>>>>> But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.
>>>>>
>>>>> 2009/7/20 OBender <os...@hotmail.com>:
>>>>>> Hi All!
>>>>>>
>>>>>>
>>>>>>
>>>>>> Let say I have a filter that produces new tokens based on the original ones.
>>>>>>
>>>>>> How bad will it be if my filter sets the start of each token to 0 and end to
>>>>>> the length of a token?
>>>>>>
>>>>>> An example (based on the phrase "How are you?":
>>>>>>
>>>>>>
>>>>>>
>>>>>> Original token:
>>>>>>
>>>>>> [you?] (8,12)
>>>>>>
>>>>>>
>>>>>>
>>>>>> New tokens:
>>>>>>
>>>>>> [you] (0,3)
>>>>>>
>>>>>> [?] (0,1)
>>>>>>
>>>>>>
>>>>>>
>>>>>> It wouldn't be so hard to calculate the right numbers for left to right
>>>>>> languages and it is a bit more challenging to do it for right to left ones
>>>>>> but for mixed text it is quite hard.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Robert Muir
>>>>> rcmuir@gmail.com
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Robert Muir
>>>> rcmuir@gmail.com
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Robert Muir
>>> rcmuir@gmail.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: question on custom filter

Posted by Robert Muir <rc...@gmail.com>.

Obender, does the following text appear like the image in the link, or not?

שומר אחי

http://farm1.static.flickr.com/3/10445435_75b4546703.jpg?v=0


On Mon, Jul 20, 2009 at 3:34 PM, OBender<os...@hotmail.com> wrote:
> I've checked, and it appears to be enabled.
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Monday, July 20, 2009 3:18 PM
> To: java-user@lucene.apache.org
> Subject: Re: question on custom filter
>
> Obender, based on your previous comments (that you see text displayed
> in the wrong order), I again recommend that you enable support for RTL
> languages in your operating system, as I mentioned earlier... are you
> using a Windows-based OS, this is not enabled by default!
>
> I think you are seeing things in the incorrect order, and this is
> causing confusion for you!
>
> On Mon, Jul 20, 2009 at 3:02 PM, Robert Muir<rc...@gmail.com> wrote:
>> Obender, i ran your code and it did what I expected (but not what you pasted):
>>
>> First token is: (טוֹב,0,4)
>> Second token is: (עֶרֶב,5,10)
>>
>> I also loaded up your SimpleWhitespaceAnalyzer in Luke, with the same results.
>>
>> On Mon, Jul 20, 2009 at 2:53 PM, OBender<os...@hotmail.com> wrote:
>>> Here is the simple code. If you run it with English and with Hebrew you will see that in case of English tokens returned from the left of the phrase to the right and with Hebrew from the right to the left.
>>>
>>> Again I'm talking about tokens not the individual letters here.
>>>
>>> public class XFilter extends TokenFilter
>>> {
>>>        protected XFilter( TokenStream tokenStream ) {
>>>                super( tokenStream );
>>>        }
>>>
>>>        @Override
>>>        public Token next( final Token reusableToken ) throws IOException
>>>        {
>>>                Token nextToken = input.next( reusableToken );
>>>                System.out.println( nextToken != null? nextToken: "" );
>>>                return nextToken;
>>>        }
>>> }
>>>
>>> public class SimpleWhitespaceAnalyzer extends Analyzer
>>> {
>>>        @Override
>>>        public TokenStream tokenStream( final String fieldName, final Reader reader )
>>>        {
>>>                TokenStream ts  = new WhitespaceTokenizer( reader );
>>>                ts                      = new XFilter( ts );
>>>
>>>                return ts;
>>>        }
>>> }
>>>
>>> -----Original Message-----
>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>> Sent: Monday, July 20, 2009 2:26 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: question on custom filter
>>>
>>> Obender, I think something in your environment / display environment
>>> might be causing some confusion.
>>>
>>> Are you using microsoft windows? If so, please verify that support for
>>> right-to-left languages is enabled [control panel/regional and
>>> language options]. It is possible you are "seeing something different"
>>> because your rendering system is not actually rendering right-to-left
>>> text in right-to-left direction!!!!
>>>
>>> Second, Instead of using a debugger, I would recommend using Luke to
>>> look at resulting tokens from your analyzer.
>>>
>>> On Mon, Jul 20, 2009 at 2:21 PM, OBender<os...@hotmail.com> wrote:
>>>> This is how it should be written:
>>>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91
>>>>
>>>> -----Original Message-----
>>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>>> Sent: Monday, July 20, 2009 2:07 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: question on custom filter
>>>>
>>>> Obender, This is not true.
>>>> the text you pasted is the following in unicode:
>>>>
>>>> \N{HEBREW LETTER TET}
>>>> \N{HEBREW LETTER VAV}
>>>> \N{HEBREW POINT HOLAM}
>>>> \N{HEBREW LETTER BET}
>>>> \N{SPACE}
>>>> \N{HEBREW LETTER AYIN}
>>>> \N{HEBREW POINT SEGOL}
>>>> \N{HEBREW LETTER RESH}
>>>> \N{HEBREW POINT SEGOL}
>>>> \N{HEBREW LETTER BET}
>>>>
>>>> you can use this utility to see how your text is encoded:
>>>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91
>>>>
>>>> For more information on directionality in unicode, see
>>>> http://unicode.org/reports/tr9/
>>>>
>>>> On Mon, Jul 20, 2009 at 1:59 PM, OBender<os...@hotmail.com> wrote:
>>>>> Robert,
>>>>>
>>>>> I'm not sure you are correct on this one.
>>>>>
>>>>> If I have a Hebrew phrase:
>>>>> [טוֹב עֶרֶב]
>>>>> Then first token that filter receives is:
>>>>> [עֶרֶב] (0,5)
>>>>> and the second is:
>>>>> [טוֹב] (6,10)
>>>>> Which means that it counts from right to left (words and indexes).
>>>>>
>>>>> Am I missing something?
>>>>>
>>>>> -----Original Message-----
>>>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>>>> Sent: Monday, July 20, 2009 1:43 PM
>>>>> To: java-user@lucene.apache.org
>>>>> Subject: Re: question on custom filter
>>>>>
>>>>> Obender, I don't think its as difficult as you think. Your filter does
>>>>> not need to be aware of this issue at all.
>>>>>
>>>>> In unicode, right-to-left languages are encoded in the data in logical order.
>>>>> The rendering system is what converts it to display in right-to-left
>>>>> for RTL languages.
>>>>>
>>>>> For example in Arabic, "Robert 1234" displays as روبرت 1234
>>>>> To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
>>>>> beh, waw, reh
>>>>>
>>>>> But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.
>>>>>
>>>>> 2009/7/20 OBender <os...@hotmail.com>:
>>>>>> Hi All!
>>>>>>
>>>>>>
>>>>>>
>>>>>> Let say I have a filter that produces new tokens based on the original ones.
>>>>>>
>>>>>> How bad will it be if my filter sets the start of each token to 0 and end to
>>>>>> the length of a token?
>>>>>>
>>>>>> An example (based on the phrase "How are you?":
>>>>>>
>>>>>>
>>>>>>
>>>>>> Original token:
>>>>>>
>>>>>> [you?] (8,12)
>>>>>>
>>>>>>
>>>>>>
>>>>>> New tokens:
>>>>>>
>>>>>> [you] (0,3)
>>>>>>
>>>>>> [?] (0,1)
>>>>>>
>>>>>>
>>>>>>
>>>>>> It wouldn't be so hard to calculate the right numbers for left to right
>>>>>> languages and it is a bit more challenging to do it for right to left ones
>>>>>> but for mixed text it is quite hard.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Robert Muir
>>>>> rcmuir@gmail.com
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Robert Muir
>>>> rcmuir@gmail.com
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Robert Muir
>>> rcmuir@gmail.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: question on custom filter

Posted by OBender <os...@hotmail.com>.

I've checked, and it appears to be enabled.

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Monday, July 20, 2009 3:18 PM
To: java-user@lucene.apache.org
Subject: Re: question on custom filter

Obender, based on your previous comments (that you see text displayed
in the wrong order), I again recommend that you enable support for RTL
languages in your operating system, as I mentioned earlier... are you
using a Windows-based OS, this is not enabled by default!

I think you are seeing things in the incorrect order, and this is
causing confusion for you!

On Mon, Jul 20, 2009 at 3:02 PM, Robert Muir<rc...@gmail.com> wrote:
> Obender, i ran your code and it did what I expected (but not what you pasted):
>
> First token is: (טוֹב,0,4)
> Second token is: (עֶרֶב,5,10)
>
> I also loaded up your SimpleWhitespaceAnalyzer in Luke, with the same results.
>
> On Mon, Jul 20, 2009 at 2:53 PM, OBender<os...@hotmail.com> wrote:
>> Here is the simple code. If you run it with English and with Hebrew you will see that in case of English tokens returned from the left of the phrase to the right and with Hebrew from the right to the left.
>>
>> Again I'm talking about tokens not the individual letters here.
>>
>> public class XFilter extends TokenFilter
>> {
>>        protected XFilter( TokenStream tokenStream ) {
>>                super( tokenStream );
>>        }
>>
>>        @Override
>>        public Token next( final Token reusableToken ) throws IOException
>>        {
>>                Token nextToken = input.next( reusableToken );
>>                System.out.println( nextToken != null? nextToken: "" );
>>                return nextToken;
>>        }
>> }
>>
>> public class SimpleWhitespaceAnalyzer extends Analyzer
>> {
>>        @Override
>>        public TokenStream tokenStream( final String fieldName, final Reader reader )
>>        {
>>                TokenStream ts  = new WhitespaceTokenizer( reader );
>>                ts                      = new XFilter( ts );
>>
>>                return ts;
>>        }
>> }
>>
>> -----Original Message-----
>> From: Robert Muir [mailto:rcmuir@gmail.com]
>> Sent: Monday, July 20, 2009 2:26 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: question on custom filter
>>
>> Obender, I think something in your environment / display environment
>> might be causing some confusion.
>>
>> Are you using microsoft windows? If so, please verify that support for
>> right-to-left languages is enabled [control panel/regional and
>> language options]. It is possible you are "seeing something different"
>> because your rendering system is not actually rendering right-to-left
>> text in right-to-left direction!!!!
>>
>> Second, Instead of using a debugger, I would recommend using Luke to
>> look at resulting tokens from your analyzer.
>>
>> On Mon, Jul 20, 2009 at 2:21 PM, OBender<os...@hotmail.com> wrote:
>>> This is how it should be written:
>>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91
>>>
>>> -----Original Message-----
>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>> Sent: Monday, July 20, 2009 2:07 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: question on custom filter
>>>
>>> Obender, This is not true.
>>> the text you pasted is the following in unicode:
>>>
>>> \N{HEBREW LETTER TET}
>>> \N{HEBREW LETTER VAV}
>>> \N{HEBREW POINT HOLAM}
>>> \N{HEBREW LETTER BET}
>>> \N{SPACE}
>>> \N{HEBREW LETTER AYIN}
>>> \N{HEBREW POINT SEGOL}
>>> \N{HEBREW LETTER RESH}
>>> \N{HEBREW POINT SEGOL}
>>> \N{HEBREW LETTER BET}
>>>
>>> you can use this utility to see how your text is encoded:
>>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91
>>>
>>> For more information on directionality in unicode, see
>>> http://unicode.org/reports/tr9/
>>>
>>> On Mon, Jul 20, 2009 at 1:59 PM, OBender<os...@hotmail.com> wrote:
>>>> Robert,
>>>>
>>>> I'm not sure you are correct on this one.
>>>>
>>>> If I have a Hebrew phrase:
>>>> [טוֹב עֶרֶב]
>>>> Then first token that filter receives is:
>>>> [עֶרֶב] (0,5)
>>>> and the second is:
>>>> [טוֹב] (6,10)
>>>> Which means that it counts from right to left (words and indexes).
>>>>
>>>> Am I missing something?
>>>>
>>>> -----Original Message-----
>>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>>> Sent: Monday, July 20, 2009 1:43 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: question on custom filter
>>>>
>>>> Obender, I don't think its as difficult as you think. Your filter does
>>>> not need to be aware of this issue at all.
>>>>
>>>> In unicode, right-to-left languages are encoded in the data in logical order.
>>>> The rendering system is what converts it to display in right-to-left
>>>> for RTL languages.
>>>>
>>>> For example in Arabic, "Robert 1234" displays as روبرت 1234
>>>> To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
>>>> beh, waw, reh
>>>>
>>>> But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.
>>>>
>>>> 2009/7/20 OBender <os...@hotmail.com>:
>>>>> Hi All!
>>>>>
>>>>>
>>>>>
>>>>> Let say I have a filter that produces new tokens based on the original ones.
>>>>>
>>>>> How bad will it be if my filter sets the start of each token to 0 and end to
>>>>> the length of a token?
>>>>>
>>>>> An example (based on the phrase "How are you?":
>>>>>
>>>>>
>>>>>
>>>>> Original token:
>>>>>
>>>>> [you?] (8,12)
>>>>>
>>>>>
>>>>>
>>>>> New tokens:
>>>>>
>>>>> [you] (0,3)
>>>>>
>>>>> [?] (0,1)
>>>>>
>>>>>
>>>>>
>>>>> It wouldn't be so hard to calculate the right numbers for left to right
>>>>> languages and it is a bit more challenging to do it for right to left ones
>>>>> but for mixed text it is quite hard.
>>>>>
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Robert Muir
>>>> rcmuir@gmail.com
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Robert Muir
>>> rcmuir@gmail.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: question on custom filter

Posted by Robert Muir <rc...@gmail.com>.

Obender, based on your previous comments (that you see text displayed
in the wrong order), I again recommend that you enable support for RTL
languages in your operating system, as I mentioned earlier... are you
using a Windows-based OS, this is not enabled by default!

I think you are seeing things in the incorrect order, and this is
causing confusion for you!

On Mon, Jul 20, 2009 at 3:02 PM, Robert Muir<rc...@gmail.com> wrote:
> Obender, i ran your code and it did what I expected (but not what you pasted):
>
> First token is: (טוֹב,0,4)
> Second token is: (עֶרֶב,5,10)
>
> I also loaded up your SimpleWhitespaceAnalyzer in Luke, with the same results.
>
> On Mon, Jul 20, 2009 at 2:53 PM, OBender<os...@hotmail.com> wrote:
>> Here is the simple code. If you run it with English and with Hebrew you will see that in case of English tokens returned from the left of the phrase to the right and with Hebrew from the right to the left.
>>
>> Again I'm talking about tokens not the individual letters here.
>>
>> public class XFilter extends TokenFilter
>> {
>>        protected XFilter( TokenStream tokenStream ) {
>>                super( tokenStream );
>>        }
>>
>>        @Override
>>        public Token next( final Token reusableToken ) throws IOException
>>        {
>>                Token nextToken = input.next( reusableToken );
>>                System.out.println( nextToken != null? nextToken: "" );
>>                return nextToken;
>>        }
>> }
>>
>> public class SimpleWhitespaceAnalyzer extends Analyzer
>> {
>>        @Override
>>        public TokenStream tokenStream( final String fieldName, final Reader reader )
>>        {
>>                TokenStream ts  = new WhitespaceTokenizer( reader );
>>                ts                      = new XFilter( ts );
>>
>>                return ts;
>>        }
>> }
>>
>> -----Original Message-----
>> From: Robert Muir [mailto:rcmuir@gmail.com]
>> Sent: Monday, July 20, 2009 2:26 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: question on custom filter
>>
>> Obender, I think something in your environment / display environment
>> might be causing some confusion.
>>
>> Are you using microsoft windows? If so, please verify that support for
>> right-to-left languages is enabled [control panel/regional and
>> language options]. It is possible you are "seeing something different"
>> because your rendering system is not actually rendering right-to-left
>> text in right-to-left direction!!!!
>>
>> Second, Instead of using a debugger, I would recommend using Luke to
>> look at resulting tokens from your analyzer.
>>
>> On Mon, Jul 20, 2009 at 2:21 PM, OBender<os...@hotmail.com> wrote:
>>> This is how it should be written:
>>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91
>>>
>>> -----Original Message-----
>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>> Sent: Monday, July 20, 2009 2:07 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: question on custom filter
>>>
>>> Obender, This is not true.
>>> the text you pasted is the following in unicode:
>>>
>>> \N{HEBREW LETTER TET}
>>> \N{HEBREW LETTER VAV}
>>> \N{HEBREW POINT HOLAM}
>>> \N{HEBREW LETTER BET}
>>> \N{SPACE}
>>> \N{HEBREW LETTER AYIN}
>>> \N{HEBREW POINT SEGOL}
>>> \N{HEBREW LETTER RESH}
>>> \N{HEBREW POINT SEGOL}
>>> \N{HEBREW LETTER BET}
>>>
>>> you can use this utility to see how your text is encoded:
>>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91
>>>
>>> For more information on directionality in unicode, see
>>> http://unicode.org/reports/tr9/
>>>
>>> On Mon, Jul 20, 2009 at 1:59 PM, OBender<os...@hotmail.com> wrote:
>>>> Robert,
>>>>
>>>> I'm not sure you are correct on this one.
>>>>
>>>> If I have a Hebrew phrase:
>>>> [טוֹב עֶרֶב]
>>>> Then first token that filter receives is:
>>>> [עֶרֶב] (0,5)
>>>> and the second is:
>>>> [טוֹב] (6,10)
>>>> Which means that it counts from right to left (words and indexes).
>>>>
>>>> Am I missing something?
>>>>
>>>> -----Original Message-----
>>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>>> Sent: Monday, July 20, 2009 1:43 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: question on custom filter
>>>>
>>>> Obender, I don't think its as difficult as you think. Your filter does
>>>> not need to be aware of this issue at all.
>>>>
>>>> In unicode, right-to-left languages are encoded in the data in logical order.
>>>> The rendering system is what converts it to display in right-to-left
>>>> for RTL languages.
>>>>
>>>> For example in Arabic, "Robert 1234" displays as روبرت 1234
>>>> To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
>>>> beh, waw, reh
>>>>
>>>> But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.
>>>>
>>>> 2009/7/20 OBender <os...@hotmail.com>:
>>>>> Hi All!
>>>>>
>>>>>
>>>>>
>>>>> Let say I have a filter that produces new tokens based on the original ones.
>>>>>
>>>>> How bad will it be if my filter sets the start of each token to 0 and end to
>>>>> the length of a token?
>>>>>
>>>>> An example (based on the phrase "How are you?":
>>>>>
>>>>>
>>>>>
>>>>> Original token:
>>>>>
>>>>> [you?] (8,12)
>>>>>
>>>>>
>>>>>
>>>>> New tokens:
>>>>>
>>>>> [you] (0,3)
>>>>>
>>>>> [?] (0,1)
>>>>>
>>>>>
>>>>>
>>>>> It wouldn't be so hard to calculate the right numbers for left to right
>>>>> languages and it is a bit more challenging to do it for right to left ones
>>>>> but for mixed text it is quite hard.
>>>>>
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Robert Muir
>>>> rcmuir@gmail.com
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Robert Muir
>>> rcmuir@gmail.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: question on custom filter

Posted by Robert Muir <rc...@gmail.com>.

Obender, i ran your code and it did what I expected (but not what you pasted):

First token is: (טוֹב,0,4)
Second token is: (עֶרֶב,5,10)

I also loaded up your SimpleWhitespaceAnalyzer in Luke, with the same results.

On Mon, Jul 20, 2009 at 2:53 PM, OBender<os...@hotmail.com> wrote:
> Here is the simple code. If you run it with English and with Hebrew you will see that in case of English tokens returned from the left of the phrase to the right and with Hebrew from the right to the left.
>
> Again I'm talking about tokens not the individual letters here.
>
> public class XFilter extends TokenFilter
> {
>        protected XFilter( TokenStream tokenStream ) {
>                super( tokenStream );
>        }
>
>        @Override
>        public Token next( final Token reusableToken ) throws IOException
>        {
>                Token nextToken = input.next( reusableToken );
>                System.out.println( nextToken != null? nextToken: "" );
>                return nextToken;
>        }
> }
>
> public class SimpleWhitespaceAnalyzer extends Analyzer
> {
>        @Override
>        public TokenStream tokenStream( final String fieldName, final Reader reader )
>        {
>                TokenStream ts  = new WhitespaceTokenizer( reader );
>                ts                      = new XFilter( ts );
>
>                return ts;
>        }
> }
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Monday, July 20, 2009 2:26 PM
> To: java-user@lucene.apache.org
> Subject: Re: question on custom filter
>
> Obender, I think something in your environment / display environment
> might be causing some confusion.
>
> Are you using microsoft windows? If so, please verify that support for
> right-to-left languages is enabled [control panel/regional and
> language options]. It is possible you are "seeing something different"
> because your rendering system is not actually rendering right-to-left
> text in right-to-left direction!!!!
>
> Second, Instead of using a debugger, I would recommend using Luke to
> look at resulting tokens from your analyzer.
>
> On Mon, Jul 20, 2009 at 2:21 PM, OBender<os...@hotmail.com> wrote:
>> This is how it should be written:
>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91
>>
>> -----Original Message-----
>> From: Robert Muir [mailto:rcmuir@gmail.com]
>> Sent: Monday, July 20, 2009 2:07 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: question on custom filter
>>
>> Obender, This is not true.
>> the text you pasted is the following in unicode:
>>
>> \N{HEBREW LETTER TET}
>> \N{HEBREW LETTER VAV}
>> \N{HEBREW POINT HOLAM}
>> \N{HEBREW LETTER BET}
>> \N{SPACE}
>> \N{HEBREW LETTER AYIN}
>> \N{HEBREW POINT SEGOL}
>> \N{HEBREW LETTER RESH}
>> \N{HEBREW POINT SEGOL}
>> \N{HEBREW LETTER BET}
>>
>> you can use this utility to see how your text is encoded:
>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91
>>
>> For more information on directionality in unicode, see
>> http://unicode.org/reports/tr9/
>>
>> On Mon, Jul 20, 2009 at 1:59 PM, OBender<os...@hotmail.com> wrote:
>>> Robert,
>>>
>>> I'm not sure you are correct on this one.
>>>
>>> If I have a Hebrew phrase:
>>> [טוֹב עֶרֶב]
>>> Then first token that filter receives is:
>>> [עֶרֶב] (0,5)
>>> and the second is:
>>> [טוֹב] (6,10)
>>> Which means that it counts from right to left (words and indexes).
>>>
>>> Am I missing something?
>>>
>>> -----Original Message-----
>>> From: Robert Muir [mailto:rcmuir@gmail.com]
>>> Sent: Monday, July 20, 2009 1:43 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: question on custom filter
>>>
>>> Obender, I don't think its as difficult as you think. Your filter does
>>> not need to be aware of this issue at all.
>>>
>>> In unicode, right-to-left languages are encoded in the data in logical order.
>>> The rendering system is what converts it to display in right-to-left
>>> for RTL languages.
>>>
>>> For example in Arabic, "Robert 1234" displays as روبرت 1234
>>> To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
>>> beh, waw, reh
>>>
>>> But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.
>>>
>>> 2009/7/20 OBender <os...@hotmail.com>:
>>>> Hi All!
>>>>
>>>>
>>>>
>>>> Let say I have a filter that produces new tokens based on the original ones.
>>>>
>>>> How bad will it be if my filter sets the start of each token to 0 and end to
>>>> the length of a token?
>>>>
>>>> An example (based on the phrase "How are you?":
>>>>
>>>>
>>>>
>>>> Original token:
>>>>
>>>> [you?] (8,12)
>>>>
>>>>
>>>>
>>>> New tokens:
>>>>
>>>> [you] (0,3)
>>>>
>>>> [?] (0,1)
>>>>
>>>>
>>>>
>>>> It wouldn't be so hard to calculate the right numbers for left to right
>>>> languages and it is a bit more challenging to do it for right to left ones
>>>> but for mixed text it is quite hard.
>>>>
>>>>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Robert Muir
>>> rcmuir@gmail.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: question on custom filter

Posted by OBender <os...@hotmail.com>.

Here is the simple code. If you run it with English and with Hebrew you will see that in case of English tokens returned from the left of the phrase to the right and with Hebrew from the right to the left.

Again I'm talking about tokens not the individual letters here.

public class XFilter extends TokenFilter
{
	protected XFilter( TokenStream tokenStream ) {
		super( tokenStream );
	}

	@Override
	public Token next( final Token reusableToken ) throws IOException
	{
		Token nextToken = input.next( reusableToken );
		System.out.println( nextToken != null? nextToken: "" );
		return nextToken;
	}
}

public class SimpleWhitespaceAnalyzer extends Analyzer
{
	@Override
	public TokenStream tokenStream( final String fieldName, final Reader reader )
	{
		TokenStream ts	= new WhitespaceTokenizer( reader );
		ts			= new XFilter( ts );

		return ts;
	}
}

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Monday, July 20, 2009 2:26 PM
To: java-user@lucene.apache.org
Subject: Re: question on custom filter

Obender, I think something in your environment / display environment
might be causing some confusion.

Are you using microsoft windows? If so, please verify that support for
right-to-left languages is enabled [control panel/regional and
language options]. It is possible you are "seeing something different"
because your rendering system is not actually rendering right-to-left
text in right-to-left direction!!!!

Second, Instead of using a debugger, I would recommend using Luke to
look at resulting tokens from your analyzer.

On Mon, Jul 20, 2009 at 2:21 PM, OBender<os...@hotmail.com> wrote:
> This is how it should be written:
> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Monday, July 20, 2009 2:07 PM
> To: java-user@lucene.apache.org
> Subject: Re: question on custom filter
>
> Obender, This is not true.
> the text you pasted is the following in unicode:
>
> \N{HEBREW LETTER TET}
> \N{HEBREW LETTER VAV}
> \N{HEBREW POINT HOLAM}
> \N{HEBREW LETTER BET}
> \N{SPACE}
> \N{HEBREW LETTER AYIN}
> \N{HEBREW POINT SEGOL}
> \N{HEBREW LETTER RESH}
> \N{HEBREW POINT SEGOL}
> \N{HEBREW LETTER BET}
>
> you can use this utility to see how your text is encoded:
> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91
>
> For more information on directionality in unicode, see
> http://unicode.org/reports/tr9/
>
> On Mon, Jul 20, 2009 at 1:59 PM, OBender<os...@hotmail.com> wrote:
>> Robert,
>>
>> I'm not sure you are correct on this one.
>>
>> If I have a Hebrew phrase:
>> [טוֹב עֶרֶב]
>> Then first token that filter receives is:
>> [עֶרֶב] (0,5)
>> and the second is:
>> [טוֹב] (6,10)
>> Which means that it counts from right to left (words and indexes).
>>
>> Am I missing something?
>>
>> -----Original Message-----
>> From: Robert Muir [mailto:rcmuir@gmail.com]
>> Sent: Monday, July 20, 2009 1:43 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: question on custom filter
>>
>> Obender, I don't think its as difficult as you think. Your filter does
>> not need to be aware of this issue at all.
>>
>> In unicode, right-to-left languages are encoded in the data in logical order.
>> The rendering system is what converts it to display in right-to-left
>> for RTL languages.
>>
>> For example in Arabic, "Robert 1234" displays as روبرت 1234
>> To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
>> beh, waw, reh
>>
>> But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.
>>
>> 2009/7/20 OBender <os...@hotmail.com>:
>>> Hi All!
>>>
>>>
>>>
>>> Let say I have a filter that produces new tokens based on the original ones.
>>>
>>> How bad will it be if my filter sets the start of each token to 0 and end to
>>> the length of a token?
>>>
>>> An example (based on the phrase "How are you?":
>>>
>>>
>>>
>>> Original token:
>>>
>>> [you?] (8,12)
>>>
>>>
>>>
>>> New tokens:
>>>
>>> [you] (0,3)
>>>
>>> [?] (0,1)
>>>
>>>
>>>
>>> It wouldn't be so hard to calculate the right numbers for left to right
>>> languages and it is a bit more challenging to do it for right to left ones
>>> but for mixed text it is quite hard.
>>>
>>>
>>>
>>> Thanks.
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: question on custom filter

Posted by Robert Muir <rc...@gmail.com>.

Obender, I think something in your environment / display environment
might be causing some confusion.

Are you using microsoft windows? If so, please verify that support for
right-to-left languages is enabled [control panel/regional and
language options]. It is possible you are "seeing something different"
because your rendering system is not actually rendering right-to-left
text in right-to-left direction!!!!

Second, Instead of using a debugger, I would recommend using Luke to
look at resulting tokens from your analyzer.

On Mon, Jul 20, 2009 at 2:21 PM, OBender<os...@hotmail.com> wrote:
> This is how it should be written:
> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Monday, July 20, 2009 2:07 PM
> To: java-user@lucene.apache.org
> Subject: Re: question on custom filter
>
> Obender, This is not true.
> the text you pasted is the following in unicode:
>
> \N{HEBREW LETTER TET}
> \N{HEBREW LETTER VAV}
> \N{HEBREW POINT HOLAM}
> \N{HEBREW LETTER BET}
> \N{SPACE}
> \N{HEBREW LETTER AYIN}
> \N{HEBREW POINT SEGOL}
> \N{HEBREW LETTER RESH}
> \N{HEBREW POINT SEGOL}
> \N{HEBREW LETTER BET}
>
> you can use this utility to see how your text is encoded:
> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91
>
> For more information on directionality in unicode, see
> http://unicode.org/reports/tr9/
>
> On Mon, Jul 20, 2009 at 1:59 PM, OBender<os...@hotmail.com> wrote:
>> Robert,
>>
>> I'm not sure you are correct on this one.
>>
>> If I have a Hebrew phrase:
>> [טוֹב עֶרֶב]
>> Then first token that filter receives is:
>> [עֶרֶב] (0,5)
>> and the second is:
>> [טוֹב] (6,10)
>> Which means that it counts from right to left (words and indexes).
>>
>> Am I missing something?
>>
>> -----Original Message-----
>> From: Robert Muir [mailto:rcmuir@gmail.com]
>> Sent: Monday, July 20, 2009 1:43 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: question on custom filter
>>
>> Obender, I don't think its as difficult as you think. Your filter does
>> not need to be aware of this issue at all.
>>
>> In unicode, right-to-left languages are encoded in the data in logical order.
>> The rendering system is what converts it to display in right-to-left
>> for RTL languages.
>>
>> For example in Arabic, "Robert 1234" displays as روبرت 1234
>> To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
>> beh, waw, reh
>>
>> But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.
>>
>> 2009/7/20 OBender <os...@hotmail.com>:
>>> Hi All!
>>>
>>>
>>>
>>> Let say I have a filter that produces new tokens based on the original ones.
>>>
>>> How bad will it be if my filter sets the start of each token to 0 and end to
>>> the length of a token?
>>>
>>> An example (based on the phrase "How are you?":
>>>
>>>
>>>
>>> Original token:
>>>
>>> [you?] (8,12)
>>>
>>>
>>>
>>> New tokens:
>>>
>>> [you] (0,3)
>>>
>>> [?] (0,1)
>>>
>>>
>>>
>>> It wouldn't be so hard to calculate the right numbers for left to right
>>> languages and it is a bit more challenging to do it for right to left ones
>>> but for mixed text it is quite hard.
>>>
>>>
>>>
>>> Thanks.
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: question on custom filter

Posted by OBender <os...@hotmail.com>.

This is how it should be written:
http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Monday, July 20, 2009 2:07 PM
To: java-user@lucene.apache.org
Subject: Re: question on custom filter

Obender, This is not true.
the text you pasted is the following in unicode:

\N{HEBREW LETTER TET}
\N{HEBREW LETTER VAV}
\N{HEBREW POINT HOLAM}
\N{HEBREW LETTER BET}
\N{SPACE}
\N{HEBREW LETTER AYIN}
\N{HEBREW POINT SEGOL}
\N{HEBREW LETTER RESH}
\N{HEBREW POINT SEGOL}
\N{HEBREW LETTER BET}

you can use this utility to see how your text is encoded:
http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91

For more information on directionality in unicode, see
http://unicode.org/reports/tr9/

On Mon, Jul 20, 2009 at 1:59 PM, OBender<os...@hotmail.com> wrote:
> Robert,
>
> I'm not sure you are correct on this one.
>
> If I have a Hebrew phrase:
> [טוֹב עֶרֶב]
> Then first token that filter receives is:
> [עֶרֶב] (0,5)
> and the second is:
> [טוֹב] (6,10)
> Which means that it counts from right to left (words and indexes).
>
> Am I missing something?
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Monday, July 20, 2009 1:43 PM
> To: java-user@lucene.apache.org
> Subject: Re: question on custom filter
>
> Obender, I don't think its as difficult as you think. Your filter does
> not need to be aware of this issue at all.
>
> In unicode, right-to-left languages are encoded in the data in logical order.
> The rendering system is what converts it to display in right-to-left
> for RTL languages.
>
> For example in Arabic, "Robert 1234" displays as روبرت 1234
> To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
> beh, waw, reh
>
> But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.
>
> 2009/7/20 OBender <os...@hotmail.com>:
>> Hi All!
>>
>>
>>
>> Let say I have a filter that produces new tokens based on the original ones.
>>
>> How bad will it be if my filter sets the start of each token to 0 and end to
>> the length of a token?
>>
>> An example (based on the phrase "How are you?":
>>
>>
>>
>> Original token:
>>
>> [you?] (8,12)
>>
>>
>>
>> New tokens:
>>
>> [you] (0,3)
>>
>> [?] (0,1)
>>
>>
>>
>> It wouldn't be so hard to calculate the right numbers for left to right
>> languages and it is a bit more challenging to do it for right to left ones
>> but for mixed text it is quite hard.
>>
>>
>>
>> Thanks.
>>
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: question on custom filter

Posted by OBender <os...@hotmail.com>.

Hold on a second, the phrase that you included link to is not in the correct order of words!

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Monday, July 20, 2009 2:07 PM
To: java-user@lucene.apache.org
Subject: Re: question on custom filter

Obender, This is not true.
the text you pasted is the following in unicode:

\N{HEBREW LETTER TET}
\N{HEBREW LETTER VAV}
\N{HEBREW POINT HOLAM}
\N{HEBREW LETTER BET}
\N{SPACE}
\N{HEBREW LETTER AYIN}
\N{HEBREW POINT SEGOL}
\N{HEBREW LETTER RESH}
\N{HEBREW POINT SEGOL}
\N{HEBREW LETTER BET}

you can use this utility to see how your text is encoded:
http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91

For more information on directionality in unicode, see
http://unicode.org/reports/tr9/

On Mon, Jul 20, 2009 at 1:59 PM, OBender<os...@hotmail.com> wrote:
> Robert,
>
> I'm not sure you are correct on this one.
>
> If I have a Hebrew phrase:
> [טוֹב עֶרֶב]
> Then first token that filter receives is:
> [עֶרֶב] (0,5)
> and the second is:
> [טוֹב] (6,10)
> Which means that it counts from right to left (words and indexes).
>
> Am I missing something?
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Monday, July 20, 2009 1:43 PM
> To: java-user@lucene.apache.org
> Subject: Re: question on custom filter
>
> Obender, I don't think its as difficult as you think. Your filter does
> not need to be aware of this issue at all.
>
> In unicode, right-to-left languages are encoded in the data in logical order.
> The rendering system is what converts it to display in right-to-left
> for RTL languages.
>
> For example in Arabic, "Robert 1234" displays as روبرت 1234
> To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
> beh, waw, reh
>
> But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.
>
> 2009/7/20 OBender <os...@hotmail.com>:
>> Hi All!
>>
>>
>>
>> Let say I have a filter that produces new tokens based on the original ones.
>>
>> How bad will it be if my filter sets the start of each token to 0 and end to
>> the length of a token?
>>
>> An example (based on the phrase "How are you?":
>>
>>
>>
>> Original token:
>>
>> [you?] (8,12)
>>
>>
>>
>> New tokens:
>>
>> [you] (0,3)
>>
>> [?] (0,1)
>>
>>
>>
>> It wouldn't be so hard to calculate the right numbers for left to right
>> languages and it is a bit more challenging to do it for right to left ones
>> but for mixed text it is quite hard.
>>
>>
>>
>> Thanks.
>>
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: question on custom filter

Posted by Robert Muir <rc...@gmail.com>.

Obender, This is not true.
the text you pasted is the following in unicode:

\N{HEBREW LETTER TET}
\N{HEBREW LETTER VAV}
\N{HEBREW POINT HOLAM}
\N{HEBREW LETTER BET}
\N{SPACE}
\N{HEBREW LETTER AYIN}
\N{HEBREW POINT SEGOL}
\N{HEBREW LETTER RESH}
\N{HEBREW POINT SEGOL}
\N{HEBREW LETTER BET}

you can use this utility to see how your text is encoded:
http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91

For more information on directionality in unicode, see
http://unicode.org/reports/tr9/

On Mon, Jul 20, 2009 at 1:59 PM, OBender<os...@hotmail.com> wrote:
> Robert,
>
> I'm not sure you are correct on this one.
>
> If I have a Hebrew phrase:
> [טוֹב עֶרֶב]
> Then first token that filter receives is:
> [עֶרֶב] (0,5)
> and the second is:
> [טוֹב] (6,10)
> Which means that it counts from right to left (words and indexes).
>
> Am I missing something?
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Monday, July 20, 2009 1:43 PM
> To: java-user@lucene.apache.org
> Subject: Re: question on custom filter
>
> Obender, I don't think its as difficult as you think. Your filter does
> not need to be aware of this issue at all.
>
> In unicode, right-to-left languages are encoded in the data in logical order.
> The rendering system is what converts it to display in right-to-left
> for RTL languages.
>
> For example in Arabic, "Robert 1234" displays as روبرت 1234
> To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
> beh, waw, reh
>
> But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.
>
> 2009/7/20 OBender <os...@hotmail.com>:
>> Hi All!
>>
>>
>>
>> Let say I have a filter that produces new tokens based on the original ones.
>>
>> How bad will it be if my filter sets the start of each token to 0 and end to
>> the length of a token?
>>
>> An example (based on the phrase "How are you?":
>>
>>
>>
>> Original token:
>>
>> [you?] (8,12)
>>
>>
>>
>> New tokens:
>>
>> [you] (0,3)
>>
>> [?] (0,1)
>>
>>
>>
>> It wouldn't be so hard to calculate the right numbers for left to right
>> languages and it is a bit more challenging to do it for right to left ones
>> but for mixed text it is quite hard.
>>
>>
>>
>> Thanks.
>>
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: question on custom filter

Posted by OBender <os...@hotmail.com>.

Robert,

I'm not sure you are correct on this one.

If I have a Hebrew phrase:
[טוֹב עֶרֶב]
Then first token that filter receives is:
[עֶרֶב] (0,5)
and the second is:
[טוֹב] (6,10)
Which means that it counts from right to left (words and indexes).

Am I missing something?

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Monday, July 20, 2009 1:43 PM
To: java-user@lucene.apache.org
Subject: Re: question on custom filter

Obender, I don't think its as difficult as you think. Your filter does
not need to be aware of this issue at all.

In unicode, right-to-left languages are encoded in the data in logical order.
The rendering system is what converts it to display in right-to-left
for RTL languages.

For example in Arabic, "Robert 1234" displays as روبرت 1234
To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
beh, waw, reh

But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.

2009/7/20 OBender <os...@hotmail.com>:
> Hi All!
>
>
>
> Let say I have a filter that produces new tokens based on the original ones.
>
> How bad will it be if my filter sets the start of each token to 0 and end to
> the length of a token?
>
> An example (based on the phrase "How are you?":
>
>
>
> Original token:
>
> [you?] (8,12)
>
>
>
> New tokens:
>
> [you] (0,3)
>
> [?] (0,1)
>
>
>
> It wouldn't be so hard to calculate the right numbers for left to right
> languages and it is a bit more challenging to do it for right to left ones
> but for mixed text it is quite hard.
>
>
>
> Thanks.
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: question on custom filter

Posted by Robert Muir <rc...@gmail.com>.

Obender, I don't think its as difficult as you think. Your filter does
not need to be aware of this issue at all.

In unicode, right-to-left languages are encoded in the data in logical order.
The rendering system is what converts it to display in right-to-left
for RTL languages.

For example in Arabic, "Robert 1234" displays as روبرت 1234
To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
beh, waw, reh

But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.

2009/7/20 OBender <os...@hotmail.com>:
> Hi All!
>
>
>
> Let say I have a filter that produces new tokens based on the original ones.
>
> How bad will it be if my filter sets the start of each token to 0 and end to
> the length of a token?
>
> An example (based on the phrase "How are you?":
>
>
>
> Original token:
>
> [you?] (8,12)
>
>
>
> New tokens:
>
> [you] (0,3)
>
> [?] (0,1)
>
>
>
> It wouldn't be so hard to calculate the right numbers for left to right
> languages and it is a bit more challenging to do it for right to left ones
> but for mixed text it is quite hard.
>
>
>
> Thanks.
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org