You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Steven Rowe (JIRA)" <ji...@apache.org> on 2008/05/02 19:16:55 UTC

[jira] Created: (LUCENE-1279) RangeQuery and RangeFilter should use collation to check for range inclusion

RangeQuery and RangeFilter should use collation to check for range inclusion
----------------------------------------------------------------------------

                 Key: LUCENE-1279
                 URL: https://issues.apache.org/jira/browse/LUCENE-1279
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Search
    Affects Versions: 2.3.1
            Reporter: Steven Rowe
            Priority: Minor
             Fix For: 2.4


See [this java-user discussion|http://www.nabble.com/lucene-farsi-problem-td16977096.html] of problems caused by Unicode code-point comparison, instead of collation, in RangeQuery.

RangeQuery could take in a Locale via a setter, which could be used with a java.text.Collator and/or CollationKey's, to handle ranges for languages which have alphabet orderings different from those in Unicode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1279) RangeQuery and RangeFilter should use collation to check for range inclusion

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-1279:
--------------------------------

    Attachment: LUCENE-1279.patch

Rewritten patch, with the collating range functionality now added to existing classes RangeQuery and RangeFilter.

All tests pass.

> RangeQuery and RangeFilter should use collation to check for range inclusion
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1279
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.1
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1279.patch, LUCENE-1279.patch
>
>
> See [this java-user discussion|http://www.nabble.com/lucene-farsi-problem-td16977096.html] of problems caused by Unicode code-point comparison, instead of collation, in RangeQuery.
> RangeQuery could take in a Locale via a setter, which could be used with a java.text.Collator and/or CollationKey's, to handle ranges for languages which have alphabet orderings different from those in Unicode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Issue Comment Edited: (LUCENE-1279) RangeQuery and RangeFilter should use collation to check for range inclusion

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595467#action_12595467 ] 

steve_rowe edited comment on LUCENE-1279 at 5/8/08 10:07 PM:
--------------------------------------------------------------

bq. Hmmm... excellent point. you convinced me.

Okay. :)  At your (previous) suggestion, I have redone the patch (will attach shortly), moving the collating stuff into RangeQuery and RangeFilter, with enabling bits in QueryParser and ConstantScoreRangeQuery.  I put WARNING text in the javadoc for each method that invokes the expensive index Term iteration, so hopefully that will give pause to those who might otherwise unwittingly slow things down.

bq. BTW: if hooks for CollatingRangeQuery are added to QueryParser, it shouldn't use this class just because a Locale is specified - that would cause some unexpected results for people who have been specifying a Locale for date reasons. a new "setter" would need to indicate when to pay attention to Collation.

I added a new setter to QueryParser for this purpose: {{setRangeCollator(Collator)}}.

      was (Author: steve_rowe):
    bq. Hmmm... excellent point. you convinced me.

Okay. :)  At your (previous) suggestion, I have redone the patch (will attach shortly), moving the collating stuff into RangeQuery and RangeFilter, with enabling bits in QueryParser and ConstantScoreRangeQuery.  I put WARNING text in the javadoc for each method that invokes the expensive index Term iteration, so hopefully that will give pause to those who might otherwise unwittingly slow things down.

bq. BTW: if hooks for CollatingRangeQuery are added to QueryParser, it shouldn't use this class just because a Locale is specified - that would cause some unexpected results for people who have been specifying a Locale for date reasons. a new "setter" would need to indicate when to pay attention to Collation.

I added a new setter to QueryParser for this purpose: {{ setRangeCollator(Collator) }}.
  
> RangeQuery and RangeFilter should use collation to check for range inclusion
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1279
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.1
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1279.patch, LUCENE-1279.patch
>
>
> See [this java-user discussion|http://www.nabble.com/lucene-farsi-problem-td16977096.html] of problems caused by Unicode code-point comparison, instead of collation, in RangeQuery.
> RangeQuery could take in a Locale via a setter, which could be used with a java.text.Collator and/or CollationKey's, to handle ranges for languages which have alphabet orderings different from those in Unicode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Assigned: (LUCENE-1279) RangeQuery and RangeFilter should use collation to check for range inclusion

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll reassigned LUCENE-1279:
---------------------------------------

    Assignee: Grant Ingersoll

> RangeQuery and RangeFilter should use collation to check for range inclusion
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1279
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.1
>            Reporter: Steven Rowe
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1279.patch, LUCENE-1279.patch
>
>
> See [this java-user discussion|http://www.nabble.com/lucene-farsi-problem-td16977096.html] of problems caused by Unicode code-point comparison, instead of collation, in RangeQuery.
> RangeQuery could take in a Locale via a setter, which could be used with a java.text.Collator and/or CollationKey's, to handle ranges for languages which have alphabet orderings different from those in Unicode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1279) RangeQuery and RangeFilter should use collation to check for range inclusion

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593819#action_12593819 ] 

Steven Rowe commented on LUCENE-1279:
-------------------------------------

RangeFilter should also take in a Locale, to perform the same sort of comparisons.

QueryParser already takes in a Locale, though it was originally intended to be used for date comparisons.  It could forward this Locale, through ConstantScoreRangeQuery, to RangeFilter.

> RangeQuery and RangeFilter should use collation to check for range inclusion
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1279
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.1
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.4
>
>
> See [this java-user discussion|http://www.nabble.com/lucene-farsi-problem-td16977096.html] of problems caused by Unicode code-point comparison, instead of collation, in RangeQuery.
> RangeQuery could take in a Locale via a setter, which could be used with a java.text.Collator and/or CollationKey's, to handle ranges for languages which have alphabet orderings different from those in Unicode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1279) RangeQuery and RangeFilter should use collation to check for range inclusion

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629131#action_12629131 ] 

Michael McCandless commented on LUCENE-1279:
--------------------------------------------

Grant, what's the game plan on this one?

> RangeQuery and RangeFilter should use collation to check for range inclusion
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1279
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.1
>            Reporter: Steven Rowe
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1279.patch, LUCENE-1279.patch
>
>
> See [this java-user discussion|http://www.nabble.com/lucene-farsi-problem-td16977096.html] of problems caused by Unicode code-point comparison, instead of collation, in RangeQuery.
> RangeQuery could take in a Locale via a setter, which could be used with a java.text.Collator and/or CollationKey's, to handle ranges for languages which have alphabet orderings different from those in Unicode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1279) RangeQuery and RangeFilter should use collation to check for range inclusion

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630806#action_12630806 ] 

Steven Rowe commented on LUCENE-1279:
-------------------------------------

{quote}
from the Collator javadocs:
bq. When sorting a list of Strings however, it is generally necessary to compare each String multiple times. In this case, CollationKeys provide better performance. The CollationKey class converts a String to a series of bits that can be compared bitwise against other CollationKeys. A CollationKey is created by a Collator object for a given String. 

I don't think we need to implement this now, but I wonder if there is a performance difference if we created the CollationKey for comparison. The big question is whether the construction of that for each term outweighs the savings by repeated comparisons to lower and upper.
{quote}

I think the problem is that every single index term has to be converted to a CollationKey for every single (range) search.  In an earlier comment on this issue, Hoss said:

bq. 4) when i first saw the thread that spawned this issue, my first reaction was to wonder if it would make sense to start allowing a Collator to be specified when indexing, and to use the raw bytes from the CollationKey as the indexed value - I haven't thought it through very hard, but i wonder if that would be feasible (it seems like it would certainly faster at query time, since it would allow more traditional term skipping.

I'm working on a utility class to store arbitrary binary in sortable, indexable Strings, so that CollationKeys can be stored in the index.  IMHO, though, this issue should still go forward.

bq. One more question, and it probably shows my lack of knowledge here, but would it be possible to enumerate the various codepoints where there are problems and just handle these separately, somehow? Basically, how pervasive is the problem? Would we perhaps be better off having a check to see if one of these bad codepoints falls in the range of lower/upper and then handle it separately?

Languages, in some cases using the same character repertoire, define different orderings.  Also, I believe some orderings are context dependent - you can't always compare character by character.   So adding this stuff to Lucene would be to duplicate a lot of the stuff that's already done in the Collator.

> RangeQuery and RangeFilter should use collation to check for range inclusion
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1279
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.1
>            Reporter: Steven Rowe
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1279.patch, LUCENE-1279.patch, LUCENE-1279.patch, LUCENE-1279.patch
>
>
> See [this java-user discussion|http://www.nabble.com/lucene-farsi-problem-td16977096.html] of problems caused by Unicode code-point comparison, instead of collation, in RangeQuery.
> RangeQuery could take in a Locale via a setter, which could be used with a java.text.Collator and/or CollationKey's, to handle ranges for languages which have alphabet orderings different from those in Unicode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1279) RangeQuery and RangeFilter should use collation to check for range inclusion

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594574#action_12594574 ] 

Steven Rowe commented on LUCENE-1279:
-------------------------------------

bq. 1) you should be able to at least start the enumerator by skiping to a term consisting of the lowerTermField and the termText of "" ... even if the Collation of the term text is random, you still know which field you want.

I thought I did that - from the patch:

{code:java}
    TermEnum enumerator = reader.terms(new Term(getField(), ""));
    ...
  public String getField() {
    return (lowerTerm != null ? lowerTerm.field() : upperTerm.field());
  }
{code}

bq. 2) why can a collator only be specified by a Locale, why not just let people specify the Collator they want directly?

In the java-user thread that spawned this issue, I mentioned that this would be necessary for custom Collators.  I used Locale because it's simpler to specify, but you're right, directly specifying a Collator makes more sense.

bq. 3) instead of adding a new public CollatingRangeQuery, would it make more sense to add an optional Collator to RangeQuery (and RangeFilter) which triggers a different code path when non null? (from a performance standpoint it would basically be one conditional check at the begining of the rewrite method.)

This was my original thought, but since the performance impact could be large compared to a standard RangeQuery, I thought it made more sense to put it where it couldn't be used accidentally :).  I can redo it to integrate with the existing classes, though.

bq. 4) when i first saw the thread that spawned this issue, my first reaction was to wonder if it would make sense to start allowing a Collator to be specified when indexing, and to use the raw bytes from the CollationKey as the indexed value - I haven't thought it through very hard, but i wonder if that would be feasible (it seems like it would certainly faster at query time, since it would allow more traditional term skipping.

I thought of something similar, but wow, this would be large.  It would require that the exact Collator used to generate the index terms also be used to generate CollationKeys for RangeQuery's/Filter's -- the Collator's rules would have to be stored in the index.  Also, how would binary CollationKey (de-)serialization fit into the String (de-)serialization currently in place for index terms?

My guess is that the functionality provided here is most useful for fields with a small number of terms -- especially in the case of RangeQuery's, where the BooleanQuery clause limit is not guarded against.  Given this IMHO most likely scenario, the performance optimization you're talking about (and the attendant code complexification) probably isn't warranted.


> RangeQuery and RangeFilter should use collation to check for range inclusion
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1279
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.1
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1279.patch
>
>
> See [this java-user discussion|http://www.nabble.com/lucene-farsi-problem-td16977096.html] of problems caused by Unicode code-point comparison, instead of collation, in RangeQuery.
> RangeQuery could take in a Locale via a setter, which could be used with a java.text.Collator and/or CollationKey's, to handle ranges for languages which have alphabet orderings different from those in Unicode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1279) RangeQuery and RangeFilter should use collation to check for range inclusion

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595412#action_12595412 ] 

Hoss Man commented on LUCENE-1279:
----------------------------------

bq. I thought I did that 

my bad.  i missread.

bq. since the performance impact could be large compared to a standard RangeQuery, I thought it made more sense to put it where it couldn't be used accidentally

Hmmm...  excellent point.  you convinced me.

BTW: if hooks for CollatingRangeQuery are added to QueryParser, it shouldn't use this class just because a Locale is specified -- that would cause some unexpected results for people who have been specifying a Locale for date reasons. a new "setter" would need to indicate when to pay attention to Collation.

> RangeQuery and RangeFilter should use collation to check for range inclusion
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1279
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.1
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1279.patch
>
>
> See [this java-user discussion|http://www.nabble.com/lucene-farsi-problem-td16977096.html] of problems caused by Unicode code-point comparison, instead of collation, in RangeQuery.
> RangeQuery could take in a Locale via a setter, which could be used with a java.text.Collator and/or CollationKey's, to handle ranges for languages which have alphabet orderings different from those in Unicode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1279) RangeQuery and RangeFilter should use collation to check for range inclusion

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594464#action_12594464 ] 

Hoss Man commented on LUCENE-1279:
----------------------------------

a few random thoughts:

1) you should be able to at least start the enumerator by skiping to a term consisting of the lowerTermField and the termText of "" ... even if the Collation of the term text is random, you still know which field you want.

2) why can a collator only be specified by a Locale, why not just let people specify the Collator they want directly?

3) instead of adding a new public CollatingRangeQuery, would it make more sense to add an optional Collator to RangeQuery (and RangeFilter) which triggers a different code path when non null?  (from a performance standpoint it would basically be one conditional check at the begining of the rewrite method.)

4) when i first saw the thread that spawned this issue, my first reaction was to wonder if it would make sense to start allowing a Collator to be specified when indexing, and to use the raw bytes from the CollationKey as the indexed value -- I haven't thought it through very hard, but i wonder if that would be feasible (it seems like it would certainly faster at query time, since it would allow more traditional term skipping.

> RangeQuery and RangeFilter should use collation to check for range inclusion
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1279
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.1
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1279.patch
>
>
> See [this java-user discussion|http://www.nabble.com/lucene-farsi-problem-td16977096.html] of problems caused by Unicode code-point comparison, instead of collation, in RangeQuery.
> RangeQuery could take in a Locale via a setter, which could be used with a java.text.Collator and/or CollationKey's, to handle ranges for languages which have alphabet orderings different from those in Unicode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1279) RangeQuery and RangeFilter should use collation to check for range inclusion

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-1279:
--------------------------------

    Attachment: LUCENE-1279.patch

Updated to current trunk revision (694771).  Mostly this consisted of switching away from deprecated Hits in tests.

Also, I used JavaCC 4.1 to regenerate QueryParser.java et al., and it looks like all of the files in the o.a.l.queryParser package have been changed - apparently the last time they were generated, JavaCC 4.0 was used.

All tests pass for me (except TestIndexReaderReopen.testThreadSafety(), which I just posted to java-dev about, and which should be completely unrelated to this issue).

> RangeQuery and RangeFilter should use collation to check for range inclusion
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1279
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.1
>            Reporter: Steven Rowe
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1279.patch, LUCENE-1279.patch, LUCENE-1279.patch
>
>
> See [this java-user discussion|http://www.nabble.com/lucene-farsi-problem-td16977096.html] of problems caused by Unicode code-point comparison, instead of collation, in RangeQuery.
> RangeQuery could take in a Locale via a setter, which could be used with a java.text.Collator and/or CollationKey's, to handle ranges for languages which have alphabet orderings different from those in Unicode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1279) RangeQuery and RangeFilter should use collation to check for range inclusion

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595467#action_12595467 ] 

Steven Rowe commented on LUCENE-1279:
-------------------------------------

bq. Hmmm... excellent point. you convinced me.

Okay. :)  At your (previous) suggestion, I have redone the patch (will attach shortly), moving the collating stuff into RangeQuery and RangeFilter, with enabling bits in QueryParser and ConstantScoreRangeQuery.  I put WARNING text in the javadoc for each method that invokes the expensive index Term iteration, so hopefully that will give pause to those who might otherwise unwittingly slow things down.

bq. BTW: if hooks for CollatingRangeQuery are added to QueryParser, it shouldn't use this class just because a Locale is specified - that would cause some unexpected results for people who have been specifying a Locale for date reasons. a new "setter" would need to indicate when to pay attention to Collation.

I added a new setter to QueryParser for this purpose: {{ setRangeCollator(Collator) }}.

> RangeQuery and RangeFilter should use collation to check for range inclusion
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1279
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.1
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1279.patch
>
>
> See [this java-user discussion|http://www.nabble.com/lucene-farsi-problem-td16977096.html] of problems caused by Unicode code-point comparison, instead of collation, in RangeQuery.
> RangeQuery could take in a Locale via a setter, which could be used with a java.text.Collator and/or CollationKey's, to handle ranges for languages which have alphabet orderings different from those in Unicode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1279) RangeQuery and RangeFilter should use collation to check for range inclusion

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593867#action_12593867 ] 

Steven Rowe commented on LUCENE-1279:
-------------------------------------

(Wild guess): iterate over all terms instead of iterating over terms between the lower and upper term.  Hmm, this is going to be slow.

The implementation could default to the current behavior if no/null Locale is supplied.

> RangeQuery and RangeFilter should use collation to check for range inclusion
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1279
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.1
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.4
>
>
> See [this java-user discussion|http://www.nabble.com/lucene-farsi-problem-td16977096.html] of problems caused by Unicode code-point comparison, instead of collation, in RangeQuery.
> RangeQuery could take in a Locale via a setter, which could be used with a java.text.Collator and/or CollationKey's, to handle ranges for languages which have alphabet orderings different from those in Unicode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1279) RangeQuery and RangeFilter should use collation to check for range inclusion

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-1279:
--------------------------------

    Attachment: LUCENE-1279.patch

bq. Seems like the new tests in TestRangeFilter still uses Hits.

Thanks, I missed those - this new patch removes Hits usages; also, switched a few deprecated Field.Index.UN_TOKENIZED usages to NOT_ANALYZED.

All tests pass.

> RangeQuery and RangeFilter should use collation to check for range inclusion
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1279
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.1
>            Reporter: Steven Rowe
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1279.patch, LUCENE-1279.patch, LUCENE-1279.patch, LUCENE-1279.patch
>
>
> See [this java-user discussion|http://www.nabble.com/lucene-farsi-problem-td16977096.html] of problems caused by Unicode code-point comparison, instead of collation, in RangeQuery.
> RangeQuery could take in a Locale via a setter, which could be used with a java.text.Collator and/or CollationKey's, to handle ranges for languages which have alphabet orderings different from those in Unicode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1279) RangeQuery and RangeFilter should use collation to check for range inclusion

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593858#action_12593858 ] 

Yonik Seeley commented on LUCENE-1279:
--------------------------------------

How do you suggest actually retrieving all of the documents between two endpoints based on non-index ordering?

> RangeQuery and RangeFilter should use collation to check for range inclusion
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1279
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.1
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.4
>
>
> See [this java-user discussion|http://www.nabble.com/lucene-farsi-problem-td16977096.html] of problems caused by Unicode code-point comparison, instead of collation, in RangeQuery.
> RangeQuery could take in a Locale via a setter, which could be used with a java.text.Collator and/or CollationKey's, to handle ranges for languages which have alphabet orderings different from those in Unicode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-1279) RangeQuery and RangeFilter should use collation to check for range inclusion

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll resolved LUCENE-1279.
-------------------------------------

       Resolution: Fixed
    Lucene Fields:   (was: [New])

Committed revision 696056.

> RangeQuery and RangeFilter should use collation to check for range inclusion
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1279
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.1
>            Reporter: Steven Rowe
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1279.patch, LUCENE-1279.patch, LUCENE-1279.patch, LUCENE-1279.patch
>
>
> See [this java-user discussion|http://www.nabble.com/lucene-farsi-problem-td16977096.html] of problems caused by Unicode code-point comparison, instead of collation, in RangeQuery.
> RangeQuery could take in a Locale via a setter, which could be used with a java.text.Collator and/or CollationKey's, to handle ranges for languages which have alphabet orderings different from those in Unicode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1279) RangeQuery and RangeFilter should use collation to check for range inclusion

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630563#action_12630563 ] 

Grant Ingersoll commented on LUCENE-1279:
-----------------------------------------

Steve, can you update this for trunk (assuming SVN is working)?

> RangeQuery and RangeFilter should use collation to check for range inclusion
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1279
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.1
>            Reporter: Steven Rowe
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1279.patch, LUCENE-1279.patch
>
>
> See [this java-user discussion|http://www.nabble.com/lucene-farsi-problem-td16977096.html] of problems caused by Unicode code-point comparison, instead of collation, in RangeQuery.
> RangeQuery could take in a Locale via a setter, which could be used with a java.text.Collator and/or CollationKey's, to handle ranges for languages which have alphabet orderings different from those in Unicode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1279) RangeQuery and RangeFilter should use collation to check for range inclusion

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644492#action_12644492 ] 

Steven Rowe commented on LUCENE-1279:
-------------------------------------

Hoss wrote:

bq. 4) when i first saw the thread that spawned this issue, my first reaction was to wonder if it would make sense to start allowing a Collator to be specified when indexing, and to use the raw bytes from the CollationKey as the indexed value - I haven't thought it through very hard, but i wonder if that would be feasible (it seems like it would certainly faster at query time, since it would allow more traditional term skipping.

See LUCENE-1435, which is an implementation of this idea.

> RangeQuery and RangeFilter should use collation to check for range inclusion
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1279
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.1
>            Reporter: Steven Rowe
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1279.patch, LUCENE-1279.patch, LUCENE-1279.patch, LUCENE-1279.patch
>
>
> See [this java-user discussion|http://www.nabble.com/lucene-farsi-problem-td16977096.html] of problems caused by Unicode code-point comparison, instead of collation, in RangeQuery.
> RangeQuery could take in a Locale via a setter, which could be used with a java.text.Collator and/or CollationKey's, to handle ranges for languages which have alphabet orderings different from those in Unicode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1279) RangeQuery and RangeFilter should use collation to check for range inclusion

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630894#action_12630894 ] 

Grant Ingersoll commented on LUCENE-1279:
-----------------------------------------

{quote}
I think the problem is that every single index term has to be converted to a CollationKey for every single (range) search. 
{quote}

Yes, agreed.  The question mainly is would that be faster than the String comparisons.  Basically, is a construction plus a bitwise compare faster than a string compare?  


{quote}
Languages, in some cases using the same character repertoire, define different orderings. Also, I believe some orderings are context dependent - you can't always compare character by character. So adding this stuff to Lucene would be to duplicate a lot of the stuff that's already done in the Collator.
{quote}

Makes sense, was just wondering if there were some shortcuts to be had since we have a very particular case and I was thinking maybe it would allow us to narrow down the range to search.

For instance, hypothetically speaking, say your field had a full range of words starting with A up to Z, but that you knew the ordering problem only occurred between L and P and that your lower and upper terms K and Q, then you could feel confident that you could skip to K and stop at Q w/o any ramifications.  I realize this is repeating what is in the Collator, but it would be nice if the collator exposed the info.  However, perhaps, if using a RuleBasedCollator, the getRules() method could be used to optimize.  Again, just thinking out loud, I haven't explored it.

I agree, this should still go forward, even as is.


> RangeQuery and RangeFilter should use collation to check for range inclusion
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1279
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.1
>            Reporter: Steven Rowe
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1279.patch, LUCENE-1279.patch, LUCENE-1279.patch, LUCENE-1279.patch
>
>
> See [this java-user discussion|http://www.nabble.com/lucene-farsi-problem-td16977096.html] of problems caused by Unicode code-point comparison, instead of collation, in RangeQuery.
> RangeQuery could take in a Locale via a setter, which could be used with a java.text.Collator and/or CollationKey's, to handle ranges for languages which have alphabet orderings different from those in Unicode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1279) RangeQuery and RangeFilter should use collation to check for range inclusion

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-1279:
--------------------------------

    Attachment: LUCENE-1279.patch

Attaching a patch containing class CollatingRangeQuery, which extends RangeQuery, overriding the rewrite() method.  A test class is also supplied.  This is targetted at contrib/.

Because *all* index terms in the Field of the lower and upper terms of the range have to be examined, since index term ordering (Unicode code point order) is not necessarily the same as the collation in the given Locale, CollatingRangeQuery's will be significantly slower than the RangeQuery's.

One of the tests uses some of the Farsi information Esra supplied in the original post.  Note that neither Java 1.4.2 nor 1.5.0 contains collation information for Farsi.  Instead, the test uses the Arabic Locale, which appears to contain the proper letter ordering for the non-Arabic Farsi letters.

> RangeQuery and RangeFilter should use collation to check for range inclusion
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1279
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.1
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1279.patch
>
>
> See [this java-user discussion|http://www.nabble.com/lucene-farsi-problem-td16977096.html] of problems caused by Unicode code-point comparison, instead of collation, in RangeQuery.
> RangeQuery could take in a Locale via a setter, which could be used with a java.text.Collator and/or CollationKey's, to handle ranges for languages which have alphabet orderings different from those in Unicode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1279) RangeQuery and RangeFilter should use collation to check for range inclusion

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630766#action_12630766 ] 

Grant Ingersoll commented on LUCENE-1279:
-----------------------------------------

{quote}
Mostly this consisted of switching away from deprecated Hits in tests. 
{quote}

Seems like the new tests in TestRangeFilter still uses Hits.

Also, from the Collator javadocs:
{quote}
When sorting a list of Strings however, it is generally necessary to compare each String multiple times. In this case, CollationKeys provide better performance. The CollationKey class converts a String to a series of bits that can be compared bitwise against other CollationKeys. A CollationKey is created by a Collator object for a given String. 
{quote}

I don't think we need to implement this now, but I wonder if there is a performance difference if we created the CollationKey for comparison.  The big question is whether the construction of that for each term outweighs the savings by repeated comparisons to lower and upper.  

One more question, and it probably shows my lack of knowledge here, but would it be possible to enumerate the various codepoints where there are problems and just handle these separately, somehow?  Basically, how pervasive is the problem?  Would we perhaps be better off having a check to see if one of these bad codepoints falls in the range of lower/upper and then handle it separately?  Or, perhaps, some reasoning  would allow us to better narrow in on the lowerTerm/upper instead of having to check the whole field.  Just thinking out loud...

Otherwise, looks pretty good.

> RangeQuery and RangeFilter should use collation to check for range inclusion
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1279
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.1
>            Reporter: Steven Rowe
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1279.patch, LUCENE-1279.patch, LUCENE-1279.patch
>
>
> See [this java-user discussion|http://www.nabble.com/lucene-farsi-problem-td16977096.html] of problems caused by Unicode code-point comparison, instead of collation, in RangeQuery.
> RangeQuery could take in a Locale via a setter, which could be used with a java.text.Collator and/or CollationKey's, to handle ranges for languages which have alphabet orderings different from those in Unicode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org