You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/09/15 03:09:08 UTC

[GitHub] [lucene] LuXugang opened a new issue, #11773: Could `PointRangeQuery`'s boundary values used for `NumericComparator` to calculate `estimatedNumberOfMatches`

LuXugang opened a new issue, #11773:
URL: https://github.com/apache/lucene/issues/11773

   ### Description
   
   Now when we do TopK search, we could rebuild `DocIdSetIterator` to reduce candidate docs since [LUCENE-9280](https://issues.apache.org/jira/browse/LUCENE-9280) .
   
   One condition of rebuilding `DocIdSetIterator` is that it must reduce number of docs at least 8x. But when we do TopK search by `PointRangeQuery`, it's `estimatedNumberOfMatches` contains some docs which are out of boundaries.Could we take advantage of range query's boundary values to make this condition much more easier to achieve？
   
   Since [LUCENE-10620](https://issues.apache.org/jira/browse/LUCENE-10620) we pass `Weight` to `Collecter`, it might be able to do this optimization?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jpountz commented on issue #11773: Could `PointRangeQuery`'s boundary values used for `NumericComparator` to calculate `estimatedNumberOfMatches`

Posted by GitBox <gi...@apache.org>.

jpountz commented on issue #11773:
URL: https://github.com/apache/lucene/issues/11773#issuecomment-1251190154

   The `estimatedNumberOfMatches` should still be very close to the actual number, so I'm not expecting that a more precise value would change when we rebuild the `DocIdSet` of top-k candidates, would it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jpountz commented on issue #11773: Could `PointRangeQuery`'s boundary values used for `NumericComparator` to calculate `estimatedNumberOfMatches`

Posted by GitBox <gi...@apache.org>.

jpountz commented on issue #11773:
URL: https://github.com/apache/lucene/issues/11773#issuecomment-1254622883

   Thanks, I had not well understood that you were after the case when both the filter and the sort would be on the same field. You are right that the collector could do better by being aware of the query. I suspect that the main challenge with this optimization is going to be to implement it in a clean way. If you have ideas how we could do this, I'd be happy to take a look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] LuXugang commented on issue #11773: Could `PointRangeQuery`'s boundary values used for `NumericComparator` to calculate `estimatedNumberOfMatches`

Posted by GitBox <gi...@apache.org>.

LuXugang commented on issue #11773:
URL: https://github.com/apache/lucene/issues/11773#issuecomment-1253245181

   > The estimatedNumberOfMatches should still be very close to the actual number
   
   Actually `estimatedNumberOfMatches` may far away from the actual number. 
   
   I wrote a [test](https://github.com/LuXugang/Lucene-7.5.0/blob/master/LuceneDemo9.2.0/src/main/java/NumericDocValuesTopNOptimization2.java) shows documents which are out of query boundary will participate in the calculation of `estimatedNumberOfMatches` which should not be  what we expected.
   
   In that [test](https://github.com/LuXugang/Lucene-7.5.0/blob/master/LuceneDemo9.2.0/src/main/java/NumericDocValuesTopNOptimization2.java), `80003` documents were indexed would match `PointRangeQuery`, and `TopFieldCollector` will collect different numbers of docs according to the number of documents which are out of query boundary.
   
   
   
   
   number of  documents which are out of query boundary | number of hits in Collector
   -- | --
   1 | 1001
   1000 | 1001
   10000 | 1001
   20000 | 80003
   100000 | 80003
   10000+ | 80003
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org