You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/09/20 00:08:31 UTC

[GitHub] [lucene] patelprateek opened a new issue, #11791: cardinality estimation for query filters

patelprateek opened a new issue, #11791:
URL: https://github.com/apache/lucene/issues/11791

   ### Description
   
   For large scale data the query filters can take long time to execute and return data . the returned data can also be large like millions of documents . Is there any functionality to be able to get some quick approximate estimate for query filters that can be potentially used to decide whether to run the query or not. 
   If not , would like to know any recommendation or ideas on how we can implement or build that functionality ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] patelprateek commented on issue #11791: cardinality estimation for query filters

Posted by GitBox <gi...@apache.org>.
patelprateek commented on issue #11791:
URL: https://github.com/apache/lucene/issues/11791#issuecomment-1252641368

   Thanks , that makes sense , will do some benchmarking on my end as well to get better understanding if it fits our sla requirements.
   Can you give me some more details on the KD Tree , IIRC KD tree were used for n-dim data points like geo-spatial right  ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on issue #11791: cardinality estimation for query filters

Posted by GitBox <gi...@apache.org>.
jpountz commented on issue #11791:
URL: https://github.com/apache/lucene/issues/11791#issuecomment-1252033446

   The cheapest way to get an estimation of the "cost" of a filter is to run `Weight#scorerSupplier` to retrieve a `ScorerSupplier` and then `ScorerSupplier#cost` to get an estimation of the number of matches of this scorer on the given segment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] patelprateek commented on issue #11791: cardinality estimation for query filters

Posted by GitBox <gi...@apache.org>.
patelprateek commented on issue #11791:
URL: https://github.com/apache/lucene/issues/11791#issuecomment-1252655054

   @jpountz : I do have some question regarding estimation accuracy , usually what i have seen some sketch data structures being used , which have different error bounds based on cardinalities of posting list size. and then on top of that precision degrades on doing set operations (query/filters) . Can you provide me with some pointers or details on is this estimation procedure also probabilistic with different error bounds or this is more accurate (assuming we use non probabilistic sketches like roaring bitmaps) ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on issue #11791: cardinality estimation for query filters

Posted by GitBox <gi...@apache.org>.
jpountz commented on issue #11791:
URL: https://github.com/apache/lucene/issues/11791#issuecomment-1252597988

   Yes, Lucene makes no difference between queries and filters.
   
   It's cheap in the sense that it only performs terms dictionary lookups and quick checks in KD tree indexes but doesn't do more expensive operations like iterating postings or checking KD tree leaves.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on issue #11791: cardinality estimation for query filters

Posted by GitBox <gi...@apache.org>.
jpountz commented on issue #11791:
URL: https://github.com/apache/lucene/issues/11791#issuecomment-1252680394

   > Can you give me some more details on the KD Tree , IIRC KD tree were used for n-dim data points like geo-spatial right ?
   
   This is correct, we also use them for the 1D case, ie. numeric data.
   
   Cardinality estimation is very inaccurate. For instance for conjunctions (AND), Lucene takes the minimum cost across clauses and for disjunctions (OR), Lucene takes the sum. The only goal it serves in Lucene is figuring out a sensible order in which to evaluate the various clauses of a query. However it's always cheap.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz closed issue #11791: cardinality estimation for query filters

Posted by GitBox <gi...@apache.org>.
jpountz closed issue #11791: cardinality estimation for query filters
URL: https://github.com/apache/lucene/issues/11791


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] patelprateek commented on issue #11791: cardinality estimation for query filters

Posted by GitBox <gi...@apache.org>.
patelprateek commented on issue #11791:
URL: https://github.com/apache/lucene/issues/11791#issuecomment-1252575078

   @jpountz : will this work for any query and not just filter ?
   Since I am new to lucene , can you please elaborate a bit on how much cheaper would it be relatively to actually running the query or filter on the data ? Also what is the accuracy of this estimation and if you can provide any documentation or pointers to this estimation strategy that will be helpful


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org