You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2010/04/03 11:47:27 UTC
[jira] Commented: (LUCENE-2362) Add support for slow filters with
batch processing
[ https://issues.apache.org/jira/browse/LUCENE-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853111#action_12853111 ]
Michael McCandless commented on LUCENE-2362:
--------------------------------------------
I think in general Lucene should do a better job managing whether the filter is cheap or expensive, random access or not (LUCENE-1536), and tune the matching/scoring appropriately.
But one issue with this patch: how is scoring done? It looks like in first pass you gather bit set, then you filter it w/ batch filter, then you iterate again in 2nd pass to collect the docs. But that 2nd pass won't in general have enough info to do scoring?
> Add support for slow filters with batch processing
> --------------------------------------------------
>
> Key: LUCENE-2362
> URL: https://issues.apache.org/jira/browse/LUCENE-2362
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 3.0.1
> Reporter: Sergey Vladimirov
> Attachments: BatchFilter.java, IndexSearcherImpl.java
>
>
> Internal implementation of IndexSearch assumes that Filter and scorer has almost equal perfomance. But in our environment we have Filter implementation that is very expensive (in compare to scorer).
> if we have, let's say, 2k of termdocs selected by scorer (each ~250 docs) and 2k selected by filter, then 250k docs will be fastly checked (and filtered out) by scorer, and 250k docs will be slowly checked by our filter.
> Using straigthforward implementation makes search out of 60 seconds per query boundary, because each next() or advance() requires N queries to database PER CHECKED DOC. Using read ahead technique allows us to optimze it to 35 seconds per query. Still too slow.
> The solution to problem is firstly select all documents by scorer and filter them in batch by our filter. Example of implementation (with BitSet) in attachement. Currently it takes only ~300 millseconds per query.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org