You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Andreas Brandl <ml...@3.141592654.de> on 2013/11/25 21:02:48 UTC

expensive post filtering of a query's result

Hi,

I have a Query that is fast and cheap to answer compared to a Filter implementation that is quite expensive (* for a background see below).

I was under the impression that when combining Query and Filter, lucene is able to calculate matches based on the query and *for these matches* applies the Filter. Actually, the Filter touches every single document in the index.

My questions is: How do I apply an efficient 'post-filtering' step to a query's result without touching all documents again?

I have tried like so:

<snip>
Filter expensiveFilter = ... // my filter implementation;
Query booleanQuery = ... // a BooleanQuery that is comparably cheap to answer
Query query = new FilteredQuery(booleanQuery, expensiveFilter, FilteredQuery.QUERY_FIRST_FILTER_STRATEGY);
isearcher.search(query, null, Integer.MAX_VALUE);
</snip>

I have also verified that the FilteredQuery.QUERY_FIRST_FILTER_STRATEGY strategy is actually used (i.e. no fallback to LEAP_FROG) like mentioned in the docs:

  /**
   * A filter strategy that advances the Query or rather its {@link Scorer} first and consults the
   * filter {@link DocIdSet} for each matched document.
   * <p>
   * Note: this strategy requires a {@link DocIdSet#bits()} to return a non-null value. Otherwise
   * this strategy falls back to {@link FilteredQuery#LEAP_FROG_QUERY_FIRST_STRATEGY}
   * </p>
   * <p>
   * Use this strategy if the filter computation is more expensive than document
   * scoring or if the filter has a linear running time to compute the next
   * matching doc like exact geo distances.
   * </p>
   */

For me, it reads like this strategy fits exactly my use case but there is clearly something I'm missing here, so any help/comments appreciated a lot.

Please see my Filter implementation attached if that is of interest (something terribly wrong there?). I'm not getting any acceptDocs, so acceptDocs is always null (which I thought was the way the query result gets propagated to the Filter).

* The background is:

I'm implementing a trigram index with lucene for regex search based on [1] which I'm going to evaluate against other regex search solutions (including Lucene's AutomatonQuery/RegexpQuery).

The lucene part boils down to having a BooleanQuery (containing trigrams like e.g. "OR(AND(hel,ell,llo), AND(wor,orl,rld))") which produces a candidate set, i.e. a superset of all actually matching documents. The last step is to verify each candidate, i.e. really match the document's content against the regex pattern. Obviously the goal is to reduce the candidate set as much as possible (via the BooleanQuery) and do the more expensive regex matching on as little documents as possible.

I'm on lucene 4.6.

Thank you,

Regards
Andreas

[1] http://swtch.com/~rsc/regexp/regexp4.html

Re: expensive post filtering of a query's result

Posted by Andreas Brandl <ml...@3.141592654.de>.
Uwe,

> Lucene Filters are always executed before on the full index. This is
> done inside getDocIdSet(), which is similar to scorer() in Querys.
> Most filters return a bitset in this method, so they calculate the
> whole bitset on the full index - this is what your filter is doing.
> The strategy only applies to the consuming of the DocIdSetIterator
> and how it is interleaved with the query's scorer (which is also a
> DocIdSetIterator). Bitset based filters are already built, so there
> is no real speed difference. For leap-frog there is room to improve,
> as the approach advances the iterators until they stop on the same
> document. The only chance to do some type of "postfiltering" is
> using a leap-frog approach and implementing
> DocIdSetIterator.advance() in an efficient way. For
> FixedBitSet/OpenBitSet there is nothing you can do, as the matching
> bits were calculated before. FYI, while calling getDocIdSet(),
> Lucene only provides acceptDocs, if there are deleted documents in
> the index or another filter was executed before.

thank you for the insights! I was clearly heading down the wrong road.

> To do post-filtering (e.g., Solr allows this), you have to do this in
> the result Collector implementation. The collector is used for
> result collection and gets every document id that matches the query+
> filters. In collector's collect(int docId) you can do the expensive
> post filtering, as collect() is only called for matching documents.
> A way to do this is to write a wrapper filter around the final
> collector and only delegating those collect() calls that match your
> post-filter.

Thanks again, that worked very well.

Can you recommend a good resource that covers this level of detail of lucene? I know of 'Lucene in Action' but I'm afraid it's a bit outdated meanwhile, isn't it?

Regards,
Andreas

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: expensive post filtering of a query's result

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

Lucene Filters are always executed before on the full index. This is done inside getDocIdSet(), which is similar to scorer() in Querys. Most filters return a bitset in this method, so they calculate the whole bitset on the full index - this is what your filter is doing. The strategy only applies to the consuming of the DocIdSetIterator and how it is interleaved with the query's scorer (which is also a DocIdSetIterator). Bitset based filters are already built, so there is no real speed difference. For leap-frog there is room to improve, as the approach advances the iterators until they stop on the same document. The only chance to do some type of "postfiltering" is using a leap-frog approach and implementing DocIdSetIterator.advance() in an efficient way. For FixedBitSet/OpenBitSet there is nothing you can do, as the matching bits were calculated before. FYI, while calling getDocIdSet(), Lucene only provides acceptDocs, if there are deleted documents in the index or another filter was executed before.

To do post-filtering (e.g., Solr allows this), you have to do this in the result Collector implementation. The collector is used for result collection and gets every document id that matches the query+ filters. In collector's collect(int docId) you can do the expensive post filtering, as collect() is only called for matching documents. A way to do this is to write a wrapper filter around the final collector and only delegating those collect() calls that match your post-filter.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Andreas Brandl [mailto:ml@3.141592654.de]
> Sent: Monday, November 25, 2013 9:03 PM
> To: java-user@lucene.apache.org
> Subject: expensive post filtering of a query's result
> 
> Hi,
> 
> I have a Query that is fast and cheap to answer compared to a Filter
> implementation that is quite expensive (* for a background see below).
> 
> I was under the impression that when combining Query and Filter, lucene is
> able to calculate matches based on the query and *for these matches*
> applies the Filter. Actually, the Filter touches every single document in the
> index.
> 
> My questions is: How do I apply an efficient 'post-filtering' step to a query's
> result without touching all documents again?
> 
> I have tried like so:
> 
> <snip>
> Filter expensiveFilter = ... // my filter implementation; Query booleanQuery
> = ... // a BooleanQuery that is comparably cheap to answer Query query =
> new FilteredQuery(booleanQuery, expensiveFilter,
> FilteredQuery.QUERY_FIRST_FILTER_STRATEGY);
> isearcher.search(query, null, Integer.MAX_VALUE); </snip>
> 
> I have also verified that the FilteredQuery.QUERY_FIRST_FILTER_STRATEGY
> strategy is actually used (i.e. no fallback to LEAP_FROG) like mentioned in the
> docs:
> 
>   /**
>    * A filter strategy that advances the Query or rather its {@link Scorer} first
> and consults the
>    * filter {@link DocIdSet} for each matched document.
>    * <p>
>    * Note: this strategy requires a {@link DocIdSet#bits()} to return a non-null
> value. Otherwise
>    * this strategy falls back to {@link
> FilteredQuery#LEAP_FROG_QUERY_FIRST_STRATEGY}
>    * </p>
>    * <p>
>    * Use this strategy if the filter computation is more expensive than
> document
>    * scoring or if the filter has a linear running time to compute the next
>    * matching doc like exact geo distances.
>    * </p>
>    */
> 
> For me, it reads like this strategy fits exactly my use case but there is clearly
> something I'm missing here, so any help/comments appreciated a lot.
> 
> Please see my Filter implementation attached if that is of interest (something
> terribly wrong there?). I'm not getting any acceptDocs, so acceptDocs is
> always null (which I thought was the way the query result gets propagated to
> the Filter).
> 
> * The background is:
> 
> I'm implementing a trigram index with lucene for regex search based on [1]
> which I'm going to evaluate against other regex search solutions (including
> Lucene's AutomatonQuery/RegexpQuery).
> 
> The lucene part boils down to having a BooleanQuery (containing trigrams
> like e.g. "OR(AND(hel,ell,llo), AND(wor,orl,rld))") which produces a candidate
> set, i.e. a superset of all actually matching documents. The last step is to
> verify each candidate, i.e. really match the document's content against the
> regex pattern. Obviously the goal is to reduce the candidate set as much as
> possible (via the BooleanQuery) and do the more expensive regex matching
> on as little documents as possible.
> 
> I'm on lucene 4.6.
> 
> Thank you,
> 
> Regards
> Andreas
> 
> [1] http://swtch.com/~rsc/regexp/regexp4.html


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org