You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by James Clarke <jc...@basistech.com> on 2013/10/10 20:01:39 UTC

Optimizing Filters

Are there any best practices for constructing Filters to search efficiently?
From my non-exhaustive experiments I cannot intuit how to construct my filters
to achieve best performance.

I have an index (Lucene 4.3) of about 1.8M documents which contain a field
acting as a flag (evidence:true). Initially all the documents I am interested in
searching have this field. Later as the index grows some documents will not have
this field.

In the simplest case I want to filter on documents with evidence:true. Running a
couple of hundred queries sequentially and recording how long it takes to
complete.

 * No filter: ~40s
 * QueryWrapperFilter(TermQuery(evidence:true)): ~80s
 * FieldValueFilter(evidence): ~43s
 * TermsFilter(evidence:true): ~50s

This suggests QWF is a bad idea.

A more complex filter is:

  (evidence:true AND (cid:x OR cid:y ...) AND language:eng)

Where 1.8M documents evidence:true, 2-4 documents per cid clause, 1-60 cid
clauses, and 1.4M documents lang:eng.

Our initial implementation uses QWF of a BooleanQuery(TQ AND BQ(OR) AND TQ)
which takes ~210s.

Adjusting this to be a BooleanFilter(TermsFilter AND TermsFilter AND
TermsFilter) sees things slow down to ~239s!

Any advice on optimizing these filters would be appreciated!

James


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Optimizing Filters

Posted by Ian Lea <ia...@gmail.com>.
Yes, I think you should have a play. But on an index that is as
realistic as you can make it - there may be variations in performance
of the different queries and filters depending on term frequencies and
loads of other stuff I don't understand.  General point being simply
that YMMV.


--
Ian.


On Wed, Oct 16, 2013 at 3:07 PM, James Clarke <jc...@basistech.com> wrote:
> Filters are created programmatically per request (and customized for the
> request) thus in order to benefit from CachingWrapperFilter we require a
> mechanism for looking up CachingWrapperFilters based on the request. But this is
> certainly an area worth trying (we could probably reuse each filter 10 times,
> because of the variation in requests and NRT search).
>
> I was hoping to improve query latency by reformulating the filters and
> queries. However my intuition of the best practice for filter and query
> construction is lacking i.e., is it better to use a TermsFilter and
> MatchAllDocsQuery or a BooleanQuery of TermQuerys, or a BooleanQuery of
> ConstantScoreQuerys of TermQuery etc.
>
> Maybe I should just hunker down and create a synthetic index and try many
> different combinations of filter/query construction.
>
> On Oct 11, 2013, at 7:33 AM, Ian Lea <ia...@gmail.com> wrote:
>
>> Are you going to be caching and reusing the filters e.g. by
>> CachingWrapperFilter?  The main benefit of filters is in reuse.  It
>> takes time to build them in the first place, likely roughly equivalent
>> to running the underlying query although with variations as you
>> describe.  Or are you saying that querying with filters is slow?
>>
>>
>> --
>> Ian.
>>
>>
>> On Thu, Oct 10, 2013 at 7:01 PM, James Clarke <jc...@basistech.com> wrote:
>>> Are there any best practices for constructing Filters to search efficiently?
>>> From my non-exhaustive experiments I cannot intuit how to construct my filters
>>> to achieve best performance.
>>>
>>> I have an index (Lucene 4.3) of about 1.8M documents which contain a field
>>> acting as a flag (evidence:true). Initially all the documents I am interested in
>>> searching have this field. Later as the index grows some documents will not have
>>> this field.
>>>
>>> In the simplest case I want to filter on documents with evidence:true. Running a
>>> couple of hundred queries sequentially and recording how long it takes to
>>> complete.
>>>
>>> * No filter: ~40s
>>> * QueryWrapperFilter(TermQuery(evidence:true)): ~80s
>>> * FieldValueFilter(evidence): ~43s
>>> * TermsFilter(evidence:true): ~50s
>>>
>>> This suggests QWF is a bad idea.
>>>
>>> A more complex filter is:
>>>
>>>  (evidence:true AND (cid:x OR cid:y ...) AND language:eng)
>>>
>>> Where 1.8M documents evidence:true, 2-4 documents per cid clause, 1-60 cid
>>> clauses, and 1.4M documents lang:eng.
>>>
>>> Our initial implementation uses QWF of a BooleanQuery(TQ AND BQ(OR) AND TQ)
>>> which takes ~210s.
>>>
>>> Adjusting this to be a BooleanFilter(TermsFilter AND TermsFilter AND
>>> TermsFilter) sees things slow down to ~239s!
>>>
>>> Any advice on optimizing these filters would be appreciated!
>>>
>>> James
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Optimizing Filters

Posted by James Clarke <jc...@basistech.com>.
Filters are created programmatically per request (and customized for the
request) thus in order to benefit from CachingWrapperFilter we require a
mechanism for looking up CachingWrapperFilters based on the request. But this is
certainly an area worth trying (we could probably reuse each filter 10 times,
because of the variation in requests and NRT search).

I was hoping to improve query latency by reformulating the filters and
queries. However my intuition of the best practice for filter and query
construction is lacking i.e., is it better to use a TermsFilter and
MatchAllDocsQuery or a BooleanQuery of TermQuerys, or a BooleanQuery of
ConstantScoreQuerys of TermQuery etc.

Maybe I should just hunker down and create a synthetic index and try many
different combinations of filter/query construction.

On Oct 11, 2013, at 7:33 AM, Ian Lea <ia...@gmail.com> wrote:

> Are you going to be caching and reusing the filters e.g. by
> CachingWrapperFilter?  The main benefit of filters is in reuse.  It
> takes time to build them in the first place, likely roughly equivalent
> to running the underlying query although with variations as you
> describe.  Or are you saying that querying with filters is slow?
> 
> 
> --
> Ian.
> 
> 
> On Thu, Oct 10, 2013 at 7:01 PM, James Clarke <jc...@basistech.com> wrote:
>> Are there any best practices for constructing Filters to search efficiently?
>> From my non-exhaustive experiments I cannot intuit how to construct my filters
>> to achieve best performance.
>> 
>> I have an index (Lucene 4.3) of about 1.8M documents which contain a field
>> acting as a flag (evidence:true). Initially all the documents I am interested in
>> searching have this field. Later as the index grows some documents will not have
>> this field.
>> 
>> In the simplest case I want to filter on documents with evidence:true. Running a
>> couple of hundred queries sequentially and recording how long it takes to
>> complete.
>> 
>> * No filter: ~40s
>> * QueryWrapperFilter(TermQuery(evidence:true)): ~80s
>> * FieldValueFilter(evidence): ~43s
>> * TermsFilter(evidence:true): ~50s
>> 
>> This suggests QWF is a bad idea.
>> 
>> A more complex filter is:
>> 
>>  (evidence:true AND (cid:x OR cid:y ...) AND language:eng)
>> 
>> Where 1.8M documents evidence:true, 2-4 documents per cid clause, 1-60 cid
>> clauses, and 1.4M documents lang:eng.
>> 
>> Our initial implementation uses QWF of a BooleanQuery(TQ AND BQ(OR) AND TQ)
>> which takes ~210s.
>> 
>> Adjusting this to be a BooleanFilter(TermsFilter AND TermsFilter AND
>> TermsFilter) sees things slow down to ~239s!
>> 
>> Any advice on optimizing these filters would be appreciated!
>> 
>> James
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Optimizing Filters

Posted by Ian Lea <ia...@gmail.com>.
Are you going to be caching and reusing the filters e.g. by
CachingWrapperFilter?  The main benefit of filters is in reuse.  It
takes time to build them in the first place, likely roughly equivalent
to running the underlying query although with variations as you
describe.  Or are you saying that querying with filters is slow?


--
Ian.


On Thu, Oct 10, 2013 at 7:01 PM, James Clarke <jc...@basistech.com> wrote:
> Are there any best practices for constructing Filters to search efficiently?
> From my non-exhaustive experiments I cannot intuit how to construct my filters
> to achieve best performance.
>
> I have an index (Lucene 4.3) of about 1.8M documents which contain a field
> acting as a flag (evidence:true). Initially all the documents I am interested in
> searching have this field. Later as the index grows some documents will not have
> this field.
>
> In the simplest case I want to filter on documents with evidence:true. Running a
> couple of hundred queries sequentially and recording how long it takes to
> complete.
>
>  * No filter: ~40s
>  * QueryWrapperFilter(TermQuery(evidence:true)): ~80s
>  * FieldValueFilter(evidence): ~43s
>  * TermsFilter(evidence:true): ~50s
>
> This suggests QWF is a bad idea.
>
> A more complex filter is:
>
>   (evidence:true AND (cid:x OR cid:y ...) AND language:eng)
>
> Where 1.8M documents evidence:true, 2-4 documents per cid clause, 1-60 cid
> clauses, and 1.4M documents lang:eng.
>
> Our initial implementation uses QWF of a BooleanQuery(TQ AND BQ(OR) AND TQ)
> which takes ~210s.
>
> Adjusting this to be a BooleanFilter(TermsFilter AND TermsFilter AND
> TermsFilter) sees things slow down to ~239s!
>
> Any advice on optimizing these filters would be appreciated!
>
> James
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org