You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Khash Sajadi <kh...@sajadi.co.uk> on 2010/10/24 00:18:48 UTC

Using filters to speed up queries

My index contains documents for different users. Each document has the user
id as a field on it.

There are about 500 different users with 3 million documents.

Currently I'm calling Search with the query (parsed from user)
and FieldCacheTermsFilter for the user id.

It works but the performance is not great.

Ideally, I would like to perform the search only on the documents that are
relevant, this should make it much faster. However, it seems Search(Query,
Filter) runs the query first and then applies the filter.

Is there a way to improve this? (i.e. run the query only on a subset of
documents)

Thanks

Re: Using filters to speed up queries

Posted by Michael McCandless <lu...@mikemccandless.com>.
Unfortunately, Lucene's performance with filters isn't great.

This is because we now always apply filters "up high", using a
leapfrog approach, where we alternate asking the filter and then the
scorer to skip to each other's docID.

But if the filter accepts "enough" (~1% in my testing) of the
documents in the index, it's often better to apply the filter "down
low" like we do deleted docs (which really is its own filter), ie
where we quickly eliminate docs as we enumerate them from the
postings.

I did a blog post about this too:

  http://chbits.blogspot.com/2010/09/fast-search-filters-using-flex.html

That post shows some of the perf gains we could get by switching
filters to apply down low, though this was for a filter that randomly
accepts 50% of the index.  And this is using the flex APIs (for 4.0);
you may be able to do something similar using FilterIndexReader
pre-4.0.

Of course you shouldn't have to do such tricks --
https://issues.apache.org/jira/browse/LUCENE-1536 is open for Lucene
to do this itself when you pass a filter.

You should test, but, I suspect a MUST clause on an AND query may not
perform that much better in general for filters that accept a biggish
part of the index, since it's still using skipping, especially if your
query wasn't already a BooleanQuery.  For restrictive filters it
should be a decent gain, but those queries are already fast to begin
with.

Do you have some perf numbers to share?  What kind of queries are you
running with the filters?  Are there certain users that have a highish
%tg of the documents, with a long tail of the other users?  If so you
could consider making dedicated indices for those high doc count
users...

Also note that static index partitioning like this does not result in
the same scoring as you'd get if each user had their own index, since
the term stats (IDF) is aggregated across all users.  So for queries
with more than one term, users can see docs sorted differently, and
this is actually a known security risk in that users can gleen some
details about the documents they aren't allowed to see due to the
shared terms stats... there is a paper somewhere (Robert?) that delves
into it.

Mike

On Sat, Oct 23, 2010 at 6:18 PM, Khash Sajadi <kh...@sajadi.co.uk> wrote:
> My index contains documents for different users. Each document has the user
> id as a field on it.
> There are about 500 different users with 3 million documents.
> Currently I'm calling Search with the query (parsed from user)
> and FieldCacheTermsFilter for the user id.
> It works but the performance is not great.
> Ideally, I would like to perform the search only on the documents that are
> relevant, this should make it much faster. However, it seems Search(Query,
> Filter) runs the query first and then applies the filter.
> Is there a way to improve this? (i.e. run the query only on a subset of
> documents)
> Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


RE: Using filters to speed up queries

Posted by Uwe Schindler <uw...@thetaphi.de>.
Yes it has some heuristics on which query "drives" the execution. So the
query on which the first hit has a larger docid is the driving one. The
other one then only gets seeked to.

 

With filters this is not the case (this may change in future, when filters
also use ConjunctionScorer). In general queries are mostly faster now than
filters, only when you cache filters, you get improvements.

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 <http://www.thetaphi.de/> http://www.thetaphi.de

eMail: uwe@thetaphi.de

 

From: Khash Sajadi [mailto:khash@sajadi.co.uk] 
Sent: Sunday, October 24, 2010 12:52 AM
To: dev@lucene.apache.org
Subject: Re: Using filters to speed up queries

 

On the topic of BooleanQuery. Would the order of the queries being added
matter? Is it clever enough to skip the second query when the first one is
returning nothing and is a MUST?

On 23 October 2010 23:47, Khash Sajadi <kh...@sajadi.co.uk> wrote:

Thanks. Will try it. Been thinking about separate indexes but have one
worry: memory and file handle issues.

 

I'm worried that in scenarios I might end up with thousands of
IndexReaders/IndexWriters open in the process (it is Windows). How is that
going to play out with memory?

 

On 23 October 2010 23:44, Mark Harwood <ma...@yahoo.co.uk> wrote:

Look at BooleanQuery with 2 "must" clauses - one for the query, one for a
ConstantScoreQuery wrapping the filter.
BooleanQuery should then use automatically use skips when reading matching
docs from the main query and skip to the next docs identified by the filter.
Give it a try, otherwise you may be looking at using separate indexes



On 23 Oct 2010, at 23:18, Khash Sajadi wrote:

> My index contains documents for different users. Each document has the
user id as a field on it.
>
> There are about 500 different users with 3 million documents.
>
> Currently I'm calling Search with the query (parsed from user) and
FieldCacheTermsFilter for the user id.
>
> It works but the performance is not great.
>
> Ideally, I would like to perform the search only on the documents that are
relevant, this should make it much faster. However, it seems Search(Query,
Filter) runs the query first and then applies the filter.
>
> Is there a way to improve this? (i.e. run the query only on a subset of
documents)
>
> Thanks



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

 

 


Re: Using filters to speed up queries

Posted by Khash Sajadi <kh...@sajadi.co.uk>.
On the topic of BooleanQuery. Would the order of the queries being added
matter? Is it clever enough to skip the second query when the first one is
returning nothing and is a MUST?

On 23 October 2010 23:47, Khash Sajadi <kh...@sajadi.co.uk> wrote:

> Thanks. Will try it. Been thinking about separate indexes but have one
> worry: memory and file handle issues.
>
> I'm worried that in scenarios I might end up with thousands of
> IndexReaders/IndexWriters open in the process (it is Windows). How is that
> going to play out with memory?
>
>
> On 23 October 2010 23:44, Mark Harwood <ma...@yahoo.co.uk> wrote:
>
>> Look at BooleanQuery with 2 "must" clauses - one for the query, one for a
>> ConstantScoreQuery wrapping the filter.
>> BooleanQuery should then use automatically use skips when reading matching
>> docs from the main query and skip to the next docs identified by the filter.
>> Give it a try, otherwise you may be looking at using separate indexes
>>
>>
>> On 23 Oct 2010, at 23:18, Khash Sajadi wrote:
>>
>> > My index contains documents for different users. Each document has the
>> user id as a field on it.
>> >
>> > There are about 500 different users with 3 million documents.
>> >
>> > Currently I'm calling Search with the query (parsed from user) and
>> FieldCacheTermsFilter for the user id.
>> >
>> > It works but the performance is not great.
>> >
>> > Ideally, I would like to perform the search only on the documents that
>> are relevant, this should make it much faster. However, it seems
>> Search(Query, Filter) runs the query first and then applies the filter.
>> >
>> > Is there a way to improve this? (i.e. run the query only on a subset of
>> documents)
>> >
>> > Thanks
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>

Re: Using filters to speed up queries

Posted by Khash Sajadi <kh...@sajadi.co.uk>.
Thanks. Will try it. Been thinking about separate indexes but have one
worry: memory and file handle issues.

I'm worried that in scenarios I might end up with thousands of
IndexReaders/IndexWriters open in the process (it is Windows). How is that
going to play out with memory?

On 23 October 2010 23:44, Mark Harwood <ma...@yahoo.co.uk> wrote:

> Look at BooleanQuery with 2 "must" clauses - one for the query, one for a
> ConstantScoreQuery wrapping the filter.
> BooleanQuery should then use automatically use skips when reading matching
> docs from the main query and skip to the next docs identified by the filter.
> Give it a try, otherwise you may be looking at using separate indexes
>
>
> On 23 Oct 2010, at 23:18, Khash Sajadi wrote:
>
> > My index contains documents for different users. Each document has the
> user id as a field on it.
> >
> > There are about 500 different users with 3 million documents.
> >
> > Currently I'm calling Search with the query (parsed from user) and
> FieldCacheTermsFilter for the user id.
> >
> > It works but the performance is not great.
> >
> > Ideally, I would like to perform the search only on the documents that
> are relevant, this should make it much faster. However, it seems
> Search(Query, Filter) runs the query first and then applies the filter.
> >
> > Is there a way to improve this? (i.e. run the query only on a subset of
> documents)
> >
> > Thanks
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Using filters to speed up queries

Posted by Mark Harwood <ma...@yahoo.co.uk>.
Look at BooleanQuery with 2 "must" clauses - one for the query, one for a ConstantScoreQuery wrapping the filter.
BooleanQuery should then use automatically use skips when reading matching docs from the main query and skip to the next docs identified by the filter.
Give it a try, otherwise you may be looking at using separate indexes


On 23 Oct 2010, at 23:18, Khash Sajadi wrote:

> My index contains documents for different users. Each document has the user id as a field on it.
> 
> There are about 500 different users with 3 million documents.
> 
> Currently I'm calling Search with the query (parsed from user) and FieldCacheTermsFilter for the user id.
> 
> It works but the performance is not great.
> 
> Ideally, I would like to perform the search only on the documents that are relevant, this should make it much faster. However, it seems Search(Query, Filter) runs the query first and then applies the filter.
> 
> Is there a way to improve this? (i.e. run the query only on a subset of documents)
> 
> Thanks


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Using filters to speed up queries

Posted by Paul Elschot <pa...@xs4all.nl>.
Some more speed up may be possible when the same combination of
filters (user account and date range here) is reused for another query.
The combined filter can then be made as an OpenBitSetDISI
(in the util package) and kept around for reuse.

Regards,
Paul Elschot

Op zondag 24 oktober 2010 12:34:07 schreef Khash Sajadi:
> Here is what I've found so far:
> 
> I have three main sets to use in a query:
> Account MUST be xxx
> User query
> DateRange on the query MUST be in (a,b) it is a NumericField
> 
> I tried the following combinations (all using a BooleanQuery with the user
> query added to it)
> 
> 1. One:
> - Add ACCOUNT as a TermQuery
> - Add DATE RANGE as Filter
> 
> 2. Two
> - Add ACCOUNT as Filer
> - Add DATE RANGE as NumericRangeQuery
> 
> I tried caching the filters on both scenarios.
> I also tried both scenarios by passing the query as a ConstantScoreQuery as
> well.
> 
> I got the best result (about 4x faster) by using a cached filter for the
> DATE RANGE and leaving the ACCOUNT as a TermQuery.
> 
> I think I'm happy with this approach. However, the security risk Uwe
> mentioned when using ACCOUNT as a Query makes me nervous. Any suggestions?
> 
> As for document distribution, the ACCOUNTS have a similar distribution of
> documents.
> 
> Also, I still would like to try the multi index approach, but not sure about
> the memory, file handle burden of it (having potentially thousands of
> reades/writers/searchers) open at the same time. I use two processes one as
> indexer and one for search with the same underlying FSDirectory. As for
> search, I use writer.getReader().reopen within a SearchManager as suggested
> by Lucene in Action.
> 
> 
> 
> 
> On 24 October 2010 10:27, Paul Elschot <pa...@xs4all.nl> wrote:
> 
> > Op zondag 24 oktober 2010 00:18:48 schreef Khash Sajadi:
> > > My index contains documents for different users. Each document has the
> > user
> > > id as a field on it.
> > >
> > > There are about 500 different users with 3 million documents.
> > >
> > > Currently I'm calling Search with the query (parsed from user)
> > > and FieldCacheTermsFilter for the user id.
> > >
> > > It works but the performance is not great.
> > >
> > > Ideally, I would like to perform the search only on the documents that
> > are
> > > relevant, this should make it much faster. However, it seems
> > Search(Query,
> > > Filter) runs the query first and then applies the filter.
> > >
> > > Is there a way to improve this? (i.e. run the query only on a subset of
> > > documents)
> > >
> > > Thanks
> > >
> >
> > When running the query with the filter, the query is run at the same time
> > as the filter. Initially and after each matching document, the filter is
> > assumed to
> > be cheaper to execute and its first or next matching document is
> > determined.
> > Then the query and the filter are repeatedly advanced to each other's next
> > matching
> > document until they are at the same document (ie. there is a match),
> > similar to
> > a boolean query with two required clauses.
> > The java code doing this is in the private method
> > IndexSearcher.searchWithFilter().
> >
> > It could be that filling the field cache is the performance problem.
> > How is the performance when this search call with the FieldCacheTermsFilter
> > is repeated?
> >
> > Also, for a single indexed term to be used as a filter (the user id in this
> > case)
> > there may be no need for a cache, a QueryWrapperFilter around the TermQuery
> > might suffice.
> >
> > Regards,
> > Paul Elschot
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
> >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


RE: Using filters to speed up queries

Posted by Uwe Schindler <uw...@thetaphi.de>.
The trick is to wrap the TermQuery using a ConstantScoreQuery(new
QueryWrapperFilter(new TermQuery(.))). Because for filtering, the TermQuery
used instead of a filter should not contribute to score. This code is used
quite often in Lucene, so don't care about the strange looking code. E.g. in
MultiTermQuery.

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 <http://www.thetaphi.de/> http://www.thetaphi.de

eMail: uwe@thetaphi.de

 

From: Khash Sajadi [mailto:khash@sajadi.co.uk] 
Sent: Sunday, October 24, 2010 12:50 PM
To: dev@lucene.apache.org
Subject: Re: Using filters to speed up queries

 

Terribly sorry. I meant Mike:

 

> 

Also note that static index partitioning like this does not result in
the same scoring as you'd get if each user had their own index, since
the term stats (IDF) is aggregated across all users.  So for queries
with more than one term, users can see docs sorted differently, and
this is actually a known security risk in that users can gleen some
details about the documents they aren't allowed to see due to the
shared terms stats... there is a paper somewhere (Robert?) that delves
into it.

 

On 24 October 2010 11:46, Uwe Schindler <uw...@thetaphi.de> wrote:

Security risk? I did not say anything about that!

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de <http://www.thetaphi.de/> 

eMail: uwe@thetaphi.de

 

From: Khash Sajadi [mailto:khash@sajadi.co.uk] 
Sent: Sunday, October 24, 2010 12:34 PM


To: dev@lucene.apache.org
Subject: Re: Using filters to speed up queries

 

Here is what I've found so far:

I have three main sets to use in a query:

Account MUST be xxx

User query

DateRange on the query MUST be in (a,b) it is a NumericField

 

I tried the following combinations (all using a BooleanQuery with the user
query added to it)

 

1. One:

- Add ACCOUNT as a TermQuery

- Add DATE RANGE as Filter

 

2. Two 

- Add ACCOUNT as Filer

- Add DATE RANGE as NumericRangeQuery

 

I tried caching the filters on both scenarios.

I also tried both scenarios by passing the query as a ConstantScoreQuery as
well.

 

I got the best result (about 4x faster) by using a cached filter for the
DATE RANGE and leaving the ACCOUNT as a TermQuery.

 

I think I'm happy with this approach. However, the security risk Uwe
mentioned when using ACCOUNT as a Query makes me nervous. Any suggestions?

 

As for document distribution, the ACCOUNTS have a similar distribution of
documents.

 

Also, I still would like to try the multi index approach, but not sure about
the memory, file handle burden of it (having potentially thousands of
reades/writers/searchers) open at the same time. I use two processes one as
indexer and one for search with the same underlying FSDirectory. As for
search, I use writer.getReader().reopen within a SearchManager as suggested
by Lucene in Action.

 

 

 

On 24 October 2010 10:27, Paul Elschot <pa...@xs4all.nl> wrote:

Op zondag 24 oktober 2010 00:18:48 schreef Khash Sajadi:

> My index contains documents for different users. Each document has the
user
> id as a field on it.
>
> There are about 500 different users with 3 million documents.
>
> Currently I'm calling Search with the query (parsed from user)
> and FieldCacheTermsFilter for the user id.
>
> It works but the performance is not great.
>
> Ideally, I would like to perform the search only on the documents that are
> relevant, this should make it much faster. However, it seems Search(Query,
> Filter) runs the query first and then applies the filter.
>
> Is there a way to improve this? (i.e. run the query only on a subset of
> documents)
>
> Thanks
>

When running the query with the filter, the query is run at the same time
as the filter. Initially and after each matching document, the filter is
assumed to
be cheaper to execute and its first or next matching document is determined.
Then the query and the filter are repeatedly advanced to each other's next
matching
document until they are at the same document (ie. there is a match), similar
to
a boolean query with two required clauses.
The java code doing this is in the private method
IndexSearcher.searchWithFilter().

It could be that filling the field cache is the performance problem.
How is the performance when this search call with the FieldCacheTermsFilter
is repeated?

Also, for a single indexed term to be used as a filter (the user id in this
case)
there may be no need for a cache, a QueryWrapperFilter around the TermQuery
might suffice.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

 

 


Re: Using filters to speed up queries

Posted by Khash Sajadi <kh...@sajadi.co.uk>.
Terribly sorry. I meant Mike:

>
Also note that static index partitioning like this does not result in
the same scoring as you'd get if each user had their own index, since
the term stats (IDF) is aggregated across all users.  So for queries
with more than one term, users can see docs sorted differently, and
this is actually a known security risk in that users can gleen some
details about the documents they aren't allowed to see due to the
shared terms stats... there is a paper somewhere (Robert?) that delves
into it.


On 24 October 2010 11:46, Uwe Schindler <uw...@thetaphi.de> wrote:

> Security risk? I did not say anything about that!
>
>
>
> -----
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: uwe@thetaphi.de
>
>
>
> *From:* Khash Sajadi [mailto:khash@sajadi.co.uk]
> *Sent:* Sunday, October 24, 2010 12:34 PM
>
> *To:* dev@lucene.apache.org
> *Subject:* Re: Using filters to speed up queries
>
>
>
> Here is what I've found so far:
>
> I have three main sets to use in a query:
>
> Account MUST be xxx
>
> User query
>
> DateRange on the query MUST be in (a,b) it is a NumericField
>
>
>
> I tried the following combinations (all using a BooleanQuery with the user
> query added to it)
>
>
>
> 1. One:
>
> - Add ACCOUNT as a TermQuery
>
> - Add DATE RANGE as Filter
>
>
>
> 2. Two
>
> - Add ACCOUNT as Filer
>
> - Add DATE RANGE as NumericRangeQuery
>
>
>
> I tried caching the filters on both scenarios.
>
> I also tried both scenarios by passing the query as a ConstantScoreQuery as
> well.
>
>
>
> I got the best result (about 4x faster) by using a cached filter for the
> DATE RANGE and leaving the ACCOUNT as a TermQuery.
>
>
>
> I think I'm happy with this approach. However, the security risk Uwe
> mentioned when using ACCOUNT as a Query makes me nervous. Any suggestions?
>
>
>
> As for document distribution, the ACCOUNTS have a similar distribution of
> documents.
>
>
>
> Also, I still would like to try the multi index approach, but not sure
> about the memory, file handle burden of it (having potentially thousands of
> reades/writers/searchers) open at the same time. I use two processes one as
> indexer and one for search with the same underlying FSDirectory. As for
> search, I use writer.getReader().reopen within a SearchManager as suggested
> by Lucene in Action.
>
>
>
>
>
>
>
> On 24 October 2010 10:27, Paul Elschot <pa...@xs4all.nl> wrote:
>
> Op zondag 24 oktober 2010 00:18:48 schreef Khash Sajadi:
>
> > My index contains documents for different users. Each document has the
> user
> > id as a field on it.
> >
> > There are about 500 different users with 3 million documents.
> >
> > Currently I'm calling Search with the query (parsed from user)
> > and FieldCacheTermsFilter for the user id.
> >
> > It works but the performance is not great.
> >
> > Ideally, I would like to perform the search only on the documents that
> are
> > relevant, this should make it much faster. However, it seems
> Search(Query,
> > Filter) runs the query first and then applies the filter.
> >
> > Is there a way to improve this? (i.e. run the query only on a subset of
> > documents)
> >
> > Thanks
> >
>
> When running the query with the filter, the query is run at the same time
> as the filter. Initially and after each matching document, the filter is
> assumed to
> be cheaper to execute and its first or next matching document is
> determined.
> Then the query and the filter are repeatedly advanced to each other's next
> matching
> document until they are at the same document (ie. there is a match),
> similar to
> a boolean query with two required clauses.
> The java code doing this is in the private method
> IndexSearcher.searchWithFilter().
>
> It could be that filling the field cache is the performance problem.
> How is the performance when this search call with the FieldCacheTermsFilter
> is repeated?
>
> Also, for a single indexed term to be used as a filter (the user id in this
> case)
> there may be no need for a cache, a QueryWrapperFilter around the TermQuery
> might suffice.
>
> Regards,
> Paul Elschot
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
>

RE: Using filters to speed up queries

Posted by Uwe Schindler <uw...@thetaphi.de>.
Security risk? I did not say anything about that!

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 <http://www.thetaphi.de/> http://www.thetaphi.de

eMail: uwe@thetaphi.de

 

From: Khash Sajadi [mailto:khash@sajadi.co.uk] 
Sent: Sunday, October 24, 2010 12:34 PM
To: dev@lucene.apache.org
Subject: Re: Using filters to speed up queries

 

Here is what I've found so far:

I have three main sets to use in a query:

Account MUST be xxx

User query

DateRange on the query MUST be in (a,b) it is a NumericField

 

I tried the following combinations (all using a BooleanQuery with the user
query added to it)

 

1. One:

- Add ACCOUNT as a TermQuery

- Add DATE RANGE as Filter

 

2. Two 

- Add ACCOUNT as Filer

- Add DATE RANGE as NumericRangeQuery

 

I tried caching the filters on both scenarios.

I also tried both scenarios by passing the query as a ConstantScoreQuery as
well.

 

I got the best result (about 4x faster) by using a cached filter for the
DATE RANGE and leaving the ACCOUNT as a TermQuery.

 

I think I'm happy with this approach. However, the security risk Uwe
mentioned when using ACCOUNT as a Query makes me nervous. Any suggestions?

 

As for document distribution, the ACCOUNTS have a similar distribution of
documents.

 

Also, I still would like to try the multi index approach, but not sure about
the memory, file handle burden of it (having potentially thousands of
reades/writers/searchers) open at the same time. I use two processes one as
indexer and one for search with the same underlying FSDirectory. As for
search, I use writer.getReader().reopen within a SearchManager as suggested
by Lucene in Action.

 

 

 

On 24 October 2010 10:27, Paul Elschot <pa...@xs4all.nl> wrote:

Op zondag 24 oktober 2010 00:18:48 schreef Khash Sajadi:

> My index contains documents for different users. Each document has the
user
> id as a field on it.
>
> There are about 500 different users with 3 million documents.
>
> Currently I'm calling Search with the query (parsed from user)
> and FieldCacheTermsFilter for the user id.
>
> It works but the performance is not great.
>
> Ideally, I would like to perform the search only on the documents that are
> relevant, this should make it much faster. However, it seems Search(Query,
> Filter) runs the query first and then applies the filter.
>
> Is there a way to improve this? (i.e. run the query only on a subset of
> documents)
>
> Thanks
>

When running the query with the filter, the query is run at the same time
as the filter. Initially and after each matching document, the filter is
assumed to
be cheaper to execute and its first or next matching document is determined.
Then the query and the filter are repeatedly advanced to each other's next
matching
document until they are at the same document (ie. there is a match), similar
to
a boolean query with two required clauses.
The java code doing this is in the private method
IndexSearcher.searchWithFilter().

It could be that filling the field cache is the performance problem.
How is the performance when this search call with the FieldCacheTermsFilter
is repeated?

Also, for a single indexed term to be used as a filter (the user id in this
case)
there may be no need for a cache, a QueryWrapperFilter around the TermQuery
might suffice.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

 


Re: Using filters to speed up queries

Posted by Khash Sajadi <kh...@sajadi.co.uk>.
Thanks everyone for your help.

At the end, I settled for using the Constant Score Query for the ACCOUNT and
cached filter for the date range. The performance on a 20mm document index
with 500 accounts is awesome!



On 25 October 2010 11:28, Michael McCandless <lu...@mikemccandless.com>wrote:

> Here's the paper I was thinking of (Robert found this):
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.159.9682 ...
> eg note this sentence from the abstract:
>
>    We show that the first implementation, based on a postprocessing
> approach, allows an arbitrary user to obtain information about the
> content of files for which he does not have read permission.
>
> Note that one simple way to "gauge" performance of filtering-down-low
> would be to open an IndexReader, delete all documents except those
> matching your filter (eg the ACCOUNT filter), then run your searches
> against that IndexReader without the ACCOUNT clause.  If you don't
> close that reader then these deletes are never committed.  This is a
> simple way to compile in a filter to an open IR, but, you'd still then
> have one reader open per user class, so the risks of too many files,
> etc still stands.
>
> Hmm though you could open an initial reader, then clone it, then do
> all your deletes on that clone for user class 1, then clone it again,
> do all deletes on that clone for user class 2.  This way you only have
> one set of open files, but you've "compiled" your filter into the
> delete docs for each reader.
>
> But, in order to do this, you'd have to disable locking (use
> NoLockFactory) in your Directory impl, just for these readers, since
> you know you'll never commit the readers with pending deletions.  Just
> be sure you never close those readers!
>
> This should give sizable speedups if the filter is non-sparse.
>
> Mike
>
> On Sun, Oct 24, 2010 at 6:34 AM, Khash Sajadi <kh...@sajadi.co.uk> wrote:
> > Here is what I've found so far:
> >
> > I have three main sets to use in a query:
> > Account MUST be xxx
> > User query
> > DateRange on the query MUST be in (a,b) it is a NumericField
> > I tried the following combinations (all using a BooleanQuery with the
> user
> > query added to it)
> > 1. One:
> > - Add ACCOUNT as a TermQuery
> > - Add DATE RANGE as Filter
> > 2. Two
> > - Add ACCOUNT as Filer
> > - Add DATE RANGE as NumericRangeQuery
> > I tried caching the filters on both scenarios.
> > I also tried both scenarios by passing the query as a ConstantScoreQuery
> as
> > well.
> > I got the best result (about 4x faster) by using a cached filter for the
> > DATE RANGE and leaving the ACCOUNT as a TermQuery.
> > I think I'm happy with this approach. However, the security risk Uwe
> > mentioned when using ACCOUNT as a Query makes me nervous. Any
> suggestions?
> > As for document distribution, the ACCOUNTS have a similar distribution of
> > documents.
> > Also, I still would like to try the multi index approach, but not sure
> about
> > the memory, file handle burden of it (having potentially thousands of
> > reades/writers/searchers) open at the same time. I use two processes one
> as
> > indexer and one for search with the same underlying FSDirectory. As for
> > search, I use writer.getReader().reopen within a SearchManager as
> suggested
> > by Lucene in Action.
> >
> >
> >
> > On 24 October 2010 10:27, Paul Elschot <pa...@xs4all.nl> wrote:
> >>
> >> Op zondag 24 oktober 2010 00:18:48 schreef Khash Sajadi:
> >> > My index contains documents for different users. Each document has the
> >> > user
> >> > id as a field on it.
> >> >
> >> > There are about 500 different users with 3 million documents.
> >> >
> >> > Currently I'm calling Search with the query (parsed from user)
> >> > and FieldCacheTermsFilter for the user id.
> >> >
> >> > It works but the performance is not great.
> >> >
> >> > Ideally, I would like to perform the search only on the documents that
> >> > are
> >> > relevant, this should make it much faster. However, it seems
> >> > Search(Query,
> >> > Filter) runs the query first and then applies the filter.
> >> >
> >> > Is there a way to improve this? (i.e. run the query only on a subset
> of
> >> > documents)
> >> >
> >> > Thanks
> >> >
> >>
> >> When running the query with the filter, the query is run at the same
> time
> >> as the filter. Initially and after each matching document, the filter is
> >> assumed to
> >> be cheaper to execute and its first or next matching document is
> >> determined.
> >> Then the query and the filter are repeatedly advanced to each other's
> next
> >> matching
> >> document until they are at the same document (ie. there is a match),
> >> similar to
> >> a boolean query with two required clauses.
> >> The java code doing this is in the private method
> >> IndexSearcher.searchWithFilter().
> >>
> >> It could be that filling the field cache is the performance problem.
> >> How is the performance when this search call with the
> >> FieldCacheTermsFilter
> >> is repeated?
> >>
> >> Also, for a single indexed term to be used as a filter (the user id in
> >> this case)
> >> there may be no need for a cache, a QueryWrapperFilter around the
> >> TermQuery
> >> might suffice.
> >>
> >> Regards,
> >> Paul Elschot
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Using filters to speed up queries

Posted by Michael McCandless <lu...@mikemccandless.com>.
Here's the paper I was thinking of (Robert found this):
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.159.9682 ...
eg note this sentence from the abstract:

    We show that the first implementation, based on a postprocessing
approach, allows an arbitrary user to obtain information about the
content of files for which he does not have read permission.

Note that one simple way to "gauge" performance of filtering-down-low
would be to open an IndexReader, delete all documents except those
matching your filter (eg the ACCOUNT filter), then run your searches
against that IndexReader without the ACCOUNT clause.  If you don't
close that reader then these deletes are never committed.  This is a
simple way to compile in a filter to an open IR, but, you'd still then
have one reader open per user class, so the risks of too many files,
etc still stands.

Hmm though you could open an initial reader, then clone it, then do
all your deletes on that clone for user class 1, then clone it again,
do all deletes on that clone for user class 2.  This way you only have
one set of open files, but you've "compiled" your filter into the
delete docs for each reader.

But, in order to do this, you'd have to disable locking (use
NoLockFactory) in your Directory impl, just for these readers, since
you know you'll never commit the readers with pending deletions.  Just
be sure you never close those readers!

This should give sizable speedups if the filter is non-sparse.

Mike

On Sun, Oct 24, 2010 at 6:34 AM, Khash Sajadi <kh...@sajadi.co.uk> wrote:
> Here is what I've found so far:
>
> I have three main sets to use in a query:
> Account MUST be xxx
> User query
> DateRange on the query MUST be in (a,b) it is a NumericField
> I tried the following combinations (all using a BooleanQuery with the user
> query added to it)
> 1. One:
> - Add ACCOUNT as a TermQuery
> - Add DATE RANGE as Filter
> 2. Two
> - Add ACCOUNT as Filer
> - Add DATE RANGE as NumericRangeQuery
> I tried caching the filters on both scenarios.
> I also tried both scenarios by passing the query as a ConstantScoreQuery as
> well.
> I got the best result (about 4x faster) by using a cached filter for the
> DATE RANGE and leaving the ACCOUNT as a TermQuery.
> I think I'm happy with this approach. However, the security risk Uwe
> mentioned when using ACCOUNT as a Query makes me nervous. Any suggestions?
> As for document distribution, the ACCOUNTS have a similar distribution of
> documents.
> Also, I still would like to try the multi index approach, but not sure about
> the memory, file handle burden of it (having potentially thousands of
> reades/writers/searchers) open at the same time. I use two processes one as
> indexer and one for search with the same underlying FSDirectory. As for
> search, I use writer.getReader().reopen within a SearchManager as suggested
> by Lucene in Action.
>
>
>
> On 24 October 2010 10:27, Paul Elschot <pa...@xs4all.nl> wrote:
>>
>> Op zondag 24 oktober 2010 00:18:48 schreef Khash Sajadi:
>> > My index contains documents for different users. Each document has the
>> > user
>> > id as a field on it.
>> >
>> > There are about 500 different users with 3 million documents.
>> >
>> > Currently I'm calling Search with the query (parsed from user)
>> > and FieldCacheTermsFilter for the user id.
>> >
>> > It works but the performance is not great.
>> >
>> > Ideally, I would like to perform the search only on the documents that
>> > are
>> > relevant, this should make it much faster. However, it seems
>> > Search(Query,
>> > Filter) runs the query first and then applies the filter.
>> >
>> > Is there a way to improve this? (i.e. run the query only on a subset of
>> > documents)
>> >
>> > Thanks
>> >
>>
>> When running the query with the filter, the query is run at the same time
>> as the filter. Initially and after each matching document, the filter is
>> assumed to
>> be cheaper to execute and its first or next matching document is
>> determined.
>> Then the query and the filter are repeatedly advanced to each other's next
>> matching
>> document until they are at the same document (ie. there is a match),
>> similar to
>> a boolean query with two required clauses.
>> The java code doing this is in the private method
>> IndexSearcher.searchWithFilter().
>>
>> It could be that filling the field cache is the performance problem.
>> How is the performance when this search call with the
>> FieldCacheTermsFilter
>> is repeated?
>>
>> Also, for a single indexed term to be used as a filter (the user id in
>> this case)
>> there may be no need for a cache, a QueryWrapperFilter around the
>> TermQuery
>> might suffice.
>>
>> Regards,
>> Paul Elschot
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Using filters to speed up queries

Posted by Khash Sajadi <kh...@sajadi.co.uk>.
Here is what I've found so far:

I have three main sets to use in a query:
Account MUST be xxx
User query
DateRange on the query MUST be in (a,b) it is a NumericField

I tried the following combinations (all using a BooleanQuery with the user
query added to it)

1. One:
- Add ACCOUNT as a TermQuery
- Add DATE RANGE as Filter

2. Two
- Add ACCOUNT as Filer
- Add DATE RANGE as NumericRangeQuery

I tried caching the filters on both scenarios.
I also tried both scenarios by passing the query as a ConstantScoreQuery as
well.

I got the best result (about 4x faster) by using a cached filter for the
DATE RANGE and leaving the ACCOUNT as a TermQuery.

I think I'm happy with this approach. However, the security risk Uwe
mentioned when using ACCOUNT as a Query makes me nervous. Any suggestions?

As for document distribution, the ACCOUNTS have a similar distribution of
documents.

Also, I still would like to try the multi index approach, but not sure about
the memory, file handle burden of it (having potentially thousands of
reades/writers/searchers) open at the same time. I use two processes one as
indexer and one for search with the same underlying FSDirectory. As for
search, I use writer.getReader().reopen within a SearchManager as suggested
by Lucene in Action.




On 24 October 2010 10:27, Paul Elschot <pa...@xs4all.nl> wrote:

> Op zondag 24 oktober 2010 00:18:48 schreef Khash Sajadi:
> > My index contains documents for different users. Each document has the
> user
> > id as a field on it.
> >
> > There are about 500 different users with 3 million documents.
> >
> > Currently I'm calling Search with the query (parsed from user)
> > and FieldCacheTermsFilter for the user id.
> >
> > It works but the performance is not great.
> >
> > Ideally, I would like to perform the search only on the documents that
> are
> > relevant, this should make it much faster. However, it seems
> Search(Query,
> > Filter) runs the query first and then applies the filter.
> >
> > Is there a way to improve this? (i.e. run the query only on a subset of
> > documents)
> >
> > Thanks
> >
>
> When running the query with the filter, the query is run at the same time
> as the filter. Initially and after each matching document, the filter is
> assumed to
> be cheaper to execute and its first or next matching document is
> determined.
> Then the query and the filter are repeatedly advanced to each other's next
> matching
> document until they are at the same document (ie. there is a match),
> similar to
> a boolean query with two required clauses.
> The java code doing this is in the private method
> IndexSearcher.searchWithFilter().
>
> It could be that filling the field cache is the performance problem.
> How is the performance when this search call with the FieldCacheTermsFilter
> is repeated?
>
> Also, for a single indexed term to be used as a filter (the user id in this
> case)
> there may be no need for a cache, a QueryWrapperFilter around the TermQuery
> might suffice.
>
> Regards,
> Paul Elschot
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Using filters to speed up queries

Posted by Paul Elschot <pa...@xs4all.nl>.
Op zondag 24 oktober 2010 00:18:48 schreef Khash Sajadi:
> My index contains documents for different users. Each document has the user
> id as a field on it.
> 
> There are about 500 different users with 3 million documents.
> 
> Currently I'm calling Search with the query (parsed from user)
> and FieldCacheTermsFilter for the user id.
> 
> It works but the performance is not great.
> 
> Ideally, I would like to perform the search only on the documents that are
> relevant, this should make it much faster. However, it seems Search(Query,
> Filter) runs the query first and then applies the filter.
> 
> Is there a way to improve this? (i.e. run the query only on a subset of
> documents)
> 
> Thanks
> 

When running the query with the filter, the query is run at the same time
as the filter. Initially and after each matching document, the filter is assumed to
be cheaper to execute and its first or next matching document is determined.
Then the query and the filter are repeatedly advanced to each other's next matching
document until they are at the same document (ie. there is a match), similar to
a boolean query with two required clauses.
The java code doing this is in the private method IndexSearcher.searchWithFilter().

It could be that filling the field cache is the performance problem.
How is the performance when this search call with the FieldCacheTermsFilter
is repeated?

Also, for a single indexed term to be used as a filter (the user id in this case)
there may be no need for a cache, a QueryWrapperFilter around the TermQuery
might suffice.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org