You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Scott Schneider <Sc...@symantec.com> on 2013/11/05 23:36:11 UTC

fq efficiency

Hi all,

I'm wondering if filter queries are efficient enough for my use cases.  I have lots and lots of users in a big, multi-tenant, sharded index.  To run a search, I can use an fq on the user id and pass in the search terms.  Does this scale well with the # users?  I suppose that, since user id is indexed, generating the filter data (which is cached) will be fast.  And looking up search terms is fast, of course.  But if the search term is a common one that many users have in their documents, then Solr may have to perform an intersection between two large sets:  docs from all users with the search term and all of the current user's docs.

Also, how about auto-complete and searching with a trailing wildcard?  As I understand it, these work well in a single-tenant index because keywords are sorted in the index, so it's easy to get all the search terms that match "foo*".  In a multi-tenant index, all users' keywords are stored together.  So if Lucene were to look at all the keywords from "foo" to "foozzzzz" (I'm not sure if it actually does this), it would skip over a large majority of keywords that don't belong to this user.

Thanks,
Scott

Re: fq efficiency

Posted by Joel Bernstein <jo...@gmail.com>.

Hi Scott,

When intersecting the two sets, Lucene has the advantage that both sets are
sorted. So Lucene can perform a merge join on the two sets in a single
pass. This turns out to be a very fast operation.

Where you'll run into performance issues is if there are a huge number of
documents that match the entire query that need to be scored/ranked. I
wouldn't start worrying about this either though until your dealing with
results with millions of documents.

Joel




On Fri, Nov 8, 2013 at 8:43 AM, Erick Erickson <er...@gmail.com>wrote:

> Have you tried this and measured or is this theoretical? Because
> filter queries are _designed_ for this kind of use case.
>
> bq: If the user has 100 documents, then finding the intersection
> requires checking each list ~100 times
>
> The cached fq is a bitset. Before checking each document,
> all that has to happen to "check the list" is index into the bitset and
> see if the bit is turned on. If it isn't, the document is bypassed.
>
>
> The lots of cores solution has this drawback. The first time a query
> comes in for a particular core, it may be loaded, which will be
> noticeably slow, so your users have to be able to tolerate first-time
> searches that take a bit of time. That said, test to see if it's
> "fast enough" before settling on the solution.
>
> But really, I'd bypass this and just try the filter query solution and
> measure. Because I'd be surprised if you really had performance
> issues here, assuming your filter queries are indeed cached and
> re-used.
>
> Best,
> Erick
>
>
> On Thu, Nov 7, 2013 at 7:02 PM, Scott Schneider <
> Scott_Schneider@symantec.com> wrote:
>
> > Digging a bit more, I think I have answered my own questions.  Can
> someone
> > please say if this sounds right?
> >
> > http://wiki.apache.org/solr/LotsOfCores looks like a pretty good
> > solution.  If I give each user his own shard, each query can be run in
> only
> > one shard.  The effect of the filter query will basically be to find that
> > shard.  The requirements listed on the wiki suggest that performance will
> > be good.  But in Solr 3.x, this won't scale with the # users/shards.
> >
> > Prepending a user id to indexed keywords using an analyzer will break
> > wildcard search.  If there is a wildcard, the query analyzer doesn't run
> > filters, so it won't prepend the user id.  I could prepend the user id
> > myself before calling Solr, but that seems... bad.
> >
> > Scott
> >
> >
> >
> > > -----Original Message-----
> > > From: Scott Schneider [mailto:Scott_Schneider@symantec.com]
> > > Sent: Thursday, November 07, 2013 2:03 PM
> > > To: solr-user@lucene.apache.org
> > > Subject: RE: fq efficiency
> > >
> > > Thanks, that link is very helpful, especially the section, "Leapfrog,
> > > anyone?"  This actually seems quite slow for my use case.  Suppose we
> > > have 10,000 users and 1,000,000 documents.  We search for "hello" for a
> > > particular user and let's assume that the fq set for the user is
> > > cached.  "hello" is a common word and perhaps 10,000 documents will
> > > match.  If the user has 100 documents, then finding the intersection
> > > requires checking each list ~100 times.  If the user has 1,000
> > > documents, we check each list ~1,000 times.  That doesn't scale well.
> > >
> > > My searches are usually in one user's data.  How can I take advantage
> > > of that?  I could have a separate index for each user, but loading so
> > > many indexes at once seems infeasible; and dynamically loading &
> > > unloading indexes is a pain.
> > >
> > > Or I could create a filter that takes tokens and prepends them with the
> > > user id.  That seems like a good solution, since my keyword searches
> > > always include a user id (and usually just 1 user id).  Though I wonder
> > > if there is a downside I haven't thought of.
> > >
> > > Thanks,
> > > Scott
> > >
> > >
> > > > -----Original Message-----
> > > > From: Shawn Heisey [mailto:solr@elyograg.org]
> > > > Sent: Tuesday, November 05, 2013 4:35 PM
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Re: fq efficiency
> > > >
> > > > On 11/5/2013 3:36 PM, Scott Schneider wrote:
> > > > > I'm wondering if filter queries are efficient enough for my use
> > > > cases.  I have lots and lots of users in a big, multi-tenant, sharded
> > > > index.  To run a search, I can use an fq on the user id and pass in
> > > the
> > > > search terms.  Does this scale well with the # users?  I suppose
> > > that,
> > > > since user id is indexed, generating the filter data (which is
> > > cached)
> > > > will be fast.  And looking up search terms is fast, of course.  But
> > > if
> > > > the search term is a common one that many users have in their
> > > > documents, then Solr may have to perform an intersection between two
> > > > large sets:  docs from all users with the search term and all of the
> > > > current user's docs.
> > > > >
> > > > > Also, how about auto-complete and searching with a trailing
> > > wildcard?
> > > > As I understand it, these work well in a single-tenant index because
> > > > keywords are sorted in the index, so it's easy to get all the search
> > > > terms that match "foo*".  In a multi-tenant index, all users'
> > > keywords
> > > > are stored together.  So if Lucene were to look at all the keywords
> > > > from "foo" to "foozzzzz" (I'm not sure if it actually does this), it
> > > > would skip over a large majority of keywords that don't belong to
> > > this
> > > > user.
> > > >
> > > >  From what I understand, there's not really a whole lot of difference
> > > > between queries and filter queries when they are NOT cached, except
> > > > that
> > > > the main query and the filter queries are executed in parallel, which
> > > > can save time.
> > > >
> > > > When filter queries are found in the filterCache, it's a different
> > > > story.  They get applied *before* the main query, which means that
> > > the
> > > > main query won't have to work as hard.  The filterCache stores
> > > > information about which documents in the entire index match the
> > > filter.
> > > > By storing it as a bitset, the amount of space required is relatively
> > > > low.  Applying filterCache results is very efficient.
> > > >
> > > > There are also advanced techniques, like assigning a cost to each
> > > > filter
> > > > and creating postfilters:
> > > >
> > > > http://yonik.com/posts/advanced-filter-caching-in-solr/
> > > >
> > > > Thanks,
> > > > Shawn
> >
> >
>



-- 
Joel Bernstein
Search Engineer at Heliosearch

Re: fq efficiency

Posted by Erick Erickson <er...@gmail.com>.

Have you tried this and measured or is this theoretical? Because
filter queries are _designed_ for this kind of use case.

bq: If the user has 100 documents, then finding the intersection
requires checking each list ~100 times

The cached fq is a bitset. Before checking each document,
all that has to happen to "check the list" is index into the bitset and
see if the bit is turned on. If it isn't, the document is bypassed.


The lots of cores solution has this drawback. The first time a query
comes in for a particular core, it may be loaded, which will be
noticeably slow, so your users have to be able to tolerate first-time
searches that take a bit of time. That said, test to see if it's
"fast enough" before settling on the solution.

But really, I'd bypass this and just try the filter query solution and
measure. Because I'd be surprised if you really had performance
issues here, assuming your filter queries are indeed cached and
re-used.

Best,
Erick


On Thu, Nov 7, 2013 at 7:02 PM, Scott Schneider <
Scott_Schneider@symantec.com> wrote:

> Digging a bit more, I think I have answered my own questions.  Can someone
> please say if this sounds right?
>
> http://wiki.apache.org/solr/LotsOfCores looks like a pretty good
> solution.  If I give each user his own shard, each query can be run in only
> one shard.  The effect of the filter query will basically be to find that
> shard.  The requirements listed on the wiki suggest that performance will
> be good.  But in Solr 3.x, this won't scale with the # users/shards.
>
> Prepending a user id to indexed keywords using an analyzer will break
> wildcard search.  If there is a wildcard, the query analyzer doesn't run
> filters, so it won't prepend the user id.  I could prepend the user id
> myself before calling Solr, but that seems... bad.
>
> Scott
>
>
>
> > -----Original Message-----
> > From: Scott Schneider [mailto:Scott_Schneider@symantec.com]
> > Sent: Thursday, November 07, 2013 2:03 PM
> > To: solr-user@lucene.apache.org
> > Subject: RE: fq efficiency
> >
> > Thanks, that link is very helpful, especially the section, "Leapfrog,
> > anyone?"  This actually seems quite slow for my use case.  Suppose we
> > have 10,000 users and 1,000,000 documents.  We search for "hello" for a
> > particular user and let's assume that the fq set for the user is
> > cached.  "hello" is a common word and perhaps 10,000 documents will
> > match.  If the user has 100 documents, then finding the intersection
> > requires checking each list ~100 times.  If the user has 1,000
> > documents, we check each list ~1,000 times.  That doesn't scale well.
> >
> > My searches are usually in one user's data.  How can I take advantage
> > of that?  I could have a separate index for each user, but loading so
> > many indexes at once seems infeasible; and dynamically loading &
> > unloading indexes is a pain.
> >
> > Or I could create a filter that takes tokens and prepends them with the
> > user id.  That seems like a good solution, since my keyword searches
> > always include a user id (and usually just 1 user id).  Though I wonder
> > if there is a downside I haven't thought of.
> >
> > Thanks,
> > Scott
> >
> >
> > > -----Original Message-----
> > > From: Shawn Heisey [mailto:solr@elyograg.org]
> > > Sent: Tuesday, November 05, 2013 4:35 PM
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: fq efficiency
> > >
> > > On 11/5/2013 3:36 PM, Scott Schneider wrote:
> > > > I'm wondering if filter queries are efficient enough for my use
> > > cases.  I have lots and lots of users in a big, multi-tenant, sharded
> > > index.  To run a search, I can use an fq on the user id and pass in
> > the
> > > search terms.  Does this scale well with the # users?  I suppose
> > that,
> > > since user id is indexed, generating the filter data (which is
> > cached)
> > > will be fast.  And looking up search terms is fast, of course.  But
> > if
> > > the search term is a common one that many users have in their
> > > documents, then Solr may have to perform an intersection between two
> > > large sets:  docs from all users with the search term and all of the
> > > current user's docs.
> > > >
> > > > Also, how about auto-complete and searching with a trailing
> > wildcard?
> > > As I understand it, these work well in a single-tenant index because
> > > keywords are sorted in the index, so it's easy to get all the search
> > > terms that match "foo*".  In a multi-tenant index, all users'
> > keywords
> > > are stored together.  So if Lucene were to look at all the keywords
> > > from "foo" to "foozzzzz" (I'm not sure if it actually does this), it
> > > would skip over a large majority of keywords that don't belong to
> > this
> > > user.
> > >
> > >  From what I understand, there's not really a whole lot of difference
> > > between queries and filter queries when they are NOT cached, except
> > > that
> > > the main query and the filter queries are executed in parallel, which
> > > can save time.
> > >
> > > When filter queries are found in the filterCache, it's a different
> > > story.  They get applied *before* the main query, which means that
> > the
> > > main query won't have to work as hard.  The filterCache stores
> > > information about which documents in the entire index match the
> > filter.
> > > By storing it as a bitset, the amount of space required is relatively
> > > low.  Applying filterCache results is very efficient.
> > >
> > > There are also advanced techniques, like assigning a cost to each
> > > filter
> > > and creating postfilters:
> > >
> > > http://yonik.com/posts/advanced-filter-caching-in-solr/
> > >
> > > Thanks,
> > > Shawn
>
>

RE: fq efficiency

Posted by Scott Schneider <Sc...@symantec.com>.

Digging a bit more, I think I have answered my own questions.  Can someone please say if this sounds right?

http://wiki.apache.org/solr/LotsOfCores looks like a pretty good solution.  If I give each user his own shard, each query can be run in only one shard.  The effect of the filter query will basically be to find that shard.  The requirements listed on the wiki suggest that performance will be good.  But in Solr 3.x, this won't scale with the # users/shards.

Prepending a user id to indexed keywords using an analyzer will break wildcard search.  If there is a wildcard, the query analyzer doesn't run filters, so it won't prepend the user id.  I could prepend the user id myself before calling Solr, but that seems... bad.

Scott



> -----Original Message-----
> From: Scott Schneider [mailto:Scott_Schneider@symantec.com]
> Sent: Thursday, November 07, 2013 2:03 PM
> To: solr-user@lucene.apache.org
> Subject: RE: fq efficiency
> 
> Thanks, that link is very helpful, especially the section, "Leapfrog,
> anyone?"  This actually seems quite slow for my use case.  Suppose we
> have 10,000 users and 1,000,000 documents.  We search for "hello" for a
> particular user and let's assume that the fq set for the user is
> cached.  "hello" is a common word and perhaps 10,000 documents will
> match.  If the user has 100 documents, then finding the intersection
> requires checking each list ~100 times.  If the user has 1,000
> documents, we check each list ~1,000 times.  That doesn't scale well.
> 
> My searches are usually in one user's data.  How can I take advantage
> of that?  I could have a separate index for each user, but loading so
> many indexes at once seems infeasible; and dynamically loading &
> unloading indexes is a pain.
> 
> Or I could create a filter that takes tokens and prepends them with the
> user id.  That seems like a good solution, since my keyword searches
> always include a user id (and usually just 1 user id).  Though I wonder
> if there is a downside I haven't thought of.
> 
> Thanks,
> Scott
> 
> 
> > -----Original Message-----
> > From: Shawn Heisey [mailto:solr@elyograg.org]
> > Sent: Tuesday, November 05, 2013 4:35 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: fq efficiency
> >
> > On 11/5/2013 3:36 PM, Scott Schneider wrote:
> > > I'm wondering if filter queries are efficient enough for my use
> > cases.  I have lots and lots of users in a big, multi-tenant, sharded
> > index.  To run a search, I can use an fq on the user id and pass in
> the
> > search terms.  Does this scale well with the # users?  I suppose
> that,
> > since user id is indexed, generating the filter data (which is
> cached)
> > will be fast.  And looking up search terms is fast, of course.  But
> if
> > the search term is a common one that many users have in their
> > documents, then Solr may have to perform an intersection between two
> > large sets:  docs from all users with the search term and all of the
> > current user's docs.
> > >
> > > Also, how about auto-complete and searching with a trailing
> wildcard?
> > As I understand it, these work well in a single-tenant index because
> > keywords are sorted in the index, so it's easy to get all the search
> > terms that match "foo*".  In a multi-tenant index, all users'
> keywords
> > are stored together.  So if Lucene were to look at all the keywords
> > from "foo" to "foozzzzz" (I'm not sure if it actually does this), it
> > would skip over a large majority of keywords that don't belong to
> this
> > user.
> >
> >  From what I understand, there's not really a whole lot of difference
> > between queries and filter queries when they are NOT cached, except
> > that
> > the main query and the filter queries are executed in parallel, which
> > can save time.
> >
> > When filter queries are found in the filterCache, it's a different
> > story.  They get applied *before* the main query, which means that
> the
> > main query won't have to work as hard.  The filterCache stores
> > information about which documents in the entire index match the
> filter.
> > By storing it as a bitset, the amount of space required is relatively
> > low.  Applying filterCache results is very efficient.
> >
> > There are also advanced techniques, like assigning a cost to each
> > filter
> > and creating postfilters:
> >
> > http://yonik.com/posts/advanced-filter-caching-in-solr/
> >
> > Thanks,
> > Shawn

RE: fq efficiency

Posted by Scott Schneider <Sc...@symantec.com>.

Thanks, that link is very helpful, especially the section, "Leapfrog, anyone?"  This actually seems quite slow for my use case.  Suppose we have 10,000 users and 1,000,000 documents.  We search for "hello" for a particular user and let's assume that the fq set for the user is cached.  "hello" is a common word and perhaps 10,000 documents will match.  If the user has 100 documents, then finding the intersection requires checking each list ~100 times.  If the user has 1,000 documents, we check each list ~1,000 times.  That doesn't scale well.

My searches are usually in one user's data.  How can I take advantage of that?  I could have a separate index for each user, but loading so many indexes at once seems infeasible; and dynamically loading & unloading indexes is a pain.

Or I could create a filter that takes tokens and prepends them with the user id.  That seems like a good solution, since my keyword searches always include a user id (and usually just 1 user id).  Though I wonder if there is a downside I haven't thought of.

Thanks,
Scott

> -----Original Message-----
> From: Shawn Heisey [mailto:solr@elyograg.org]
> Sent: Tuesday, November 05, 2013 4:35 PM
> To: solr-user@lucene.apache.org
> Subject: Re: fq efficiency
> 
> On 11/5/2013 3:36 PM, Scott Schneider wrote:
> > I'm wondering if filter queries are efficient enough for my use
> cases.  I have lots and lots of users in a big, multi-tenant, sharded
> index.  To run a search, I can use an fq on the user id and pass in the
> search terms.  Does this scale well with the # users?  I suppose that,
> since user id is indexed, generating the filter data (which is cached)
> will be fast.  And looking up search terms is fast, of course.  But if
> the search term is a common one that many users have in their
> documents, then Solr may have to perform an intersection between two
> large sets:  docs from all users with the search term and all of the
> current user's docs.
> >
> > Also, how about auto-complete and searching with a trailing wildcard?
> As I understand it, these work well in a single-tenant index because
> keywords are sorted in the index, so it's easy to get all the search
> terms that match "foo*".  In a multi-tenant index, all users' keywords
> are stored together.  So if Lucene were to look at all the keywords
> from "foo" to "foozzzzz" (I'm not sure if it actually does this), it
> would skip over a large majority of keywords that don't belong to this
> user.
> 
>  From what I understand, there's not really a whole lot of difference
> between queries and filter queries when they are NOT cached, except
> that
> the main query and the filter queries are executed in parallel, which
> can save time.
> 
> When filter queries are found in the filterCache, it's a different
> story.  They get applied *before* the main query, which means that the
> main query won't have to work as hard.  The filterCache stores
> information about which documents in the entire index match the filter.
> By storing it as a bitset, the amount of space required is relatively
> low.  Applying filterCache results is very efficient.
> 
> There are also advanced techniques, like assigning a cost to each
> filter
> and creating postfilters:
> 
> http://yonik.com/posts/advanced-filter-caching-in-solr/
> 
> Thanks,
> Shawn

Re: fq efficiency

Posted by Shawn Heisey <so...@elyograg.org>.

On 11/5/2013 3:36 PM, Scott Schneider wrote:
> I'm wondering if filter queries are efficient enough for my use cases.  I have lots and lots of users in a big, multi-tenant, sharded index.  To run a search, I can use an fq on the user id and pass in the search terms.  Does this scale well with the # users?  I suppose that, since user id is indexed, generating the filter data (which is cached) will be fast.  And looking up search terms is fast, of course.  But if the search term is a common one that many users have in their documents, then Solr may have to perform an intersection between two large sets:  docs from all users with the search term and all of the current user's docs.
>
> Also, how about auto-complete and searching with a trailing wildcard?  As I understand it, these work well in a single-tenant index because keywords are sorted in the index, so it's easy to get all the search terms that match "foo*".  In a multi-tenant index, all users' keywords are stored together.  So if Lucene were to look at all the keywords from "foo" to "foozzzzz" (I'm not sure if it actually does this), it would skip over a large majority of keywords that don't belong to this user.

 From what I understand, there's not really a whole lot of difference 
between queries and filter queries when they are NOT cached, except that 
the main query and the filter queries are executed in parallel, which 
can save time.

When filter queries are found in the filterCache, it's a different 
story.  They get applied *before* the main query, which means that the 
main query won't have to work as hard.  The filterCache stores 
information about which documents in the entire index match the filter.  
By storing it as a bitset, the amount of space required is relatively 
low.  Applying filterCache results is very efficient.

There are also advanced techniques, like assigning a cost to each filter 
and creating postfilters:

http://yonik.com/posts/advanced-filter-caching-in-solr/

Thanks,
Shawn