You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Daniel Tyreus <da...@webshots.com> on 2013/04/24 23:10:13 UTC

filter before facet

We're testing SolrCloud 4.1 for NRT search over hundreds of millions of
documents. I've been really impressed. The query performance is so much
better than we were getting out of our database.

With filter queries, we're able to get query times of less than 100ms under
moderate load. That's amazing.

My question today is on faceting. Let me give some examples to help make my
point.

*fq=state:California*
numFound = 92193
QTime = *80*

*fq=state:Calforni*
numFound = 0
QTime = *8*

*fq=state:California&facet=true&facet.field=city*
numFound = 92193
QTime = *1316*

*fq=city:"San Francisco"&facet=true&facet.field=city*
numFound = 1961
QTime = *1477*

*fq=state:Californi&facet=true&facet.field=city*
numFound = 0
QTime = *1380*

So filtering is fast and faceting is slow, which is understandable.

But why is it slow to generate facets on a result set of 0? Furthermore,
why does it take the same amount of time to generate facets on a result set
of 2000 as 100,000 documents?

This leads me to believe that the FQ is being applied AFTER the facets are
calculated on the whole data set. For my use case it would make a ton of
sense to apply the FQ first and then facet. Is it possible to specify this
behavior or do I need to get into the code and get my hands dirty?

Best Regards,
Daniel

Re: filter before facet

Posted by Daniel Tyreus <da...@webshots.com>.
I'm actually using one not listed in that doc (I suspect it's new). At
least with 3 or more facet fields, the FCS method is by far the best.

Here are some representative numbers with everything the same except for
the facet.method.

facet.method = fc
QTime = 3168

facet.method = enum
QTime = 309

facet.method = fcs
QTime = 19






On Wed, Apr 24, 2013 at 2:19 PM, Alexandre Rafalovitch
<ar...@gmail.com>wrote:

> What's your facet.method? Have you tried setting it both ways?
> http://wiki.apache.org/solr/SimpleFacetParameters#facet.method
>
> Regards,
>    Alex.
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
>
>
> On Wed, Apr 24, 2013 at 5:10 PM, Daniel Tyreus <da...@webshots.com>
> wrote:
> > We're testing SolrCloud 4.1 for NRT search over hundreds of millions of
> > documents. I've been really impressed. The query performance is so much
> > better than we were getting out of our database.
> >
> > With filter queries, we're able to get query times of less than 100ms
> under
> > moderate load. That's amazing.
> >
> > My question today is on faceting. Let me give some examples to help make
> my
> > point.
> >
> > *fq=state:California*
> > numFound = 92193
> > QTime = *80*
> >
> > *fq=state:Calforni*
> > numFound = 0
> > QTime = *8*
> >
> > *fq=state:California&facet=true&facet.field=city*
> > numFound = 92193
> > QTime = *1316*
> >
> > *fq=city:"San Francisco"&facet=true&facet.field=city*
> > numFound = 1961
> > QTime = *1477*
> >
> > *fq=state:Californi&facet=true&facet.field=city*
> > numFound = 0
> > QTime = *1380*
> >
> > So filtering is fast and faceting is slow, which is understandable.
> >
> > But why is it slow to generate facets on a result set of 0? Furthermore,
> > why does it take the same amount of time to generate facets on a result
> set
> > of 2000 as 100,000 documents?
> >
> > This leads me to believe that the FQ is being applied AFTER the facets
> are
> > calculated on the whole data set. For my use case it would make a ton of
> > sense to apply the FQ first and then facet. Is it possible to specify
> this
> > behavior or do I need to get into the code and get my hands dirty?
> >
> > Best Regards,
> > Daniel
>

Re: filter before facet

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
What's your facet.method? Have you tried setting it both ways?
http://wiki.apache.org/solr/SimpleFacetParameters#facet.method

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Wed, Apr 24, 2013 at 5:10 PM, Daniel Tyreus <da...@webshots.com> wrote:
> We're testing SolrCloud 4.1 for NRT search over hundreds of millions of
> documents. I've been really impressed. The query performance is so much
> better than we were getting out of our database.
>
> With filter queries, we're able to get query times of less than 100ms under
> moderate load. That's amazing.
>
> My question today is on faceting. Let me give some examples to help make my
> point.
>
> *fq=state:California*
> numFound = 92193
> QTime = *80*
>
> *fq=state:Calforni*
> numFound = 0
> QTime = *8*
>
> *fq=state:California&facet=true&facet.field=city*
> numFound = 92193
> QTime = *1316*
>
> *fq=city:"San Francisco"&facet=true&facet.field=city*
> numFound = 1961
> QTime = *1477*
>
> *fq=state:Californi&facet=true&facet.field=city*
> numFound = 0
> QTime = *1380*
>
> So filtering is fast and faceting is slow, which is understandable.
>
> But why is it slow to generate facets on a result set of 0? Furthermore,
> why does it take the same amount of time to generate facets on a result set
> of 2000 as 100,000 documents?
>
> This leads me to believe that the FQ is being applied AFTER the facets are
> calculated on the whole data set. For my use case it would make a ton of
> sense to apply the FQ first and then facet. Is it possible to specify this
> behavior or do I need to get into the code and get my hands dirty?
>
> Best Regards,
> Daniel

Re: filter before facet

Posted by Daniel Tyreus <da...@webshots.com>.
On Thu, Apr 25, 2013 at 12:35 AM, Toke Eskildsen <te...@statsbiblioteket.dk>wrote:

>
>
> > This leads me to believe that the FQ is being applied AFTER the facets
> are
> > calculated on the whole data set. For my use case it would make a ton of
> > sense to apply the FQ first and then facet. Is it possible to specify
> this
> > behavior or do I need to get into the code and get my hands dirty?
>
>
> As for creating a new faceting implementation that avoids the startup
> penalty by using only the found documents, then it is technically quite
> simple: Use stored fields, iterate the hits and request the values.
> Unfortunately this scales poorly with the number of hits, so unless you
> can guarantee that you will always have small result sets, this is
> probably not a viable option.
>
>
Thank you Toke for your detailed reply. I have perhaps an unusual use case
where we may have hundreds of thousands of users each with a few thousand
documents. On some queries I can guarantee the result size will be small
compared to the entire corpus since I'm filtering on one user's documents.
I may give this alternative faceting implementation a try.

Best regards,
Daniel

Re: filter before facet

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Wed, 2013-04-24 at 23:10 +0200, Daniel Tyreus wrote:
> But why is it slow to generate facets on a result set of 0? Furthermore,
> why does it take the same amount of time to generate facets on a result set
> of 2000 as 100,000 documents?

The default faceting method for your query is field cache. Field cache
faceting works by generating a structure for all the values for the
field in the whole corpus. It is exactly the same work whether you hit
0, 2K or 100M documents with your query.

After the structure has been build, the actual counting of values in the
facet is fast. There is not much difference between 2K and 100K hits.

> This leads me to believe that the FQ is being applied AFTER the facets are
> calculated on the whole data set. For my use case it would make a ton of
> sense to apply the FQ first and then facet. Is it possible to specify this
> behavior or do I need to get into the code and get my hands dirty?

As you write later, you have tried fc, enum and fcs, with fcs having the
fastest first-request-time time. That is understandable as it is
segment-oriented and (nearly) just a matter of loading the values
sequentially from storage. However, the general observation is that it
is about 10 times as slow as the fc-method for subsequent queries. Since
you are doing NRT that might still leave fcs as the best method for you.

As for creating a new faceting implementation that avoids the startup
penalty by using only the found documents, then it is technically quite
simple: Use stored fields, iterate the hits and request the values.
Unfortunately this scales poorly with the number of hits, so unless you
can guarantee that you will always have small result sets, this is
probably not a viable option.

- Toke Eskildsen, State and University Library, Denmark