You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Andrew Laird <an...@gettyimages.com> on 2012/06/07 10:01:37 UTC

How to cap facet counts beyond a specified limit

We have an index with ~100M documents and I am looking for a simple way to speed up faceted searches.  Is there a relatively straightforward way to stop counting the number of matching documents beyond some specifiable value?  For our needs we don't really need to know that a particular facet has exactly 14,203,527 matches - just knowing that there are "more than a million" is enough.  If I could somehow limit the hit counts to a million (say) it seems like that could decrease the work required to compute the values (just stop counting after the limit is reached) and potentially improve faceted search time - especially when we have 20-30 fields to facet on.  Has anyone else tried to do something like this?

Many thanks for comments and info,

Sincerely,


andy laird | gettyimages | 206.925.6728

Re: How to cap facet counts beyond a specified limit

Posted by Jack Krupansky <ja...@basetechnology.com>.

Sounds like an interesting improvement to propose.

It will also depend on various factors, such as number of unique terms in a 
field, field type, etc.

Which field types are giving you the most trouble and how many unique values 
do they have? And do you specify a facet.method or just let it default?

What release of Solr are you on? Are you using "trie" for numeric fields? 
Are these mostly string fields? Any boolean fields?

-- Jack Krupansky

-----Original Message----- 
From: Andrew Laird
Sent: Thursday, June 07, 2012 4:01 AM
To: solr-user@lucene.apache.org
Subject: How to cap facet counts beyond a specified limit

We have an index with ~100M documents and I am looking for a simple way to 
speed up faceted searches.  Is there a relatively straightforward way to 
stop counting the number of matching documents beyond some specifiable 
value?  For our needs we don't really need to know that a particular facet 
has exactly 14,203,527 matches - just knowing that there are "more than a 
million" is enough.  If I could somehow limit the hit counts to a million 
(say) it seems like that could decrease the work required to compute the 
values (just stop counting after the limit is reached) and potentially 
improve faceted search time - especially when we have 20-30 fields to facet 
on.  Has anyone else tried to do something like this?

Many thanks for comments and info,

Sincerely,


andy laird | gettyimages | 206.925.6728

Re: How to cap facet counts beyond a specified limit

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Thu, 2012-06-07 at 10:01 +0200, Andrew Laird wrote:
> For our needs we don't really need to know that a particular facet has
> exactly 14,203,527 matches - just knowing that there are "more than a
> million" is enough.  If I could somehow limit the hit counts to a
> million (say) [...]

It should be feasible to stop the collector after 1M documents has been
processed. If nothing else then just by ignoring subsequent IDs.
However, the ID's received would be in index-order, which normally means
old-to-new. If the nature of the corpus, and thereby the facet values,
changes over time, this change would not be reflected in the facets that
has many hits as the collector never reaches the newer documents.

> it seems like that could decrease the work required to
> compute the values (just stop counting after the limit is reached) and
> potentially improve faceted search time - especially when we have 20-30
> fields to facet on.  Has anyone else tried to do something like this?

The current Solr facet implementation treats every facet structure
individually. It works fine in a lot of areas but it also means that the
list of IDs for matching documents is iterated once for every facet: In
the sample case, 14M+ hits * 25 fields = 350M+ hits processed.

I have been experimenting with an alternative approach (SOLR-2412) that
packs the terms in the facets as a single structure underneath the hood,
which means only 14M+ hits processed in the current case. Unfortunately
it is not mature and only works for text fields.

- Toke Eskildsen, State and University Library, Denmark