You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Nicolae Mihalache <xp...@gmail.com> on 2009/08/03 10:45:30 UTC

faceted search cache and optimisations

Hello,

I'm using faceted search (perhaps in a dumb way) to collect some statistics
for my index. I have documents in various languages, one of the field is
"language" and I simply want to see how many documents I have for each
language. I have noticed that the search builds a int[maxDoc] array and then
traverses the array to count. If facet.method=enum (discovered later) is
used, the things are still counted in a different way. But for this case
where all the documents are retrieved, the information is already available
in the lucene index.
So, I think it would be a good optimization to detect these cases (i.e. no
filtering) and just return the number from the index instead of counting the
docs again.

Another issue: there is no way currently to disable the caching of the
int[maxDoc], is there? If there are many fields to be faceted, this can
quikly lead to out of memory situations. I think it would be good to give
the option (as part of the query) to disable the caching, even if it is
slow, at least it works and is useful for non-interactive processing.

And another possibe optimization for the int[maxDoc] inspired from the
column stored databases: the way they do it is to find the minimum number of
bits to represent a value. If for example my language field has 30 possible
values (i.e. I have docs in 30 languages), I only need 5 bits for each doc
(instead of int=32 bits). Then I can represent the whole int[maxDoc] in less
than 1/6 of the space required now.
What's even better, sometimes the documents can be partitioned such that not
all the values of a field are represented in the same partition.
For example let's assume that I have a field called doc_generation_date. If
I harverst the documents each three days, and I consider a partition as
having the same three days of data, for each partition I will basically have
only three possible values for the doc_generation_date. That means that I
only need to have 2 bits for each document plus a table for each partition
that maps from the partition value id (one of the three values represented
on two bits) to the index value id (that is the id stored in the lucene
index).
Of course, for the language field above, the partitioning would not help
unless I index successively only english docs, then only french, etc.
And also it wouldn't work just like that for multi-value fields.

nicolae

Re: faceted search cache and optimisations

Posted by Yonik Seeley <ys...@gmail.com>.
On Mon, Aug 3, 2009 at 4:45 AM, Nicolae Mihalache<xp...@gmail.com> wrote:
> Hello,
>
> I'm using faceted search (perhaps in a dumb way) to collect some statistics
> for my index. I have documents in various languages, one of the field is
> "language" and I simply want to see how many documents I have for each
> language. I have noticed that the search builds a int[maxDoc] array and then
> traverses the array to count. If facet.method=enum (discovered later) is
> used, the things are still counted in a different way. But for this case
> where all the documents are retrieved, the information is already available
> in the lucene index.

> So, I think it would be a good optimization to detect these cases (i.e. no
> filtering) and just return the number from the index instead of counting the
> docs again.

That would require
 - a base query that matched the entire index
 - no filters
 - no deletions in the index

If you want those numbers, see the terms component.

> Another issue: there is no way currently to disable the caching of the
> int[maxDoc], is there?

use facet.method=enum... the number of filters cached can be
controlled by the filterCache.
You can also prevent the filterCache from being used via the
facet.enum.cache.minDf param.

-Yonik