You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Fuad Efendi <fu...@efendi.ca> on 2008/02/06 20:07:03 UTC
Performance of filterCache for Faceting
Ciao-Ciao Everyone!
I did something strange and website (www.tokenizer.org) performs 1000 times
faster now (it is still in my basement via ASDL 600kbps upload asychronous)
Thank you for supporting SOLR!
filterCache:
Size: 1051311
What I did: single-valued fields for Category and ItemName. Category field
is tokenized (with custom analyzer), and I updated only 30% of Lucene index,
but it was more than enough for huge performance improvements. Before that,
due to some mistakes in parsing HTML and etc., I had *multivalued* field for
facets; I tried to minimize amount of tokens (sorry for possibly wrong
terminology). Field for facets is still tokenized, but it is single-value.
Before: after Commit/Optimize and Server Restart first query took 5-7
minutes to execute
After: only 2 seconds!
I was browsing Lucene fieldCache, unfortunately it's not applicable for
tokenized fields...
filterCache size is almost the same as before, but it works much faster.
Thanks,
Fuad Efendi
(416)993-2060(cell)
(416)761-1940(home)
Tokenizer Inc.
http://www.tokenizer.org
RE: Performance of filterCache for Faceting
Posted by Fuad Efendi <fu...@efendi.ca>.
> On 6-Feb-08, at 4:32 PM, Fuad Efendi wrote:
>
> >> Indeed the field cache method works much better when the values are
> >> single-valued. Unfortunately, there is no way for solr to
> know that
> >> the analyzer is only outputting a single token per
> document, else we
> >> could apply this optimization automatically.
> >
> > Thanks Mike,
> >
> > Some clarification:
> > *single-valued* in my previous Email means *field-with-single-only-
> > value*
> > (in SOLR terms, multiValued="false"), and not a *single-token*. This
> > *single-valued* field is analyzed/tokenized and it is
> *multi-valued-
> > token*
> > so that fieldCache can't work. And I have extremely good performance
> > improvements, *without* Lucene's FieldCache optimization!
>
> That seems extremely odd. Sure you aren't just sending fewer unique
> tokens?
>
> -Mike
>
Yes, that is true: I have probably 1,000,000 of unique tokens (at least,
1,000,000 size of filterCache) (tokens include different forms of words such
as Telescope, Telescoping; I am not using EnglishPorterFilter yet...)
Each single-value-field contains about 3-7 tokens; database size is
6,000,000 documents, and I reindexed 30% of a database(SOLR) by changing
multi-value field to single-value (some filters...)
I did this reindexing hoping to reduce total number of different tokens.
I'll finish reindexing in a few (may be 24) hours :)
If you browse website you may notice some large product names containing
even product price as a separate field value; the same with Category where I
use product name(s) but with different tokenizer; I am filtering product
names now, including category.
As a sample of multi-value product (and category) 'bad' data which has not
been reindexed yet: http://www.tokenizer.org/large/price.htm
I can't even say that index became smaller after reindexing; it is 1.6Gb,
almost the same as before.
-Fuad
Re: Performance of filterCache for Faceting
Posted by Mike Klaas <mi...@gmail.com>.
On 6-Feb-08, at 4:32 PM, Fuad Efendi wrote:
>> Indeed the field cache method works much better when the values are
>> single-valued. Unfortunately, there is no way for solr to know that
>> the analyzer is only outputting a single token per document, else we
>> could apply this optimization automatically.
>
> Thanks Mike,
>
> Some clarification:
> *single-valued* in my previous Email means *field-with-single-only-
> value*
> (in SOLR terms, multiValued="false"), and not a *single-token*. This
> *single-valued* field is analyzed/tokenized and it is *multi-valued-
> token*
> so that fieldCache can't work. And I have extremely good performance
> improvements, *without* Lucene's FieldCache optimization!
That seems extremely odd. Sure you aren't just sending fewer unique
tokens?
-Mike
RE: Performance of filterCache for Faceting
Posted by Fuad Efendi <fu...@efendi.ca>.
>Indeed the field cache method works much better when the values are
>single-valued. Unfortunately, there is no way for solr to know that
>the analyzer is only outputting a single token per document, else we
>could apply this optimization automatically.
Thanks Mike,
Some clarification:
*single-valued* in my previous Email means *field-with-single-only-value*
(in SOLR terms, multiValued="false"), and not a *single-token*. This
*single-valued* field is analyzed/tokenized and it is *multi-valued-token*
so that fieldCache can't work. And I have extremely good performance
improvements, *without* Lucene's FieldCache optimization!
Re: Performance of filterCache for Faceting
Posted by Mike Klaas <mi...@gmail.com>.
On 6-Feb-08, at 11:07 AM, Fuad Efendi wrote:
>
> What I did: single-valued fields for Category and ItemName.
> Category field
> is tokenized (with custom analyzer), and I updated only 30% of
> Lucene index,
> but it was more than enough for huge performance improvements.
> Before that,
> due to some mistakes in parsing HTML and etc., I had *multivalued*
> field for
> facets; I tried to minimize amount of tokens (sorry for possibly wrong
> terminology). Field for facets is still tokenized, but it is single-
> value.
>
> Before: after Commit/Optimize and Server Restart first query took 5-7
> minutes to execute
> After: only 2 seconds!
>
> I was browsing Lucene fieldCache, unfortunately it's not applicable
> for
> tokenized fields...
>
> filterCache size is almost the same as before, but it works much
> faster.
Indeed the field cache method works much better when the values are
single-valued. Unfortunately, there is no way for solr to know that
the analyzer is only outputting a single token per document, else we
could apply this optimization automatically.
-Mike