You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Fuad Efendi <fu...@efendi.ca> on 2008/02/06 20:07:03 UTC

Performance of filterCache for Faceting

Ciao-Ciao Everyone!

I did something strange and website (www.tokenizer.org) performs 1000 times
faster now (it is still in my basement via ASDL 600kbps upload asychronous)
Thank you for supporting SOLR!

filterCache:
Size: 1051311

What I did: single-valued fields for Category and ItemName. Category field
is tokenized (with custom analyzer), and I updated only 30% of Lucene index,
but it was more than enough for huge performance improvements. Before that,
due to some mistakes in parsing HTML and etc., I had *multivalued* field for
facets; I tried to minimize amount of tokens (sorry for possibly wrong
terminology). Field for facets is still tokenized, but it is single-value.

Before: after Commit/Optimize and Server Restart first query took 5-7
minutes to execute
After: only 2 seconds!

I was browsing Lucene fieldCache, unfortunately it's not applicable for
tokenized fields... 

filterCache size is almost the same as before, but it works much faster.


Thanks,

Fuad Efendi
(416)993-2060(cell)
(416)761-1940(home)
Tokenizer Inc.
http://www.tokenizer.org

RE: Performance of filterCache for Faceting

Posted by Fuad Efendi <fu...@efendi.ca>.

> On 6-Feb-08, at 4:32 PM, Fuad Efendi wrote:
> 
> >> Indeed the field cache method works much better when the values are
> >> single-valued.  Unfortunately, there is no way for solr to 
> know that
> >> the analyzer is only outputting a single token per 
> document, else we
> >> could apply this optimization automatically.
> >
> > Thanks Mike,
> >
> > Some clarification:
> > *single-valued* in my previous Email means *field-with-single-only- 
> > value*
> > (in SOLR terms, multiValued="false"), and not a *single-token*. This
> > *single-valued* field is analyzed/tokenized and it is 
> *multi-valued- 
> > token*
> > so that fieldCache can't work. And I have extremely good performance
> > improvements, *without* Lucene's FieldCache optimization!
> 
> That seems extremely odd.  Sure you aren't just sending fewer unique  
> tokens?
> 
> -Mike
> 

Yes, that is true: I have probably 1,000,000 of unique tokens (at least,
1,000,000 size of filterCache) (tokens include different forms of words such
as Telescope, Telescoping; I am not using EnglishPorterFilter yet...)

Each single-value-field contains about 3-7 tokens; database size is
6,000,000 documents, and I reindexed 30% of a database(SOLR) by changing
multi-value field to single-value (some filters...)

I did this reindexing hoping to reduce total number of different tokens.
I'll finish reindexing in a few (may be 24) hours :)

If you browse website you may notice some large product names containing
even product price as a separate field value; the same with Category where I
use product name(s) but with different tokenizer; I am filtering product
names now, including category.
As a sample of multi-value product (and category) 'bad' data which has not
been reindexed yet: http://www.tokenizer.org/large/price.htm

I can't even say that index became smaller after reindexing; it is 1.6Gb,
almost the same as before.

-Fuad

Re: Performance of filterCache for Faceting

Posted by Mike Klaas <mi...@gmail.com>.

On 6-Feb-08, at 4:32 PM, Fuad Efendi wrote:

>> Indeed the field cache method works much better when the values are
>> single-valued.  Unfortunately, there is no way for solr to know that
>> the analyzer is only outputting a single token per document, else we
>> could apply this optimization automatically.
>
> Thanks Mike,
>
> Some clarification:
> *single-valued* in my previous Email means *field-with-single-only- 
> value*
> (in SOLR terms, multiValued="false"), and not a *single-token*. This
> *single-valued* field is analyzed/tokenized and it is *multi-valued- 
> token*
> so that fieldCache can't work. And I have extremely good performance
> improvements, *without* Lucene's FieldCache optimization!

That seems extremely odd.  Sure you aren't just sending fewer unique  
tokens?

-Mike

RE: Performance of filterCache for Faceting

Posted by Fuad Efendi <fu...@efendi.ca>.

>Indeed the field cache method works much better when the values are  
>single-valued.  Unfortunately, there is no way for solr to know that  
>the analyzer is only outputting a single token per document, else we  
>could apply this optimization automatically.

Thanks Mike,

Some clarification:
*single-valued* in my previous Email means *field-with-single-only-value*
(in SOLR terms, multiValued="false"), and not a *single-token*. This
*single-valued* field is analyzed/tokenized and it is *multi-valued-token*
so that fieldCache can't work. And I have extremely good performance
improvements, *without* Lucene's FieldCache optimization!

Re: Performance of filterCache for Faceting

Posted by Mike Klaas <mi...@gmail.com>.

On 6-Feb-08, at 11:07 AM, Fuad Efendi wrote:
>
> What I did: single-valued fields for Category and ItemName.  
> Category field
> is tokenized (with custom analyzer), and I updated only 30% of  
> Lucene index,
> but it was more than enough for huge performance improvements.  
> Before that,
> due to some mistakes in parsing HTML and etc., I had *multivalued*  
> field for
> facets; I tried to minimize amount of tokens (sorry for possibly wrong
> terminology). Field for facets is still tokenized, but it is single- 
> value.
>
> Before: after Commit/Optimize and Server Restart first query took 5-7
> minutes to execute
> After: only 2 seconds!
>
> I was browsing Lucene fieldCache, unfortunately it's not applicable  
> for
> tokenized fields...
>
> filterCache size is almost the same as before, but it works much  
> faster.

Indeed the field cache method works much better when the values are  
single-valued.  Unfortunately, there is no way for solr to know that  
the analyzer is only outputting a single token per document, else we  
could apply this optimization automatically.

-Mike