You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by climbingrose <cl...@gmail.com> on 2007/07/16 17:04:38 UTC

Slow facet with custom Analyser

Hi all,

My facet browsing performance has been decent on my system until I add my
custom Analyser. Initially, I facetted "title" field which is of default
string type (no analysers, tokenisers...) and got quick responses (first
query is just under 1s, subsequent queries are < 0.1s). I created a custom
analyser which is not much different from the DefaultAnalyzer in FieldType
class. Essentially, this analyzer will not do any tokonisations, but only
convert the value into lower case, remove spaces, unwanted chars and words.
After I applied the analyser to "title" field, facet performance degraded
considerably. Every query is now > 1.2s and the filterCache hit ratio is
extremely small:

lookups : 918485
> hits : 23
> hitratio : 0.00
> inserts : 918487
> evictions : 917971
> size : 512
> cumulative_lookups : 918485
> cumulative_hits : 23
> cumulative_hitratio : 0.00
> cumulative_inserts : 918487
> cumulative_evictions : 917971
>

Any idea? Here is my analyser code:

public class FacetTextAnalyser extends SolrAnalyzer {
>     final int maxChars;
>     final Set<Character> ignoredChars;
>     final Set<String> ignoredWords;
>
>     public final static char[] IGNORED_CHARS = {'/', '\\', '\'', '\"',
> '#', '&', '!', '?', '*', '>', '<', ','};
>     public static final String[] IGNORED_WORDS = {
>             "a", "an", "and", "are", "as", "at", "be", "but", "by",
>             "for", "if", "in", "into", "is",
>             "no", "not", "of", "on", "or", "such",
>             "that", "the", "their", "then", "there", "these",
>             "they", "this", "to", "was", "will", "with"
>     };
>
>     public FacetTextAnalyser() {
>         maxChars = 255;
>         ignoredChars = new HashSet<Character>();
>         for (int i = 0; i < IGNORED_CHARS.length; i++) {
>             ignoredChars.add(IGNORED_CHARS[i]);
>         }
>         ignoredWords = new HashSet<String>();
>         for (int i = 0; i < IGNORED_WORDS.length; i++) {
>             ignoredWords.add(IGNORED_WORDS[i]);
>         }
>
>     }
>
>     public FacetTextAnalyser(int maxChars, Set<Character> ignoredChars,
> Set<String> ignoredWords) {
>         this.maxChars = maxChars;
>         this.ignoredChars = ignoredChars;
>         this.ignoredWords = ignoredWords;
>     }
>
>     @Override
>     public TokenStream tokenStream(String fieldName, Reader reader) {
>         return new Tokenizer(reader) {
>             char[] cbuf = new char[maxChars];
>
>             public Token next() throws IOException {
>                 int n = input.read(cbuf, 0, maxChars);
>                 if (n <= 0)
>                     return null;
>                 char[] temp = new char[n];
>                 int index = 0;
>                 boolean space = true;
>                 for (int i = 0; i < n; i++) {
>                     char c = cbuf[i];
>                     if (ignoredChars.contains(cbuf[i])) {
>                         c = ' ';
>                     }
>                     if (Character.isWhitespace(c)) {
>                         if (space)
>                             continue;
>                         else {
>                             temp[index] = ' ';
>                             if (index > 0) {
>                                 int j = index - 1;
>                                 while (temp[j] != ' ' && j > 0) {
>                                     j--;
>                                 }
>                                 String str = (j == 0)? new String(temp, 0,
> index): new String(temp, j + 1, index - j - 1);
>                                 System.out.println(str);
>                                 if (ignoredWords.contains(str))
>                                     index = j;
>                             }
>                             index++;
>                             //Check ignored words
>                             space = true;
>                         }
>                     } else {
>                         temp[index] = Character.toLowerCase(c);
>                         index++;
>                         space = false;
>                     }
>
>                 }
>                 temp[0] = Character.toUpperCase(temp[0]);
>                 String s = new String(temp, 0, index);
>                 return new Token(s, 0, n);
>             };
>         };
>     }
> }
>

Here is how I declare the analyser:

>   <fieldType name="text_em" class="solr.TextField"
> positionIncrementGap="100">
>                 <analyzer class="net.jseeker.lucene.FacetTextAnalyser"/>
>     </fieldType>
>


-- 
Regards,

Cuong Hoang

Re: Slow facet with custom Analyser

Posted by Yonik Seeley <yo...@apache.org>.
On 7/16/07, Chris Hostetter <ho...@fucit.org> wrote:
>
> : > ...but i don't understand why both checking isTokenized() ... shouldn't
> : > multiValued() be enough?
> :
> : A field could return "false" for multiValued() and still have multiple
> : tokens per document for that field.
>
> ah .. right ... sorry: multiValued() indicates wether multiple discreet
> values can be added to the field (and stored if the field is stored) but
> says nothing baout what the Analyzer may do with any single value.
>
> perhaps we should really have an [f.foo.]facet.field.type=(single|multi)
> param to let clients indicate when they know exactly which method they
> wnat used (getFacetTermEnumCounts vs getFieldCacheCounts) ... if the
> property is not set, the default can be determeined using the
> "sf.multiValued() || ft.isTokenized() || ft instanceof BoolField" logic.

Or a method FieldType.multiToken(), and a new method
TokenizerFactory/TokenFilterFactory.multiToken() that can be used to
determine this when the FieldType was created (grrr, too bad they
weren't abstract classes)

Or a new attribute in the schema (but I don't like that solution much)

But allowing the user to select the strategy has some merit, esp since
there will be an additional way to find the top "n" when I get around
to finishing my facet-tree-index code.

-Yonik

Re: Slow facet with custom Analyser

Posted by climbingrose <cl...@gmail.com>.
Thanks for the suggestion Chris. I modified SimpleFacets to check for
[f.foo.]facet.field.type==(single|multi)
and the performance has been improved significantly.

On 7/17/07, Chris Hostetter <ho...@fucit.org> wrote:
>
>
> : > ...but i don't understand why both checking isTokenized() ...
> shouldn't
> : > multiValued() be enough?
> :
> : A field could return "false" for multiValued() and still have multiple
> : tokens per document for that field.
>
> ah .. right ... sorry: multiValued() indicates wether multiple discreet
> values can be added to the field (and stored if the field is stored) but
> says nothing baout what the Analyzer may do with any single value.
>
> perhaps we should really have an [f.foo.]facet.field.type=(single|multi)
> param to let clients indicate when they know exactly which method they
> wnat used (getFacetTermEnumCounts vs getFieldCacheCounts) ... if the
> property is not set, the default can be determeined using the
> "sf.multiValued() || ft.isTokenized() || ft instanceof BoolField" logic.
>
>
> -Hoss
>
>


-- 
Regards,

Cuong Hoang

Re: Slow facet with custom Analyser

Posted by Chris Hostetter <ho...@fucit.org>.
: > ...but i don't understand why both checking isTokenized() ... shouldn't
: > multiValued() be enough?
:
: A field could return "false" for multiValued() and still have multiple
: tokens per document for that field.

ah .. right ... sorry: multiValued() indicates wether multiple discreet
values can be added to the field (and stored if the field is stored) but
says nothing baout what the Analyzer may do with any single value.

perhaps we should really have an [f.foo.]facet.field.type=(single|multi)
param to let clients indicate when they know exactly which method they
wnat used (getFacetTermEnumCounts vs getFieldCacheCounts) ... if the
property is not set, the default can be determeined using the
"sf.multiValued() || ft.isTokenized() || ft instanceof BoolField" logic.


-Hoss


Re: Slow facet with custom Analyser

Posted by Yonik Seeley <yo...@apache.org>.
On 7/16/07, Chris Hostetter <ho...@fucit.org> wrote:
> : There is currently no way to force Solr to use the FieldCache method.
>
> I'm having a hard time remembering why this is ... i see the line in
> SimpleFacets that says...
>     if (sf.multiValued() || ft.isTokenized() || ft instanceof BoolField) {
>       // Always use filters for booleans... we know the number of values is very small.
>
> ...but i don't understand why both checking isTokenized() ... shouldn't
> multiValued() be enough?

A field could return "false" for multiValued() and still have multiple
tokens per document for that field.

-Yonik

Re: Slow facet with custom Analyser

Posted by Chris Hostetter <ho...@fucit.org>.
: There is currently no way to force Solr to use the FieldCache method.

I'm having a hard time remembering why this is ... i see the line in
SimpleFacets that says...
    if (sf.multiValued() || ft.isTokenized() || ft instanceof BoolField) {
      // Always use filters for booleans... we know the number of values is very small.

...but i don't understand why both checking isTokenized() ... shouldn't
multiValued() be enough?

(that way people can use custom analyzers but as long as they only produce
one term per doc FieldCache will work fine)




-Hoss


Re: Slow facet with custom Analyser

Posted by climbingrose <cl...@gmail.com>.
I've tried both of your recommendations (use facet.enum.cache.minDf=1000 and
optimise the index). The query time is around 0.4-0.5s now but it's still
slow compared to the old "string" type. I haven't tried to increase
filterCache but 1000000 of cached items looks a bit too much for my server
atm. It's quite pitty that we can't force Solr to use FieldCache. I think I
might pre-process "title" field and index it as "string" instead of using
analyser. However, it defeats the purpose of having pluggable analysers,
tokenisers...

On 7/17/07, Yonik Seeley <yo...@apache.org> wrote:
>
> On 7/16/07, climbingrose <cl...@gmail.com> wrote:
> > Thanks Yonik. In my case, there is only one "title" field per document
> so is
> > there a way to force Solr to work the old way? My analyser doesn't break
> up
> > the "title" field into multiple tokens. It only tries to format the
> field
> > value (to lower case, remove unwanted chars and words). Therefore, it's
> no
> > difference from using "string" single-valued type.
>
> There is currently no way to force Solr to use the FieldCache method.
>
> Oh, and in
> "2) expand the size of the fieldcache to 1000000 if you have the memory"
> should have been filterCache, not fieldcache.
>
> -Yonik
>
> > I'll try your first recommendation to see how it goes.
>
> faceting typically proceeds much faster on an optimized index too.
>
> -Yonik
>



-- 
Regards,

Cuong Hoang

Re: Slow facet with custom Analyser

Posted by Yonik Seeley <yo...@apache.org>.
On 7/16/07, climbingrose <cl...@gmail.com> wrote:
> Thanks Yonik. In my case, there is only one "title" field per document so is
> there a way to force Solr to work the old way? My analyser doesn't break up
> the "title" field into multiple tokens. It only tries to format the field
> value (to lower case, remove unwanted chars and words). Therefore, it's no
> difference from using "string" single-valued type.

There is currently no way to force Solr to use the FieldCache method.

Oh, and in
 "2) expand the size of the fieldcache to 1000000 if you have the memory"
should have been filterCache, not fieldcache.

-Yonik

> I'll try your first recommendation to see how it goes.

faceting typically proceeds much faster on an optimized index too.

-Yonik

Re: Slow facet with custom Analyser

Posted by climbingrose <cl...@gmail.com>.
Thanks Yonik. In my case, there is only one "title" field per document so is
there a way to force Solr to work the old way? My analyser doesn't break up
the "title" field into multiple tokens. It only tries to format the field
value (to lower case, remove unwanted chars and words). Therefore, it's no
difference from using "string" single-valued type.

I'll try your first recommendation to see how it goes.

Thanks again.

On 7/17/07, Yonik Seeley <yo...@apache.org> wrote:
>
> Since you went from a non multi-valued "string" type (which Solr knows
> has at most one value per document) to a custom analyzer type (which
> could produce multiple tokens per document), Solr switched tactics
> from using the FieldCache for faceting to using the filterCache.
>
> Right now, you could try to
> 1) use facet.enum.cache.minDf=1000 (don't use the fieldCache except
> for large facets)
> 2) expand the size of the fieldcache to 1000000 if you have the memory
>
> Optimizing your index should also speed up faceting (but that is a lot
> of facets).
>
> -Yonik
>
> On 7/16/07, climbingrose <cl...@gmail.com> wrote:
> > Hi all,
> >
> > My facet browsing performance has been decent on my system until I add
> my
> > custom Analyser. Initially, I facetted "title" field which is of default
> > string type (no analysers, tokenisers...) and got quick responses (first
> > query is just under 1s, subsequent queries are < 0.1s). I created a
> custom
> > analyser which is not much different from the DefaultAnalyzer in
> FieldType
> > class. Essentially, this analyzer will not do any tokonisations, but
> only
> > convert the value into lower case, remove spaces, unwanted chars and
> words.
> > After I applied the analyser to "title" field, facet performance
> degraded
> > considerably. Every query is now > 1.2s and the filterCache hit ratio is
> > extremely small:
> >
> > lookups : 918485
> > > hits : 23
> > > hitratio : 0.00
> > > inserts : 918487
> > > evictions : 917971
> > > size : 512
> > > cumulative_lookups : 918485
> > > cumulative_hits : 23
> > > cumulative_hitratio : 0.00
> > > cumulative_inserts : 918487
> > > cumulative_evictions : 917971
>



-- 
Regards,

Cuong Hoang

Re: Slow facet with custom Analyser

Posted by Yonik Seeley <yo...@apache.org>.
Since you went from a non multi-valued "string" type (which Solr knows
has at most one value per document) to a custom analyzer type (which
could produce multiple tokens per document), Solr switched tactics
from using the FieldCache for faceting to using the filterCache.

Right now, you could try to
1) use facet.enum.cache.minDf=1000 (don't use the fieldCache except
for large facets)
2) expand the size of the fieldcache to 1000000 if you have the memory

Optimizing your index should also speed up faceting (but that is a lot
of facets).

-Yonik

On 7/16/07, climbingrose <cl...@gmail.com> wrote:
> Hi all,
>
> My facet browsing performance has been decent on my system until I add my
> custom Analyser. Initially, I facetted "title" field which is of default
> string type (no analysers, tokenisers...) and got quick responses (first
> query is just under 1s, subsequent queries are < 0.1s). I created a custom
> analyser which is not much different from the DefaultAnalyzer in FieldType
> class. Essentially, this analyzer will not do any tokonisations, but only
> convert the value into lower case, remove spaces, unwanted chars and words.
> After I applied the analyser to "title" field, facet performance degraded
> considerably. Every query is now > 1.2s and the filterCache hit ratio is
> extremely small:
>
> lookups : 918485
> > hits : 23
> > hitratio : 0.00
> > inserts : 918487
> > evictions : 917971
> > size : 512
> > cumulative_lookups : 918485
> > cumulative_hits : 23
> > cumulative_hitratio : 0.00
> > cumulative_inserts : 918487
> > cumulative_evictions : 917971