You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Demian Katz <de...@villanova.edu> on 2011/06/06 22:04:56 UTC

SpellCheckComponent performance

I'm continuing to work on tuning my Solr server, and now I'm noticing that my biggest bottleneck is the SpellCheckComponent.  This is eating multiple seconds on most first-time searches, and still taking around 500ms even on cached searches.  Here is my configuration:

  <searchComponent name="spellcheck" class="org.apache.solr.handler.component.SpellCheckComponent">
    <lst name="spellchecker">
      <str name="name">basicSpell</str>
      <str name="field">spelling</str>
      <str name="accuracy">0.75</str>
      <str name="spellcheckIndexDir">./spellchecker</str>
      <str name="queryAnalyzerFieldType">textSpell</str>
      <str name="buildOnOptimize">true</str>
    </lst>
  </searchComponent>

I've done a bit of searching, but the best advice I could find for making the search component go faster involved reducing spellcheck.maxCollationTries, which doesn't even seem to apply to my settings.

Does anyone have any advice on tuning this aspect of my configuration?  Are there any extra debug settings that might give deeper insight into how the component is spending its time?

thanks,
Demian

RE: SpellCheckComponent performance

Posted by "Dyer, James" <Ja...@ingrambook.com>.
Demian,

If you omit "spellcheckIndexDir" from the configuration, it will create an in-memory spelling dictionary.  

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Demian Katz [mailto:demian.katz@villanova.edu] 
Sent: Tuesday, June 07, 2011 7:59 AM
To: solr-user@lucene.apache.org
Subject: RE: SpellCheckComponent performance

As I may have mentioned before, VuFind is actually doing two Solr queries for every search -- a base query that gets basic spelling suggestions, and a supplemental spelling-only query that gets shingled spelling suggestions.  If there's a way to get two different spelling responses in a single query, I'd love to hear about it...  but the double-querying doesn't seem to be a huge problem -- the delays I'm talking about are in the spelling portion of the initial query.  Just for the sake of completeness, here are both of my spelling field types:

    <!-- Basic Text Field for use with Spell Correction -->
    <fieldType name="textSpell" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="schema.UnicodeNormalizationFilterFactory" version="icu4j" composed="false" remove_diacritics="true" remove_modifiers="true" fold="true"/>
        <filter class="solr.ISOLatin1AccentFilterFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>
    <!-- More advanced spell checking field. -->
    <fieldType name="textSpellShingle" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="false"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="false"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

...and here are the fields:

   <field name="spelling" type="textSpell" indexed="true" stored="true"/>
   <field name="spellingShingle" type="textSpellShingle" indexed="true" stored="true" multiValued="true"/>

As you can probably guess, I'm using spelling in my main query and spellingShingle in my supplemental query.

Here are stats on the spelling field:

{field=spelling,memSize=107830314,tindexSize=249184,time=25747,phase1=25150,nTerms=1343061,bigTerms=231,termInstances=40960454,uses=1}

(I obtained these numbers by temporarily adding the spelling field as a facet to my warming query -- probably not a very smart way to do it, but it was the only way I could figure out!  If there's a more elegant and accurate approach, I'd be interested to know what it is.)

I should also note that my basic spelling index is 114MB and my shingled spelling index is 931MB -- not outrageously large.  Is there a way to persuade Solr to load these into memory for faster performance?

thanks,
Demian

> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Monday, June 06, 2011 6:23 PM
> To: solr-user@lucene.apache.org
> Subject: Re: SpellCheckComponent performance
> 
> Hmmm, how are you configuring you spell checker? The first-time
> slowdown
> is probably due to cache warming, but subsequent 500 ms slowdowns
> seem odd. How many unique terms are there in your spellecheck index?
> 
> It'd probably be best if you showed us your fieldtype and field
> definition...
> 
> Best
> Erick
> 
> On Mon, Jun 6, 2011 at 4:04 PM, Demian Katz <de...@villanova.edu>
> wrote:
> > I'm continuing to work on tuning my Solr server, and now I'm noticing
> that my biggest bottleneck is the SpellCheckComponent.  This is eating
> multiple seconds on most first-time searches, and still taking around
> 500ms even on cached searches.  Here is my configuration:
> >
> >  <searchComponent name="spellcheck"
> class="org.apache.solr.handler.component.SpellCheckComponent">
> >    <lst name="spellchecker">
> >      <str name="name">basicSpell</str>
> >      <str name="field">spelling</str>
> >      <str name="accuracy">0.75</str>
> >      <str name="spellcheckIndexDir">./spellchecker</str>
> >      <str name="queryAnalyzerFieldType">textSpell</str>
> >      <str name="buildOnOptimize">true</str>
> >    </lst>
> >  </searchComponent>
> >
> > I've done a bit of searching, but the best advice I could find for
> making the search component go faster involved reducing
> spellcheck.maxCollationTries, which doesn't even seem to apply to my
> settings.
> >
> > Does anyone have any advice on tuning this aspect of my
> configuration?  Are there any extra debug settings that might give
> deeper insight into how the component is spending its time?
> >
> > thanks,
> > Demian
> >

RE: SpellCheckComponent performance

Posted by Demian Katz <de...@villanova.edu>.
As I may have mentioned before, VuFind is actually doing two Solr queries for every search -- a base query that gets basic spelling suggestions, and a supplemental spelling-only query that gets shingled spelling suggestions.  If there's a way to get two different spelling responses in a single query, I'd love to hear about it...  but the double-querying doesn't seem to be a huge problem -- the delays I'm talking about are in the spelling portion of the initial query.  Just for the sake of completeness, here are both of my spelling field types:

    <!-- Basic Text Field for use with Spell Correction -->
    <fieldType name="textSpell" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="schema.UnicodeNormalizationFilterFactory" version="icu4j" composed="false" remove_diacritics="true" remove_modifiers="true" fold="true"/>
        <filter class="solr.ISOLatin1AccentFilterFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>
    <!-- More advanced spell checking field. -->
    <fieldType name="textSpellShingle" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="false"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="false"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

...and here are the fields:

   <field name="spelling" type="textSpell" indexed="true" stored="true"/>
   <field name="spellingShingle" type="textSpellShingle" indexed="true" stored="true" multiValued="true"/>

As you can probably guess, I'm using spelling in my main query and spellingShingle in my supplemental query.

Here are stats on the spelling field:

{field=spelling,memSize=107830314,tindexSize=249184,time=25747,phase1=25150,nTerms=1343061,bigTerms=231,termInstances=40960454,uses=1}

(I obtained these numbers by temporarily adding the spelling field as a facet to my warming query -- probably not a very smart way to do it, but it was the only way I could figure out!  If there's a more elegant and accurate approach, I'd be interested to know what it is.)

I should also note that my basic spelling index is 114MB and my shingled spelling index is 931MB -- not outrageously large.  Is there a way to persuade Solr to load these into memory for faster performance?

thanks,
Demian

> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Monday, June 06, 2011 6:23 PM
> To: solr-user@lucene.apache.org
> Subject: Re: SpellCheckComponent performance
> 
> Hmmm, how are you configuring you spell checker? The first-time
> slowdown
> is probably due to cache warming, but subsequent 500 ms slowdowns
> seem odd. How many unique terms are there in your spellecheck index?
> 
> It'd probably be best if you showed us your fieldtype and field
> definition...
> 
> Best
> Erick
> 
> On Mon, Jun 6, 2011 at 4:04 PM, Demian Katz <de...@villanova.edu>
> wrote:
> > I'm continuing to work on tuning my Solr server, and now I'm noticing
> that my biggest bottleneck is the SpellCheckComponent.  This is eating
> multiple seconds on most first-time searches, and still taking around
> 500ms even on cached searches.  Here is my configuration:
> >
> >  <searchComponent name="spellcheck"
> class="org.apache.solr.handler.component.SpellCheckComponent">
> >    <lst name="spellchecker">
> >      <str name="name">basicSpell</str>
> >      <str name="field">spelling</str>
> >      <str name="accuracy">0.75</str>
> >      <str name="spellcheckIndexDir">./spellchecker</str>
> >      <str name="queryAnalyzerFieldType">textSpell</str>
> >      <str name="buildOnOptimize">true</str>
> >    </lst>
> >  </searchComponent>
> >
> > I've done a bit of searching, but the best advice I could find for
> making the search component go faster involved reducing
> spellcheck.maxCollationTries, which doesn't even seem to apply to my
> settings.
> >
> > Does anyone have any advice on tuning this aspect of my
> configuration?  Are there any extra debug settings that might give
> deeper insight into how the component is spending its time?
> >
> > thanks,
> > Demian
> >

Re: SpellCheckComponent performance

Posted by Erick Erickson <er...@gmail.com>.
Hmmm, how are you configuring you spell checker? The first-time slowdown
is probably due to cache warming, but subsequent 500 ms slowdowns
seem odd. How many unique terms are there in your spellecheck index?

It'd probably be best if you showed us your fieldtype and field definition...

Best
Erick

On Mon, Jun 6, 2011 at 4:04 PM, Demian Katz <de...@villanova.edu> wrote:
> I'm continuing to work on tuning my Solr server, and now I'm noticing that my biggest bottleneck is the SpellCheckComponent.  This is eating multiple seconds on most first-time searches, and still taking around 500ms even on cached searches.  Here is my configuration:
>
>  <searchComponent name="spellcheck" class="org.apache.solr.handler.component.SpellCheckComponent">
>    <lst name="spellchecker">
>      <str name="name">basicSpell</str>
>      <str name="field">spelling</str>
>      <str name="accuracy">0.75</str>
>      <str name="spellcheckIndexDir">./spellchecker</str>
>      <str name="queryAnalyzerFieldType">textSpell</str>
>      <str name="buildOnOptimize">true</str>
>    </lst>
>  </searchComponent>
>
> I've done a bit of searching, but the best advice I could find for making the search component go faster involved reducing spellcheck.maxCollationTries, which doesn't even seem to apply to my settings.
>
> Does anyone have any advice on tuning this aspect of my configuration?  Are there any extra debug settings that might give deeper insight into how the component is spending its time?
>
> thanks,
> Demian
>