You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Shawn Heisey <so...@elyograg.org> on 2010/08/20 23:14:16 UTC

spellcheck index blown away during rebuild

  I am just delving into the spellcheckcomponent on a test server 
running a 3.1 build from June 29th.  I have noticed that when you ask 
for a rebuild of the spell check index, it deletes it before starting 
the rebuild.  It takes about 39 minutes to build one (3GB), which is a 
long time to do without autosuggest.  I expect it to take less time on 
my production servers, but don't yet know how much less.

Given that it seems to be using the same segment capabilities as the 
rest of Solr, would it not be possible for it to keep the old one around 
while it builds a new one, then switch before deleting the old one?  I 
could not see an existing Jira issue on this.  Does anyone know of one, 
or should I create it?

Thanks,
Shawn

Re: spellcheck index blown away during rebuild

Posted by Lance Norskog <go...@gmail.com>.

To make a dictionary with a 'minimum document count' you need to make
the dictionary from the facets. Facets will create this for you; but
will allocate memory for every last term. The last N facets will have
the smallest # of terms.

To get term counts for hundreds of millions of terms, I think you need
a separate program that walks the terms. It would be very easy to pull
the term and counts, and print terms with count > N. Lucene's
CheckIndex program gives a nice base for this kind of thing.

On Thu, Aug 26, 2010 at 3:09 PM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> : What you're talking about is effectively promoting the spellcheck
> : index to a first-class Solr index, instead of an appendage bolted on
> : the side of an existing core. Given sharding and distributed search,
> : this may be a better design.
>
> even w/o promoting the spell index to be a "main" index, it still seems
> like the "rebuild" aspect of SpellCheck component could be improved to
> take advantage of regular Lucene IndexReader semanics: don't reopen the
> reader used to serve SpellComponent requests untill the "new" index is
> completley built.
>
> I'm actaully really suprised that it doesn't work that way right now --
> but i imagine this has to do with the way the SpellCheckCOmponent deals
> with the SpellChecker abstraction that hides the index -- still, it seems
> like there's room for improvement there.
>
>
> -Hoss
>
> --
> http://lucenerevolution.org/  ...  October 7-8, Boston
> http://bit.ly/stump-hoss      ...  Stump The Chump!
>
>

-- 
Lance Norskog
goksron@gmail.com

Re: spellcheck index blown away during rebuild

Posted by Chris Hostetter <ho...@fucit.org>.

: What you're talking about is effectively promoting the spellcheck
: index to a first-class Solr index, instead of an appendage bolted on
: the side of an existing core. Given sharding and distributed search,
: this may be a better design.

even w/o promoting the spell index to be a "main" index, it still seems 
like the "rebuild" aspect of SpellCheck component could be improved to 
take advantage of regular Lucene IndexReader semanics: don't reopen the 
reader used to serve SpellComponent requests untill the "new" index is 
completley built.

I'm actaully really suprised that it doesn't work that way right now -- 
but i imagine this has to do with the way the SpellCheckCOmponent deals 
with the SpellChecker abstraction that hides the index -- still, it seems 
like there's room for improvement there.


-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss      ...  Stump The Chump!

Re: spellcheck index blown away during rebuild

Posted by Shawn Heisey <so...@elyograg.org>.

  On 8/20/2010 8:56 PM, Lance Norskog wrote:
> The first question is about your use cases. How many words are in the
> eventual 3GB spelling index? Do you really need that many?
> Spell-checking is a more controllable UI if you make it from a
> dictionary.

It's built from an index-only field that combines four other fields.  
The data we are indexing is metadata from photos, text articles, and 
videos, with most of it being photos.  On a single shard, the schema 
browser shows * * 23612208 distinct terms in the catchall field, from 
7305684 documents.  If it's a one-to-one relationship, there you go.

Perhaps I need to make another catchall field that leaves out the "full" 
text field.  I'll have to experiment, because my index is already bigger 
than I want it to be.  I have no budget for throwing more hardware at 
the problem.  We are in the process of rewriting our application so that 
we can reduce our index size, but that is still a few months out.

Aside from the index itself, I'm not sure where I'd get an appropriate 
dictionary for photo metadata that would not require major manual work.  
Is there any easy way to get the full list of distinct terms and their 
counts? I'd imagine that if I could filter out those with only a handful 
of occurrences, the list would be dramatically smaller.  Other filters 
might be useful as well, such as removing those above say 15 or 20 
characters.  Normally I'd go to the facet feature for this sort of 
information, but I'm not sure my servers could handle that.

> What you're talking about is effectively promoting the spellcheck
> index to a first-class Solr index, instead of an appendage bolted on
> the side of an existing core. Given sharding and distributed search,
> this may be a better design.

Can you elaborate on what "this" refers to above?  Are you saying that 
you think promoting it to a full Solr index is a good idea?  I saw a 
Jira issue with the idea of building the spellcheck index at the same 
time as the rest of the index, and storing it in the same directory.  
This sounds like a very good way to go, especially if the filtering I 
mentioned above were a part of the configuration.

Thanks,
Shawn

Re: spellcheck index blown away during rebuild

Posted by Lance Norskog <go...@gmail.com>.

The first question is about your use cases. How many words are in the
eventual 3GB spelling index? Do you really need that many?
Spell-checking is a more controllable UI if you make it from a
dictionary.

What you're talking about is effectively promoting the spellcheck
index to a first-class Solr index, instead of an appendage bolted on
the side of an existing core. Given sharding and distributed search,
this may be a better design.

Lance

On Fri, Aug 20, 2010 at 2:14 PM, Shawn Heisey <so...@elyograg.org> wrote:
>  I am just delving into the spellcheckcomponent on a test server running a
> 3.1 build from June 29th.  I have noticed that when you ask for a rebuild of
> the spell check index, it deletes it before starting the rebuild.  It takes
> about 39 minutes to build one (3GB), which is a long time to do without
> autosuggest.  I expect it to take less time on my production servers, but
> don't yet know how much less.
>
> Given that it seems to be using the same segment capabilities as the rest of
> Solr, would it not be possible for it to keep the old one around while it
> builds a new one, then switch before deleting the old one?  I could not see
> an existing Jira issue on this.  Does anyone know of one, or should I create
> it?
>
> Thanks,
> Shawn
>
>

-- 
Lance Norskog
goksron@gmail.com