You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Shawn Heisey <so...@elyograg.org> on 2010/08/20 23:14:16 UTC
spellcheck index blown away during rebuild
I am just delving into the spellcheckcomponent on a test server
running a 3.1 build from June 29th. I have noticed that when you ask
for a rebuild of the spell check index, it deletes it before starting
the rebuild. It takes about 39 minutes to build one (3GB), which is a
long time to do without autosuggest. I expect it to take less time on
my production servers, but don't yet know how much less.
Given that it seems to be using the same segment capabilities as the
rest of Solr, would it not be possible for it to keep the old one around
while it builds a new one, then switch before deleting the old one? I
could not see an existing Jira issue on this. Does anyone know of one,
or should I create it?
Thanks,
Shawn
Re: spellcheck index blown away during rebuild
Posted by Lance Norskog <go...@gmail.com>.
To make a dictionary with a 'minimum document count' you need to make
the dictionary from the facets. Facets will create this for you; but
will allocate memory for every last term. The last N facets will have
the smallest # of terms.
To get term counts for hundreds of millions of terms, I think you need
a separate program that walks the terms. It would be very easy to pull
the term and counts, and print terms with count > N. Lucene's
CheckIndex program gives a nice base for this kind of thing.
On Thu, Aug 26, 2010 at 3:09 PM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> : What you're talking about is effectively promoting the spellcheck
> : index to a first-class Solr index, instead of an appendage bolted on
> : the side of an existing core. Given sharding and distributed search,
> : this may be a better design.
>
> even w/o promoting the spell index to be a "main" index, it still seems
> like the "rebuild" aspect of SpellCheck component could be improved to
> take advantage of regular Lucene IndexReader semanics: don't reopen the
> reader used to serve SpellComponent requests untill the "new" index is
> completley built.
>
> I'm actaully really suprised that it doesn't work that way right now --
> but i imagine this has to do with the way the SpellCheckCOmponent deals
> with the SpellChecker abstraction that hides the index -- still, it seems
> like there's room for improvement there.
>
>
> -Hoss
>
> --
> http://lucenerevolution.org/ ... October 7-8, Boston
> http://bit.ly/stump-hoss ... Stump The Chump!
>
>
--
Lance Norskog
goksron@gmail.com
Re: spellcheck index blown away during rebuild
Posted by Chris Hostetter <ho...@fucit.org>.
: What you're talking about is effectively promoting the spellcheck
: index to a first-class Solr index, instead of an appendage bolted on
: the side of an existing core. Given sharding and distributed search,
: this may be a better design.
even w/o promoting the spell index to be a "main" index, it still seems
like the "rebuild" aspect of SpellCheck component could be improved to
take advantage of regular Lucene IndexReader semanics: don't reopen the
reader used to serve SpellComponent requests untill the "new" index is
completley built.
I'm actaully really suprised that it doesn't work that way right now --
but i imagine this has to do with the way the SpellCheckCOmponent deals
with the SpellChecker abstraction that hides the index -- still, it seems
like there's room for improvement there.
-Hoss
--
http://lucenerevolution.org/ ... October 7-8, Boston
http://bit.ly/stump-hoss ... Stump The Chump!
Re: spellcheck index blown away during rebuild
Posted by Shawn Heisey <so...@elyograg.org>.
On 8/20/2010 8:56 PM, Lance Norskog wrote:
> The first question is about your use cases. How many words are in the
> eventual 3GB spelling index? Do you really need that many?
> Spell-checking is a more controllable UI if you make it from a
> dictionary.
It's built from an index-only field that combines four other fields.
The data we are indexing is metadata from photos, text articles, and
videos, with most of it being photos. On a single shard, the schema
browser shows * * 23612208 distinct terms in the catchall field, from
7305684 documents. If it's a one-to-one relationship, there you go.
Perhaps I need to make another catchall field that leaves out the "full"
text field. I'll have to experiment, because my index is already bigger
than I want it to be. I have no budget for throwing more hardware at
the problem. We are in the process of rewriting our application so that
we can reduce our index size, but that is still a few months out.
Aside from the index itself, I'm not sure where I'd get an appropriate
dictionary for photo metadata that would not require major manual work.
Is there any easy way to get the full list of distinct terms and their
counts? I'd imagine that if I could filter out those with only a handful
of occurrences, the list would be dramatically smaller. Other filters
might be useful as well, such as removing those above say 15 or 20
characters. Normally I'd go to the facet feature for this sort of
information, but I'm not sure my servers could handle that.
> What you're talking about is effectively promoting the spellcheck
> index to a first-class Solr index, instead of an appendage bolted on
> the side of an existing core. Given sharding and distributed search,
> this may be a better design.
Can you elaborate on what "this" refers to above? Are you saying that
you think promoting it to a full Solr index is a good idea? I saw a
Jira issue with the idea of building the spellcheck index at the same
time as the rest of the index, and storing it in the same directory.
This sounds like a very good way to go, especially if the filtering I
mentioned above were a part of the configuration.
Thanks,
Shawn
Re: spellcheck index blown away during rebuild
Posted by Lance Norskog <go...@gmail.com>.
The first question is about your use cases. How many words are in the
eventual 3GB spelling index? Do you really need that many?
Spell-checking is a more controllable UI if you make it from a
dictionary.
What you're talking about is effectively promoting the spellcheck
index to a first-class Solr index, instead of an appendage bolted on
the side of an existing core. Given sharding and distributed search,
this may be a better design.
Lance
On Fri, Aug 20, 2010 at 2:14 PM, Shawn Heisey <so...@elyograg.org> wrote:
> I am just delving into the spellcheckcomponent on a test server running a
> 3.1 build from June 29th. I have noticed that when you ask for a rebuild of
> the spell check index, it deletes it before starting the rebuild. It takes
> about 39 minutes to build one (3GB), which is a long time to do without
> autosuggest. I expect it to take less time on my production servers, but
> don't yet know how much less.
>
> Given that it seems to be using the same segment capabilities as the rest of
> Solr, would it not be possible for it to keep the old one around while it
> builds a new one, then switch before deleting the old one? I could not see
> an existing Jira issue on this. Does anyone know of one, or should I create
> it?
>
> Thanks,
> Shawn
>
>
--
Lance Norskog
goksron@gmail.com