You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Nalini Kartha <na...@gmail.com> on 2012/03/07 01:30:35 UTC

Re: Using multiple DirectSolrSpellcheckers for a query

Hi James,

Thanks for the detailed reply and sorry for the delay getting back.

One issue for us with using the collate functionality is that some of our
query types  are default OR (implemented using the mm param value). Since
the collate functionality reruns the query using all param values specified
in the original query, it'll effectively be issuing an OR query again
right? Which means that again we could end up with corrections which aren't
the best for the current query?

Another issue we're running into is that we're using unstemmed fields as
the source for our spell correction field and so we could end up
unnecessarily correcting queries containing stemmed versions of words.

So for eg. if I have a document containing "running" my fields look like
this -

docUnstemmed: running
docStemmed: run, ...
spellcheck: running

If a user searches for "run OR jump", there are matching results (since we
search against both the stemmed and unstemmed fields) but the spellcheck
results will contain corrections for "run", let's say "sun". We don't want
to overcorrect queries which are returning valid results like this one. Any
suggestions for how to deal with this?

I was thinking that there might be value in having another dictionary which
is used for vetting words but not for finding corrections - the stemmed
fields could be used as a source for this dictionary. So before finding
corrections for a term if it doesn't exist in the primary dictionary, check
the secondary dictionary and make sure the term does not exist in it as
well. But then, this would require an extra copyfield (we could have
multiple unstemmed fields as a source for this secondary dictionary) and
bloat the index even more so I'm not sure if it's feasible.

Thanks,
Nalini

On Thu, Jan 26, 2012 at 10:23 AM, Dyer, James <Ja...@ingrambook.com>wrote:

> Nalini,
>
> Right now the best you can do is to use <copyField> to combine everything
> into a catch-all for spellchecking purposes.  While this seems wasteful,
> this often has to be done anyhow because typically you'll need
> less/different analysis for spellchecking than for searching.  But rather
> than having separate <copyField>s to create multiple dictionaries, put
> everything into one field to create a single "master" dictionary.
>
> From there, you need to set "spellcheck.collate" to true and also
> "spellcheck.maxCollationTries" greater than zero (5-10 usually works).  The
> first parameter tells it to generate re-written queries with spelling
> suggestions (collations).  The second parameter tells it to weed out any
> collations that won't generate hits if you re-query them.  This is
> important because having unrelated keywords in your master dictionary will
> increase the chances the spellchecker will pick the wrong words as
> corrections.
>
> There is a significant caveat to this:  The spellchecker typically only
> suggests for words in the dictionary.  So by creating a huge, master
> dictionary you might find that many misspelled words won't generate
> suggestions.  See this thread for some workarounds:
> http://lucene.472066.n3.nabble.com/Improving-Solr-Spell-Checker-Results-td3658411.html
>
> I think having multiple, per-field dictionaries as you suggest might be a
> good way to go.  While this is not supported, I don't think its because of
> performance concerns.  (There would be an overhead cost to this but I think
> it would still be practical).  It just hasn't been implemented yet.  But we
> might be getting to a possible start to this type of functionality.  In
> https://issues.apache.org/jira/browse/SOLR-2585 a separate spellchecker
> is added that just corrects wordbreak (or is it "word break"?) problems,
> then a "ConjunctionSolrSpellChecker" combines the results from the main
> spellchecker and the wordbreak spellcheker.  I could see a next step beyond
> this being to support per-field dictionaries, checking them separately,
> then combining the results.
>
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
>
> -----Original Message-----
> From: Nalini Kartha [mailto:nalinikartha@gmail.com]
> Sent: Wednesday, January 25, 2012 11:56 AM
> To: solr-user@lucene.apache.org
> Subject: Using multiple DirectSolrSpellcheckers for a query
>
> Hi,
>
> We are trying to use the DirectSolrSpellChecker to get corrections for
> mis-spelled query terms directly from fields in the Solr index.
>
> However, we need to use multiple fields for spellchecking a query. It looks
> looks like you can only use one spellchecker for a request and so the
> workaround for this it to create a copy field from the fields required for
> spell correction?
>
> We'd like to avoid this because we allow users to perform different kinds
> of queries on different sets of fields and so to provide meaningful
> corrections we'd have to create multiple copy fields - one for each query
> type.
>
> Is there any reason why Solr doesn't support using multiple spellcheckers
> for a query? Is it because of performance overhead?
>
> Thanks,
> Nalini
>

RE: Using multiple DirectSolrSpellcheckers for a query

Posted by "Dyer, James" <Ja...@ingrambook.com>.

Nalini,

You're at least the second person to mention a need to override "mm" in conjunction with "maxCollationTries".  I opened https://issues.apache.org/jira/browse/SOLR-3211 to see about getting this addressed.  (not sure if it will be done soon though).  The only workaround I can think of is to use the "spellcheck.q" parameter and insert "AND" between all your keywords.

I'm not sure I can think of an easy solution to your other problem.  The fact is, if a user enters "run", how do you know he meant "running" and not "sun" ?  I mean, if either substitution results in hits, then the user could have meant either, right?  (This is a fake example though, because if "run" is in your dictionary, the spellchecker will not even try to correct it.  Maybe a better example is if the user entered "eun", which could correct to either "run" or "sun".).

If you really hate this behavior, maybe you could also solve this using "spellcheck.q".  What if you had something like this:

?q=eun jump
&mm=0
&defType=dismax
&qf=docStemmed
&spellcheck=true 
{lotsa spellcheck params here} 
&spellcheck.q=docUnstemmed:(eun AND jump)

...now it won't both correct and stem.  The corrections would need to match the raw keyword.  Is this closer to what you want?

One other note here...It looks like your "docUnstemmed" and "spellcheck" fields have pretty much the same or similar analysis.  You might not need both of them.  Possibly this would be a way to save some index-bloat?

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

-----Original Message-----
From: Nalini Kartha [mailto:nalinikartha@gmail.com] 
Sent: Tuesday, March 06, 2012 6:31 PM
To: solr-user@lucene.apache.org
Subject: Re: Using multiple DirectSolrSpellcheckers for a query

Hi James,

Thanks for the detailed reply and sorry for the delay getting back.

One issue for us with using the collate functionality is that some of our
query types  are default OR (implemented using the mm param value). Since
the collate functionality reruns the query using all param values specified
in the original query, it'll effectively be issuing an OR query again
right? Which means that again we could end up with corrections which aren't
the best for the current query?

Another issue we're running into is that we're using unstemmed fields as
the source for our spell correction field and so we could end up
unnecessarily correcting queries containing stemmed versions of words.

So for eg. if I have a document containing "running" my fields look like
this -

docUnstemmed: running
docStemmed: run, ...
spellcheck: running

If a user searches for "run OR jump", there are matching results (since we
search against both the stemmed and unstemmed fields) but the spellcheck
results will contain corrections for "run", let's say "sun". We don't want
to overcorrect queries which are returning valid results like this one. Any
suggestions for how to deal with this?

I was thinking that there might be value in having another dictionary which
is used for vetting words but not for finding corrections - the stemmed
fields could be used as a source for this dictionary. So before finding
corrections for a term if it doesn't exist in the primary dictionary, check
the secondary dictionary and make sure the term does not exist in it as
well. But then, this would require an extra copyfield (we could have
multiple unstemmed fields as a source for this secondary dictionary) and
bloat the index even more so I'm not sure if it's feasible.

Thanks,
Nalini

On Thu, Jan 26, 2012 at 10:23 AM, Dyer, James <Ja...@ingrambook.com>wrote:

> Nalini,
>
> Right now the best you can do is to use <copyField> to combine everything
> into a catch-all for spellchecking purposes.  While this seems wasteful,
> this often has to be done anyhow because typically you'll need
> less/different analysis for spellchecking than for searching.  But rather
> than having separate <copyField>s to create multiple dictionaries, put
> everything into one field to create a single "master" dictionary.
>
> From there, you need to set "spellcheck.collate" to true and also
> "spellcheck.maxCollationTries" greater than zero (5-10 usually works).  The
> first parameter tells it to generate re-written queries with spelling
> suggestions (collations).  The second parameter tells it to weed out any
> collations that won't generate hits if you re-query them.  This is
> important because having unrelated keywords in your master dictionary will
> increase the chances the spellchecker will pick the wrong words as
> corrections.
>
> There is a significant caveat to this:  The spellchecker typically only
> suggests for words in the dictionary.  So by creating a huge, master
> dictionary you might find that many misspelled words won't generate
> suggestions.  See this thread for some workarounds:
> http://lucene.472066.n3.nabble.com/Improving-Solr-Spell-Checker-Results-td3658411.html
>
> I think having multiple, per-field dictionaries as you suggest might be a
> good way to go.  While this is not supported, I don't think its because of
> performance concerns.  (There would be an overhead cost to this but I think
> it would still be practical).  It just hasn't been implemented yet.  But we
> might be getting to a possible start to this type of functionality.  In
> https://issues.apache.org/jira/browse/SOLR-2585 a separate spellchecker
> is added that just corrects wordbreak (or is it "word break"?) problems,
> then a "ConjunctionSolrSpellChecker" combines the results from the main
> spellchecker and the wordbreak spellcheker.  I could see a next step beyond
> this being to support per-field dictionaries, checking them separately,
> then combining the results.
>
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
>
> -----Original Message-----
> From: Nalini Kartha [mailto:nalinikartha@gmail.com]
> Sent: Wednesday, January 25, 2012 11:56 AM
> To: solr-user@lucene.apache.org
> Subject: Using multiple DirectSolrSpellcheckers for a query
>
> Hi,
>
> We are trying to use the DirectSolrSpellChecker to get corrections for
> mis-spelled query terms directly from fields in the Solr index.
>
> However, we need to use multiple fields for spellchecking a query. It looks
> looks like you can only use one spellchecker for a request and so the
> workaround for this it to create a copy field from the fields required for
> spell correction?
>
> We'd like to avoid this because we allow users to perform different kinds
> of queries on different sets of fields and so to provide meaningful
> corrections we'd have to create multiple copy fields - one for each query
> type.
>
> Is there any reason why Solr doesn't support using multiple spellcheckers
> for a query? Is it because of performance overhead?
>
> Thanks,
> Nalini
>