You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by David Radunz <da...@boxen.net> on 2012/01/14 06:42:11 UTC

Improving Solr Spell Checker Results

Hey,

     Firstly I would like to thank you all for creating such a great 
searching platform. What I was wondering is whether it is possible to:

1. Have the spell checker take into account multiple words. For example 
if I search for "Sigourney Wever" it doesn't flag as a spelling issue as 
'wever' is a correctly spelled word. And if I searched for "Sigourney 
Wevr" the suggestion is "Sigourney Wever". Of course the correct 
spelling is: Sigourney Weaver
2. Have the spell checker return corrections only for dictionary items 
added on the field being searched. i.e. Searching for an actor would 
only use the dictionary fields from the actor. This makes sense on many 
levels, as when you are field searching its useless to get a correction 
from another field as no values would match in any case.

Hopefully someone can help!

Thanks in advance,

David

RE: Improving Solr Spell Checker Results

Posted by "Dyer, James" <Ja...@ingrambook.com>.
Taking a quick look at DirectSolrSpellChecker I think I agree that using DirectSolrSpellChecker and the "thresholdTokenFrequency" parameter may provide an additional workaround for David's situation.  One caveat is that terms like "wever" need to always be low-frequency.  Also, DirectSolrSpellChecker is available only for 4.x/Trunk, where it is the default spellcheck impl.  But if using 4.x/Trunk, you can possibly do even better by applying the SOLR-2585 patch:  even if the mispelled word is high-frequency yet wrong in context, this patch still would allow you to get suggestions.  (The downside being that SOLR-2585 is brand-new and hasn't seen much scrutiny yet.)

This is different behavior than IndexBasedSpellChecker, which will never give suggestions for a term in the index (unless of course you use "onlyMorePopular").  With IndexBasedSpellChecker, "thresholdTokenFrequency" only removes low-frequency terms from possibly being suggested.  It does not control which terms will generate suggestions.  IndexBasedSpellChecker is the default spellcheck impl for 3.x and earlier versions.

Thank you for clarifying this important difference between the two spellcheck impls.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: O. Klein [mailto:klein@octoweb.nl] 
Sent: Wednesday, January 18, 2012 7:22 AM
To: solr-user@lucene.apache.org
Subject: RE: Improving Solr Spell Checker Results


Dyer, James wrote
> 
> David,
> 
> The spellchecker normally won't give suggestions for any term in your
> index.  So even if "wever" is misspelled in context, if it exists in the
> index the spell checker will not try correcting it.  There are 3
> workarounds:
> 1. Use the patch included with SOLR-2585 (this is for Trunk/4.x only). 
> See https://issues.apache.org/jira/browse/SOLR-2585
> 

When using trunk and DirectSolrSpellChecker I do get suggestions for terms
that are in the index. Lowering the thresholdTokenFrequency to 0.001 in my
case is giving me very good suggestions even if documents with the
misspelled word in them were found.

This combined with maxCollationTries (with all terms required) is giving
some sort of context sensitive suggestions.

Is this correct or is there something I'm missing?


--
View this message in context: http://lucene.472066.n3.nabble.com/Improving-Solr-Spell-Checker-Results-tp3658411p3669186.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Improving Solr Spell Checker Results

Posted by David Radunz <da...@boxen.net>.
On 19/01/2012 12:21 AM, O. Klein wrote:
> Dyer, James wrote
>> David,
>>
>> The spellchecker normally won't give suggestions for any term in your
>> index.  So even if "wever" is misspelled in context, if it exists in the
>> index the spell checker will not try correcting it.  There are 3
>> workarounds:
>> 1. Use the patch included with SOLR-2585 (this is for Trunk/4.x only).
>> See https://issues.apache.org/jira/browse/SOLR-2585
>>
> When using trunk and DirectSolrSpellChecker I do get suggestions for terms
> that are in the index. Lowering the thresholdTokenFrequency to 0.001 in my
> case is giving me very good suggestions even if documents with the
> misspelled word in them were found.
>
> This combined with maxCollationTries (with all terms required) is giving
> some sort of context sensitive suggestions.
>
> Is this correct or is there something I'm missing?

Hey,

     Thanks for the input, but setting the thresholdTokenFrequency to 
0.001 has now excluded spell check suggesions that were correctly 
working. I.e. 'Matrx' now does not work, but when I remove the theshold 
again it suggests 'Matrix'. Si I guess to use this I would have to 
constantly reconfigure this property as the product database grows, 
which isn't really what I wanted.

Thanks for your input though,

David
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Improving-Solr-Spell-Checker-Results-tp3658411p3669186.html
> Sent from the Solr - User mailing list archive at Nabble.com.


RE: Improving Solr Spell Checker Results

Posted by "O. Klein" <kl...@octoweb.nl>.
Dyer, James wrote
> 
> David,
> 
> The spellchecker normally won't give suggestions for any term in your
> index.  So even if "wever" is misspelled in context, if it exists in the
> index the spell checker will not try correcting it.  There are 3
> workarounds:
> 1. Use the patch included with SOLR-2585 (this is for Trunk/4.x only). 
> See https://issues.apache.org/jira/browse/SOLR-2585
> 

When using trunk and DirectSolrSpellChecker I do get suggestions for terms
that are in the index. Lowering the thresholdTokenFrequency to 0.001 in my
case is giving me very good suggestions even if documents with the
misspelled word in them were found.

This combined with maxCollationTries (with all terms required) is giving
some sort of context sensitive suggestions.

Is this correct or is there something I'm missing?


--
View this message in context: http://lucene.472066.n3.nabble.com/Improving-Solr-Spell-Checker-Results-tp3658411p3669186.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Improving Solr Spell Checker Results

Posted by David Radunz <da...@boxen.net>.
Hey,

     Thanks for that, I have uploaded a new patch as advised.

Cheers,

David

On 23/01/2012 1:01 PM, Erick Erickson wrote:
> David:
>
> There's some good info here:
> http://wiki.apache.org/solr/HowToContribute#Working_With_Patches
>
> But the short form is to go into solr_home and issue this command:
> 'svn diff>  SOLR-2585.patch'. IDE's may also have a "create patch"
> feature, but I find the straight SVN command more reliable.
>
> Note I'm not saying that your patch will necessarily be picked up, but
> it's a thoughtful gesture to upload a more current patch. In your
> comments please identify what code line you're working on (4.x? 3.x?).
>
> And when you upload, down near the bottom of the dialog box there'll be
> a radio button about "grant ASF license" which is fairly important to
> click for legal reasons....
>
> Thanks
> Erick
>
> On Sun, Jan 22, 2012 at 5:54 PM, David Radunz<da...@boxen.net>  wrote:
>> Hey Erick,
>>
>>     Sure, can you explain the process to create the patch and upload it and
>> i'll do it first thing tomorrow.
>>
>> Thanks again for your help,
>>
>> David
>>
>>
>> On 23/01/2012 12:51 PM, Erick Erickson wrote:
>>> I can't help with your *real* problem, but when looking at patches,
>>> if the "resolution" field isn't set to something like "fixed" it means
>>> that the patch has NOT  been applied to any code lines. There
>>> also should be commit revisions specified in the comments.
>>> If "Fix Versions" has values, that doesn't mean the patch has
>>> been applied either, that's often just a statement of where
>>> the patch *should* go.
>>>
>>> And, between the time someone uploads a patch and it actually
>>> gets *committed*, the underlying code line can, indeed,  change
>>> and the patch doesn't apply cleanly. Since you've already had
>>> to do this, could you upload your version that *does* apply
>>> cleanly?
>>>
>>> Best
>>> Erick
>>>
>>> On Sun, Jan 22, 2012 at 2:56 AM, David Radunz<da...@boxen.net>    wrote:
>>>> James,
>>>>
>>>>     I worked out that I actually needed to 'apply' patch SOLR-2585,
>>>> whoops.
>>>> So I have done that now and it seems to return 'correctlySpelled=true'
>>>> for
>>>> 'Sigorney Wever' (when Sigorney isn't even in the dictionary). Could
>>>> something have changed in the trunk to make your patch no longer work? I
>>>> had
>>>> to manually merge the setup for the test case due to a new 'hyphens' test
>>>> case. The settings I am use are:
>>>>
>>>> <lst name="defaults">
>>>> <str name="echoParams">explicit</str>
>>>> <int name="rows">10</int>
>>>>
>>>> <str name="spellcheck.onlyMorePopular">false</str>
>>>> <int name="spellcheck.count">10</int>
>>>> <str name="spellcheck.extendedResults">true</str>
>>>> <str name="spellcheck.collate">true</str>
>>>> <str name="spellcheck.collateExtendedResults">true</str>
>>>> <int name="spellcheck.maxCollationTries">10</int>
>>>> <int name="spellcheck.maxCollations">1</int>
>>>>
>>>> <int name="spellcheck.alternativeTermCount">5</int>
>>>> <int name="spellcheck.maxResultsForSuggest">1</int>
>>>> </lst>
>>>>
>>>>
>>>> <lst name="spellchecker">
>>>> <str name="name">default</str>
>>>> <str name="field">spell</str>
>>>> <str name="classname">solr.DirectSolrSpellChecker</str>
>>>>
>>>> <!-- the spellcheck distance measure used, the default is the internal
>>>> levenshtein -->
>>>> <str name="distanceMeasure">internal</str>
>>>> <!-- minimum accuracy needed to be considered a valid spellcheck
>>>> suggestion
>>>> -->
>>>> <float name="accuracy">0.5</float>
>>>> <!-- the maximum #edits we consider when enumerating terms: can be 1 or 2
>>>> -->
>>>> <int name="maxEdits">2</int>
>>>> <!-- the minimum shared prefix when enumerating terms -->
>>>> <int name="minPrefix">1</int>
>>>> <!-- maximum number of inspections per result. -->
>>>> <int name="maxInspections">5</int>
>>>> <!-- minimum length of a query term to be considered for correction -->
>>>> <int name="minQueryLength">4</int>
>>>> <!-- maximum threshold of documents a query term can appear to be
>>>> considered
>>>> for correction -->
>>>> <float name="maxQueryFrequency">0.01</float>
>>>> <!-- require suggestions to occur in 0.1% of the documents -->
>>>> <!--
>>>> <float name="thresholdTokenFrequency">0.001</float>
>>>>       -->
>>>>
>>>> <str name="spellcheckIndexDir">spellchecker</str>
>>>> <str name="buildOnCommit">true</str>
>>>> </lst>
>>>>
>>>> With the query:
>>>>
>>>>
>>>> spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,primary_cat_id&sort=score+desc,name+asc,year_made+desc&start=0&q=sigorney+wever+title:"sigorney+wever"^100+series_name:"sigorney+wever"^50&spellcheck.q=sigorney+wever&fq=store_id:"1"&rows=5
>>>>
>>>> Cheers,
>>>>
>>>> David
>>>>
>>>>
>>>>
>>>> On 22/01/2012 2:03 AM, David Radunz wrote:
>>>>> James,
>>>>>
>>>>>     Thanks again for your lengthy and informative response. I updated
>>>>> from
>>>>> SVN trunk again today and was successfully able to run 'ant test'. So I
>>>>> proceeded with trying your suggestions (for question 1 so far):
>>>>>
>>>>> On 17/01/2012 5:32 AM, Dyer, James wrote:
>>>>>> David,
>>>>>>
>>>>>> The spellchecker normally won't give suggestions for any term in your
>>>>>> index.  So even if "wever" is misspelled in context, if it exists in
>>>>>> the
>>>>>> index the spell checker will not try correcting it.  There are 3
>>>>>> workarounds:
>>>>>> 1. Use the patch included with SOLR-2585 (this is for Trunk/4.x only).
>>>>>>   See https://issues.apache.org/jira/browse/SOLR-2585
>>>>> I have tried using this with the original test case of 'Signorney
>>>>> Wever'.
>>>>> I didn't notice any difference, although I am a little unclear as to
>>>>> what
>>>>> exactly this patch does. Nor am I really clear what to set either of the
>>>>> options to, so I set them both to '5'. I tried to find the test case it
>>>>> mentions, but it's not present in SpellCheckCollatorTest.java .. Any
>>>>> suggestions?
>>>>>
>>>>>> 2. try "onlyMorePopular=true" in your request.
>>>>>>
>>>>>>   (http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular).
>>>>>>   But see the September 2, 2011 comment in SOLR-2585 about why this
>>>>>> might not
>>>>>> do what you'd hope it would.
>>>>>
>>>>> Trying this did produce 'Signourney Weaver' as you would hope, but I am
>>>>> a
>>>>> little afraid of the downside. I would much more like a context
>>>>> sensative
>>>>> spell check that involves the terms around the correction.
>>>>>>
>>>>>> 3. If you're building your index on a<copyField />, you can add a
>>>>>> stopword filter that filters out all of the misspelt or rare words from
>>>>>> the
>>>>>> field that the dictionary is based.  This could be an arduous task, and
>>>>>> it
>>>>>> may or may not work well for your data.
>>>>> I am currently using a copyField for all terms that are relevant, which
>>>>> is
>>>>> quite a lot and the dictionary would encompass a huge amount of data.
>>>>> Adding
>>>>> stopword filters would be out of the question as we presently have more
>>>>> than
>>>>> 30,000 products and this is for the initial launch, we intend to have
>>>>> many
>>>>> many more.
>>>>>>
>>>>>> As for your second question, I take it you're using (e)dismax with
>>>>>> multiple fields in "qf", right?  The only way I know to handle this is
>>>>>> to
>>>>>> create a<copyfield>      that combines all of the fields you search
>>>>>> across.  Use
>>>>>> this combined field to base your dictionary.  Also, specifying
>>>>>> "spellcheck.maxCollationTries" with a non-zero value will weed out the
>>>>>> nonsense word combinations that are likely to occur when doing this,
>>>>>> ensuring that any collations provided will indeed yield hits.  The
>>>>>> downside
>>>>>> to doing this, of course, is it will make your first problem more acute
>>>>>> in
>>>>>> that there will be even more terms in your index that the spellchecker
>>>>>> will
>>>>>> ignore entirely, even if they're mispelled in context.  Once again,
>>>>>> SOLR-2585 is designed to tackle this problem but it is still in its
>>>>>> early
>>>>>> stages, and thus far it is Trunk-only.
>>>>> I tried setting spellcheck.maxCollationTries to 5 to see if it would
>>>>> help
>>>>> with the above problem, but it did not.
>>>>>
>>>>> I have now tried using it in the context of question 2. I tried
>>>>> searching
>>>>> for 'Sigorney Wever' in the series name (which it's not present in, as
>>>>> its
>>>>> an actor):
>>>>>
>>>>>
>>>>>
>>>>> spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,series_name_attr_opt_combo&sort=score+desc,release_date+desc&start=0&q=*+series_name:"signourney+wever"^100&spellcheck.q=signourney+wever&fq=store_id:"1"+AND+series_name_attr_opt_search:*signourney*wever*&rows=5&spellcheck.maxCollationTries=5
>>>>>
>>>>> Suggestions for 'Sigourney' Wever were returned, but no spelling
>>>>> suggestions or ones for series names (which i doubt there would be)
>>>>> should
>>>>> have been returned.
>>>>>
>>>>>> You might also be interested in
>>>>>> https://issues.apache.org/jira/browse/SOLR-2993 .  Although this is
>>>>>> unrelated to your two questions, the patch on this issue introduces a
>>>>>> new
>>>>>> "ConjunctionSolrSpellChecker" which theoretically could be enhanced to
>>>>>> do
>>>>>> exactly what you want.  That is, you could (theoretically) create
>>>>>> separate
>>>>>> dictionaries for each of the fields you're searching and let the CSSC
>>>>>> combine the results&      generate collations, etc.
>>>>>
>>>>> During the upgrade I switched to solr.DirectSolrSpellChecker, which I
>>>>> presume will help with this? I am a senior developer (in
>>>>> Java/Perl/Python/PHP) but I have not as yet looked at any of the Solr
>>>>> source
>>>>> code. So I am in the dark when you say it could be tailored for my
>>>>> needs.
>>>>> Also, how would it work? Query wise.. Would it be like..
>>>>> spellcheck.series_name.q= and spellcheck.actor.q= and so on? If so that
>>>>> sounds tempting to try and achieve. But if you could provide any
>>>>> pointers in
>>>>> what exactly would be required that would really help.
>>>>>
>>>>> Thanks again for your time,
>>>>>
>>>>> David
>>>>>>
>>>>>> James Dyer
>>>>>> E-Commerce Systems
>>>>>> Ingram Content Group
>>>>>> (615) 213-4311
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: David Radunz [mailto:david@boxen.net]
>>>>>> Sent: Friday, January 13, 2012 11:42 PM
>>>>>> To: solr-user@lucene.apache.org
>>>>>> Subject: Improving Solr Spell Checker Results
>>>>>>
>>>>>> Hey,
>>>>>>
>>>>>>       Firstly I would like to thank you all for creating such a great
>>>>>> searching platform. What I was wondering is whether it is possible to:
>>>>>>
>>>>>> 1. Have the spell checker take into account multiple words. For example
>>>>>> if I search for "Sigourney Wever" it doesn't flag as a spelling issue
>>>>>> as
>>>>>> 'wever' is a correctly spelled word. And if I searched for "Sigourney
>>>>>> Wevr" the suggestion is "Sigourney Wever". Of course the correct
>>>>>> spelling is: Sigourney Weaver
>>>>>> 2. Have the spell checker return corrections only for dictionary items
>>>>>> added on the field being searched. i.e. Searching for an actor would
>>>>>> only use the dictionary fields from the actor. This makes sense on many
>>>>>> levels, as when you are field searching its useless to get a correction
>>>>>> from another field as no values would match in any case.
>>>>>>
>>>>>> Hopefully someone can help!
>>>>>>
>>>>>> Thanks in advance,
>>>>>>
>>>>>> David
>>>>>


Re: Improving Solr Spell Checker Results

Posted by Erick Erickson <er...@gmail.com>.
David:

There's some good info here:
http://wiki.apache.org/solr/HowToContribute#Working_With_Patches

But the short form is to go into solr_home and issue this command:
'svn diff > SOLR-2585.patch'. IDE's may also have a "create patch"
feature, but I find the straight SVN command more reliable.

Note I'm not saying that your patch will necessarily be picked up, but
it's a thoughtful gesture to upload a more current patch. In your
comments please identify what code line you're working on (4.x? 3.x?).

And when you upload, down near the bottom of the dialog box there'll be
a radio button about "grant ASF license" which is fairly important to
click for legal reasons....

Thanks
Erick

On Sun, Jan 22, 2012 at 5:54 PM, David Radunz <da...@boxen.net> wrote:
> Hey Erick,
>
>    Sure, can you explain the process to create the patch and upload it and
> i'll do it first thing tomorrow.
>
> Thanks again for your help,
>
> David
>
>
> On 23/01/2012 12:51 PM, Erick Erickson wrote:
>>
>> I can't help with your *real* problem, but when looking at patches,
>> if the "resolution" field isn't set to something like "fixed" it means
>> that the patch has NOT  been applied to any code lines. There
>> also should be commit revisions specified in the comments.
>> If "Fix Versions" has values, that doesn't mean the patch has
>> been applied either, that's often just a statement of where
>> the patch *should* go.
>>
>> And, between the time someone uploads a patch and it actually
>> gets *committed*, the underlying code line can, indeed,  change
>> and the patch doesn't apply cleanly. Since you've already had
>> to do this, could you upload your version that *does* apply
>> cleanly?
>>
>> Best
>> Erick
>>
>> On Sun, Jan 22, 2012 at 2:56 AM, David Radunz<da...@boxen.net>  wrote:
>>>
>>> James,
>>>
>>>    I worked out that I actually needed to 'apply' patch SOLR-2585,
>>> whoops.
>>> So I have done that now and it seems to return 'correctlySpelled=true'
>>> for
>>> 'Sigorney Wever' (when Sigorney isn't even in the dictionary). Could
>>> something have changed in the trunk to make your patch no longer work? I
>>> had
>>> to manually merge the setup for the test case due to a new 'hyphens' test
>>> case. The settings I am use are:
>>>
>>> <lst name="defaults">
>>> <str name="echoParams">explicit</str>
>>> <int name="rows">10</int>
>>>
>>> <str name="spellcheck.onlyMorePopular">false</str>
>>> <int name="spellcheck.count">10</int>
>>> <str name="spellcheck.extendedResults">true</str>
>>> <str name="spellcheck.collate">true</str>
>>> <str name="spellcheck.collateExtendedResults">true</str>
>>> <int name="spellcheck.maxCollationTries">10</int>
>>> <int name="spellcheck.maxCollations">1</int>
>>>
>>> <int name="spellcheck.alternativeTermCount">5</int>
>>> <int name="spellcheck.maxResultsForSuggest">1</int>
>>> </lst>
>>>
>>>
>>> <lst name="spellchecker">
>>> <str name="name">default</str>
>>> <str name="field">spell</str>
>>> <str name="classname">solr.DirectSolrSpellChecker</str>
>>>
>>> <!-- the spellcheck distance measure used, the default is the internal
>>> levenshtein -->
>>> <str name="distanceMeasure">internal</str>
>>> <!-- minimum accuracy needed to be considered a valid spellcheck
>>> suggestion
>>> -->
>>> <float name="accuracy">0.5</float>
>>> <!-- the maximum #edits we consider when enumerating terms: can be 1 or 2
>>> -->
>>> <int name="maxEdits">2</int>
>>> <!-- the minimum shared prefix when enumerating terms -->
>>> <int name="minPrefix">1</int>
>>> <!-- maximum number of inspections per result. -->
>>> <int name="maxInspections">5</int>
>>> <!-- minimum length of a query term to be considered for correction -->
>>> <int name="minQueryLength">4</int>
>>> <!-- maximum threshold of documents a query term can appear to be
>>> considered
>>> for correction -->
>>> <float name="maxQueryFrequency">0.01</float>
>>> <!-- require suggestions to occur in 0.1% of the documents -->
>>> <!--
>>> <float name="thresholdTokenFrequency">0.001</float>
>>>      -->
>>>
>>> <str name="spellcheckIndexDir">spellchecker</str>
>>> <str name="buildOnCommit">true</str>
>>> </lst>
>>>
>>> With the query:
>>>
>>>
>>> spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,primary_cat_id&sort=score+desc,name+asc,year_made+desc&start=0&q=sigorney+wever+title:"sigorney+wever"^100+series_name:"sigorney+wever"^50&spellcheck.q=sigorney+wever&fq=store_id:"1"&rows=5
>>>
>>> Cheers,
>>>
>>> David
>>>
>>>
>>>
>>> On 22/01/2012 2:03 AM, David Radunz wrote:
>>>>
>>>> James,
>>>>
>>>>    Thanks again for your lengthy and informative response. I updated
>>>> from
>>>> SVN trunk again today and was successfully able to run 'ant test'. So I
>>>> proceeded with trying your suggestions (for question 1 so far):
>>>>
>>>> On 17/01/2012 5:32 AM, Dyer, James wrote:
>>>>>
>>>>> David,
>>>>>
>>>>> The spellchecker normally won't give suggestions for any term in your
>>>>> index.  So even if "wever" is misspelled in context, if it exists in
>>>>> the
>>>>> index the spell checker will not try correcting it.  There are 3
>>>>> workarounds:
>>>>> 1. Use the patch included with SOLR-2585 (this is for Trunk/4.x only).
>>>>>  See https://issues.apache.org/jira/browse/SOLR-2585
>>>>
>>>> I have tried using this with the original test case of 'Signorney
>>>> Wever'.
>>>> I didn't notice any difference, although I am a little unclear as to
>>>> what
>>>> exactly this patch does. Nor am I really clear what to set either of the
>>>> options to, so I set them both to '5'. I tried to find the test case it
>>>> mentions, but it's not present in SpellCheckCollatorTest.java .. Any
>>>> suggestions?
>>>>
>>>>> 2. try "onlyMorePopular=true" in your request.
>>>>>
>>>>>  (http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular).
>>>>>  But see the September 2, 2011 comment in SOLR-2585 about why this
>>>>> might not
>>>>> do what you'd hope it would.
>>>>
>>>>
>>>> Trying this did produce 'Signourney Weaver' as you would hope, but I am
>>>> a
>>>> little afraid of the downside. I would much more like a context
>>>> sensative
>>>> spell check that involves the terms around the correction.
>>>>>
>>>>>
>>>>> 3. If you're building your index on a<copyField />, you can add a
>>>>> stopword filter that filters out all of the misspelt or rare words from
>>>>> the
>>>>> field that the dictionary is based.  This could be an arduous task, and
>>>>> it
>>>>> may or may not work well for your data.
>>>>
>>>> I am currently using a copyField for all terms that are relevant, which
>>>> is
>>>> quite a lot and the dictionary would encompass a huge amount of data.
>>>> Adding
>>>> stopword filters would be out of the question as we presently have more
>>>> than
>>>> 30,000 products and this is for the initial launch, we intend to have
>>>> many
>>>> many more.
>>>>>
>>>>>
>>>>> As for your second question, I take it you're using (e)dismax with
>>>>> multiple fields in "qf", right?  The only way I know to handle this is
>>>>> to
>>>>> create a<copyfield>    that combines all of the fields you search
>>>>> across.  Use
>>>>> this combined field to base your dictionary.  Also, specifying
>>>>> "spellcheck.maxCollationTries" with a non-zero value will weed out the
>>>>> nonsense word combinations that are likely to occur when doing this,
>>>>> ensuring that any collations provided will indeed yield hits.  The
>>>>> downside
>>>>> to doing this, of course, is it will make your first problem more acute
>>>>> in
>>>>> that there will be even more terms in your index that the spellchecker
>>>>> will
>>>>> ignore entirely, even if they're mispelled in context.  Once again,
>>>>> SOLR-2585 is designed to tackle this problem but it is still in its
>>>>> early
>>>>> stages, and thus far it is Trunk-only.
>>>>
>>>> I tried setting spellcheck.maxCollationTries to 5 to see if it would
>>>> help
>>>> with the above problem, but it did not.
>>>>
>>>> I have now tried using it in the context of question 2. I tried
>>>> searching
>>>> for 'Sigorney Wever' in the series name (which it's not present in, as
>>>> its
>>>> an actor):
>>>>
>>>>
>>>>
>>>> spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,series_name_attr_opt_combo&sort=score+desc,release_date+desc&start=0&q=*+series_name:"signourney+wever"^100&spellcheck.q=signourney+wever&fq=store_id:"1"+AND+series_name_attr_opt_search:*signourney*wever*&rows=5&spellcheck.maxCollationTries=5
>>>>
>>>> Suggestions for 'Sigourney' Wever were returned, but no spelling
>>>> suggestions or ones for series names (which i doubt there would be)
>>>> should
>>>> have been returned.
>>>>
>>>>> You might also be interested in
>>>>> https://issues.apache.org/jira/browse/SOLR-2993 .  Although this is
>>>>> unrelated to your two questions, the patch on this issue introduces a
>>>>> new
>>>>> "ConjunctionSolrSpellChecker" which theoretically could be enhanced to
>>>>> do
>>>>> exactly what you want.  That is, you could (theoretically) create
>>>>> separate
>>>>> dictionaries for each of the fields you're searching and let the CSSC
>>>>> combine the results&    generate collations, etc.
>>>>
>>>>
>>>> During the upgrade I switched to solr.DirectSolrSpellChecker, which I
>>>> presume will help with this? I am a senior developer (in
>>>> Java/Perl/Python/PHP) but I have not as yet looked at any of the Solr
>>>> source
>>>> code. So I am in the dark when you say it could be tailored for my
>>>> needs.
>>>> Also, how would it work? Query wise.. Would it be like..
>>>> spellcheck.series_name.q= and spellcheck.actor.q= and so on? If so that
>>>> sounds tempting to try and achieve. But if you could provide any
>>>> pointers in
>>>> what exactly would be required that would really help.
>>>>
>>>> Thanks again for your time,
>>>>
>>>> David
>>>>>
>>>>>
>>>>> James Dyer
>>>>> E-Commerce Systems
>>>>> Ingram Content Group
>>>>> (615) 213-4311
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: David Radunz [mailto:david@boxen.net]
>>>>> Sent: Friday, January 13, 2012 11:42 PM
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Improving Solr Spell Checker Results
>>>>>
>>>>> Hey,
>>>>>
>>>>>      Firstly I would like to thank you all for creating such a great
>>>>> searching platform. What I was wondering is whether it is possible to:
>>>>>
>>>>> 1. Have the spell checker take into account multiple words. For example
>>>>> if I search for "Sigourney Wever" it doesn't flag as a spelling issue
>>>>> as
>>>>> 'wever' is a correctly spelled word. And if I searched for "Sigourney
>>>>> Wevr" the suggestion is "Sigourney Wever". Of course the correct
>>>>> spelling is: Sigourney Weaver
>>>>> 2. Have the spell checker return corrections only for dictionary items
>>>>> added on the field being searched. i.e. Searching for an actor would
>>>>> only use the dictionary fields from the actor. This makes sense on many
>>>>> levels, as when you are field searching its useless to get a correction
>>>>> from another field as no values would match in any case.
>>>>>
>>>>> Hopefully someone can help!
>>>>>
>>>>> Thanks in advance,
>>>>>
>>>>> David
>>>>
>>>>
>

Re: Improving Solr Spell Checker Results

Posted by David Radunz <da...@boxen.net>.
Hey Erick,

     Sure, can you explain the process to create the patch and upload it 
and i'll do it first thing tomorrow.

Thanks again for your help,

David

On 23/01/2012 12:51 PM, Erick Erickson wrote:
> I can't help with your *real* problem, but when looking at patches,
> if the "resolution" field isn't set to something like "fixed" it means
> that the patch has NOT  been applied to any code lines. There
> also should be commit revisions specified in the comments.
> If "Fix Versions" has values, that doesn't mean the patch has
> been applied either, that's often just a statement of where
> the patch *should* go.
>
> And, between the time someone uploads a patch and it actually
> gets *committed*, the underlying code line can, indeed,  change
> and the patch doesn't apply cleanly. Since you've already had
> to do this, could you upload your version that *does* apply
> cleanly?
>
> Best
> Erick
>
> On Sun, Jan 22, 2012 at 2:56 AM, David Radunz<da...@boxen.net>  wrote:
>> James,
>>
>>     I worked out that I actually needed to 'apply' patch SOLR-2585, whoops.
>> So I have done that now and it seems to return 'correctlySpelled=true' for
>> 'Sigorney Wever' (when Sigorney isn't even in the dictionary). Could
>> something have changed in the trunk to make your patch no longer work? I had
>> to manually merge the setup for the test case due to a new 'hyphens' test
>> case. The settings I am use are:
>>
>> <lst name="defaults">
>> <str name="echoParams">explicit</str>
>> <int name="rows">10</int>
>>
>> <str name="spellcheck.onlyMorePopular">false</str>
>> <int name="spellcheck.count">10</int>
>> <str name="spellcheck.extendedResults">true</str>
>> <str name="spellcheck.collate">true</str>
>> <str name="spellcheck.collateExtendedResults">true</str>
>> <int name="spellcheck.maxCollationTries">10</int>
>> <int name="spellcheck.maxCollations">1</int>
>>
>> <int name="spellcheck.alternativeTermCount">5</int>
>> <int name="spellcheck.maxResultsForSuggest">1</int>
>> </lst>
>>
>>
>> <lst name="spellchecker">
>> <str name="name">default</str>
>> <str name="field">spell</str>
>> <str name="classname">solr.DirectSolrSpellChecker</str>
>>
>> <!-- the spellcheck distance measure used, the default is the internal
>> levenshtein -->
>> <str name="distanceMeasure">internal</str>
>> <!-- minimum accuracy needed to be considered a valid spellcheck suggestion
>> -->
>> <float name="accuracy">0.5</float>
>> <!-- the maximum #edits we consider when enumerating terms: can be 1 or 2
>> -->
>> <int name="maxEdits">2</int>
>> <!-- the minimum shared prefix when enumerating terms -->
>> <int name="minPrefix">1</int>
>> <!-- maximum number of inspections per result. -->
>> <int name="maxInspections">5</int>
>> <!-- minimum length of a query term to be considered for correction -->
>> <int name="minQueryLength">4</int>
>> <!-- maximum threshold of documents a query term can appear to be considered
>> for correction -->
>> <float name="maxQueryFrequency">0.01</float>
>> <!-- require suggestions to occur in 0.1% of the documents -->
>> <!--
>> <float name="thresholdTokenFrequency">0.001</float>
>>       -->
>>
>> <str name="spellcheckIndexDir">spellchecker</str>
>> <str name="buildOnCommit">true</str>
>> </lst>
>>
>> With the query:
>>
>> spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,primary_cat_id&sort=score+desc,name+asc,year_made+desc&start=0&q=sigorney+wever+title:"sigorney+wever"^100+series_name:"sigorney+wever"^50&spellcheck.q=sigorney+wever&fq=store_id:"1"&rows=5
>>
>> Cheers,
>>
>> David
>>
>>
>>
>> On 22/01/2012 2:03 AM, David Radunz wrote:
>>> James,
>>>
>>>     Thanks again for your lengthy and informative response. I updated from
>>> SVN trunk again today and was successfully able to run 'ant test'. So I
>>> proceeded with trying your suggestions (for question 1 so far):
>>>
>>> On 17/01/2012 5:32 AM, Dyer, James wrote:
>>>> David,
>>>>
>>>> The spellchecker normally won't give suggestions for any term in your
>>>> index.  So even if "wever" is misspelled in context, if it exists in the
>>>> index the spell checker will not try correcting it.  There are 3
>>>> workarounds:
>>>> 1. Use the patch included with SOLR-2585 (this is for Trunk/4.x only).
>>>>   See https://issues.apache.org/jira/browse/SOLR-2585
>>> I have tried using this with the original test case of 'Signorney Wever'.
>>> I didn't notice any difference, although I am a little unclear as to what
>>> exactly this patch does. Nor am I really clear what to set either of the
>>> options to, so I set them both to '5'. I tried to find the test case it
>>> mentions, but it's not present in SpellCheckCollatorTest.java .. Any
>>> suggestions?
>>>
>>>> 2. try "onlyMorePopular=true" in your request.
>>>>   (http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular).
>>>>   But see the September 2, 2011 comment in SOLR-2585 about why this might not
>>>> do what you'd hope it would.
>>>
>>> Trying this did produce 'Signourney Weaver' as you would hope, but I am a
>>> little afraid of the downside. I would much more like a context sensative
>>> spell check that involves the terms around the correction.
>>>>
>>>> 3. If you're building your index on a<copyField />, you can add a
>>>> stopword filter that filters out all of the misspelt or rare words from the
>>>> field that the dictionary is based.  This could be an arduous task, and it
>>>> may or may not work well for your data.
>>> I am currently using a copyField for all terms that are relevant, which is
>>> quite a lot and the dictionary would encompass a huge amount of data. Adding
>>> stopword filters would be out of the question as we presently have more than
>>> 30,000 products and this is for the initial launch, we intend to have many
>>> many more.
>>>>
>>>> As for your second question, I take it you're using (e)dismax with
>>>> multiple fields in "qf", right?  The only way I know to handle this is to
>>>> create a<copyfield>    that combines all of the fields you search across.  Use
>>>> this combined field to base your dictionary.  Also, specifying
>>>> "spellcheck.maxCollationTries" with a non-zero value will weed out the
>>>> nonsense word combinations that are likely to occur when doing this,
>>>> ensuring that any collations provided will indeed yield hits.  The downside
>>>> to doing this, of course, is it will make your first problem more acute in
>>>> that there will be even more terms in your index that the spellchecker will
>>>> ignore entirely, even if they're mispelled in context.  Once again,
>>>> SOLR-2585 is designed to tackle this problem but it is still in its early
>>>> stages, and thus far it is Trunk-only.
>>> I tried setting spellcheck.maxCollationTries to 5 to see if it would help
>>> with the above problem, but it did not.
>>>
>>> I have now tried using it in the context of question 2. I tried searching
>>> for 'Sigorney Wever' in the series name (which it's not present in, as its
>>> an actor):
>>>
>>>
>>> spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,series_name_attr_opt_combo&sort=score+desc,release_date+desc&start=0&q=*+series_name:"signourney+wever"^100&spellcheck.q=signourney+wever&fq=store_id:"1"+AND+series_name_attr_opt_search:*signourney*wever*&rows=5&spellcheck.maxCollationTries=5
>>>
>>> Suggestions for 'Sigourney' Wever were returned, but no spelling
>>> suggestions or ones for series names (which i doubt there would be) should
>>> have been returned.
>>>
>>>> You might also be interested in
>>>> https://issues.apache.org/jira/browse/SOLR-2993 .  Although this is
>>>> unrelated to your two questions, the patch on this issue introduces a new
>>>> "ConjunctionSolrSpellChecker" which theoretically could be enhanced to do
>>>> exactly what you want.  That is, you could (theoretically) create separate
>>>> dictionaries for each of the fields you're searching and let the CSSC
>>>> combine the results&    generate collations, etc.
>>>
>>> During the upgrade I switched to solr.DirectSolrSpellChecker, which I
>>> presume will help with this? I am a senior developer (in
>>> Java/Perl/Python/PHP) but I have not as yet looked at any of the Solr source
>>> code. So I am in the dark when you say it could be tailored for my needs.
>>> Also, how would it work? Query wise.. Would it be like..
>>> spellcheck.series_name.q= and spellcheck.actor.q= and so on? If so that
>>> sounds tempting to try and achieve. But if you could provide any pointers in
>>> what exactly would be required that would really help.
>>>
>>> Thanks again for your time,
>>>
>>> David
>>>>
>>>> James Dyer
>>>> E-Commerce Systems
>>>> Ingram Content Group
>>>> (615) 213-4311
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: David Radunz [mailto:david@boxen.net]
>>>> Sent: Friday, January 13, 2012 11:42 PM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Improving Solr Spell Checker Results
>>>>
>>>> Hey,
>>>>
>>>>       Firstly I would like to thank you all for creating such a great
>>>> searching platform. What I was wondering is whether it is possible to:
>>>>
>>>> 1. Have the spell checker take into account multiple words. For example
>>>> if I search for "Sigourney Wever" it doesn't flag as a spelling issue as
>>>> 'wever' is a correctly spelled word. And if I searched for "Sigourney
>>>> Wevr" the suggestion is "Sigourney Wever". Of course the correct
>>>> spelling is: Sigourney Weaver
>>>> 2. Have the spell checker return corrections only for dictionary items
>>>> added on the field being searched. i.e. Searching for an actor would
>>>> only use the dictionary fields from the actor. This makes sense on many
>>>> levels, as when you are field searching its useless to get a correction
>>>> from another field as no values would match in any case.
>>>>
>>>> Hopefully someone can help!
>>>>
>>>> Thanks in advance,
>>>>
>>>> David
>>>


Re: Improving Solr Spell Checker Results

Posted by Erick Erickson <er...@gmail.com>.
I can't help with your *real* problem, but when looking at patches,
if the "resolution" field isn't set to something like "fixed" it means
that the patch has NOT  been applied to any code lines. There
also should be commit revisions specified in the comments.
If "Fix Versions" has values, that doesn't mean the patch has
been applied either, that's often just a statement of where
the patch *should* go.

And, between the time someone uploads a patch and it actually
gets *committed*, the underlying code line can, indeed,  change
and the patch doesn't apply cleanly. Since you've already had
to do this, could you upload your version that *does* apply
cleanly?

Best
Erick

On Sun, Jan 22, 2012 at 2:56 AM, David Radunz <da...@boxen.net> wrote:
> James,
>
>    I worked out that I actually needed to 'apply' patch SOLR-2585, whoops.
> So I have done that now and it seems to return 'correctlySpelled=true' for
> 'Sigorney Wever' (when Sigorney isn't even in the dictionary). Could
> something have changed in the trunk to make your patch no longer work? I had
> to manually merge the setup for the test case due to a new 'hyphens' test
> case. The settings I am use are:
>
> <lst name="defaults">
> <str name="echoParams">explicit</str>
> <int name="rows">10</int>
>
> <str name="spellcheck.onlyMorePopular">false</str>
> <int name="spellcheck.count">10</int>
> <str name="spellcheck.extendedResults">true</str>
> <str name="spellcheck.collate">true</str>
> <str name="spellcheck.collateExtendedResults">true</str>
> <int name="spellcheck.maxCollationTries">10</int>
> <int name="spellcheck.maxCollations">1</int>
>
> <int name="spellcheck.alternativeTermCount">5</int>
> <int name="spellcheck.maxResultsForSuggest">1</int>
> </lst>
>
>
> <lst name="spellchecker">
> <str name="name">default</str>
> <str name="field">spell</str>
> <str name="classname">solr.DirectSolrSpellChecker</str>
>
> <!-- the spellcheck distance measure used, the default is the internal
> levenshtein -->
> <str name="distanceMeasure">internal</str>
> <!-- minimum accuracy needed to be considered a valid spellcheck suggestion
> -->
> <float name="accuracy">0.5</float>
> <!-- the maximum #edits we consider when enumerating terms: can be 1 or 2
> -->
> <int name="maxEdits">2</int>
> <!-- the minimum shared prefix when enumerating terms -->
> <int name="minPrefix">1</int>
> <!-- maximum number of inspections per result. -->
> <int name="maxInspections">5</int>
> <!-- minimum length of a query term to be considered for correction -->
> <int name="minQueryLength">4</int>
> <!-- maximum threshold of documents a query term can appear to be considered
> for correction -->
> <float name="maxQueryFrequency">0.01</float>
> <!-- require suggestions to occur in 0.1% of the documents -->
> <!--
> <float name="thresholdTokenFrequency">0.001</float>
>      -->
>
> <str name="spellcheckIndexDir">spellchecker</str>
> <str name="buildOnCommit">true</str>
> </lst>
>
> With the query:
>
> spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,primary_cat_id&sort=score+desc,name+asc,year_made+desc&start=0&q=sigorney+wever+title:"sigorney+wever"^100+series_name:"sigorney+wever"^50&spellcheck.q=sigorney+wever&fq=store_id:"1"&rows=5
>
> Cheers,
>
> David
>
>
>
> On 22/01/2012 2:03 AM, David Radunz wrote:
>>
>> James,
>>
>>    Thanks again for your lengthy and informative response. I updated from
>> SVN trunk again today and was successfully able to run 'ant test'. So I
>> proceeded with trying your suggestions (for question 1 so far):
>>
>> On 17/01/2012 5:32 AM, Dyer, James wrote:
>>>
>>> David,
>>>
>>> The spellchecker normally won't give suggestions for any term in your
>>> index.  So even if "wever" is misspelled in context, if it exists in the
>>> index the spell checker will not try correcting it.  There are 3
>>> workarounds:
>>> 1. Use the patch included with SOLR-2585 (this is for Trunk/4.x only).
>>>  See https://issues.apache.org/jira/browse/SOLR-2585
>>
>> I have tried using this with the original test case of 'Signorney Wever'.
>> I didn't notice any difference, although I am a little unclear as to what
>> exactly this patch does. Nor am I really clear what to set either of the
>> options to, so I set them both to '5'. I tried to find the test case it
>> mentions, but it's not present in SpellCheckCollatorTest.java .. Any
>> suggestions?
>>
>>> 2. try "onlyMorePopular=true" in your request.
>>>  (http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular).
>>>  But see the September 2, 2011 comment in SOLR-2585 about why this might not
>>> do what you'd hope it would.
>>
>>
>> Trying this did produce 'Signourney Weaver' as you would hope, but I am a
>> little afraid of the downside. I would much more like a context sensative
>> spell check that involves the terms around the correction.
>>>
>>>
>>> 3. If you're building your index on a<copyField />, you can add a
>>> stopword filter that filters out all of the misspelt or rare words from the
>>> field that the dictionary is based.  This could be an arduous task, and it
>>> may or may not work well for your data.
>>
>> I am currently using a copyField for all terms that are relevant, which is
>> quite a lot and the dictionary would encompass a huge amount of data. Adding
>> stopword filters would be out of the question as we presently have more than
>> 30,000 products and this is for the initial launch, we intend to have many
>> many more.
>>>
>>>
>>> As for your second question, I take it you're using (e)dismax with
>>> multiple fields in "qf", right?  The only way I know to handle this is to
>>> create a<copyfield>  that combines all of the fields you search across.  Use
>>> this combined field to base your dictionary.  Also, specifying
>>> "spellcheck.maxCollationTries" with a non-zero value will weed out the
>>> nonsense word combinations that are likely to occur when doing this,
>>> ensuring that any collations provided will indeed yield hits.  The downside
>>> to doing this, of course, is it will make your first problem more acute in
>>> that there will be even more terms in your index that the spellchecker will
>>> ignore entirely, even if they're mispelled in context.  Once again,
>>> SOLR-2585 is designed to tackle this problem but it is still in its early
>>> stages, and thus far it is Trunk-only.
>>
>> I tried setting spellcheck.maxCollationTries to 5 to see if it would help
>> with the above problem, but it did not.
>>
>> I have now tried using it in the context of question 2. I tried searching
>> for 'Sigorney Wever' in the series name (which it's not present in, as its
>> an actor):
>>
>>
>> spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,series_name_attr_opt_combo&sort=score+desc,release_date+desc&start=0&q=*+series_name:"signourney+wever"^100&spellcheck.q=signourney+wever&fq=store_id:"1"+AND+series_name_attr_opt_search:*signourney*wever*&rows=5&spellcheck.maxCollationTries=5
>>
>> Suggestions for 'Sigourney' Wever were returned, but no spelling
>> suggestions or ones for series names (which i doubt there would be) should
>> have been returned.
>>
>>>
>>> You might also be interested in
>>> https://issues.apache.org/jira/browse/SOLR-2993 .  Although this is
>>> unrelated to your two questions, the patch on this issue introduces a new
>>> "ConjunctionSolrSpellChecker" which theoretically could be enhanced to do
>>> exactly what you want.  That is, you could (theoretically) create separate
>>> dictionaries for each of the fields you're searching and let the CSSC
>>> combine the results&  generate collations, etc.
>>
>>
>> During the upgrade I switched to solr.DirectSolrSpellChecker, which I
>> presume will help with this? I am a senior developer (in
>> Java/Perl/Python/PHP) but I have not as yet looked at any of the Solr source
>> code. So I am in the dark when you say it could be tailored for my needs.
>> Also, how would it work? Query wise.. Would it be like..
>> spellcheck.series_name.q= and spellcheck.actor.q= and so on? If so that
>> sounds tempting to try and achieve. But if you could provide any pointers in
>> what exactly would be required that would really help.
>>
>> Thanks again for your time,
>>
>> David
>>>
>>>
>>> James Dyer
>>> E-Commerce Systems
>>> Ingram Content Group
>>> (615) 213-4311
>>>
>>>
>>> -----Original Message-----
>>> From: David Radunz [mailto:david@boxen.net]
>>> Sent: Friday, January 13, 2012 11:42 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Improving Solr Spell Checker Results
>>>
>>> Hey,
>>>
>>>      Firstly I would like to thank you all for creating such a great
>>> searching platform. What I was wondering is whether it is possible to:
>>>
>>> 1. Have the spell checker take into account multiple words. For example
>>> if I search for "Sigourney Wever" it doesn't flag as a spelling issue as
>>> 'wever' is a correctly spelled word. And if I searched for "Sigourney
>>> Wevr" the suggestion is "Sigourney Wever". Of course the correct
>>> spelling is: Sigourney Weaver
>>> 2. Have the spell checker return corrections only for dictionary items
>>> added on the field being searched. i.e. Searching for an actor would
>>> only use the dictionary fields from the actor. This makes sense on many
>>> levels, as when you are field searching its useless to get a correction
>>> from another field as no values would match in any case.
>>>
>>> Hopefully someone can help!
>>>
>>> Thanks in advance,
>>>
>>> David
>>
>>
>

RE: Improving Solr Spell Checker Results

Posted by "Dyer, James" <Ja...@ingrambook.com>.
David,

Thank you for taking the time to evaluate SOLR-2585.  Perhaps the title of the issue advertises more than it delivers?  (The name is borrowed from a section in the first book listed here: http://wiki.apache.org/lucene-java/InformationRetrieval)  In any case, I think SOLR-2585 is a step forward.  The idea is that some words are "correctly spelled" in that they exist in the dictionary, yet are incorrect in the context of the user's query.  The patch that is out there just tries to find the user something that works.  It sounds like you want it to find the _best_ something that works, and its not doing a good job at that.

The solution so far is crude:  it just takes the most promising words (based on low edit-distance and higher doc frequency) and tries re-querying different combinations until it finds some that give you hits.  There are no doubt a ton of ways to make this more efficient (and effective).  The book I mention says to look at 2-word shingle combinations or possibly check the query log for combinations that have worked in the past.  I would imagine as time goes on someone would implement things like this for Solr.

Your idea to have it consider term proximity is interesting.  Perhaps we can hack this with the current code by changing your "spellcheck.q" to a phrase query?  Or if the user had more than 2 words adding slop as well so that it would consider the words 1-or-2 removed but not further?  Of course this will *eliminate* collations that don't meet the phrase requirememnts and you probably would rather just have it rank them lower, right? (this would require better code!)

In my use-cases we usually require 100% terms (mm=100%), so at the time "spellcheck.maxResultsForSuggest" seemed to make sense.  If all the terms are required, then by default the spellchecker returns nothing if even only 1 result is returned.  So from my perspective this parameter makes it more flexible:  set it to 5, for instance, and now you get spelling suggestions if the query returns 0-5 results.  But I can see how this might not be what you want in cases where mm<100%.  It might be awful for mm=0.  Can you think of something better?

One more thing:  Would you be ok if I took some of our comments here and added them to the JIRA issue?  As this is code that is not even in trunk, it would be helpful to track our comments in JIRA and get visibility on the dev-list also, where discussions about unincorporated patches usually occur.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: David Radunz [mailto:david@boxen.net] 
Sent: Sunday, January 22, 2012 6:42 AM
To: solr-user@lucene.apache.org
Subject: Re: Improving Solr Spell Checker Results

Hey James,

     I have played around a bit more with the settings and tried setting 
spellcheck.maxResultsForSuggest=100 and spellcheck.maxCollations=3. This 
yields 'Sigourney Weaver' as ONE of the corrections, but it's the second 
one and not the first. Which is wrong if this is a patch for 'context 
sensative', because it doesn't really seem to honor any context at all. 
Unless I am missunderstanding this? Also, I don't really like 
maxResultsForSuggest as it means 'all or nothing'. If you set it to 10 
and there are 100 results, then you offer no corrections at all even if 
the term is missing in the dictionary entirely.

     If I set spellcheck.maxResultsForSuggest=100 and 
spellcheck.maxCollations=3 and choose the collation with the largest 
'hits' I get Sigourney Weaver and other 'popular' terms. But say I 
searched for 'pork and chups', the 'popular' correction is 'park and 
chips' where as the first correction was correct: 'pork and chips'.

     So really, none of the solutions either in this patch or Solr offer 
what I would truely call context sensative spell checking. That being, 
in a full text search engine you find documents based on terms and how 
close they are togehter in the document. It makes more than perfect 
sense to treat the dictionary like this, so that when there are multiple 
terms it offers suggestions for the terms that match closely to whats 
entered surrounding the term.

Example:

     "Sigourney Wever" would never appear in a document ever.
     "Sigourney Weaver" however has many 'hits' in exactly that order of 
words.

So there needs to be a way to boost suggestions based on adjacency...  
Much like the full text search operates.

Thoughts?

David

On 22/01/2012 9:56 PM, David Radunz wrote:
> James,
>
>     I worked out that I actually needed to 'apply' patch SOLR-2585, 
> whoops. So I have done that now and it seems to return 
> 'correctlySpelled=true' for 'Sigorney Wever' (when Sigorney isn't even 
> in the dictionary). Could something have changed in the trunk to make 
> your patch no longer work? I had to manually merge the setup for the 
> test case due to a new 'hyphens' test case. The settings I am use are:
>
> <lst name="defaults">
> <str name="echoParams">explicit</str>
> <int name="rows">10</int>
>
> <str name="spellcheck.onlyMorePopular">false</str>
> <int name="spellcheck.count">10</int>
> <str name="spellcheck.extendedResults">true</str>
> <str name="spellcheck.collate">true</str>
> <str name="spellcheck.collateExtendedResults">true</str>
> <int name="spellcheck.maxCollationTries">10</int>
> <int name="spellcheck.maxCollations">1</int>
>
> <int name="spellcheck.alternativeTermCount">5</int>
> <int name="spellcheck.maxResultsForSuggest">1</int>
> </lst>
>
>
> <lst name="spellchecker">
> <str name="name">default</str>
> <str name="field">spell</str>
> <str name="classname">solr.DirectSolrSpellChecker</str>
>
> <!-- the spellcheck distance measure used, the default is the internal 
> levenshtein -->
> <str name="distanceMeasure">internal</str>
> <!-- minimum accuracy needed to be considered a valid spellcheck 
> suggestion -->
> <float name="accuracy">0.5</float>
> <!-- the maximum #edits we consider when enumerating terms: can be 1 
> or 2 -->
> <int name="maxEdits">2</int>
> <!-- the minimum shared prefix when enumerating terms -->
> <int name="minPrefix">1</int>
> <!-- maximum number of inspections per result. -->
> <int name="maxInspections">5</int>
> <!-- minimum length of a query term to be considered for correction -->
> <int name="minQueryLength">4</int>
> <!-- maximum threshold of documents a query term can appear to be 
> considered for correction -->
> <float name="maxQueryFrequency">0.01</float>
> <!-- require suggestions to occur in 0.1% of the documents -->
> <!--
> <float name="thresholdTokenFrequency">0.001</float>
>       -->
>
> <str name="spellcheckIndexDir">spellchecker</str>
> <str name="buildOnCommit">true</str>
> </lst>
>
> With the query:
>
> spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,primary_cat_id&sort=score+desc,name+asc,year_made+desc&start=0&q=sigorney+wever+title:"sigorney+wever"^100+series_name:"sigorney+wever"^50&spellcheck.q=sigorney+wever&fq=store_id:"1"&rows=5 
>
>
> Cheers,
>
> David
>
>
> On 22/01/2012 2:03 AM, David Radunz wrote:
>> James,
>>
>>     Thanks again for your lengthy and informative response. I updated 
>> from SVN trunk again today and was successfully able to run 'ant 
>> test'. So I proceeded with trying your suggestions (for question 1 so 
>> far):
>>
>> On 17/01/2012 5:32 AM, Dyer, James wrote:
>>> David,
>>>
>>> The spellchecker normally won't give suggestions for any term in 
>>> your index.  So even if "wever" is misspelled in context, if it 
>>> exists in the index the spell checker will not try correcting it.  
>>> There are 3 workarounds:
>>> 1. Use the patch included with SOLR-2585 (this is for Trunk/4.x 
>>> only).  See https://issues.apache.org/jira/browse/SOLR-2585
>> I have tried using this with the original test case of 'Signorney 
>> Wever'. I didn't notice any difference, although I am a little 
>> unclear as to what exactly this patch does. Nor am I really clear 
>> what to set either of the options to, so I set them both to '5'. I 
>> tried to find the test case it mentions, but it's not present in 
>> SpellCheckCollatorTest.java .. Any suggestions?
>>
>>> 2. try "onlyMorePopular=true" in your request.  
>>> (http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular).  
>>> But see the September 2, 2011 comment in SOLR-2585 about why this 
>>> might not do what you'd hope it would.
>>
>> Trying this did produce 'Signourney Weaver' as you would hope, but I 
>> am a little afraid of the downside. I would much more like a context 
>> sensative spell check that involves the terms around the correction.
>>>
>>> 3. If you're building your index on a<copyField />, you can add a 
>>> stopword filter that filters out all of the misspelt or rare words 
>>> from the field that the dictionary is based.  This could be an 
>>> arduous task, and it may or may not work well for your data.
>> I am currently using a copyField for all terms that are relevant, 
>> which is quite a lot and the dictionary would encompass a huge amount 
>> of data. Adding stopword filters would be out of the question as we 
>> presently have more than 30,000 products and this is for the initial 
>> launch, we intend to have many many more.
>>>
>>> As for your second question, I take it you're using (e)dismax with 
>>> multiple fields in "qf", right?  The only way I know to handle this 
>>> is to create a<copyfield>  that combines all of the fields you 
>>> search across.  Use this combined field to base your dictionary.  
>>> Also, specifying "spellcheck.maxCollationTries" with a non-zero 
>>> value will weed out the nonsense word combinations that are likely 
>>> to occur when doing this, ensuring that any collations provided will 
>>> indeed yield hits.  The downside to doing this, of course, is it 
>>> will make your first problem more acute in that there will be even 
>>> more terms in your index that the spellchecker will ignore entirely, 
>>> even if they're mispelled in context.  Once again, SOLR-2585 is 
>>> designed to tackle this problem but it is still in its early stages, 
>>> and thus far it is Trunk-only.
>> I tried setting spellcheck.maxCollationTries to 5 to see if it would 
>> help with the above problem, but it did not.
>>
>> I have now tried using it in the context of question 2. I tried 
>> searching for 'Sigorney Wever' in the series name (which it's not 
>> present in, as its an actor):
>>
>> spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,series_name_attr_opt_combo&sort=score+desc,release_date+desc&start=0&q=*+series_name:"signourney+wever"^100&spellcheck.q=signourney+wever&fq=store_id:"1"+AND+series_name_attr_opt_search:*signourney*wever*&rows=5&spellcheck.maxCollationTries=5 
>>
>>
>> Suggestions for 'Sigourney' Wever were returned, but no spelling 
>> suggestions or ones for series names (which i doubt there would be) 
>> should have been returned.
>>
>>>
>>> You might also be interested in 
>>> https://issues.apache.org/jira/browse/SOLR-2993 .  Although this is 
>>> unrelated to your two questions, the patch on this issue introduces 
>>> a new "ConjunctionSolrSpellChecker" which theoretically could be 
>>> enhanced to do exactly what you want.  That is, you could 
>>> (theoretically) create separate dictionaries for each of the fields 
>>> you're searching and let the CSSC combine the results&  generate 
>>> collations, etc.
>>
>> During the upgrade I switched to solr.DirectSolrSpellChecker, which I 
>> presume will help with this? I am a senior developer (in 
>> Java/Perl/Python/PHP) but I have not as yet looked at any of the Solr 
>> source code. So I am in the dark when you say it could be tailored 
>> for my needs. Also, how would it work? Query wise.. Would it be 
>> like.. spellcheck.series_name.q= and spellcheck.actor.q= and so on? 
>> If so that sounds tempting to try and achieve. But if you could 
>> provide any pointers in what exactly would be required that would 
>> really help.
>>
>> Thanks again for your time,
>>
>> David
>>>
>>> James Dyer
>>> E-Commerce Systems
>>> Ingram Content Group
>>> (615) 213-4311
>>>
>>>
>>> -----Original Message-----
>>> From: David Radunz [mailto:david@boxen.net]
>>> Sent: Friday, January 13, 2012 11:42 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Improving Solr Spell Checker Results
>>>
>>> Hey,
>>>
>>>       Firstly I would like to thank you all for creating such a great
>>> searching platform. What I was wondering is whether it is possible to:
>>>
>>> 1. Have the spell checker take into account multiple words. For example
>>> if I search for "Sigourney Wever" it doesn't flag as a spelling 
>>> issue as
>>> 'wever' is a correctly spelled word. And if I searched for "Sigourney
>>> Wevr" the suggestion is "Sigourney Wever". Of course the correct
>>> spelling is: Sigourney Weaver
>>> 2. Have the spell checker return corrections only for dictionary items
>>> added on the field being searched. i.e. Searching for an actor would
>>> only use the dictionary fields from the actor. This makes sense on many
>>> levels, as when you are field searching its useless to get a correction
>>> from another field as no values would match in any case.
>>>
>>> Hopefully someone can help!
>>>
>>> Thanks in advance,
>>>
>>> David
>>
>


Re: Improving Solr Spell Checker Results

Posted by David Radunz <da...@boxen.net>.
Hey,

     I am trying to send this again as 'plain-text' to see if it 
delivers ok this time. All of the previous messages I sent should be below..

Cheers,

David

On 22/01/2012 11:42 PM, David Radunz wrote:
> Hey James,
>
>     I have played around a bit more with the settings and tried 
> setting spellcheck.maxResultsForSuggest=100 and 
> spellcheck.maxCollations=3. This yields 'Sigourney Weaver' as ONE of 
> the corrections, but it's the second one and not the first. Which is 
> wrong if this is a patch for 'context sensative', because it doesn't 
> really seem to honor any context at all. Unless I am missunderstanding 
> this? Also, I don't really like maxResultsForSuggest as it means 'all 
> or nothing'. If you set it to 10 and there are 100 results, then you 
> offer no corrections at all even if the term is missing in the 
> dictionary entirely.
>
>     If I set spellcheck.maxResultsForSuggest=100 and 
> spellcheck.maxCollations=3 and choose the collation with the largest 
> 'hits' I get Sigourney Weaver and other 'popular' terms. But say I 
> searched for 'pork and chups', the 'popular' correction is 'park and 
> chips' where as the first correction was correct: 'pork and chips'.
>
>     So really, none of the solutions either in this patch or Solr 
> offer what I would truely call context sensative spell checking. That 
> being, in a full text search engine you find documents based on terms 
> and how close they are togehter in the document. It makes more than 
> perfect sense to treat the dictionary like this, so that when there 
> are multiple terms it offers suggestions for the terms that match 
> closely to whats entered surrounding the term.
>
> Example:
>
>     "Sigourney Wever" would never appear in a document ever.
>     "Sigourney Weaver" however has many 'hits' in exactly that order 
> of words.
>
> So there needs to be a way to boost suggestions based on adjacency...  
> Much like the full text search operates.
>
> Thoughts?
>
> David
>
> On 22/01/2012 9:56 PM, David Radunz wrote:
>> James,
>>
>>     I worked out that I actually needed to 'apply' patch SOLR-2585, 
>> whoops. So I have done that now and it seems to return 
>> 'correctlySpelled=true' for 'Sigorney Wever' (when Sigorney isn't 
>> even in the dictionary). Could something have changed in the trunk to 
>> make your patch no longer work? I had to manually merge the setup for 
>> the test case due to a new 'hyphens' test case. The settings I am use 
>> are:
>>
>> <lst name="defaults">
>> <str name="echoParams">explicit</str>
>> <int name="rows">10</int>
>>
>> <str name="spellcheck.onlyMorePopular">false</str>
>> <int name="spellcheck.count">10</int>
>> <str name="spellcheck.extendedResults">true</str>
>> <str name="spellcheck.collate">true</str>
>> <str name="spellcheck.collateExtendedResults">true</str>
>> <int name="spellcheck.maxCollationTries">10</int>
>> <int name="spellcheck.maxCollations">1</int>
>>
>> <int name="spellcheck.alternativeTermCount">5</int>
>> <int name="spellcheck.maxResultsForSuggest">1</int>
>> </lst>
>>
>>
>> <lst name="spellchecker">
>> <str name="name">default</str>
>> <str name="field">spell</str>
>> <str name="classname">solr.DirectSolrSpellChecker</str>
>>
>> <!-- the spellcheck distance measure used, the default is the 
>> internal levenshtein -->
>> <str name="distanceMeasure">internal</str>
>> <!-- minimum accuracy needed to be considered a valid spellcheck 
>> suggestion -->
>> <float name="accuracy">0.5</float>
>> <!-- the maximum #edits we consider when enumerating terms: can be 1 
>> or 2 -->
>> <int name="maxEdits">2</int>
>> <!-- the minimum shared prefix when enumerating terms -->
>> <int name="minPrefix">1</int>
>> <!-- maximum number of inspections per result. -->
>> <int name="maxInspections">5</int>
>> <!-- minimum length of a query term to be considered for correction -->
>> <int name="minQueryLength">4</int>
>> <!-- maximum threshold of documents a query term can appear to be 
>> considered for correction -->
>> <float name="maxQueryFrequency">0.01</float>
>> <!-- require suggestions to occur in 0.1% of the documents -->
>> <!--
>> <float name="thresholdTokenFrequency">0.001</float>
>>       -->
>>
>> <str name="spellcheckIndexDir">spellchecker</str>
>> <str name="buildOnCommit">true</str>
>> </lst>
>>
>> With the query:
>>
>> spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,primary_cat_id&sort=score+desc,name+asc,year_made+desc&start=0&q=sigorney+wever+title:"sigorney+wever"^100+series_name:"sigorney+wever"^50&spellcheck.q=sigorney+wever&fq=store_id:"1"&rows=5 
>>
>>
>> Cheers,
>>
>> David
>>
>>
>> On 22/01/2012 2:03 AM, David Radunz wrote:
>>> James,
>>>
>>>     Thanks again for your lengthy and informative response. I 
>>> updated from SVN trunk again today and was successfully able to run 
>>> 'ant test'. So I proceeded with trying your suggestions (for 
>>> question 1 so far):
>>>
>>> On 17/01/2012 5:32 AM, Dyer, James wrote:
>>>> David,
>>>>
>>>> The spellchecker normally won't give suggestions for any term in 
>>>> your index.  So even if "wever" is misspelled in context, if it 
>>>> exists in the index the spell checker will not try correcting it.  
>>>> There are 3 workarounds:
>>>> 1. Use the patch included with SOLR-2585 (this is for Trunk/4.x 
>>>> only).  See https://issues.apache.org/jira/browse/SOLR-2585
>>> I have tried using this with the original test case of 'Signorney 
>>> Wever'. I didn't notice any difference, although I am a little 
>>> unclear as to what exactly this patch does. Nor am I really clear 
>>> what to set either of the options to, so I set them both to '5'. I 
>>> tried to find the test case it mentions, but it's not present in 
>>> SpellCheckCollatorTest.java .. Any suggestions?
>>>
>>>> 2. try "onlyMorePopular=true" in your request.  
>>>> (http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular).  
>>>> But see the September 2, 2011 comment in SOLR-2585 about why this 
>>>> might not do what you'd hope it would.
>>>
>>> Trying this did produce 'Signourney Weaver' as you would hope, but I 
>>> am a little afraid of the downside. I would much more like a context 
>>> sensative spell check that involves the terms around the correction.
>>>>
>>>> 3. If you're building your index on a<copyField />, you can add a 
>>>> stopword filter that filters out all of the misspelt or rare words 
>>>> from the field that the dictionary is based.  This could be an 
>>>> arduous task, and it may or may not work well for your data.
>>> I am currently using a copyField for all terms that are relevant, 
>>> which is quite a lot and the dictionary would encompass a huge 
>>> amount of data. Adding stopword filters would be out of the question 
>>> as we presently have more than 30,000 products and this is for the 
>>> initial launch, we intend to have many many more.
>>>>
>>>> As for your second question, I take it you're using (e)dismax with 
>>>> multiple fields in "qf", right?  The only way I know to handle this 
>>>> is to create a<copyfield>  that combines all of the fields you 
>>>> search across.  Use this combined field to base your dictionary.  
>>>> Also, specifying "spellcheck.maxCollationTries" with a non-zero 
>>>> value will weed out the nonsense word combinations that are likely 
>>>> to occur when doing this, ensuring that any collations provided 
>>>> will indeed yield hits.  The downside to doing this, of course, is 
>>>> it will make your first problem more acute in that there will be 
>>>> even more terms in your index that the spellchecker will ignore 
>>>> entirely, even if they're mispelled in context.  Once again, 
>>>> SOLR-2585 is designed to tackle this problem but it is still in its 
>>>> early stages, and thus far it is Trunk-only.
>>> I tried setting spellcheck.maxCollationTries to 5 to see if it would 
>>> help with the above problem, but it did not.
>>>
>>> I have now tried using it in the context of question 2. I tried 
>>> searching for 'Sigorney Wever' in the series name (which it's not 
>>> present in, as its an actor):
>>>
>>> spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,series_name_attr_opt_combo&sort=score+desc,release_date+desc&start=0&q=*+series_name:"signourney+wever"^100&spellcheck.q=signourney+wever&fq=store_id:"1"+AND+series_name_attr_opt_search:*signourney*wever*&rows=5&spellcheck.maxCollationTries=5 
>>>
>>>
>>> Suggestions for 'Sigourney' Wever were returned, but no spelling 
>>> suggestions or ones for series names (which i doubt there would be) 
>>> should have been returned.
>>>
>>>>
>>>> You might also be interested in 
>>>> https://issues.apache.org/jira/browse/SOLR-2993 .  Although this is 
>>>> unrelated to your two questions, the patch on this issue introduces 
>>>> a new "ConjunctionSolrSpellChecker" which theoretically could be 
>>>> enhanced to do exactly what you want.  That is, you could 
>>>> (theoretically) create separate dictionaries for each of the fields 
>>>> you're searching and let the CSSC combine the results&  generate 
>>>> collations, etc.
>>>
>>> During the upgrade I switched to solr.DirectSolrSpellChecker, which 
>>> I presume will help with this? I am a senior developer (in 
>>> Java/Perl/Python/PHP) but I have not as yet looked at any of the 
>>> Solr source code. So I am in the dark when you say it could be 
>>> tailored for my needs. Also, how would it work? Query wise.. Would 
>>> it be like.. spellcheck.series_name.q= and spellcheck.actor.q= and 
>>> so on? If so that sounds tempting to try and achieve. But if you 
>>> could provide any pointers in what exactly would be required that 
>>> would really help.
>>>
>>> Thanks again for your time,
>>>
>>> David
>>>>
>>>> James Dyer
>>>> E-Commerce Systems
>>>> Ingram Content Group
>>>> (615) 213-4311
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: David Radunz [mailto:david@boxen.net]
>>>> Sent: Friday, January 13, 2012 11:42 PM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Improving Solr Spell Checker Results
>>>>
>>>> Hey,
>>>>
>>>>       Firstly I would like to thank you all for creating such a great
>>>> searching platform. What I was wondering is whether it is possible to:
>>>>
>>>> 1. Have the spell checker take into account multiple words. For 
>>>> example
>>>> if I search for "Sigourney Wever" it doesn't flag as a spelling 
>>>> issue as
>>>> 'wever' is a correctly spelled word. And if I searched for "Sigourney
>>>> Wevr" the suggestion is "Sigourney Wever". Of course the correct
>>>> spelling is: Sigourney Weaver
>>>> 2. Have the spell checker return corrections only for dictionary items
>>>> added on the field being searched. i.e. Searching for an actor would
>>>> only use the dictionary fields from the actor. This makes sense on 
>>>> many
>>>> levels, as when you are field searching its useless to get a 
>>>> correction
>>>> from another field as no values would match in any case.
>>>>
>>>> Hopefully someone can help!
>>>>
>>>> Thanks in advance,
>>>>
>>>> David
>>>
>>
>


Re: Improving Solr Spell Checker Results

Posted by David Radunz <da...@boxen.net>.
Hey James,

     I have played around a bit more with the settings and tried setting 
spellcheck.maxResultsForSuggest=100 and spellcheck.maxCollations=3. This 
yields 'Sigourney Weaver' as ONE of the corrections, but it's the second 
one and not the first. Which is wrong if this is a patch for 'context 
sensative', because it doesn't really seem to honor any context at all. 
Unless I am missunderstanding this? Also, I don't really like 
maxResultsForSuggest as it means 'all or nothing'. If you set it to 10 
and there are 100 results, then you offer no corrections at all even if 
the term is missing in the dictionary entirely.

     If I set spellcheck.maxResultsForSuggest=100 and 
spellcheck.maxCollations=3 and choose the collation with the largest 
'hits' I get Sigourney Weaver and other 'popular' terms. But say I 
searched for 'pork and chups', the 'popular' correction is 'park and 
chips' where as the first correction was correct: 'pork and chips'.

     So really, none of the solutions either in this patch or Solr offer 
what I would truely call context sensative spell checking. That being, 
in a full text search engine you find documents based on terms and how 
close they are togehter in the document. It makes more than perfect 
sense to treat the dictionary like this, so that when there are multiple 
terms it offers suggestions for the terms that match closely to whats 
entered surrounding the term.

Example:

     "Sigourney Wever" would never appear in a document ever.
     "Sigourney Weaver" however has many 'hits' in exactly that order of 
words.

So there needs to be a way to boost suggestions based on adjacency...  
Much like the full text search operates.

Thoughts?

David

On 22/01/2012 9:56 PM, David Radunz wrote:
> James,
>
>     I worked out that I actually needed to 'apply' patch SOLR-2585, 
> whoops. So I have done that now and it seems to return 
> 'correctlySpelled=true' for 'Sigorney Wever' (when Sigorney isn't even 
> in the dictionary). Could something have changed in the trunk to make 
> your patch no longer work? I had to manually merge the setup for the 
> test case due to a new 'hyphens' test case. The settings I am use are:
>
> <lst name="defaults">
> <str name="echoParams">explicit</str>
> <int name="rows">10</int>
>
> <str name="spellcheck.onlyMorePopular">false</str>
> <int name="spellcheck.count">10</int>
> <str name="spellcheck.extendedResults">true</str>
> <str name="spellcheck.collate">true</str>
> <str name="spellcheck.collateExtendedResults">true</str>
> <int name="spellcheck.maxCollationTries">10</int>
> <int name="spellcheck.maxCollations">1</int>
>
> <int name="spellcheck.alternativeTermCount">5</int>
> <int name="spellcheck.maxResultsForSuggest">1</int>
> </lst>
>
>
> <lst name="spellchecker">
> <str name="name">default</str>
> <str name="field">spell</str>
> <str name="classname">solr.DirectSolrSpellChecker</str>
>
> <!-- the spellcheck distance measure used, the default is the internal 
> levenshtein -->
> <str name="distanceMeasure">internal</str>
> <!-- minimum accuracy needed to be considered a valid spellcheck 
> suggestion -->
> <float name="accuracy">0.5</float>
> <!-- the maximum #edits we consider when enumerating terms: can be 1 
> or 2 -->
> <int name="maxEdits">2</int>
> <!-- the minimum shared prefix when enumerating terms -->
> <int name="minPrefix">1</int>
> <!-- maximum number of inspections per result. -->
> <int name="maxInspections">5</int>
> <!-- minimum length of a query term to be considered for correction -->
> <int name="minQueryLength">4</int>
> <!-- maximum threshold of documents a query term can appear to be 
> considered for correction -->
> <float name="maxQueryFrequency">0.01</float>
> <!-- require suggestions to occur in 0.1% of the documents -->
> <!--
> <float name="thresholdTokenFrequency">0.001</float>
>       -->
>
> <str name="spellcheckIndexDir">spellchecker</str>
> <str name="buildOnCommit">true</str>
> </lst>
>
> With the query:
>
> spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,primary_cat_id&sort=score+desc,name+asc,year_made+desc&start=0&q=sigorney+wever+title:"sigorney+wever"^100+series_name:"sigorney+wever"^50&spellcheck.q=sigorney+wever&fq=store_id:"1"&rows=5 
>
>
> Cheers,
>
> David
>
>
> On 22/01/2012 2:03 AM, David Radunz wrote:
>> James,
>>
>>     Thanks again for your lengthy and informative response. I updated 
>> from SVN trunk again today and was successfully able to run 'ant 
>> test'. So I proceeded with trying your suggestions (for question 1 so 
>> far):
>>
>> On 17/01/2012 5:32 AM, Dyer, James wrote:
>>> David,
>>>
>>> The spellchecker normally won't give suggestions for any term in 
>>> your index.  So even if "wever" is misspelled in context, if it 
>>> exists in the index the spell checker will not try correcting it.  
>>> There are 3 workarounds:
>>> 1. Use the patch included with SOLR-2585 (this is for Trunk/4.x 
>>> only).  See https://issues.apache.org/jira/browse/SOLR-2585
>> I have tried using this with the original test case of 'Signorney 
>> Wever'. I didn't notice any difference, although I am a little 
>> unclear as to what exactly this patch does. Nor am I really clear 
>> what to set either of the options to, so I set them both to '5'. I 
>> tried to find the test case it mentions, but it's not present in 
>> SpellCheckCollatorTest.java .. Any suggestions?
>>
>>> 2. try "onlyMorePopular=true" in your request.  
>>> (http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular).  
>>> But see the September 2, 2011 comment in SOLR-2585 about why this 
>>> might not do what you'd hope it would.
>>
>> Trying this did produce 'Signourney Weaver' as you would hope, but I 
>> am a little afraid of the downside. I would much more like a context 
>> sensative spell check that involves the terms around the correction.
>>>
>>> 3. If you're building your index on a<copyField />, you can add a 
>>> stopword filter that filters out all of the misspelt or rare words 
>>> from the field that the dictionary is based.  This could be an 
>>> arduous task, and it may or may not work well for your data.
>> I am currently using a copyField for all terms that are relevant, 
>> which is quite a lot and the dictionary would encompass a huge amount 
>> of data. Adding stopword filters would be out of the question as we 
>> presently have more than 30,000 products and this is for the initial 
>> launch, we intend to have many many more.
>>>
>>> As for your second question, I take it you're using (e)dismax with 
>>> multiple fields in "qf", right?  The only way I know to handle this 
>>> is to create a<copyfield>  that combines all of the fields you 
>>> search across.  Use this combined field to base your dictionary.  
>>> Also, specifying "spellcheck.maxCollationTries" with a non-zero 
>>> value will weed out the nonsense word combinations that are likely 
>>> to occur when doing this, ensuring that any collations provided will 
>>> indeed yield hits.  The downside to doing this, of course, is it 
>>> will make your first problem more acute in that there will be even 
>>> more terms in your index that the spellchecker will ignore entirely, 
>>> even if they're mispelled in context.  Once again, SOLR-2585 is 
>>> designed to tackle this problem but it is still in its early stages, 
>>> and thus far it is Trunk-only.
>> I tried setting spellcheck.maxCollationTries to 5 to see if it would 
>> help with the above problem, but it did not.
>>
>> I have now tried using it in the context of question 2. I tried 
>> searching for 'Sigorney Wever' in the series name (which it's not 
>> present in, as its an actor):
>>
>> spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,series_name_attr_opt_combo&sort=score+desc,release_date+desc&start=0&q=*+series_name:"signourney+wever"^100&spellcheck.q=signourney+wever&fq=store_id:"1"+AND+series_name_attr_opt_search:*signourney*wever*&rows=5&spellcheck.maxCollationTries=5 
>>
>>
>> Suggestions for 'Sigourney' Wever were returned, but no spelling 
>> suggestions or ones for series names (which i doubt there would be) 
>> should have been returned.
>>
>>>
>>> You might also be interested in 
>>> https://issues.apache.org/jira/browse/SOLR-2993 .  Although this is 
>>> unrelated to your two questions, the patch on this issue introduces 
>>> a new "ConjunctionSolrSpellChecker" which theoretically could be 
>>> enhanced to do exactly what you want.  That is, you could 
>>> (theoretically) create separate dictionaries for each of the fields 
>>> you're searching and let the CSSC combine the results&  generate 
>>> collations, etc.
>>
>> During the upgrade I switched to solr.DirectSolrSpellChecker, which I 
>> presume will help with this? I am a senior developer (in 
>> Java/Perl/Python/PHP) but I have not as yet looked at any of the Solr 
>> source code. So I am in the dark when you say it could be tailored 
>> for my needs. Also, how would it work? Query wise.. Would it be 
>> like.. spellcheck.series_name.q= and spellcheck.actor.q= and so on? 
>> If so that sounds tempting to try and achieve. But if you could 
>> provide any pointers in what exactly would be required that would 
>> really help.
>>
>> Thanks again for your time,
>>
>> David
>>>
>>> James Dyer
>>> E-Commerce Systems
>>> Ingram Content Group
>>> (615) 213-4311
>>>
>>>
>>> -----Original Message-----
>>> From: David Radunz [mailto:david@boxen.net]
>>> Sent: Friday, January 13, 2012 11:42 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Improving Solr Spell Checker Results
>>>
>>> Hey,
>>>
>>>       Firstly I would like to thank you all for creating such a great
>>> searching platform. What I was wondering is whether it is possible to:
>>>
>>> 1. Have the spell checker take into account multiple words. For example
>>> if I search for "Sigourney Wever" it doesn't flag as a spelling 
>>> issue as
>>> 'wever' is a correctly spelled word. And if I searched for "Sigourney
>>> Wevr" the suggestion is "Sigourney Wever". Of course the correct
>>> spelling is: Sigourney Weaver
>>> 2. Have the spell checker return corrections only for dictionary items
>>> added on the field being searched. i.e. Searching for an actor would
>>> only use the dictionary fields from the actor. This makes sense on many
>>> levels, as when you are field searching its useless to get a correction
>>> from another field as no values would match in any case.
>>>
>>> Hopefully someone can help!
>>>
>>> Thanks in advance,
>>>
>>> David
>>
>


Re: Improving Solr Spell Checker Results

Posted by David Radunz <da...@boxen.net>.
James,

     I worked out that I actually needed to 'apply' patch SOLR-2585, 
whoops. So I have done that now and it seems to return 
'correctlySpelled=true' for 'Sigorney Wever' (when Sigorney isn't even 
in the dictionary). Could something have changed in the trunk to make 
your patch no longer work? I had to manually merge the setup for the 
test case due to a new 'hyphens' test case. The settings I am use are:

<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>

<str name="spellcheck.onlyMorePopular">false</str>
<int name="spellcheck.count">10</int>
<str name="spellcheck.extendedResults">true</str>
<str name="spellcheck.collate">true</str>
<str name="spellcheck.collateExtendedResults">true</str>
<int name="spellcheck.maxCollationTries">10</int>
<int name="spellcheck.maxCollations">1</int>

<int name="spellcheck.alternativeTermCount">5</int>
<int name="spellcheck.maxResultsForSuggest">1</int>
</lst>


<lst name="spellchecker">
<str name="name">default</str>
<str name="field">spell</str>
<str name="classname">solr.DirectSolrSpellChecker</str>

<!-- the spellcheck distance measure used, the default is the internal 
levenshtein -->
<str name="distanceMeasure">internal</str>
<!-- minimum accuracy needed to be considered a valid spellcheck 
suggestion -->
<float name="accuracy">0.5</float>
<!-- the maximum #edits we consider when enumerating terms: can be 1 or 
2 -->
<int name="maxEdits">2</int>
<!-- the minimum shared prefix when enumerating terms -->
<int name="minPrefix">1</int>
<!-- maximum number of inspections per result. -->
<int name="maxInspections">5</int>
<!-- minimum length of a query term to be considered for correction -->
<int name="minQueryLength">4</int>
<!-- maximum threshold of documents a query term can appear to be 
considered for correction -->
<float name="maxQueryFrequency">0.01</float>
<!-- require suggestions to occur in 0.1% of the documents -->
<!--
<float name="thresholdTokenFrequency">0.001</float>
       -->

<str name="spellcheckIndexDir">spellchecker</str>
<str name="buildOnCommit">true</str>
</lst>

With the query:

spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,primary_cat_id&sort=score+desc,name+asc,year_made+desc&start=0&q=sigorney+wever+title:"sigorney+wever"^100+series_name:"sigorney+wever"^50&spellcheck.q=sigorney+wever&fq=store_id:"1"&rows=5

Cheers,

David


On 22/01/2012 2:03 AM, David Radunz wrote:
> James,
>
>     Thanks again for your lengthy and informative response. I updated 
> from SVN trunk again today and was successfully able to run 'ant 
> test'. So I proceeded with trying your suggestions (for question 1 so 
> far):
>
> On 17/01/2012 5:32 AM, Dyer, James wrote:
>> David,
>>
>> The spellchecker normally won't give suggestions for any term in your 
>> index.  So even if "wever" is misspelled in context, if it exists in 
>> the index the spell checker will not try correcting it.  There are 3 
>> workarounds:
>> 1. Use the patch included with SOLR-2585 (this is for Trunk/4.x 
>> only).  See https://issues.apache.org/jira/browse/SOLR-2585
> I have tried using this with the original test case of 'Signorney 
> Wever'. I didn't notice any difference, although I am a little unclear 
> as to what exactly this patch does. Nor am I really clear what to set 
> either of the options to, so I set them both to '5'. I tried to find 
> the test case it mentions, but it's not present in 
> SpellCheckCollatorTest.java .. Any suggestions?
>
>> 2. try "onlyMorePopular=true" in your request.  
>> (http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular).  
>> But see the September 2, 2011 comment in SOLR-2585 about why this 
>> might not do what you'd hope it would.
>
> Trying this did produce 'Signourney Weaver' as you would hope, but I 
> am a little afraid of the downside. I would much more like a context 
> sensative spell check that involves the terms around the correction.
>>
>> 3. If you're building your index on a<copyField />, you can add a 
>> stopword filter that filters out all of the misspelt or rare words 
>> from the field that the dictionary is based.  This could be an 
>> arduous task, and it may or may not work well for your data.
> I am currently using a copyField for all terms that are relevant, 
> which is quite a lot and the dictionary would encompass a huge amount 
> of data. Adding stopword filters would be out of the question as we 
> presently have more than 30,000 products and this is for the initial 
> launch, we intend to have many many more.
>>
>> As for your second question, I take it you're using (e)dismax with 
>> multiple fields in "qf", right?  The only way I know to handle this 
>> is to create a<copyfield>  that combines all of the fields you search 
>> across.  Use this combined field to base your dictionary.  Also, 
>> specifying "spellcheck.maxCollationTries" with a non-zero value will 
>> weed out the nonsense word combinations that are likely to occur when 
>> doing this, ensuring that any collations provided will indeed yield 
>> hits.  The downside to doing this, of course, is it will make your 
>> first problem more acute in that there will be even more terms in 
>> your index that the spellchecker will ignore entirely, even if 
>> they're mispelled in context.  Once again, SOLR-2585 is designed to 
>> tackle this problem but it is still in its early stages, and thus far 
>> it is Trunk-only.
> I tried setting spellcheck.maxCollationTries to 5 to see if it would 
> help with the above problem, but it did not.
>
> I have now tried using it in the context of question 2. I tried 
> searching for 'Sigorney Wever' in the series name (which it's not 
> present in, as its an actor):
>
> spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,series_name_attr_opt_combo&sort=score+desc,release_date+desc&start=0&q=*+series_name:"signourney+wever"^100&spellcheck.q=signourney+wever&fq=store_id:"1"+AND+series_name_attr_opt_search:*signourney*wever*&rows=5&spellcheck.maxCollationTries=5 
>
>
> Suggestions for 'Sigourney' Wever were returned, but no spelling 
> suggestions or ones for series names (which i doubt there would be) 
> should have been returned.
>
>>
>> You might also be interested in 
>> https://issues.apache.org/jira/browse/SOLR-2993 .  Although this is 
>> unrelated to your two questions, the patch on this issue introduces a 
>> new "ConjunctionSolrSpellChecker" which theoretically could be 
>> enhanced to do exactly what you want.  That is, you could 
>> (theoretically) create separate dictionaries for each of the fields 
>> you're searching and let the CSSC combine the results&  generate 
>> collations, etc.
>
> During the upgrade I switched to solr.DirectSolrSpellChecker, which I 
> presume will help with this? I am a senior developer (in 
> Java/Perl/Python/PHP) but I have not as yet looked at any of the Solr 
> source code. So I am in the dark when you say it could be tailored for 
> my needs. Also, how would it work? Query wise.. Would it be like.. 
> spellcheck.series_name.q= and spellcheck.actor.q= and so on? If so 
> that sounds tempting to try and achieve. But if you could provide any 
> pointers in what exactly would be required that would really help.
>
> Thanks again for your time,
>
> David
>>
>> James Dyer
>> E-Commerce Systems
>> Ingram Content Group
>> (615) 213-4311
>>
>>
>> -----Original Message-----
>> From: David Radunz [mailto:david@boxen.net]
>> Sent: Friday, January 13, 2012 11:42 PM
>> To: solr-user@lucene.apache.org
>> Subject: Improving Solr Spell Checker Results
>>
>> Hey,
>>
>>       Firstly I would like to thank you all for creating such a great
>> searching platform. What I was wondering is whether it is possible to:
>>
>> 1. Have the spell checker take into account multiple words. For example
>> if I search for "Sigourney Wever" it doesn't flag as a spelling issue as
>> 'wever' is a correctly spelled word. And if I searched for "Sigourney
>> Wevr" the suggestion is "Sigourney Wever". Of course the correct
>> spelling is: Sigourney Weaver
>> 2. Have the spell checker return corrections only for dictionary items
>> added on the field being searched. i.e. Searching for an actor would
>> only use the dictionary fields from the actor. This makes sense on many
>> levels, as when you are field searching its useless to get a correction
>> from another field as no values would match in any case.
>>
>> Hopefully someone can help!
>>
>> Thanks in advance,
>>
>> David
>


Re: Improving Solr Spell Checker Results

Posted by David Radunz <da...@boxen.net>.
James,

     Thanks again for your lengthy and informative response. I updated 
from SVN trunk again today and was successfully able to run 'ant test'. 
So I proceeded with trying your suggestions (for question 1 so far):

On 17/01/2012 5:32 AM, Dyer, James wrote:
> David,
>
> The spellchecker normally won't give suggestions for any term in your index.  So even if "wever" is misspelled in context, if it exists in the index the spell checker will not try correcting it.  There are 3 workarounds:
> 1. Use the patch included with SOLR-2585 (this is for Trunk/4.x only).  See https://issues.apache.org/jira/browse/SOLR-2585
I have tried using this with the original test case of 'Signorney 
Wever'. I didn't notice any difference, although I am a little unclear 
as to what exactly this patch does. Nor am I really clear what to set 
either of the options to, so I set them both to '5'. I tried to find the 
test case it mentions, but it's not present in 
SpellCheckCollatorTest.java .. Any suggestions?

> 2. try "onlyMorePopular=true" in your request.  (http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular).  But see the September 2, 2011 comment in SOLR-2585 about why this might not do what you'd hope it would.

Trying this did produce 'Signourney Weaver' as you would hope, but I am 
a little afraid of the downside. I would much more like a context 
sensative spell check that involves the terms around the correction.
>
> 3. If you're building your index on a<copyField />, you can add a stopword filter that filters out all of the misspelt or rare words from the field that the dictionary is based.  This could be an arduous task, and it may or may not work well for your data.
I am currently using a copyField for all terms that are relevant, which 
is quite a lot and the dictionary would encompass a huge amount of data. 
Adding stopword filters would be out of the question as we presently 
have more than 30,000 products and this is for the initial launch, we 
intend to have many many more.
>
> As for your second question, I take it you're using (e)dismax with multiple fields in "qf", right?  The only way I know to handle this is to create a<copyfield>  that combines all of the fields you search across.  Use this combined field to base your dictionary.  Also, specifying "spellcheck.maxCollationTries" with a non-zero value will weed out the nonsense word combinations that are likely to occur when doing this, ensuring that any collations provided will indeed yield hits.  The downside to doing this, of course, is it will make your first problem more acute in that there will be even more terms in your index that the spellchecker will ignore entirely, even if they're mispelled in context.  Once again, SOLR-2585 is designed to tackle this problem but it is still in its early stages, and thus far it is Trunk-only.
I tried setting spellcheck.maxCollationTries to 5 to see if it would 
help with the above problem, but it did not.

I have now tried using it in the context of question 2. I tried 
searching for 'Sigorney Wever' in the series name (which it's not 
present in, as its an actor):

spellcheck=true&facet=on&fl=id,sku,name,format,thumbnail,release_date,url_path,price,special_price,year_made_attr_opt_combo,series_name_attr_opt_combo&sort=score+desc,release_date+desc&start=0&q=*+series_name:"signourney+wever"^100&spellcheck.q=signourney+wever&fq=store_id:"1"+AND+series_name_attr_opt_search:*signourney*wever*&rows=5&spellcheck.maxCollationTries=5

Suggestions for 'Sigourney' Wever were returned, but no spelling 
suggestions or ones for series names (which i doubt there would be) 
should have been returned.

>
> You might also be interested in https://issues.apache.org/jira/browse/SOLR-2993 .  Although this is unrelated to your two questions, the patch on this issue introduces a new "ConjunctionSolrSpellChecker" which theoretically could be enhanced to do exactly what you want.  That is, you could (theoretically) create separate dictionaries for each of the fields you're searching and let the CSSC combine the results&  generate collations, etc.

During the upgrade I switched to solr.DirectSolrSpellChecker, which I 
presume will help with this? I am a senior developer (in 
Java/Perl/Python/PHP) but I have not as yet looked at any of the Solr 
source code. So I am in the dark when you say it could be tailored for 
my needs. Also, how would it work? Query wise.. Would it be like.. 
spellcheck.series_name.q= and spellcheck.actor.q= and so on? If so that 
sounds tempting to try and achieve. But if you could provide any 
pointers in what exactly would be required that would really help.

Thanks again for your time,

David
>
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
>
>
> -----Original Message-----
> From: David Radunz [mailto:david@boxen.net]
> Sent: Friday, January 13, 2012 11:42 PM
> To: solr-user@lucene.apache.org
> Subject: Improving Solr Spell Checker Results
>
> Hey,
>
>       Firstly I would like to thank you all for creating such a great
> searching platform. What I was wondering is whether it is possible to:
>
> 1. Have the spell checker take into account multiple words. For example
> if I search for "Sigourney Wever" it doesn't flag as a spelling issue as
> 'wever' is a correctly spelled word. And if I searched for "Sigourney
> Wevr" the suggestion is "Sigourney Wever". Of course the correct
> spelling is: Sigourney Weaver
> 2. Have the spell checker return corrections only for dictionary items
> added on the field being searched. i.e. Searching for an actor would
> only use the dictionary fields from the actor. This makes sense on many
> levels, as when you are field searching its useless to get a correction
> from another field as no values would match in any case.
>
> Hopefully someone can help!
>
> Thanks in advance,
>
> David


Re: Improving Solr Spell Checker Results

Posted by David Radunz <da...@boxen.net>.
Hey,

     Thanks so much for your outstanding response. I have been buisy for 
a few days so have not had a chance to try it out. I have now tried to 
install trunc of solr and when i run 'ant test' I encounter the following:

     [junit] Testsuite: 
org.apache.lucene.facet.taxonomy.directory.TestDirectoryTaxonomyReader
     [junit] Testcase: 
testRefreshReadRecreatedTaxonomy(org.apache.lucene.facet.taxonomy.directory.TestDirectoryTaxonomyReader):      
FAILED
     [junit] Expected InconsistentTaxonomyException
     [junit] junit.framework.AssertionFailedError: Expected 
InconsistentTaxonomyException
     [junit]     at 
org.apache.lucene.facet.taxonomy.directory.TestDirectoryTaxonomyReader.doTestReadRecreatedTaxono(TestDirectoryTaxonomyReader.java:168)
     [junit]     at 
org.apache.lucene.facet.taxonomy.directory.TestDirectoryTaxonomyReader.testRefreshReadRecreatedTaxonomy(TestDirectoryTaxonomyReader.java:130)
     [junit]     at 
org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:529)
     [junit]     at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165)
     [junit]     at 
org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57)


     Should I ignore this (and other failed tests) and continue anyway?

Cheers,

David

On 17/01/2012 5:32 AM, Dyer, James wrote:
> David,
>
> The spellchecker normally won't give suggestions for any term in your index.  So even if "wever" is misspelled in context, if it exists in the index the spell checker will not try correcting it.  There are 3 workarounds:
> 1. Use the patch included with SOLR-2585 (this is for Trunk/4.x only).  See https://issues.apache.org/jira/browse/SOLR-2585
>
> 2. try "onlyMorePopular=true" in your request.  (http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular).  But see the September 2, 2011 comment in SOLR-2585 about why this might not do what you'd hope it would.
>
> 3. If you're building your index on a<copyField />, you can add a stopword filter that filters out all of the misspelt or rare words from the field that the dictionary is based.  This could be an arduous task, and it may or may not work well for your data.
>
> As for your second question, I take it you're using (e)dismax with multiple fields in "qf", right?  The only way I know to handle this is to create a<copyfield>  that combines all of the fields you search across.  Use this combined field to base your dictionary.  Also, specifying "spellcheck.maxCollationTries" with a non-zero value will weed out the nonsense word combinations that are likely to occur when doing this, ensuring that any collations provided will indeed yield hits.  The downside to doing this, of course, is it will make your first problem more acute in that there will be even more terms in your index that the spellchecker will ignore entirely, even if they're mispelled in context.  Once again, SOLR-2585 is designed to tackle this problem but it is still in its early stages, and thus far it is Trunk-only.
>
> You might also be interested in https://issues.apache.org/jira/browse/SOLR-2993 .  Although this is unrelated to your two questions, the patch on this issue introduces a new "ConjunctionSolrSpellChecker" which theoretically could be enhanced to do exactly what you want.  That is, you could (theoretically) create separate dictionaries for each of the fields you're searching and let the CSSC combine the results&  generate collations, etc.
>
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
>
>
> -----Original Message-----
> From: David Radunz [mailto:david@boxen.net]
> Sent: Friday, January 13, 2012 11:42 PM
> To: solr-user@lucene.apache.org
> Subject: Improving Solr Spell Checker Results
>
> Hey,
>
>       Firstly I would like to thank you all for creating such a great
> searching platform. What I was wondering is whether it is possible to:
>
> 1. Have the spell checker take into account multiple words. For example
> if I search for "Sigourney Wever" it doesn't flag as a spelling issue as
> 'wever' is a correctly spelled word. And if I searched for "Sigourney
> Wevr" the suggestion is "Sigourney Wever". Of course the correct
> spelling is: Sigourney Weaver
> 2. Have the spell checker return corrections only for dictionary items
> added on the field being searched. i.e. Searching for an actor would
> only use the dictionary fields from the actor. This makes sense on many
> levels, as when you are field searching its useless to get a correction
> from another field as no values would match in any case.
>
> Hopefully someone can help!
>
> Thanks in advance,
>
> David


RE: Improving Solr Spell Checker Results

Posted by "Dyer, James" <Ja...@ingrambook.com>.
David,

The spellchecker normally won't give suggestions for any term in your index.  So even if "wever" is misspelled in context, if it exists in the index the spell checker will not try correcting it.  There are 3 workarounds:
1. Use the patch included with SOLR-2585 (this is for Trunk/4.x only).  See https://issues.apache.org/jira/browse/SOLR-2585

2. try "onlyMorePopular=true" in your request.  (http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular).  But see the September 2, 2011 comment in SOLR-2585 about why this might not do what you'd hope it would.

3. If you're building your index on a <copyField />, you can add a stopword filter that filters out all of the misspelt or rare words from the field that the dictionary is based.  This could be an arduous task, and it may or may not work well for your data.

As for your second question, I take it you're using (e)dismax with multiple fields in "qf", right?  The only way I know to handle this is to create a <copyfield> that combines all of the fields you search across.  Use this combined field to base your dictionary.  Also, specifying "spellcheck.maxCollationTries" with a non-zero value will weed out the nonsense word combinations that are likely to occur when doing this, ensuring that any collations provided will indeed yield hits.  The downside to doing this, of course, is it will make your first problem more acute in that there will be even more terms in your index that the spellchecker will ignore entirely, even if they're mispelled in context.  Once again, SOLR-2585 is designed to tackle this problem but it is still in its early stages, and thus far it is Trunk-only.

You might also be interested in https://issues.apache.org/jira/browse/SOLR-2993 .  Although this is unrelated to your two questions, the patch on this issue introduces a new "ConjunctionSolrSpellChecker" which theoretically could be enhanced to do exactly what you want.  That is, you could (theoretically) create separate dictionaries for each of the fields you're searching and let the CSSC combine the results & generate collations, etc. 

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: David Radunz [mailto:david@boxen.net] 
Sent: Friday, January 13, 2012 11:42 PM
To: solr-user@lucene.apache.org
Subject: Improving Solr Spell Checker Results

Hey,

     Firstly I would like to thank you all for creating such a great 
searching platform. What I was wondering is whether it is possible to:

1. Have the spell checker take into account multiple words. For example 
if I search for "Sigourney Wever" it doesn't flag as a spelling issue as 
'wever' is a correctly spelled word. And if I searched for "Sigourney 
Wevr" the suggestion is "Sigourney Wever". Of course the correct 
spelling is: Sigourney Weaver
2. Have the spell checker return corrections only for dictionary items 
added on the field being searched. i.e. Searching for an actor would 
only use the dictionary fields from the actor. This makes sense on many 
levels, as when you are field searching its useless to get a correction 
from another field as no values would match in any case.

Hopefully someone can help!

Thanks in advance,

David