You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Husain, Yavar" <yh...@firstam.com> on 2012/03/01 12:43:13 UTC

Spelling Corrector Algorithm

Hi

For spell checking component I set extendedResults to get the frequencies and then select the word with the best frequency. I understand the spell check algorithm based on Edit Distance. For an example:

Query to Solr: Marien

Spell Check Text Returned: Marine (Freq: 120), Market (Freq: 900) and others. My dictionary here is based on indexed words.

So I chose Market (more frequency) however which is wrong as my intent was marine. Both have Edit Distance of 2.

Now how can I improve this Algorithm to select marine instead of market (based on something more than edit distance and frequency stuff)?

Do I have to incorporate some "soundex" algorithms too?

I am looking for simple stuff which I can quickly implement.

I even tried using Peter Norvig's spell corrector Algorithm (which is great) but again I ran in same problems.
</PRE>
<BR>
******************************************************************************************<BR>This message may contain confidential or proprietary information intended only for the use of the<BR>addressee(s) named above or may contain information that is legally privileged. If you are<BR>not the intended addressee, or the person responsible for delivering it to the intended addressee,<BR>you are hereby notified that reading, disseminating, distributing or copying this message is strictly<BR>prohibited. If you have received this message by mistake, please immediately notify us by<BR>replying to the message and delete the original message and any copies immediately thereafter.<BR>
<BR>
Thank you.~<BR>
******************************************************************************************<BR>
FAFLD<BR>
<PRE>

RE: Spelling Corrector Algorithm

Posted by "Husain, Yavar" <yh...@firstam.com>.

Thanks Robert. Yes thats right I can get some more accuracy if I use transposition in addition to substitution, insert and deletion.
________________________________________
From: Robert Muir [rcmuir@gmail.com]
Sent: Thursday, March 01, 2012 9:50 PM
To: solr-user@lucene.apache.org
Subject: Re: Spelling Corrector Algorithm

On Thu, Mar 1, 2012 at 6:43 AM, Husain, Yavar <yh...@firstam.com> wrote:
> Hi
>
> For spell checking component I set extendedResults to get the frequencies and then select the word with the best frequency. I understand the spell check algorithm based on Edit Distance. For an example:
>
> Query to Solr: Marien
>
> Spell Check Text Returned: Marine (Freq: 120), Market (Freq: 900) and others. My dictionary here is based on indexed words.
>
> So I chose Market (more frequency) however which is wrong as my intent was marine. Both have Edit Distance of 2.

If you don't mind using trunk, just use directspellchecker, since it
counts marine as an edit distance of 1 from marien (a transposition:
https://issues.apache.org/jira/browse/LUCENE-3662)

--
lucidimagination.com
****************************************************************************************** 
This message may contain confidential or proprietary information intended only for the use of the 
addressee(s) named above or may contain information that is legally privileged. If you are 
not the intended addressee, or the person responsible for delivering it to the intended addressee, 
you are hereby notified that reading, disseminating, distributing or copying this message is strictly 
prohibited. If you have received this message by mistake, please immediately notify us by 
replying to the message and delete the original message and any copies immediately thereafter. 

Thank you.- 
******************************************************************************************
FAFLD

Re: Spelling Corrector Algorithm

Posted by Robert Muir <rc...@gmail.com>.

On Thu, Mar 1, 2012 at 6:43 AM, Husain, Yavar <yh...@firstam.com> wrote:
> Hi
>
> For spell checking component I set extendedResults to get the frequencies and then select the word with the best frequency. I understand the spell check algorithm based on Edit Distance. For an example:
>
> Query to Solr: Marien
>
> Spell Check Text Returned: Marine (Freq: 120), Market (Freq: 900) and others. My dictionary here is based on indexed words.
>
> So I chose Market (more frequency) however which is wrong as my intent was marine. Both have Edit Distance of 2.

If you don't mind using trunk, just use directspellchecker, since it
counts marine as an edit distance of 1 from marien (a transposition:
https://issues.apache.org/jira/browse/LUCENE-3662)

-- 
lucidimagination.com

RE: Spelling Corrector Algorithm

Posted by "Husain, Yavar" <yh...@firstam.com>.

Thanks James. I loved the last line in your mail  "But in the end, especially with 1-word queries, I doubt even the best algorithms are going to always accurately guess what the user wanted." Absolutely I agree to this; if it is a phrase (instead of single word) then probably we can apply some NLP stuff.
________________________________________
From: Dyer, James [James.Dyer@ingrambook.com]
Sent: Thursday, March 01, 2012 9:29 PM
To: solr-user@lucene.apache.org
Subject: RE: Spelling Corrector Algorithm

Yavar,

When you listed what the spell checker returns you put them in this order:

> Marine (Freq: 120), Market (Freq: 900) and others

Was "Marine" listed first, and then did you pick "Market" because you thought higher frequency is better?  If so, you probably have the right settings already but need to trust it and go with the first result.

If, on the other hand, the wrong suggestions truly are coming up first, you have 2 extension points:

1. You can change the comparator class.  The default one sorts by "score" (distance) first and then "frequency" to break ties.  There is also a pre-packed comparator that sorts just on frequency, or you can write your own (implementing Comparator<org.apache.lucene.search.spell.SuggestWord>) . But I doubt you'd want to change this one.  (see http://wiki.apache.org/solr/SpellCheckComponent#Custom_Comparators_and_the_Lucene_Spell_Checkers_.28IndexBasedSpellChecker.2C_FileBasedSpellChecker.2C_DirectSolrSpellChecker.29 for more info)

2. You can change the distance metric.  The default uses Levenshtein distance, but there is also an implementation for Jaro-Winkler distance.  (see the wikipedia articles for these 2 if you want to know the subtle differences).  It almost seems to me that Jaro-Winkler might give you better results but you'd have to test.  See the example under http://wiki.apache.org/solr/SpellCheckComponent?highlight=%28distanceMeasure%29#Configuration for more information on how to configure this.

If neither distance measure works for you, you could try implementing your own by creating a class implementing "org.apache.lucene.search.spell.StringDistance", then specify your class for the "distanceMeasure" parameter.

Finally, there are some other possibly easy solutions to your problem you should test before going through the trouble of writing custom code:

1. Try a higher "spellcheck.count".  Even if you only want a couple of results the algorithm works better with this set >5 (10-20 might be optimial in some cases).
2. Use DirectSolrSpellChecker, if on 4.x.  This one is not influenced by "spellcheck.count", so you can truly set it to 1 if all you want is 1 result.
3. Use "spellcheck.collate=true" and set "spellcheck.maxCollationTries" to maybe 5 or 10.  This will try the various suggestions by querying the index along with any other query parameters (other keywords, filters, etc), letting you know which suggestions are going to truly return hits in context (and how many).
4. Try Jaro-Winkler (as mentioned above).

Hope this helps.  But in the end, especially with 1-word queries, I doubt even the best algorithms are going to always accurately guess what the user wanted.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

-----Original Message-----
From: Husain, Yavar [mailto:yhusain@firstam.com]
Sent: Thursday, March 01, 2012 5:43 AM
To: solr-user@lucene.apache.org
Subject: Spelling Corrector Algorithm

Hi

For spell checking component I set extendedResults to get the frequencies and then select the word with the best frequency. I understand the spell check algorithm based on Edit Distance. For an example:

Query to Solr: Marien

Spell Check Text Returned: Marine (Freq: 120), Market (Freq: 900) and others. My dictionary here is based on indexed words.

So I chose Market (more frequency) however which is wrong as my intent was marine. Both have Edit Distance of 2.

Now how can I improve this Algorithm to select marine instead of market (based on something more than edit distance and frequency stuff)?

Do I have to incorporate some "soundex" algorithms too?

I am looking for simple stuff which I can quickly implement.

I even tried using Peter Norvig's spell corrector Algorithm (which is great) but again I ran in same problems.
</PRE>
<BR>
******************************************************************************************<BR>This message may contain confidential or proprietary information intended only for the use of the<BR>addressee(s) named above or may contain information that is legally privileged. If you are<BR>not the intended addressee, or the person responsible for delivering it to the intended addressee,<BR>you are hereby notified that reading, disseminating, distributing or copying this message is strictly<BR>prohibited. If you have received this message by mistake, please immediately notify us by<BR>replying to the message and delete the original message and any copies immediately thereafter.<BR>
<BR>
Thank you.~<BR>
******************************************************************************************<BR>
FAFLD<BR>
<PRE>

RE: Spelling Corrector Algorithm

Posted by "Dyer, James" <Ja...@ingrambook.com>.

Yavar,

When you listed what the spell checker returns you put them in this order:

> Marine (Freq: 120), Market (Freq: 900) and others

Was "Marine" listed first, and then did you pick "Market" because you thought higher frequency is better?  If so, you probably have the right settings already but need to trust it and go with the first result.

If, on the other hand, the wrong suggestions truly are coming up first, you have 2 extension points:

1. You can change the comparator class.  The default one sorts by "score" (distance) first and then "frequency" to break ties.  There is also a pre-packed comparator that sorts just on frequency, or you can write your own (implementing Comparator<org.apache.lucene.search.spell.SuggestWord>) . But I doubt you'd want to change this one.  (see http://wiki.apache.org/solr/SpellCheckComponent#Custom_Comparators_and_the_Lucene_Spell_Checkers_.28IndexBasedSpellChecker.2C_FileBasedSpellChecker.2C_DirectSolrSpellChecker.29 for more info)

2. You can change the distance metric.  The default uses Levenshtein distance, but there is also an implementation for Jaro-Winkler distance.  (see the wikipedia articles for these 2 if you want to know the subtle differences).  It almost seems to me that Jaro-Winkler might give you better results but you'd have to test.  See the example under http://wiki.apache.org/solr/SpellCheckComponent?highlight=%28distanceMeasure%29#Configuration for more information on how to configure this.

If neither distance measure works for you, you could try implementing your own by creating a class implementing "org.apache.lucene.search.spell.StringDistance", then specify your class for the "distanceMeasure" parameter.

Finally, there are some other possibly easy solutions to your problem you should test before going through the trouble of writing custom code:

1. Try a higher "spellcheck.count".  Even if you only want a couple of results the algorithm works better with this set >5 (10-20 might be optimial in some cases).
2. Use DirectSolrSpellChecker, if on 4.x.  This one is not influenced by "spellcheck.count", so you can truly set it to 1 if all you want is 1 result.
3. Use "spellcheck.collate=true" and set "spellcheck.maxCollationTries" to maybe 5 or 10.  This will try the various suggestions by querying the index along with any other query parameters (other keywords, filters, etc), letting you know which suggestions are going to truly return hits in context (and how many).
4. Try Jaro-Winkler (as mentioned above).

Hope this helps.  But in the end, especially with 1-word queries, I doubt even the best algorithms are going to always accurately guess what the user wanted.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Husain, Yavar [mailto:yhusain@firstam.com] 
Sent: Thursday, March 01, 2012 5:43 AM
To: solr-user@lucene.apache.org
Subject: Spelling Corrector Algorithm

Hi

For spell checking component I set extendedResults to get the frequencies and then select the word with the best frequency. I understand the spell check algorithm based on Edit Distance. For an example:

Query to Solr: Marien

Spell Check Text Returned: Marine (Freq: 120), Market (Freq: 900) and others. My dictionary here is based on indexed words.

So I chose Market (more frequency) however which is wrong as my intent was marine. Both have Edit Distance of 2.

Now how can I improve this Algorithm to select marine instead of market (based on something more than edit distance and frequency stuff)?

Do I have to incorporate some "soundex" algorithms too?

I am looking for simple stuff which I can quickly implement.

I even tried using Peter Norvig's spell corrector Algorithm (which is great) but again I ran in same problems.
</PRE>
<BR>
******************************************************************************************<BR>This message may contain confidential or proprietary information intended only for the use of the<BR>addressee(s) named above or may contain information that is legally privileged. If you are<BR>not the intended addressee, or the person responsible for delivering it to the intended addressee,<BR>you are hereby notified that reading, disseminating, distributing or copying this message is strictly<BR>prohibited. If you have received this message by mistake, please immediately notify us by<BR>replying to the message and delete the original message and any copies immediately thereafter.<BR>
<BR>
Thank you.~<BR>
******************************************************************************************<BR>
FAFLD<BR>
<PRE>