You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Dyer, James" <Ja...@ingrambook.com> on 2011/06/01 22:02:28 UTC

RE: Spellcheck Phrases

Tanner,

I just entered SOLR-2571 to fix the float-parsing-bug that breaks "thresholdTokenFrequency".  Its just a 1-line code fix so I also included a patch that should cleanly apply to solr 3.1.  See https://issues.apache.org/jira/browse/SOLR-2571 for info and patches.

This parameter appears absent from the wiki.  And as it has always been broken for me, I haven't tested it.  However, my understanding it should be set as the minimum percentage of documents in which a term has to occur in order for it to appear in the spelling dictionary.  For instance in the config below, a term would have to occur in at least 1% of the documents for it to be part of the spelling dictionary.  This might be a good setting for long fields but for the short fields in my application, I was thinking of setting this to something like 1/1000 of 1% ...

<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
 <str name="queryAnalyzerFieldType">text</str>
 <lst name="spellchecker">
  <str name="name">spellchecker</str>
  <str name="field">Spelling_Dictionary</str>
  <str name="fieldType">text</str>
  <str name="spellcheckIndexDir">./spellchecker</str>
  <str name="thresholdTokenFrequency">.01</str> 
 </lst>
</searchComponent>

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Tanner Postert [mailto:tanner.postert@gmail.com] 
Sent: Friday, May 27, 2011 6:04 PM
To: solr-user@lucene.apache.org
Subject: Re: Spellcheck Phrases

are there any updates on this? any third party apps that can make this work
as expected?

On Wed, Feb 23, 2011 at 12:38 PM, Dyer, James <Ja...@ingrambook.com>wrote:

> Tanner,
>
> Currently Solr will only make suggestions for words that are not in the
> dictionary, unless you specifiy "spellcheck.onlyMorePopular=true".  However,
> if you do that, then it will try to "improve" every word in your query, even
> the ones that are spelled correctly (so while it might change "brake" to
> "break" it might also change "leg" to "log".)
>
> You might be able to alleviate some of the pain by setting the
> "thresholdTokenFrequency" so as to remove misspelled and rarely-used words
> from your dictionary, although I personally haven't been able to get this
> parameter to work.  It also doesn't seem to be documented on the wiki but it
> is in the 1.4.1. source code, in class IndexBasedSpellChecker.  Its also
> mentioned in Smiley&Pugh's book.  I tried setting it like this, but got a
> ClassCastException on the float value:
>
> <searchComponent name="spellcheck" class="solr.SpellCheckComponent">
>  <str name="queryAnalyzerFieldType">text_spelling</str>
>  <lst name="spellchecker">
>  <str name="name">spellchecker</str>
>  <str name="field">Spelling_Dictionary</str>
>  <str name="fieldType">text_spelling</str>
>  <str name="buildOnOptimize">true</str>
>  <str name="thresholdTokenFrequency">.0000001</str>
>  </lst>
> </searchComponent>
>
> I have it on my to-do list to look into this further but haven't yet.  If
> you decide to try it and can get it to work, please let me know how you do
> it.
>
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
>
> -----Original Message-----
> From: Tanner Postert [mailto:tanner.postert@gmail.com]
> Sent: Wednesday, February 23, 2011 12:53 PM
> To: solr-user@lucene.apache.org
> Subject: Spellcheck Phrases
>
> right now when I search for 'brake a leg', solr returns valid results with
> no indication of misspelling, which is understandable since all of those
> terms are valid words and are probably found in a few pieces of our
> content.
> My question is:
>
> is there any way for it to recognize that the phase should be "break a leg"
> and not "brake a leg" and suggest the proper phrase?
>

Re: Spellcheck Phrases

Posted by Erick Erickson <er...@gmail.com>.
Please start a new thread for this question, see:

http://people.apache.org/~hossman/#threadhijack
<<<
When starting a new discussion on a mailing list, please do not reply to
an existing message, instead start a fresh email.  Even if you change the
subject line of your email, other mail headers still track which thread
you replied to and your question is "hidden" in that thread and gets less
attention.   It makes following discussions in the mailing list archives
particularly difficult.
>>>

Best
Erick

On Tue, Aug 23, 2011 at 11:47 AM, Herman Kiefus <he...@angieslist.com> wrote:
> The angle that I am trying here is to create a dictionary from indexed terms that contain only correctly spelled words.  We are doing this by having the field from which the dictionary is created utilize a type that employs solr.KeepWordFilterFactory, which in turn utilizes a text file of known correctly spelled words (including their respective derivations example: lead, leads, leading, etc.).
>
> This is working great for us with the exception being those fields in our schema that contain proper names.  I can't seem to get (unfiltered) terms from those fields along with (correctly spelled) terms from other fields into the single field upon which the dictionary is built.
>
> -----Original Message-----
> From: Dyer, James [mailto:James.Dyer@ingrambook.com]
> Sent: Thursday, June 02, 2011 11:40 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Spellcheck Phrases
>
> Actually, someone just pointed out to me that a patch like this is unnecessary.  The code works as-is if configured like this:
>
> <float name="thresholdTokenFrequency">.01</float>  (correct)
>
> instead of this:
>
> <str name="thresholdTokenFrequency">.01</str> (incorrect)
>
> I tested this and it seems to work.  I'm still am trying to figure out if using this parameter actually improves the quality of our spell suggestions, now that I know how to use it properly.
>
> Sorry about the mis-information earlier.
>
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
>
>
> -----Original Message-----
> From: Dyer, James
> Sent: Wednesday, June 01, 2011 3:02 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Spellcheck Phrases
>
> Tanner,
>
> I just entered SOLR-2571 to fix the float-parsing-bug that breaks "thresholdTokenFrequency".  Its just a 1-line code fix so I also included a patch that should cleanly apply to solr 3.1.  See https://issues.apache.org/jira/browse/SOLR-2571 for info and patches.
>
> This parameter appears absent from the wiki.  And as it has always been broken for me, I haven't tested it.  However, my understanding it should be set as the minimum percentage of documents in which a term has to occur in order for it to appear in the spelling dictionary.  For instance in the config below, a term would have to occur in at least 1% of the documents for it to be part of the spelling dictionary.  This might be a good setting for long fields but for the short fields in my application, I was thinking of setting this to something like 1/1000 of 1% ...
>
> <searchComponent name="spellcheck" class="solr.SpellCheckComponent">  <str name="queryAnalyzerFieldType">text</str>
>  <lst name="spellchecker">
>  <str name="name">spellchecker</str>
>  <str name="field">Spelling_Dictionary</str>
>  <str name="fieldType">text</str>
>  <str name="spellcheckIndexDir">./spellchecker</str>
>  <str name="thresholdTokenFrequency">.01</str>
>  </lst>
> </searchComponent>
>
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
>
>
> -----Original Message-----
> From: Tanner Postert [mailto:tanner.postert@gmail.com]
> Sent: Friday, May 27, 2011 6:04 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Spellcheck Phrases
>
> are there any updates on this? any third party apps that can make this work as expected?
>
> On Wed, Feb 23, 2011 at 12:38 PM, Dyer, James <Ja...@ingrambook.com>wrote:
>
>> Tanner,
>>
>> Currently Solr will only make suggestions for words that are not in
>> the dictionary, unless you specifiy "spellcheck.onlyMorePopular=true".
>> However, if you do that, then it will try to "improve" every word in
>> your query, even the ones that are spelled correctly (so while it
>> might change "brake" to "break" it might also change "leg" to "log".)
>>
>> You might be able to alleviate some of the pain by setting the
>> "thresholdTokenFrequency" so as to remove misspelled and rarely-used
>> words from your dictionary, although I personally haven't been able to
>> get this parameter to work.  It also doesn't seem to be documented on
>> the wiki but it is in the 1.4.1. source code, in class
>> IndexBasedSpellChecker.  Its also mentioned in Smiley&Pugh's book.  I
>> tried setting it like this, but got a ClassCastException on the float value:
>>
>> <searchComponent name="spellcheck" class="solr.SpellCheckComponent">
>> <str name="queryAnalyzerFieldType">text_spelling</str>
>>  <lst name="spellchecker">
>>  <str name="name">spellchecker</str>
>>  <str name="field">Spelling_Dictionary</str>
>>  <str name="fieldType">text_spelling</str>
>>  <str name="buildOnOptimize">true</str>  <str
>> name="thresholdTokenFrequency">.0000001</str>
>>  </lst>
>> </searchComponent>
>>
>> I have it on my to-do list to look into this further but haven't yet.
>> If you decide to try it and can get it to work, please let me know how
>> you do it.
>>
>> James Dyer
>> E-Commerce Systems
>> Ingram Content Group
>> (615) 213-4311
>>
>> -----Original Message-----
>> From: Tanner Postert [mailto:tanner.postert@gmail.com]
>> Sent: Wednesday, February 23, 2011 12:53 PM
>> To: solr-user@lucene.apache.org
>> Subject: Spellcheck Phrases
>>
>> right now when I search for 'brake a leg', solr returns valid results
>> with no indication of misspelling, which is understandable since all
>> of those terms are valid words and are probably found in a few pieces
>> of our content.
>> My question is:
>>
>> is there any way for it to recognize that the phase should be "break a leg"
>> and not "brake a leg" and suggest the proper phrase?
>>
>

RE: Spellcheck Phrases

Posted by Herman Kiefus <he...@angieslist.com>.
The angle that I am trying here is to create a dictionary from indexed terms that contain only correctly spelled words.  We are doing this by having the field from which the dictionary is created utilize a type that employs solr.KeepWordFilterFactory, which in turn utilizes a text file of known correctly spelled words (including their respective derivations example: lead, leads, leading, etc.).

This is working great for us with the exception being those fields in our schema that contain proper names.  I can't seem to get (unfiltered) terms from those fields along with (correctly spelled) terms from other fields into the single field upon which the dictionary is built.

-----Original Message-----
From: Dyer, James [mailto:James.Dyer@ingrambook.com] 
Sent: Thursday, June 02, 2011 11:40 AM
To: solr-user@lucene.apache.org
Subject: RE: Spellcheck Phrases

Actually, someone just pointed out to me that a patch like this is unnecessary.  The code works as-is if configured like this:

<float name="thresholdTokenFrequency">.01</float>  (correct)

instead of this:

<str name="thresholdTokenFrequency">.01</str> (incorrect)

I tested this and it seems to work.  I'm still am trying to figure out if using this parameter actually improves the quality of our spell suggestions, now that I know how to use it properly.

Sorry about the mis-information earlier.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Dyer, James
Sent: Wednesday, June 01, 2011 3:02 PM
To: solr-user@lucene.apache.org
Subject: RE: Spellcheck Phrases

Tanner,

I just entered SOLR-2571 to fix the float-parsing-bug that breaks "thresholdTokenFrequency".  Its just a 1-line code fix so I also included a patch that should cleanly apply to solr 3.1.  See https://issues.apache.org/jira/browse/SOLR-2571 for info and patches.

This parameter appears absent from the wiki.  And as it has always been broken for me, I haven't tested it.  However, my understanding it should be set as the minimum percentage of documents in which a term has to occur in order for it to appear in the spelling dictionary.  For instance in the config below, a term would have to occur in at least 1% of the documents for it to be part of the spelling dictionary.  This might be a good setting for long fields but for the short fields in my application, I was thinking of setting this to something like 1/1000 of 1% ...

<searchComponent name="spellcheck" class="solr.SpellCheckComponent">  <str name="queryAnalyzerFieldType">text</str>
 <lst name="spellchecker">
  <str name="name">spellchecker</str>
  <str name="field">Spelling_Dictionary</str>
  <str name="fieldType">text</str>
  <str name="spellcheckIndexDir">./spellchecker</str>
  <str name="thresholdTokenFrequency">.01</str>
 </lst>
</searchComponent>

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Tanner Postert [mailto:tanner.postert@gmail.com]
Sent: Friday, May 27, 2011 6:04 PM
To: solr-user@lucene.apache.org
Subject: Re: Spellcheck Phrases

are there any updates on this? any third party apps that can make this work as expected?

On Wed, Feb 23, 2011 at 12:38 PM, Dyer, James <Ja...@ingrambook.com>wrote:

> Tanner,
>
> Currently Solr will only make suggestions for words that are not in 
> the dictionary, unless you specifiy "spellcheck.onlyMorePopular=true".  
> However, if you do that, then it will try to "improve" every word in 
> your query, even the ones that are spelled correctly (so while it 
> might change "brake" to "break" it might also change "leg" to "log".)
>
> You might be able to alleviate some of the pain by setting the 
> "thresholdTokenFrequency" so as to remove misspelled and rarely-used 
> words from your dictionary, although I personally haven't been able to 
> get this parameter to work.  It also doesn't seem to be documented on 
> the wiki but it is in the 1.4.1. source code, in class 
> IndexBasedSpellChecker.  Its also mentioned in Smiley&Pugh's book.  I 
> tried setting it like this, but got a ClassCastException on the float value:
>
> <searchComponent name="spellcheck" class="solr.SpellCheckComponent">  
> <str name="queryAnalyzerFieldType">text_spelling</str>
>  <lst name="spellchecker">
>  <str name="name">spellchecker</str>
>  <str name="field">Spelling_Dictionary</str>
>  <str name="fieldType">text_spelling</str>
>  <str name="buildOnOptimize">true</str>  <str 
> name="thresholdTokenFrequency">.0000001</str>
>  </lst>
> </searchComponent>
>
> I have it on my to-do list to look into this further but haven't yet.  
> If you decide to try it and can get it to work, please let me know how 
> you do it.
>
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
>
> -----Original Message-----
> From: Tanner Postert [mailto:tanner.postert@gmail.com]
> Sent: Wednesday, February 23, 2011 12:53 PM
> To: solr-user@lucene.apache.org
> Subject: Spellcheck Phrases
>
> right now when I search for 'brake a leg', solr returns valid results 
> with no indication of misspelling, which is understandable since all 
> of those terms are valid words and are probably found in a few pieces 
> of our content.
> My question is:
>
> is there any way for it to recognize that the phase should be "break a leg"
> and not "brake a leg" and suggest the proper phrase?
>

RE: Spellcheck Phrases

Posted by "Dyer, James" <Ja...@ingrambook.com>.
Actually, someone just pointed out to me that a patch like this is unnecessary.  The code works as-is if configured like this:

<float name="thresholdTokenFrequency">.01</float>  (correct)

instead of this:

<str name="thresholdTokenFrequency">.01</str> (incorrect)

I tested this and it seems to work.  I'm still am trying to figure out if using this parameter actually improves the quality of our spell suggestions, now that I know how to use it properly.

Sorry about the mis-information earlier.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Dyer, James 
Sent: Wednesday, June 01, 2011 3:02 PM
To: solr-user@lucene.apache.org
Subject: RE: Spellcheck Phrases

Tanner,

I just entered SOLR-2571 to fix the float-parsing-bug that breaks "thresholdTokenFrequency".  Its just a 1-line code fix so I also included a patch that should cleanly apply to solr 3.1.  See https://issues.apache.org/jira/browse/SOLR-2571 for info and patches.

This parameter appears absent from the wiki.  And as it has always been broken for me, I haven't tested it.  However, my understanding it should be set as the minimum percentage of documents in which a term has to occur in order for it to appear in the spelling dictionary.  For instance in the config below, a term would have to occur in at least 1% of the documents for it to be part of the spelling dictionary.  This might be a good setting for long fields but for the short fields in my application, I was thinking of setting this to something like 1/1000 of 1% ...

<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
 <str name="queryAnalyzerFieldType">text</str>
 <lst name="spellchecker">
  <str name="name">spellchecker</str>
  <str name="field">Spelling_Dictionary</str>
  <str name="fieldType">text</str>
  <str name="spellcheckIndexDir">./spellchecker</str>
  <str name="thresholdTokenFrequency">.01</str> 
 </lst>
</searchComponent>

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Tanner Postert [mailto:tanner.postert@gmail.com] 
Sent: Friday, May 27, 2011 6:04 PM
To: solr-user@lucene.apache.org
Subject: Re: Spellcheck Phrases

are there any updates on this? any third party apps that can make this work
as expected?

On Wed, Feb 23, 2011 at 12:38 PM, Dyer, James <Ja...@ingrambook.com>wrote:

> Tanner,
>
> Currently Solr will only make suggestions for words that are not in the
> dictionary, unless you specifiy "spellcheck.onlyMorePopular=true".  However,
> if you do that, then it will try to "improve" every word in your query, even
> the ones that are spelled correctly (so while it might change "brake" to
> "break" it might also change "leg" to "log".)
>
> You might be able to alleviate some of the pain by setting the
> "thresholdTokenFrequency" so as to remove misspelled and rarely-used words
> from your dictionary, although I personally haven't been able to get this
> parameter to work.  It also doesn't seem to be documented on the wiki but it
> is in the 1.4.1. source code, in class IndexBasedSpellChecker.  Its also
> mentioned in Smiley&Pugh's book.  I tried setting it like this, but got a
> ClassCastException on the float value:
>
> <searchComponent name="spellcheck" class="solr.SpellCheckComponent">
>  <str name="queryAnalyzerFieldType">text_spelling</str>
>  <lst name="spellchecker">
>  <str name="name">spellchecker</str>
>  <str name="field">Spelling_Dictionary</str>
>  <str name="fieldType">text_spelling</str>
>  <str name="buildOnOptimize">true</str>
>  <str name="thresholdTokenFrequency">.0000001</str>
>  </lst>
> </searchComponent>
>
> I have it on my to-do list to look into this further but haven't yet.  If
> you decide to try it and can get it to work, please let me know how you do
> it.
>
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
>
> -----Original Message-----
> From: Tanner Postert [mailto:tanner.postert@gmail.com]
> Sent: Wednesday, February 23, 2011 12:53 PM
> To: solr-user@lucene.apache.org
> Subject: Spellcheck Phrases
>
> right now when I search for 'brake a leg', solr returns valid results with
> no indication of misspelling, which is understandable since all of those
> terms are valid words and are probably found in a few pieces of our
> content.
> My question is:
>
> is there any way for it to recognize that the phase should be "break a leg"
> and not "brake a leg" and suggest the proper phrase?
>