You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by anuvenk <an...@hotmail.com> on 2008/01/25 19:08:09 UTC
Re: Spell Check Handler

I followed your instructions exactly. But still have trouble with multiword
queries
for eg: q=grapics returns 'graphics'
but q=grapics card returns nothing.
I even tried with the latest nightly build but didn't solve the problem. Any
solution available.

scott.tabar wrote:
> 
> Matthew,
> 
> Thanks for the question.  The answer is that they come from your own
> indexes so the dictionary is based upon the actual words that are already
> stored in Solr.  This makes sense; if the spell checker is suggesting a
> word that is not in the Solr index, then it will not help the user find
> what they are looking for.
> 
> You can control which fields in Solr can feed the spell checker.  Also you
> can have more than one spell checker that is focused on a specific
> subjects.
> 
> The following example of a SpellCheckerRequestHandler is based upon the
> one I created for the test case.  You need to add this to yor
> solrconfig.xml file.  You can view the whole thing within the Solr source
> code once it is commited in to the main stream.  The path is:
> /src/test/test-files/solr/conf/solrconfig-spellchecker.xml and
> schema-spellchecker.xml in the same directory.
> 
>   <!-- SpellCheckerRequestHandler takes in a word (or several words) as
> the
>        value of the "q" parameter and returns a list of alternative
> spelling
>        suggestions.  If invoked with a ...&cmd=rebuild, it will rebuild
> the
>        spellchecker index.
>   -->
>   <requestHandler name="spellchecker"
> class="solr.SpellCheckerRequestHandler" startup="lazy">
>     <!-- default values for query parameters -->
>      <lst name="defaults">
>        <int name="suggestionCount">20</int>
>        <float name="accuracy">0.60</float>
>      </lst>
>      
>      <!-- Main init params for handler -->
>      
>      <!-- The directory where your SpellChecker Index should live.   -->
>      <!-- May be absolute, or relative to the Solr "dataDir" directory.
> -->
>      <!-- If this option is not specified, a RAM directory will be used
> -->
>      <str name="spellcheckerIndexDir">spell</str>
>      
>      <!-- the field in your schema that you want to be able to build -->
>      <!-- your spell index on. This should be a field that uses a very -->
>      <!-- simple FieldType without a lot of Analysis (ie: string) -->
>      <str name="termSourceField">spell</str>
>      
>    </requestHandler>
> 
> Some comments:
>   - The termSourceField should be a field you have defined within your
> solr schema file.  See notes below about the use of this field.
>   - The spellcheckeerIndexDir is the name of the directory that contain
> the spellchecker indexes.  In my example, I used spell, and it will be at
> the same level of data and conf.  You can name it what ever you would like
> to.
>   - if you use the name of "/spellchecker" the url will be more RESTful
>   - if you need to have more than one spell checker in use at a time, then
> you will need to change the name, spellcheckerIndexDir, and
> termSourceField
>   - If you have more than one spell checker hitting the same index
> directory, then when you rebuild the index through one of the handlers the
> other handlers will not know it has been reindexed.  To resolve this
> issue, you may have to restart Solr.  
> 
> 
> The following components are from the schema-spellchecker.xml file:
> 
> 	<fieldType name="spellText" class="solr.TextField"
> positionIncrementGap="100">
> 	  <analyzer type="index">
> 	    <tokenizer class="solr.StandardTokenizerFactory"/>
> 	    <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
> 	    <filter class="solr.StandardFilterFactory"/>
> 	    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> 	  </analyzer>
> 	  <analyzer type="query">
> 	    <tokenizer class="solr.StandardTokenizerFactory"/>
> 	    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> 	    <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
> 	    <filter class="solr.StandardFilterFactory"/>
> 	    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> 	  </analyzer>
> 	</fieldType>
> 
> 
>    <field name="spell" type="spellText" indexed="true" stored="true" />
> 
> 
> 
> Some comments on Schema items above:
>   - The fieldType must be contained within the types
>   - The spellText content can be named what every you want
>   - The spellText fieldType should not be too aggressive on stemming or
> modifying the the contents of the field
>   - Could use string instead of the defined fieldType of spellText, but it
> does not have to be that restrictive
> 
>   - The field spellText needs to be within the "fields" group with your
> other defined fields
>   - You could always use the copyField to either copy another fields
> content into your "spell" field: 
>       <copyField source="misc" dest="spell"/>
> 
> 
> Some notes on the name of the handler:
>   - If you precede the name with "/" you can use the following url instead
> of the second one:
>   - using the name of "/spellchecker"
>      http://yourSolrSite/solr/spellchecker?q=sialophosphoprotein 
>   - using the name of "spellchecker"
>     http://yourSolrSite/solr/select?qt=spellchecker&q=sialophosphoprotein
> 
> 
> Matthew, I hope you find this somewhat helpful.
> 
>    Scott Tabar
> 
> ---- Matthew Runo <mr...@zappos.com> wrote: 
> Where does the index come from in the first place? Do we have to  
> enter the words, or are they entered as documents enter the SOLR index?
> 
> I'd love to be able to use my own documents as the spell check index  
> of "correctly spelled words".
> 
> +--------------------------------------------------------+
>   | Matthew Runo
>   | Zappos Development
>   | mruno@zappos.com
>   | 702-943-7833
> +--------------------------------------------------------+
> 
> 
> On Oct 11, 2007, at 7:08 AM, <sc...@fuse.net>  
> <sc...@fuse.net> wrote:
> 
>> Climbingrose,
>>
>> I think you make a valid point.  Each person may have a different  
>> concept of how something should work with their application.
>>
>> My thought on the subject of spell checking multiple words:
>>   - the parameter "multiWords" enables spell checking on each word  
>> in "q" parameter instead of on the whole field
>>   - each word is then represented in its own entry in a list of all  
>> words that are checked
>>   - to identify each word that is being checked within that entry,  
>> it is identified by the key "words"
>>   - to identify if the word was found exactly as it is within the  
>> spell checker's index, the "exist" key contains this information
>>   - Since there can be suggestions for both misspelled words and  
>> words that are spelled correctly, the list of suggestions is also  
>> included for both correctly spelled and misspelled words, even if  
>> the suggestion list is empty.
>>
>>   - My vision is that if a user has a search query of multiple  
>> words and they are wanting to perform a check on the words, the use  
>> of "multiWords" will check all words at one time, independently  
>> from each others and return the list.  The presenting web app can  
>> then identify visually to the user which words are misspelled and  
>> which ones have suggestions too.  The user can then work with the  
>> various lists of suggestions without having to re-hit Solr.   
>> Naturally, if the user manually changes a word, then Solr will have  
>> to be re-hit, but providing a single list of all words, including  
>> suggestions for correct words along with incorrect words, will help  
>> simplify applications (by reducing iterating over each word) and  
>> will help reduce the number of hits to the Solr server.
>>
>>
>>> 1) I assumpt that when user enter a misspelled multiword query, we  
>>> should
>>> only check for words that are actually misspelled. For example, if  
>>> user
>>> enter "life expectancy calculatar", which has "calculator"  
>>> misspelled, we
>>> should only spellcheck "calculatar".
>>
>> I think I understand what you mean in the above statement, but you  
>> must admit, it does sound funny.  After all, how do you identify  
>> that a word is misspelled by NOT using the spelling checker?   
>> Correct me if I am wrong, but I think you intended to say that when  
>> a word is identified as being misspelled, then you should only  
>> include the suggestions for misspelled words.  If this is the case,  
>> then I would have to disagree with you.  The user may be interested  
>> in finding words that might mean the same, but are more popular  
>> (appears in more indexed documents within the Lucene index).  Hence  
>> the reason why I added the result field "exist" to identify that a  
>> word is spelled correctly even if there is a list of suggestions.   
>> Please note, the situation can exist too where a word is misspelled  
>> and there are no suggestions so one cannot use the suggestion list  
>> as an indicator to the correctness of the individual word(s).
>>
>>
>>> 2) I only return the best string for a mispelled query.
>>
>> You can also use the parameter "suggestionCount=1" to control how  
>> many words are returned.  In this case, it will do what your code  
>> is doing, but still allow the client to dynamically change this  
>> value without the need to hard code it within the main source code.
>>
>>
>> As far as only including terms that are more popular than the word  
>> that is being checked, there is already a parameter  
>> "onlyMorePopular" that you can use to dynamically control this  
>> feature from the client side so it does not have to be hard coded  
>> within the spelling checker.
>>
>> Review these parameter options on the wiki, but keep in mind I have  
>> not updated the wiki with my changes or the new parameter and  
>> result fields:
>> http://wiki.apache.org/solr/SpellCheckerRequestHandler
>>
>>    Thanks Climbingrose,
>>
>>      Scott Tabar
>>
>>
>>
>>
>> ---- climbingrose <cl...@gmail.com> wrote:
>> Just to clarify this line of code:
>>
>> String[] suggestions = spellChecker.suggestSimilar(termText, numSug,
>> req.getSearcher().getReader(), restrictToField, true);
>>
>> I only return suggestions if they are more popular than termText. You
>> probably need to use code in Scott's patch to make this behaviour
>> configurable.
>>
>> On 10/11/07, climbingrose <cl...@gmail.com> wrote:
>>>
>>> Hi all,
>>>
>>> I've been so busy the last few days so I haven't replied to this  
>>> email. I
>>> modified SpellCheckerHandler a while ago to include support for  
>>> multiword
>>> query. To be honest, I didn't have time to write unit test for the  
>>> code.
>>> However, I deployed it in a production environment and it has been  
>>> working
>>> for me so far. My version, however, has two assumptions:
>>>
>>> 1) I assumpt that when user enter a misspelled multiword query, we  
>>> should
>>> only check for words that are actually misspelled. For example, if  
>>> user
>>> enter "life expectancy calculatar", which has "calculator"  
>>> misspelled, we
>>> should only spellcheck "calculatar".
>>> 2) I only return the best string for a mispelled query.
>>>
>>> I guess I can just directly paste the code here so that others can  
>>> adapt
>>> for their own purposes. If you have any question, just send me an  
>>> email.
>>> I'll happy to help  you.
>>>
>>>         StringBuffer buf = null;
>>>         if (null != words && !"".equals(words.trim())) {
>>>             Analyzer analyzer = req.getSchema
>>> ().getField(field).getType().getAnalyzer();
>>>
>>>             TokenStream source = analyzer.tokenStream(field, new
>>> StringReader(words));
>>>             Token t;
>>>             boolean hasSuggestion = false;
>>>             boolean termExists = false;
>>>             while (true) {
>>>                 try {
>>>                     t = source.next();
>>>                 } catch (IOException e) {
>>>                     t = null;
>>>                 }
>>>                 if (t == null)
>>>                     break;
>>>
>>>                 String termText = t.termText();
>>>                 String[] suggestions = spellChecker.suggestSimilar 
>>> (termText,
>>> numSug, req.getSearcher().getReader(), restrictToField, true);
>>>                 if (suggestions != null && suggestions.length > 0) {
>>>                     if (!suggestions[0].equals(termText)) {
>>>                         hasSuggestion = true;
>>>                     }
>>>                     if (buf == null) {
>>>                         buf = new StringBuffer(suggestions[0]);
>>>                     } else
>>>                         buf.append(" ").append(suggestions[0]);
>>>                 } else if (spellChecker.exist(termText)){
>>>                     termExists = true;
>>>                     if (buf == null) {
>>>                         buf = new StringBuffer(termText);
>>>                     } else
>>>                         buf.append(" ").append(termText);
>>>                 } else {
>>>                     hasSuggestion = false;
>>>                     termExists= false;
>>>                     break;
>>>                 }
>>>             }
>>>             try {
>>>                 source.close();
>>>             } catch (IOException e) {
>>>                 // ignore
>>>             }
>>>             // String[] suggestions = spellChecker.suggestSimilar 
>>> (words,
>>> numSug,
>>>             // nullReader, restrictToField, onlyMorePopular);
>>>             if (hasSuggestion || (!hasSuggestion && termExists))
>>>                 rsp.add("suggestions", buf.toString());
>>>             else
>>>                 rsp.add("suggestions", null);
>>>
>>>
>>>
>>> On 10/11/07, scott.tabar@fuse.net <sc...@fuse.net> wrote:
>>>>
>>>> Hoss,
>>>>
>>>> I had a feeling someone would be quoting Yonik's Law of  
>>>> Patches!  ;-)
>>>>
>>>> For now, this is done.
>>>>
>>>> I created the changes, created JavaDoc comments on the various  
>>>> settings
>>>> and their expected output, created a JUnit test for the
>>>> SpellCheckerRequestHandler
>>>> which tests various components of the handler, and I also created  
>>>> the
>>>> supporting configuration files for the JUnit tests (schema and
>>>> solrconfig files).
>>>>
>>>> I attached the patch to the JIRA issue so now we just have to  
>>>> wait until
>>>> it gets
>>>> added back in to the main code stream.
>>>>
>>>> For anyone who is interested, here is a link to the JIRA:
>>>> https://issues.apache.org/jira/browse/SOLR-375
>>>>
>>>> Could someone please drop me a hint on how to update the wiki or any
>>>> other
>>>> documentation that could benefit to being updated; I'll like to  
>>>> help out
>>>> as much
>>>> as possible, but first I need to know "how". ;-)
>>>>
>>>> When these changes do get committed back in to the daily build,  
>>>> please
>>>> review the generated JavaDoc for information on how to utilize  
>>>> these new
>>>> features.
>>>> If anyone has any questions, or comments, please do not hesitate  
>>>> to ask.
>>>>
>>>>
>>>> As a general note of a self-critique on these changes, I am not 100%
>>>> sure of the way I
>>>> implemented the "nested" structure when the "multiWords"  
>>>> parameter is
>>>> used.  My interest
>>>> is that it should work smoothly with some other technology such as
>>>> Prototype using the
>>>> JSon output type.  Unfortunately, I will not be getting a chance to
>>>> start on that coding until
>>>> next week so it is up in the air as to if this structure will be
>>>> conducive or not.  I am planning
>>>> on providing more details in the documentations as far as how to  
>>>> utilize
>>>> these modifications
>>>> in Prototype and AJax when I get a chance (even provide links to a
>>>> production site so you
>>>> can see it in action and view the source if interested).  So stay
>>>> tuned...
>>>>
>>>>    Thanks for everyones time,
>>>>       Scott Tabar
>>>>
>>>> ---- Chris Hostetter <ho...@fucit.org> wrote:
>>>>
>>>> : If you like, I can post the source code changes that I made to the
>>>> : SpellCheckerRequestHandler, but at this time I am not ready to  
>>>> open a
>>>> : JIRA issue and submit the changes back through the subversion.   
>>>> I will
>>>> : need to do a little more testing, documentation, and create  
>>>> some unit
>>>> : tests to cover all of these changes, but what I have been able to
>>>> : perform, it is working very well.
>>>>
>>>> Keep in mind "Yonik's Law Of Patches" ...
>>>>
>>>>         "A half-baked patch in Jira, with no documentation, no tests
>>>>         and no backwards compatibility is better than no patch at  
>>>> all."
>>>>         http://wiki.apache.org/solr/HowToContribute
>>>>
>>>> ...even if you don't think the code is "solid" yet, if you want to
>>>> eventually make it available to people, making a "rough" version
>>>> available
>>>> to people early gives other people the opportunity to help you  
>>>> make it
>>>> solid (by writing unit tests, fixing bugs, and adding  
>>>> documentation).
>>>>
>>>>
>>>> -Hoss
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Regards,
>>>
>>> Cuong Hoang
>>
>>
>>
>>
>> -- 
>> Regards,
>>
>> Cuong Hoang
>>
> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Re%3A-Spell-Check-Handler-tp13090498p15093599.html
Sent from the Solr - User mailing list archive at Nabble.com.