You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by anuvenk <an...@hotmail.com> on 2008/01/25 19:08:09 UTC
Re: Spell Check Handler
I followed your instructions exactly. But still have trouble with multiword
queries
for eg: q=grapics returns 'graphics'
but q=grapics card returns nothing.
I even tried with the latest nightly build but didn't solve the problem. Any
solution available.
scott.tabar wrote:
>
> Matthew,
>
> Thanks for the question. The answer is that they come from your own
> indexes so the dictionary is based upon the actual words that are already
> stored in Solr. This makes sense; if the spell checker is suggesting a
> word that is not in the Solr index, then it will not help the user find
> what they are looking for.
>
> You can control which fields in Solr can feed the spell checker. Also you
> can have more than one spell checker that is focused on a specific
> subjects.
>
> The following example of a SpellCheckerRequestHandler is based upon the
> one I created for the test case. You need to add this to yor
> solrconfig.xml file. You can view the whole thing within the Solr source
> code once it is commited in to the main stream. The path is:
> /src/test/test-files/solr/conf/solrconfig-spellchecker.xml and
> schema-spellchecker.xml in the same directory.
>
> <!-- SpellCheckerRequestHandler takes in a word (or several words) as
> the
> value of the "q" parameter and returns a list of alternative
> spelling
> suggestions. If invoked with a ...&cmd=rebuild, it will rebuild
> the
> spellchecker index.
> -->
> <requestHandler name="spellchecker"
> class="solr.SpellCheckerRequestHandler" startup="lazy">
> <!-- default values for query parameters -->
> <lst name="defaults">
> <int name="suggestionCount">20</int>
> <float name="accuracy">0.60</float>
> </lst>
>
> <!-- Main init params for handler -->
>
> <!-- The directory where your SpellChecker Index should live. -->
> <!-- May be absolute, or relative to the Solr "dataDir" directory.
> -->
> <!-- If this option is not specified, a RAM directory will be used
> -->
> <str name="spellcheckerIndexDir">spell</str>
>
> <!-- the field in your schema that you want to be able to build -->
> <!-- your spell index on. This should be a field that uses a very -->
> <!-- simple FieldType without a lot of Analysis (ie: string) -->
> <str name="termSourceField">spell</str>
>
> </requestHandler>
>
> Some comments:
> - The termSourceField should be a field you have defined within your
> solr schema file. See notes below about the use of this field.
> - The spellcheckeerIndexDir is the name of the directory that contain
> the spellchecker indexes. In my example, I used spell, and it will be at
> the same level of data and conf. You can name it what ever you would like
> to.
> - if you use the name of "/spellchecker" the url will be more RESTful
> - if you need to have more than one spell checker in use at a time, then
> you will need to change the name, spellcheckerIndexDir, and
> termSourceField
> - If you have more than one spell checker hitting the same index
> directory, then when you rebuild the index through one of the handlers the
> other handlers will not know it has been reindexed. To resolve this
> issue, you may have to restart Solr.
>
>
> The following components are from the schema-spellchecker.xml file:
>
> <fieldType name="spellText" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
> <filter class="solr.StandardFilterFactory"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
> <filter class="solr.StandardFilterFactory"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> </fieldType>
>
>
> <field name="spell" type="spellText" indexed="true" stored="true" />
>
>
>
> Some comments on Schema items above:
> - The fieldType must be contained within the types
> - The spellText content can be named what every you want
> - The spellText fieldType should not be too aggressive on stemming or
> modifying the the contents of the field
> - Could use string instead of the defined fieldType of spellText, but it
> does not have to be that restrictive
>
> - The field spellText needs to be within the "fields" group with your
> other defined fields
> - You could always use the copyField to either copy another fields
> content into your "spell" field:
> <copyField source="misc" dest="spell"/>
>
>
> Some notes on the name of the handler:
> - If you precede the name with "/" you can use the following url instead
> of the second one:
> - using the name of "/spellchecker"
> http://yourSolrSite/solr/spellchecker?q=sialophosphoprotein
> - using the name of "spellchecker"
> http://yourSolrSite/solr/select?qt=spellchecker&q=sialophosphoprotein
>
>
> Matthew, I hope you find this somewhat helpful.
>
> Scott Tabar
>
> ---- Matthew Runo <mr...@zappos.com> wrote:
> Where does the index come from in the first place? Do we have to
> enter the words, or are they entered as documents enter the SOLR index?
>
> I'd love to be able to use my own documents as the spell check index
> of "correctly spelled words".
>
> +--------------------------------------------------------+
> | Matthew Runo
> | Zappos Development
> | mruno@zappos.com
> | 702-943-7833
> +--------------------------------------------------------+
>
>
> On Oct 11, 2007, at 7:08 AM, <sc...@fuse.net>
> <sc...@fuse.net> wrote:
>
>> Climbingrose,
>>
>> I think you make a valid point. Each person may have a different
>> concept of how something should work with their application.
>>
>> My thought on the subject of spell checking multiple words:
>> - the parameter "multiWords" enables spell checking on each word
>> in "q" parameter instead of on the whole field
>> - each word is then represented in its own entry in a list of all
>> words that are checked
>> - to identify each word that is being checked within that entry,
>> it is identified by the key "words"
>> - to identify if the word was found exactly as it is within the
>> spell checker's index, the "exist" key contains this information
>> - Since there can be suggestions for both misspelled words and
>> words that are spelled correctly, the list of suggestions is also
>> included for both correctly spelled and misspelled words, even if
>> the suggestion list is empty.
>>
>> - My vision is that if a user has a search query of multiple
>> words and they are wanting to perform a check on the words, the use
>> of "multiWords" will check all words at one time, independently
>> from each others and return the list. The presenting web app can
>> then identify visually to the user which words are misspelled and
>> which ones have suggestions too. The user can then work with the
>> various lists of suggestions without having to re-hit Solr.
>> Naturally, if the user manually changes a word, then Solr will have
>> to be re-hit, but providing a single list of all words, including
>> suggestions for correct words along with incorrect words, will help
>> simplify applications (by reducing iterating over each word) and
>> will help reduce the number of hits to the Solr server.
>>
>>
>>> 1) I assumpt that when user enter a misspelled multiword query, we
>>> should
>>> only check for words that are actually misspelled. For example, if
>>> user
>>> enter "life expectancy calculatar", which has "calculator"
>>> misspelled, we
>>> should only spellcheck "calculatar".
>>
>> I think I understand what you mean in the above statement, but you
>> must admit, it does sound funny. After all, how do you identify
>> that a word is misspelled by NOT using the spelling checker?
>> Correct me if I am wrong, but I think you intended to say that when
>> a word is identified as being misspelled, then you should only
>> include the suggestions for misspelled words. If this is the case,
>> then I would have to disagree with you. The user may be interested
>> in finding words that might mean the same, but are more popular
>> (appears in more indexed documents within the Lucene index). Hence
>> the reason why I added the result field "exist" to identify that a
>> word is spelled correctly even if there is a list of suggestions.
>> Please note, the situation can exist too where a word is misspelled
>> and there are no suggestions so one cannot use the suggestion list
>> as an indicator to the correctness of the individual word(s).
>>
>>
>>> 2) I only return the best string for a mispelled query.
>>
>> You can also use the parameter "suggestionCount=1" to control how
>> many words are returned. In this case, it will do what your code
>> is doing, but still allow the client to dynamically change this
>> value without the need to hard code it within the main source code.
>>
>>
>> As far as only including terms that are more popular than the word
>> that is being checked, there is already a parameter
>> "onlyMorePopular" that you can use to dynamically control this
>> feature from the client side so it does not have to be hard coded
>> within the spelling checker.
>>
>> Review these parameter options on the wiki, but keep in mind I have
>> not updated the wiki with my changes or the new parameter and
>> result fields:
>> http://wiki.apache.org/solr/SpellCheckerRequestHandler
>>
>> Thanks Climbingrose,
>>
>> Scott Tabar
>>
>>
>>
>>
>> ---- climbingrose <cl...@gmail.com> wrote:
>> Just to clarify this line of code:
>>
>> String[] suggestions = spellChecker.suggestSimilar(termText, numSug,
>> req.getSearcher().getReader(), restrictToField, true);
>>
>> I only return suggestions if they are more popular than termText. You
>> probably need to use code in Scott's patch to make this behaviour
>> configurable.
>>
>> On 10/11/07, climbingrose <cl...@gmail.com> wrote:
>>>
>>> Hi all,
>>>
>>> I've been so busy the last few days so I haven't replied to this
>>> email. I
>>> modified SpellCheckerHandler a while ago to include support for
>>> multiword
>>> query. To be honest, I didn't have time to write unit test for the
>>> code.
>>> However, I deployed it in a production environment and it has been
>>> working
>>> for me so far. My version, however, has two assumptions:
>>>
>>> 1) I assumpt that when user enter a misspelled multiword query, we
>>> should
>>> only check for words that are actually misspelled. For example, if
>>> user
>>> enter "life expectancy calculatar", which has "calculator"
>>> misspelled, we
>>> should only spellcheck "calculatar".
>>> 2) I only return the best string for a mispelled query.
>>>
>>> I guess I can just directly paste the code here so that others can
>>> adapt
>>> for their own purposes. If you have any question, just send me an
>>> email.
>>> I'll happy to help you.
>>>
>>> StringBuffer buf = null;
>>> if (null != words && !"".equals(words.trim())) {
>>> Analyzer analyzer = req.getSchema
>>> ().getField(field).getType().getAnalyzer();
>>>
>>> TokenStream source = analyzer.tokenStream(field, new
>>> StringReader(words));
>>> Token t;
>>> boolean hasSuggestion = false;
>>> boolean termExists = false;
>>> while (true) {
>>> try {
>>> t = source.next();
>>> } catch (IOException e) {
>>> t = null;
>>> }
>>> if (t == null)
>>> break;
>>>
>>> String termText = t.termText();
>>> String[] suggestions = spellChecker.suggestSimilar
>>> (termText,
>>> numSug, req.getSearcher().getReader(), restrictToField, true);
>>> if (suggestions != null && suggestions.length > 0) {
>>> if (!suggestions[0].equals(termText)) {
>>> hasSuggestion = true;
>>> }
>>> if (buf == null) {
>>> buf = new StringBuffer(suggestions[0]);
>>> } else
>>> buf.append(" ").append(suggestions[0]);
>>> } else if (spellChecker.exist(termText)){
>>> termExists = true;
>>> if (buf == null) {
>>> buf = new StringBuffer(termText);
>>> } else
>>> buf.append(" ").append(termText);
>>> } else {
>>> hasSuggestion = false;
>>> termExists= false;
>>> break;
>>> }
>>> }
>>> try {
>>> source.close();
>>> } catch (IOException e) {
>>> // ignore
>>> }
>>> // String[] suggestions = spellChecker.suggestSimilar
>>> (words,
>>> numSug,
>>> // nullReader, restrictToField, onlyMorePopular);
>>> if (hasSuggestion || (!hasSuggestion && termExists))
>>> rsp.add("suggestions", buf.toString());
>>> else
>>> rsp.add("suggestions", null);
>>>
>>>
>>>
>>> On 10/11/07, scott.tabar@fuse.net <sc...@fuse.net> wrote:
>>>>
>>>> Hoss,
>>>>
>>>> I had a feeling someone would be quoting Yonik's Law of
>>>> Patches! ;-)
>>>>
>>>> For now, this is done.
>>>>
>>>> I created the changes, created JavaDoc comments on the various
>>>> settings
>>>> and their expected output, created a JUnit test for the
>>>> SpellCheckerRequestHandler
>>>> which tests various components of the handler, and I also created
>>>> the
>>>> supporting configuration files for the JUnit tests (schema and
>>>> solrconfig files).
>>>>
>>>> I attached the patch to the JIRA issue so now we just have to
>>>> wait until
>>>> it gets
>>>> added back in to the main code stream.
>>>>
>>>> For anyone who is interested, here is a link to the JIRA:
>>>> https://issues.apache.org/jira/browse/SOLR-375
>>>>
>>>> Could someone please drop me a hint on how to update the wiki or any
>>>> other
>>>> documentation that could benefit to being updated; I'll like to
>>>> help out
>>>> as much
>>>> as possible, but first I need to know "how". ;-)
>>>>
>>>> When these changes do get committed back in to the daily build,
>>>> please
>>>> review the generated JavaDoc for information on how to utilize
>>>> these new
>>>> features.
>>>> If anyone has any questions, or comments, please do not hesitate
>>>> to ask.
>>>>
>>>>
>>>> As a general note of a self-critique on these changes, I am not 100%
>>>> sure of the way I
>>>> implemented the "nested" structure when the "multiWords"
>>>> parameter is
>>>> used. My interest
>>>> is that it should work smoothly with some other technology such as
>>>> Prototype using the
>>>> JSon output type. Unfortunately, I will not be getting a chance to
>>>> start on that coding until
>>>> next week so it is up in the air as to if this structure will be
>>>> conducive or not. I am planning
>>>> on providing more details in the documentations as far as how to
>>>> utilize
>>>> these modifications
>>>> in Prototype and AJax when I get a chance (even provide links to a
>>>> production site so you
>>>> can see it in action and view the source if interested). So stay
>>>> tuned...
>>>>
>>>> Thanks for everyones time,
>>>> Scott Tabar
>>>>
>>>> ---- Chris Hostetter <ho...@fucit.org> wrote:
>>>>
>>>> : If you like, I can post the source code changes that I made to the
>>>> : SpellCheckerRequestHandler, but at this time I am not ready to
>>>> open a
>>>> : JIRA issue and submit the changes back through the subversion.
>>>> I will
>>>> : need to do a little more testing, documentation, and create
>>>> some unit
>>>> : tests to cover all of these changes, but what I have been able to
>>>> : perform, it is working very well.
>>>>
>>>> Keep in mind "Yonik's Law Of Patches" ...
>>>>
>>>> "A half-baked patch in Jira, with no documentation, no tests
>>>> and no backwards compatibility is better than no patch at
>>>> all."
>>>> http://wiki.apache.org/solr/HowToContribute
>>>>
>>>> ...even if you don't think the code is "solid" yet, if you want to
>>>> eventually make it available to people, making a "rough" version
>>>> available
>>>> to people early gives other people the opportunity to help you
>>>> make it
>>>> solid (by writing unit tests, fixing bugs, and adding
>>>> documentation).
>>>>
>>>>
>>>> -Hoss
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Regards,
>>>
>>> Cuong Hoang
>>
>>
>>
>>
>> --
>> Regards,
>>
>> Cuong Hoang
>>
>
>
>
>
--
View this message in context: http://www.nabble.com/Re%3A-Spell-Check-Handler-tp13090498p15093599.html
Sent from the Solr - User mailing list archive at Nabble.com.