You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Safat Siddiqui <sa...@gmail.com> on 2015/07/07 05:06:16 UTC

Solr Spell checker for non-english language

Hello,

I am using Solr version 4.10.3 and trying to customize it for bangla
language. I have already built a Bangla language stemmer for Solr indexing:
It works fine.

Now I like to use Solr spell checker and suggestion functionality for
Bangla language. Which section in "DirectSolrSpellChecker" should I modify?
I can not find which section is causing the difference between "English"
and "Non-english" language. A direction will be very helpful for me. Thanks
in advance.

Regards,
Safat

-- 
Thanks,
Safat Siddiqui
Student
Department of CSE
Shahjalal University of Science and Technology
Sylhet, Bangladesh.

RE: Solr Spell checker for non-english language

Posted by "Dyer, James" <Ja...@ingramcontent.com>.
Safat,

DirectSolrSpellChecker defaults to Levenshtein Distance to determine how closely related the query terms are versus the actual terms in the index.  (see https://en.wikipedia.org/wiki/Levenshtein_distance) .  This is not an English-specific metric and it works for many languages.

Assuming this is not appropriate for the Bangla language (sorry for my ignorance!), you might need to implement your own Distance metric, implementing the StringDistance interface.  You can specify your custom class using the "distanceMeasure" parameter under the SpellCheckComponent entry in solrconfig.xml:

<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
               <lst name="spellchecker">
                              <str name="classname">solr.DirectSolrSpellChecker</str>
                              <str name="distanceMeasure">fully.qualified.classname.here</str>
                              .. etc ..
               </lst>
</searchComponent>

For more information, see:  http://lucene.apache.org/core/5_2_1/suggest/org/apache/lucene/search/spell/DirectSpellChecker.html#setDistance%28org.apache.lucene.search.spell.StringDistance%29

Finally, if misplaced whitespace in the query are a problem in the Bangla, you may wish to consider using WordBreakSolrSpellchecker in conjunction with DirectSolrSpellChecker to correct these problems also.  See the main Solr example solrconfig.xml for more information. (https://github.com/apache/lucene-solr/blob/branch_5x/solr/example/files/conf/solrconfig.xml)

James Dyer
Ingram Content Group

From: Safat Siddiqui [mailto:safat006@gmail.com]
Sent: Monday, July 06, 2015 10:06 PM
To: dev@lucene.apache.org
Subject: Solr Spell checker for non-english language

Hello,
I am using Solr version 4.10.3 and trying to customize it for bangla language. I have already built a Bangla language stemmer for Solr indexing: It works fine.
Now I like to use Solr spell checker and suggestion functionality for Bangla language. Which section in "DirectSolrSpellChecker" should I modify? I can not find which section is causing the difference between "English" and "Non-english" language. A direction will be very helpful for me. Thanks in advance.
Regards,
Safat

--
Thanks,
Safat Siddiqui
Student
Department of CSE
Shahjalal University of Science and Technology
Sylhet, Bangladesh.