You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by David Philip <da...@gmail.com> on 2014/10/20 16:06:45 UTC

Word Break Spell Checker Implementation algorithm

Hi,

    Could you please point me to the link where I can learn about the
theory behind the implementation of word break spell checker?
Like we know that the solr's DirectSolrSpellCheck component uses levenstian
distance algorithm, what is the algorithm used behind the word break spell
checker component? How does it detects the space that is needed if it
doesn't use shingle?


Thanks - David

Re: Word Break Spell Checker Implementation algorithm

Posted by Ramzi Alqrainy <ra...@gmail.com>.

WordBreakSolrSpellChecker offers suggestions by combining adjacent query
terms and/or breaking terms into multiple words. It is a SpellCheckComponent
enhancement, leveraging Lucene's WordBreakSpellChecker. It can detect
spelling errors resulting from misplaced whitespace without the use of
shingle-based dictionaries and provides collation support for word-break
errors, including cases where the user has a mix of single-word spelling
errors and word-break errors in the same query. It also provides shard
support.

Here is how it might be configured in solrconfig.xml:

<http://lucene.472066.n3.nabble.com/file/n4164997/Screen_Shot_2014-10-20_at_9.png>

Some of the parameters will be familiar from the discussion of the other
spell checkers, such as name, classname, and field. New for this spell
checker is combineWords, which defines whether words should be combined in a
dictionary search (default is true); breakWords, which defines if words
should be broken during a dictionary search (default is true); and
maxChanges, an integer which defines how many times the spell checker should
check collation possibilities against the index (default is 10).
The spellchecker can be configured with a traditional checker (ie:
DirectSolrSpellChecker). The results are combined and collations can contain
a mix of corrections from both spellcheckers.

Add It to a Request Handler

Queries will be sent to a RequestHandler. If every request should generate a
suggestion, then you would add the following to the requestHandler that you
are using:

<http://lucene.472066.n3.nabble.com/file/n4164997/2.png>

For more details, you can read the below tutorial

https://cwiki.apache.org/confluence/display/solr/Spell+Checking

--
View this message in context: http://lucene.472066.n3.nabble.com/Word-Break-Spell-Checker-Implementation-algorithm-tp4164955p4164997.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Word Break Spell Checker Implementation algorithm

Posted by "Dyer, James" <Ja...@ingramcontent.com>.

David,

I do not know of a published algorithm for this.  All it does is in the case of terms with 0 frequency, it checks the document frequency of the various parts that can be made from the terms by breaking them and/or by combining adjacent terms. There are tuning parameters available that let you limit how much work it will do to try and find a suitable replacement.  See http://lucene.apache.org/core/4_10_0/suggest/org/apache/lucene/search/spell/WordBreakSpellChecker.html .

This of course is slower than indexing shingles as the work is done at query time vs index time.  But it saves the added index size and indexing time required to index the shingles separately.

James Dyer
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: David Philip [mailto:davidphilipsheron@gmail.com] 
Sent: Monday, October 20, 2014 9:07 AM
To: solr-user@lucene.apache.org
Subject: Word Break Spell Checker Implementation algorithm

Hi,

    Could you please point me to the link where I can learn about the
theory behind the implementation of word break spell checker?
Like we know that the solr's DirectSolrSpellCheck component uses levenstian
distance algorithm, what is the algorithm used behind the word break spell
checker component? How does it detects the space that is needed if it
doesn't use shingle?


Thanks - David