You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2007/05/15 21:48:20 UTC
[Lucene-java Wiki] Update of "SpellChecker" by DanielNaber
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.
The following page has been changed by DanielNaber:
http://wiki.apache.org/jakarta-lucene/SpellChecker
The comment on the change is:
typo and grammar fixes
------------------------------------------------------------------------------
=== SpellChecker ===
- A Spell Checker allows to suggest a list of words closed from a misspelled word. This implementation is based on the David Spencer's code using the n-gram method and the Levensthein distance.
+ A Spell Checker allows to suggest a list of words similar to a misspelled word. This implementation is based on David Spencer's code using the n-gram method and the Levenshtein distance.
== Structure of a dictionary index ==
- A Index (the dictionary) with all the possible words (a lucene index) must be created. The structure of this index is (for a 3-4 gram).
+ An index (the dictionary) with all the possible words (a lucene index) must be created. The structure of this index is (for a 3-4 gram) this:
|| Index Structure || Example ||
|| word || kings ||
||gram3|| kin, ing, ngs ||
@@ -15, +15 @@
||end3|| ngs||
||end4|| ings||
- == Importation: add words to the dictionary ==
+ == Import: Adding Words to the Dictionary ==
- we can add the words coming from a Lucene Index (more precisely a set of Lucene fields), why not, from a file with a list of words.
+ We can add the words coming from a Lucene Index (more precisely from a set of Lucene fields), and from a text file with a list of words.
* Example: we can add all the keywords of a given Lucene field of my index.
{{{
@@ -24, +24 @@
spell.indexDictionary(new LuceneDictionary(my_luceneReader,my_fieldname));
}}}
- == get a list of suggested words ==
+ == Getting a List of Suggested Words ==
- The suggestSimilar method return a list of suggested words sorted by:
+ The suggestSimilar method returns a list of suggested words sorted by:
- 1. the Levenshtein distance (the closest words of the misspelled word is the first of the list).
+ 1. the Levenshtein distance (the most similar word to the misspelled word is the first in the list).
- 2. (optionaly) the popularity of the word in a given Lucene Field.
+ 2. (optionally) the popularity of the word in a given Lucene Field.
- furthermore, that list can be restricted only to the words present in a given Lucene Field.
+ Furthermore, that list can be restricted only to the words present in a given Lucene Field.
* First example: the suggestSimilar(misspelled_word, num_list) method.
The ''num_list'' is the maximum number of words returned.
@@ -39, +39 @@
//l[0] = "seventy"
}}}
- * Second example: the suggestSimilar(misspelled_word, num_list, myIndex_Redear,myField, morePopular)
+ * Second example: the suggestSimilar(misspelled_word, num_list, myIndexReader,myField, morePopular)
- ''''Note'''': if myIndex_reader and myField are null this method is the same as the first method
+ ''''Note'''': if myIndexReader and myField are null this method is the same as the first method
- 1. The returned words are restricted only to the words presents in the field ''myField'' of the Lucene Index "myIndex_Reader"
+ 1. The returned words are restricted only to the words presents in the field ''myField'' of the Lucene Index "myIndexReader"
- 2. the list is also sorted with a second criterium: the popularity (the frequence) of the word in the user field
+ 2. The list is also sorted with a second criterium: the popularity (the frequency) of the word in the user field
- 3. If ''morePopular'' is true and the mispelled word exist in the user field , return only the words more frequent than this.
+ 3. If ''morePopular'' is true and the mispelled word exists in the user field, return only the words more frequent than this.
- See the test case code for example
+ See the test case code for an example.
-
== Changes ==
Version 1.1 :
* sort fixed (the sort was inversed!)
- * set gram dynamicaly (depending of the length of the word)
+ * set gram dynamically (depending of the length of the word)
* use the FuzzyQuery score: ((edit distance)/(length of word))
- * new Dictionary interface + LuceneDictionary and PlaintextDictionary implementation
+ * new Dictionary interface + LuceneDictionary and PlaintextDictionary implementation
* replace addWords method by indexDictionary(Dictionnary dic)
- * add a new public method: boolean exist(word)
+ * add a new public method: boolean exist(word)
* add a build.xml
== Credits ==