You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Mathieu Lecarme <ma...@garambrogne.net> on 2007/07/06 20:37:45 UTC

for a better spellchecker

Now, SpellChecker use the trigram algorithm to find similar words. It  
works well for keyboard fumbles, but not well enough for short words  
and for languages like french where a same sound can be wrote  
differently.
Spellchecking is a classical computer task, and aspell provides some  
nice and free (it's GNU) sound dictionary. Lots of dictionary are  
available.
I did a python parser which write translation code in different  
languages : python, php and java. A bit like snowball stuff.
Few works will be done to generate lucene compliant code. But is the  
python generator is well enough to Lucene, or a translation must be  
done in Java to put it in Lucene source?

I'll start soon a PhonemeSpellChecker wich overide the trigram  
SpellChecker.

Next step is to implement word cutter, just like Google suggest.

Any suggestions?

M.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: for a better spellchecker

Posted by Mathieu Lecarme <ma...@garambrogne.net>.

The SpellChecker code mix indexing function, ngram treatment, and  
querying functions. Extending it will not produce clean code.
Is it relevant to first refactor SpellChecker code for extracting   
dictionary reading function and indexing/searching functions?
SpellChecker will get a method to add SpellEngine interface wich  
looks like

interface SpellEngine {
	public void addWord(String word);
	public String[] suggestSimilar(String word, int numSug);
}

and something to sort suggestions, like "distance" from suggested word.

M.

Le 9 juil. 07 à 02:38, Chris Hostetter a écrit :

>
> : Now, SpellChecker use the trigram algorithm to find similar  
> words. It
> : works well for keyboard fumbles, but not well enough for short words
> : and for languages like french where a same sound can be wrote
> : differently.
> : Spellchecking is a classical computer task, and aspell provides some
> : nice and free (it's GNU) sound dictionary. Lots of dictionary are
> : available.
>
> The topic of "spell correction" as it pertains to Lucene users can  
> really
> have two meanings:
>   a) an attempt to suggest potential spell correction of query strings
> provided by a user as a form of input pre-processing
>   b) to use Lucene as a tool to suggest spell corrections based on  
> a known
> corpus.
>
> The contrib/spellchecker code is an application of "B" -- it may in  
> fact
> be useful for "A" but that doesn't mean there aren't other non-Lucene
> tools for achieving "A" as well.
>
> : I did a python parser which write translation code in different
> : languages : python, php and java. A bit like snowball stuff.
> : Few works will be done to generate lucene compliant code. But is the
> : python generator is well enough to Lucene, or a translation must be
> : done in Java to put it in Lucene source?
>
> the Lucene-Java repository tends to be about java code, but
> contrib/javascript is an example of code that may be of general use to
> Lucene-Java users that isn't java.
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: for a better spellchecker

Posted by Chris Hostetter <ho...@fucit.org>.

: Now, SpellChecker use the trigram algorithm to find similar words. It
: works well for keyboard fumbles, but not well enough for short words
: and for languages like french where a same sound can be wrote
: differently.
: Spellchecking is a classical computer task, and aspell provides some
: nice and free (it's GNU) sound dictionary. Lots of dictionary are
: available.

The topic of "spell correction" as it pertains to Lucene users can really
have two meanings:
  a) an attempt to suggest potential spell correction of query strings
provided by a user as a form of input pre-processing
  b) to use Lucene as a tool to suggest spell corrections based on a known
corpus.

The contrib/spellchecker code is an application of "B" -- it may in fact
be useful for "A" but that doesn't mean there aren't other non-Lucene
tools for achieving "A" as well.

: I did a python parser which write translation code in different
: languages : python, php and java. A bit like snowball stuff.
: Few works will be done to generate lucene compliant code. But is the
: python generator is well enough to Lucene, or a translation must be
: done in Java to put it in Lucene source?

the Lucene-Java repository tends to be about java code, but
contrib/javascript is an example of code that may be of general use to
Lucene-Java users that isn't java.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: for a better spellchecker

Posted by "J. Delgado" <jd...@lendingclub.com>.

Instead of "overriding" the trigram approach you may want to do a
combination. That is create trigrams out of the list of words from the
dictionary and weigh the matches much higher than those coming from the
index or even have a first dictionary exact lookup and then a trigram/index
based lookup if it fails.

J.D.

2007/7/6, Mathieu Lecarme <ma...@garambrogne.net>:
>
> Now, SpellChecker use the trigram algorithm to find similar words. It
> works well for keyboard fumbles, but not well enough for short words
> and for languages like french where a same sound can be wrote
> differently.
> Spellchecking is a classical computer task, and aspell provides some
> nice and free (it's GNU) sound dictionary. Lots of dictionary are
> available.
> I did a python parser which write translation code in different
> languages : python, php and java. A bit like snowball stuff.
> Few works will be done to generate lucene compliant code. But is the
> python generator is well enough to Lucene, or a translation must be
> done in Java to put it in Lucene source?
>
> I'll start soon a PhonemeSpellChecker wich overide the trigram
> SpellChecker.
>
> Next step is to implement word cutter, just like Google suggest.
>
> Any suggestions?
>
> M.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>