You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Markus Fischer <ma...@fischer.name> on 2008/02/29 19:29:29 UTC

Re: Alternate spelling suggestion (was [Resent] Document boosting based on .. semantics? )

Hi

Mathieu Lecarme wrote:
>> On a related topic, I'm also searching for a way to suggest alternate 
>> spelling of words to the user, when we found a word which is very less 
>> frequent used in the index or not in the index at all. I'm Austrian 
>> based, when I e.g. search for "retthich" (wrong spelled "rettich" 
>> which is radish), Google suggests me the proper spelled word. I'm 
>> searching for a way to figure how to accomplish this, but again this 
>> may be lucene off-topic and/or I should properly start a separate 
>> thread ...
> you can use the ngram pattern and levestein distance to find near words.
> I try with  phonetic and aspell dictionnary.

Thanks for the dictionary hint. So obvious, but still I haven't thought about 
it until you mentioned it!

I've had really great success with the following pattern:

* after the Lucene index was created, I generate a myspell compatible 
dictionary from it

* when the search returns no result, every term is ran through myspell suggestion

* the top five myspell suggestion of each term are re-sorted by their 
frequency from the Lucene index

Additionally: when the user only entered a single term, I'm also providing the 
second best results as alternative (did you mean x or y?).

The results have been very good so far, thanks again!

- Markus

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Alternate spelling suggestion (was [Resent] Document boosting based on .. semantics? )

Posted by Mathieu Lecarme <ma...@garambrogne.net>.
> Hi
>
> Mathieu Lecarme wrote:
>>> On a related topic, I'm also searching for a way to suggest  
>>> alternate spelling of words to the user, when we found a word  
>>> which is very less frequent used in the index or not in the index  
>>> at all. I'm Austrian based, when I e.g. search for  
>>> "retthich" (wrong spelled "rettich" which is radish), Google  
>>> suggests me the proper spelled word. I'm searching for a way to  
>>> figure how to accomplish this, but again this may be lucene off- 
>>> topic and/or I should properly start a separate thread ...
>> you can use the ngram pattern and levestein distance to find near  
>> words.
>> I try with  phonetic and aspell dictionnary.
>
> Thanks for the dictionary hint. So obvious, but still I haven't  
> thought about it until you mentioned it!
>
> I've had really great success with the following pattern:
>
> * after the Lucene index was created, I generate a myspell  
> compatible dictionary from it
>
> * when the search returns no result, every term is ran through  
> myspell suggestion
>
> * the top five myspell suggestion of each term are re-sorted by  
> their frequency from the Lucene index
>
> Additionally: when the user only entered a single term, I'm also  
> providing the second best results as alternative (did you mean x or  
> y?).
>
> The results have been very good so far, thanks again!
>
> - Markus

I submit a patch for doing that nicely :
https://issues.apache.org/jira/browse/LUCENE-1190

Here is an example code :

LexiconReader lexiconReader = new DirectoryReader(directory);
Lexicon lexicon = new Lexicon(new RAMDirectory());
lexicon.addAnalyser(new NGramAnalyzer());
lexicon.read(lexiconReader);
QueryParser parser = new QueryParser("txt", new WhitespaceAnalyzer());
Query query = parser.parse("bio:brawn");
SuggestiveSearcher searcher = new SuggestiveSearcher(new  
IndexSearcher(directory), lexicon);
SuggestiveHits hits = searcher.searchWithSuggestions(query);
System.out.println(hits.getSuggestedQuery());

In this example, a lexicon is build from a directory, with a Ngram  
analyzer.
The directory contains the term "brown" in field "bio".
Hits is final, so, you can't heritate from it, that's why there is a  
SuggestiveHits.
SuggestiveHits has a thresold, if there's to few answer, similar word  
search is triggered.

Using myspell dictionnary is a good idea, i'll implement this.

M.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org