You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Brian Mila <bm...@iastate.edu> on 2003/06/26 21:53:40 UTC

misspelled queries

Hi,

I've been thinking about trying to implement a misspelled or a similarity
match, ala googles "did you mean this ....".  I was thinking of using
SoundEx or one of the newer algorithms to find appropriate suggestions.  To
do this though I think I would need to enumerate every term in the index,
not a pratical solution I suppose.   Has anyone else attempted this or had
any success with this idea?

 My only other idea would be to generate the SoundEx codes for every term as
its indexed and then add those codes to the index in a different field.
(fyi, here's a
link that explains SoundEx with example code:
 http://www.codeproject.com/csharp/soundex.asp?target=soundex).
Then the query would search the regular fields and then form a second
soundex'd query and run it on the soundex field.  Does this sound plausible?
I'd be really interested to hear results if anyone has tried this before.

Regards,
Brian






---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: misspelled queries

Posted by Eric Jain <Er...@isb-sib.ch>.
> I've been thinking about trying to implement a misspelled or a
> similarity match, ala googles "did you mean this ....".

This is what I do: If a query yields a low number of results, and one of
the terms does not occur in the index, or not very often, then the term
that occurs most often in the index among all terms that are similar to
the original term is suggested as a correction.

Works pretty well most of the time, and when not, it's usually funny :-)

Counting the number of occurrences of a term in an index can be done
efficiently using indexReader.docFreq(term).

See FuzzyTermEnum how to list all similar terms. Depending on the size
of your index, you will probably have to create your own version. Most
effective optimization: Include only terms that start with the same two
or three characters in the enumeration with
super.setEnum(indexReader.terms) in the constructor of your TermEnum.

Runs within milliseconds on a half-gigabyte index.

--
Eric Jain


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE: misspelled queries

Posted by Jon Crowell <jc...@dsg.harvard.edu>.
ASpell is an open source spell checking tool with an API for
C. (I'm afraid I don't know of a C# spell checking API).
ASpell uses a very sophisticated algorithm that begins by
translating the offending word into its soundslike equivalent,
and yields the best results of any spell checking tool I am
aware of.  It is not completely dependent on soundex, however,
so even misspellings that are not close enough to yield the
same soundex code will get good results with ASpell.

http://aspell.sourceforge.net/

Upon coming across a misspelled word you could automatically
run the search using the top (or the top three) spelling
suggestions. Or you could just proved a couple alternative
queries based on the spelling suggestions. If your dictionary
is equal to your index then the suggestions will definitely
yield hits with lucene and will also very likely be what the
user had in mind (because ASpell is amazingly good at finding
the right word).

Now you say that you don't want to use a dictionary but you
do want to deal with misspellings. That seems difficult to me.
Also, relying only on the SoundEx code will leave you high and
dry every time someone makes a minor typo that messes up the
SoundEx code -- "nifraction" instead of "infraction", for
instance.

Jon


> > GSpell is an open source java spell checking API.  It can
> > be found at 
> > http://umlslex.nlm.nih.gov/nlsRepository/gspell/doc/userDoc/
> >
> > It incorporates both metaphone (which is similar to 
> > SoundEx, I think) and ngram algorithms and it is easy to use.
> 
> That might be an option, but I'm using NLucene and C# so 
> porting a full java app is more solution than I'm looking for.
> 
> > I currently have an application in which a user submits a query
> > to Lucene and along the way I use GSpell to check all the terms
> > in the query.  If any are misspelled I underline with a squiggly
> > red line and provide spelling suggestions from GSpell if the
> > user right-clicks.
> >
> > If your spelling correction dictionary is exactly equal to 
> > the terms in your index then any misspelled word is also
> > guaranteed not to yield any hits, and any indexed term is
> > guaranteed not to turn up incorrectly spelled.
> 
> That's not quite what I wanted, actually.  I don't intend to 
> use a dictionary at all.  My hope is that the misspelling 
> should be close enough to the correct spelling that the 
> soundex code would be the same (i.e., spelling and speling 
> and spellling would all have the same soundex code).
> 
> > Jon
> 
> > >
> > > Hi,
> > >
> > > I've been thinking about trying to implement a misspelled or
> > > a similarity match, ala googles "did you mean this ....".  I
> > > was thinking of using SoundEx or one of the newer algorithms
> > > to find appropriate suggestions.  To do this though I think
> > > I would need to enumerate every term in the index, not a
> > > pratical solution I suppose.   Has anyone else attempted this
> > > or had any success with this idea?
> > >
> > > My only other idea would be to generate the SoundEx codes
> > > for every term as its indexed and then add those codes to the
> > > index in a different field. (fyi, here's a link that explains
> > > SoundEx with example code: 
> > > http://www.codeproject.com/csharp/soundex.asp?target=soundex).
> > >
> > > Then the query would search the regular fields and then form
> > > a second soundex'd query and run it on the soundex field.
> > > Does this sound plausible? I'd be really interested to hear
> > > results if anyone has tried this before.
> > >
> > > Regards,
> > > Brian
> > > 
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: misspelled queries

Posted by Brian Mila <bm...@iastate.edu>.
> GSpell is an open source java spell checking API.  It can be found at
> http://umlslex.nlm.nih.gov/nlsRepository/gspell/doc/userDoc/
>
> It incorporates both metaphone (which is similar to SoundEx, I think) and
> ngram algorithms and it is easy to use.
>

That might be an option, but I'm using NLucene and C# so porting a full java
app is more solution than I'm looking for.


> I currently have an application in which a user submits a query to Lucene
> and along the way I use GSpell to check all the terms in the query.  If
any
> are misspelled I underline with a squiggly red line and provide spelling
> suggestions from GSpell if the user right-clicks.
>
> If your spelling correction dictionary is exactly equal to the terms in
your
> index then any misspelled word is also guaranteed not to yield any hits,
and
> any indexed term is guaranteed not to turn up incorrectly spelled.
>

That's not quite what I wanted, actually.  I don't intend to use a
dictionary at all.  My
hope is that the misspelling should be close enough to the correct spelling
that
the soundex code would be the same (i.e., spelling and speling and spellling
would
all have the same soundex code).

> Jon

> >
> > Hi,
> >
> > I've been thinking about trying to implement a misspelled or
> > a similarity match, ala googles "did you mean this ....".  I
> > was thinking of using SoundEx or one of the newer algorithms
> > to find appropriate suggestions.  To do this though I think I
> > would need to enumerate every term in the index,
> > not a pratical solution I suppose.   Has anyone else
> > attempted this or had
> > any success with this idea?
> >
> >  My only other idea would be to generate the SoundEx codes
> > for every term as its indexed and then add those codes to the
> > index in a different field. (fyi, here's a link that explains
> > SoundEx with example code:
> > http://www.codeproject.com/csharp/soundex.asp?target=soundex).
>
> Then the query would search the regular fields and then form a second
> soundex'd query and run it on the soundex field.  Does this sound
plausible?
> I'd be really interested to hear results if anyone has tried this before.
>
> Regards,
> Brian
>
>




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE: misspelled queries

Posted by Jon Crowell <jc...@dsg.harvard.edu>.
GSpell is an open source java spell checking API.  It can be found at
http://umlslex.nlm.nih.gov/nlsRepository/gspell/doc/userDoc/

It incorporates both metaphone (which is similar to SoundEx, I think) and
ngram algorithms and it is easy to use.

I currently have an application in which a user submits a query to Lucene
and along the way I use GSpell to check all the terms in the query.  If any
are misspelled I underline with a squiggly red line and provide spelling
suggestions from GSpell if the user right-clicks.

If your spelling correction dictionary is exactly equal to the terms in your
index then any misspelled word is also guaranteed not to yield any hits, and
any indexed term is guaranteed not to turn up incorrectly spelled.

Jon



> -----Original Message-----
> From: Brian Mila [mailto:bmila@iastate.edu] 
> Sent: Thursday, June 26, 2003 3:54 PM
> To: lucene-user@jakarta.apache.org
> Subject: misspelled queries
> 
> 
> Hi,
> 
> I've been thinking about trying to implement a misspelled or 
> a similarity match, ala googles "did you mean this ....".  I 
> was thinking of using SoundEx or one of the newer algorithms 
> to find appropriate suggestions.  To do this though I think I 
> would need to enumerate every term in the index,
> not a pratical solution I suppose.   Has anyone else 
> attempted this or had
> any success with this idea?
> 
>  My only other idea would be to generate the SoundEx codes 
> for every term as its indexed and then add those codes to the 
> index in a different field. (fyi, here's a link that explains 
> SoundEx with example code:  
> http://www.codeproject.com/csharp/soundex.asp?target=soundex).

Then the query would search the regular fields and then form a second
soundex'd query and run it on the soundex field.  Does this sound plausible?
I'd be really interested to hear results if anyone has tried this before.

Regards,
Brian






---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org