You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by David Spencer <da...@tropo.com> on 2004/09/14 21:22:52 UTC

NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Andrzej Bialecki wrote:

> David Spencer wrote:
> 
>>
>> I can/should send the code out. The logic is that for any terms in a 
>> query that have zero matches, go thru all the terms(!) and calculate 
>> the Levenshtein string distance, and return the best matches. A more 
>> intelligent way of doing this is to instead look for terms that also 
>> match on the 1st "n" (prob 3) chars.
> 
> 
> ...or prepare in advance a fast lookup index - split all existing terms 
> to bi- or trigrams, create a separate lookup index, and then simply for 
> each term ask a phrase query (phrase = all n-grams from an input term), 
> with a slop > 0, to get similar existing terms. This should be fast, and 
> you could provide a "did you mean" function too...
> 

Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 
phases. First you build a "fast lookup index" as mentioned above. Then 
to correct a word you do a query in this index based on the ngrams in 
the misspelled word.

Let's see.

[1] Source is attached and I'd like to contribute it to the sandbox, esp 
if someone can validate that what it's doing is reasonable and useful.

[2] Here's a demo page. I built an ngram index for ngrams of length 3 
and 4 based on the existing index I have of approx 100k 
javadoc-generated pages. You type in a misspelled word like "recursixe" 
or whatnot to see what suggestions it returns. Note this is not a normal 
search index query -- rather this is a test page for spelling corrections.

http://www.searchmorph.com/kat/spell.jsp

[3] Here's the javadoc:

http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGramSpeller.html

[4] Here's source in HTML:

http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152

[5] A few more details:

Based on a subsequent mail in this thread I set boosts for the words in 
the ngram index. The background is each word (er..term for a given 
field) in the orig index is a separate Document in the ngram index. This 
Doc contains all ngrams (in my test case, like #2 above, of length 3 and 
4) of the word. I also set a boost of log(word_freq)/log(num_docs) so 
that more frequently words will tend to be suggested more often.

I think in "plain" English then the way a word is suggested as a 
spelling correction is:
- frequently occuring words score higher
- words that share more ngrams with the orig word score higher
- words that share rare ngrams with the orig word score higher

[6]

If people want to vote me in as a committer to the sandbox then I can 
check this code in - though again, I'd appreciate feedback.

thx,
  Dave

Hits.doc(x) and range queries

Posted by ro...@xemaps.com.

Hi guys!

I've posted previously that Hits.doc(x) was taking a long time.  Turns out it
has to do with a date range in our query.  We usually do date ranges like this:
Date:[(lucene date field) - (lucene date field)]

Sometimes the begin date is "000000000" which is what we get from
DateField.dateToString( ( new Date( 0 ) ).

This is when getting our search results from the Hits object takes an absurd
amount of time.  Its usually each time the Hits object attempts to get more
results from an IndexSearcher ( aka, every 100? ).

It also takes up more memory...

I was wondering why it affects the search so much even though we're only
returning 350 or so results.  Does the QueryParser do something similar to the
DateFilter on range queries?  Would it be better to use a DateFilter?

We're using Lucene 1.2 (with plans to upgrade).  Do newer versions of Lucene
have this problem?

Roy.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by Doug Cutting <cu...@apache.org>.

Andrzej Bialecki wrote:
> I was wondering about the way you build the n-gram queries. You 
> basically don't care about their position in the input term. Originally 
> I thought about using PhraseQuery with a slop - however, after checking 
> the source of PhraseQuery I realized that this probably wouldn't be that 
> fast... You use BooleanQuery and start/end boosts instead, which may 
> give similar results in the end but much cheaper.

Sloppy PhraseQuery's are slower than BooleanQueries, but not horribly 
slower.  The problem is that they don't handle the case where phrase 
elements are missing altogether, while a BooleanQuery does.  So what you 
really need is maybe a variation of a sloppy PhraseQuery that scores 
matches that do not contain all of the terms...

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by David Spencer <da...@tropo.com>.

Andrzej Bialecki wrote:

> David Spencer wrote:
> 
>>> ...or prepare in advance a fast lookup index - split all existing
>>> terms to bi- or trigrams, create a separate lookup index, and then
>>> simply for each term ask a phrase query (phrase = all n-grams from
>>> an input term), with a slop > 0, to get similar existing terms.
>>> This should be fast, and you could provide a "did you mean"
>>> function too...
>>>
>>
>> Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2
>> phases. First you build a "fast lookup index" as mentioned above.
>> Then to correct a word you do a query in this index based on the
>> ngrams in the misspelled word.
>>
> 
> The background for this suggestion was that I was playing some time ago 
> with a Luke plugin that builds various sorts of ancillary indexes, but 
> then I never finished it... Kudos for actually making it work ;-)

Sure, it was a fun little edge project. For the most part the code was 
done last week right after this thread appeared, but it always takes a 
while to get it from 95 to 100%.

> 
>> [1] Source is attached and I'd like to contribute it to the sandbox,
>> esp if someone can validate that what it's doing is reasonable and
>> useful.
> 
> 
> There have been many requests for this or similar functionality in the 
> past, I believe it should go into sandbox.
> 
> I was wondering about the way you build the n-gram queries. You 
> basically don't care about their position in the input term. Originally 
> I thought about using PhraseQuery with a slop - however, after checking 
> the source of PhraseQuery I realized that this probably wouldn't be that 
> fast... You use BooleanQuery and start/end boosts instead, which may 
> give similar results in the end but much cheaper.
> 
> I also wonder how this algorithm would behave for smaller values of 

Sure, I'll try to rebuild the demo w/ lengths 2-5 (and then the query 
page can test any conitguous combo).

> start/end lengths (e.g. 2,3,4). In a sense, the smaller the n-gram 
> length, the more "fuzziness" you introduce, which may or may not be 
> desirable (increased recall at the cost of precision - for small indexes 
> this may be useful from the user's perspective because you will always 
> get a plausible hit, for huge indexes it's a loss).
> 
>>
>> [2] Here's a demo page. I built an ngram index for ngrams of length 3
>>  and 4 based on the existing index I have of approx 100k 
>> javadoc-generated pages. You type in a misspelled word like
>> "recursixe" or whatnot to see what suggestions it returns. Note this
>> is not a normal search index query -- rather this is a test page for
>> spelling corrections.
>>
>> http://www.searchmorph.com/kat/spell.jsp
> 
> 
> Very nice demo! 

Thanks, kinda designed for ngram-nerds if you know what I mean :)

> I bet it's running way faster than the linear search 

Indeed, this is almost zero time, whereas the simple and dumb linear 
search was taking me 10sec. I will have to redo the sites main search 
page so it uses this new code, TBD, prob tomorrow.

> over terms :-), even though you have to build the index in advance. But 
> if you work with static or mostly static indexes this doesn't matter.
> 
>> Based on a subsequent mail in this thread I set boosts for the words
>> in the ngram index. The background is each word (er..term for a given
>>  field) in the orig index is a separate Document in the ngram index.
>> This Doc contains all ngrams (in my test case, like #2 above, of
>> length 3 and 4) of the word. I also set a boost of
>> log(word_freq)/log(num_docs) so that more frequently words will tend
>> to be suggested more often.
> 
> 
> You may want to experiment with 2 <= n <= 5. Some n-gram based 
Yep, will do prob tomorrow.

> techniques use all lengths together, some others use just single length, 
> results also vary depending on the language...
> 
>>
>> I think in "plain" English then the way a word is suggested as a 
>> spelling correction is: - frequently occuring words score higher -
>> words that share more ngrams with the orig word score higher - words
>> that share rare ngrams with the orig word score higher
> 
> 
> I think this is a reasonable heuristics. Reading the code I would 
> present it this way:

ok, thx, will update

> 
> - words that share more ngrams with the orig word score higher, and
>   words that share rare ngrams with the orig word score higher
>   (as a natural consequence of using BooleanQuery),
> 
> - and, frequently occuring words score higher (as a consequence of using
>   per-Document boosts),
> 
> - from reading the source code I see that you use Levenshtein distance
>   to prune the resultset of too long/too short results,
> 
> I think also that because you don't use the positional information about 
>  the input n-grams you may be getting some really weird hits.

Good point, though I haven't seen this yet. Might be due to the prefix 
boost and maybe some Markov chain magic tending to only show reasonable 
words.

> You could 
> prune them by simply checking if you find a (threshold) of input ngrams 
> in the right sequence in the found terms. This shouldn't be too costly 

Good point, I'll try to add that in as an optional parameter.

> because you operate on a small result set.
> 
>>
>> [6]
>>
>> If people want to vote me in as a committer to the sandbox then I can
> 
> 
> Well, someone needs to maintain the code after all... ;-)
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by Andrzej Bialecki <ab...@getopt.org>.

David Spencer wrote:

>> ...or prepare in advance a fast lookup index - split all existing
>> terms to bi- or trigrams, create a separate lookup index, and then
>> simply for each term ask a phrase query (phrase = all n-grams from
>> an input term), with a slop > 0, to get similar existing terms.
>> This should be fast, and you could provide a "did you mean"
>> function too...
>> 
> 
> Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2
> phases. First you build a "fast lookup index" as mentioned above.
> Then to correct a word you do a query in this index based on the
> ngrams in the misspelled word.
> 

The background for this suggestion was that I was playing some time ago 
with a Luke plugin that builds various sorts of ancillary indexes, but 
then I never finished it... Kudos for actually making it work ;-)

> [1] Source is attached and I'd like to contribute it to the sandbox,
> esp if someone can validate that what it's doing is reasonable and
> useful.

There have been many requests for this or similar functionality in the 
past, I believe it should go into sandbox.

I was wondering about the way you build the n-gram queries. You 
basically don't care about their position in the input term. Originally 
I thought about using PhraseQuery with a slop - however, after checking 
the source of PhraseQuery I realized that this probably wouldn't be that 
fast... You use BooleanQuery and start/end boosts instead, which may 
give similar results in the end but much cheaper.

I also wonder how this algorithm would behave for smaller values of 
start/end lengths (e.g. 2,3,4). In a sense, the smaller the n-gram 
length, the more "fuzziness" you introduce, which may or may not be 
desirable (increased recall at the cost of precision - for small indexes 
this may be useful from the user's perspective because you will always 
get a plausible hit, for huge indexes it's a loss).

> 
> [2] Here's a demo page. I built an ngram index for ngrams of length 3
>  and 4 based on the existing index I have of approx 100k 
> javadoc-generated pages. You type in a misspelled word like
> "recursixe" or whatnot to see what suggestions it returns. Note this
> is not a normal search index query -- rather this is a test page for
> spelling corrections.
> 
> http://www.searchmorph.com/kat/spell.jsp

Very nice demo! I bet it's running way faster than the linear search 
over terms :-), even though you have to build the index in advance. But 
if you work with static or mostly static indexes this doesn't matter.

> Based on a subsequent mail in this thread I set boosts for the words
> in the ngram index. The background is each word (er..term for a given
>  field) in the orig index is a separate Document in the ngram index.
> This Doc contains all ngrams (in my test case, like #2 above, of
> length 3 and 4) of the word. I also set a boost of
> log(word_freq)/log(num_docs) so that more frequently words will tend
> to be suggested more often.

You may want to experiment with 2 <= n <= 5. Some n-gram based 
techniques use all lengths together, some others use just single length, 
results also vary depending on the language...

> 
> I think in "plain" English then the way a word is suggested as a 
> spelling correction is: - frequently occuring words score higher -
> words that share more ngrams with the orig word score higher - words
> that share rare ngrams with the orig word score higher

I think this is a reasonable heuristics. Reading the code I would 
present it this way:

- words that share more ngrams with the orig word score higher, and
   words that share rare ngrams with the orig word score higher
   (as a natural consequence of using BooleanQuery),

- and, frequently occuring words score higher (as a consequence of using
   per-Document boosts),

- from reading the source code I see that you use Levenshtein distance
   to prune the resultset of too long/too short results,

I think also that because you don't use the positional information about 
  the input n-grams you may be getting some really weird hits. You could 
prune them by simply checking if you find a (threshold) of input ngrams 
in the right sequence in the found terms. This shouldn't be too costly 
because you operate on a small result set.

> 
> [6]
> 
> If people want to vote me in as a committer to the sandbox then I can

Well, someone needs to maintain the code after all... ;-)

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by Morus Walter <mo...@tanto.de>.

David Spencer writes:
> > 
> > could you put the current version of your code on that website as a java
> 
> Weblog entry updated:
> 
> http://searchmorph.com/weblog/index.php?id=23
> 
thanks
> 
> Great suggestion and thanks for that idiom - I should know such things 
> by now. To clarify the "issue", it's just a performance one, not other 
> functionality...anyway I put in the code - and to be scientific I 
> benchmarked it two times before the change and two times after - and the 
> results were suprising the same both times (1:45 to 1:50 with an index 
> that takes up > 200MB). Probably there are cases where this will run 
> faster, and the code seems more "correct" now so it's in.
> 
Ahh, I see, you check the field later.
The logging made me think, you index all fields you loop over, in which
case one might get unwanted words into the ngram index.
> 
> 
> > 
> > 
> > An interesting application of this might be an ngram-Index enhanced version
> > of the FuzzyQuery. While this introduces more complexity on the indexing
> > side, it might be a large speedup for fuzzy searches.
> 
> I also thinking of reviewing the list to see if anyone had done a "Jaro 
> Winkler" fuzzy query yet and doing that....
> 
I went into another direction, and changed the ngram index and search
to use a simliarity that computes 

   m * m / ( n1 * n2)

where m is the number of matches and n1 is the number of ngrams in the
query and n2 is the number of ngrams in the word.
(At least if I got that right; I'm not sure if I understand all parts
of the similarity class correctly)

After removing the document boost in the ngram index based on the 
word frequency in the original index I find the results pretty good.
My data is a number of encyclopedias and dictionaries and I only use the
headwords for the ngram index. Term frequency doesn't seem relevent
in this case.

I still use the levenshtein distance to modify the score and sort according
to  score / distance  but in most cases this does not make a difference.
So I'll probably drop the distance calculation completely.

I also see few difference between using 2- and 3-grams on the one hand
and only using 2-grams on the other. So I'll presumably drop the 3-grams.

I'm not sure, if the similarity I use, is useful in general, but I 
attached it to this message in case someone is interested.
Note that you need to set the similarity for the index writer and searcher
and thus have to reindex in case you want to give it a try.

Morus

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by David Spencer <da...@tropo.com>.

Morus Walter wrote:

> Hi David,
> 
>>Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 
>>phases. First you build a "fast lookup index" as mentioned above. Then 
>>to correct a word you do a query in this index based on the ngrams in 
>>the misspelled word.
>>
>>Let's see.
>>
>>[1] Source is attached and I'd like to contribute it to the sandbox, esp 
>>if someone can validate that what it's doing is reasonable and useful.
>>
> 
> great :-)
> 
>>[4] Here's source in HTML:
>>
>>http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152
>>
> 
> could you put the current version of your code on that website as a java

Weblog entry updated:

http://searchmorph.com/weblog/index.php?id=23

To link to source code:

http://www.searchmorph.com/pub/ngramspeller/NGramSpeller.java

> source also? At least until it's in the lucene sandbox.
> 
> 
> I created an ngram index on one of my indexes and think I found an issue
> in the indexing code:
> 
> There is an option -f to specify the field on which the ngram index will
> be created. 
> However there is no code to restrict the term enumeration on this field.
> 
> So instead of 
> 		final TermEnum te = r.terms();
> i'd suggest
> 		final TermEnum te = r.terms(new Term(field, ""));
> and a check within the loop over the terms if the enumerated term
> still has fieldname field, e.g.
> 			Term t = te.term();
> 			if ( !t.field().equals(field) ) {
> 			    break;
> 			}
> 
> otherwise you loop over all terms in all fields.

Great suggestion and thanks for that idiom - I should know such things 
by now. To clarify the "issue", it's just a performance one, not other 
functionality...anyway I put in the code - and to be scientific I 
benchmarked it two times before the change and two times after - and the 
results were suprising the same both times (1:45 to 1:50 with an index 
that takes up > 200MB). Probably there are cases where this will run 
faster, and the code seems more "correct" now so it's in.



> 
> 
> An interesting application of this might be an ngram-Index enhanced version
> of the FuzzyQuery. While this introduces more complexity on the indexing
> side, it might be a large speedup for fuzzy searches.

I also thinking of reviewing the list to see if anyone had done a "Jaro 
Winkler" fuzzy query yet and doing that....

> 


Thanks,
  Dave

> Morus
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by Morus Walter <mo...@tanto.de>.

Hi David,
> 
> Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 
> phases. First you build a "fast lookup index" as mentioned above. Then 
> to correct a word you do a query in this index based on the ngrams in 
> the misspelled word.
> 
> Let's see.
> 
> [1] Source is attached and I'd like to contribute it to the sandbox, esp 
> if someone can validate that what it's doing is reasonable and useful.
> 
great :-)
> 
> [4] Here's source in HTML:
> 
> http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152
> 
could you put the current version of your code on that website as a java
source also? At least until it's in the lucene sandbox.


I created an ngram index on one of my indexes and think I found an issue
in the indexing code:

There is an option -f to specify the field on which the ngram index will
be created. 
However there is no code to restrict the term enumeration on this field.

So instead of 
		final TermEnum te = r.terms();
i'd suggest
		final TermEnum te = r.terms(new Term(field, ""));
and a check within the loop over the terms if the enumerated term
still has fieldname field, e.g.
			Term t = te.term();
			if ( !t.field().equals(field) ) {
			    break;
			}

otherwise you loop over all terms in all fields.


An interesting application of this might be an ngram-Index enhanced version
of the FuzzyQuery. While this introduces more complexity on the indexing
side, it might be a large speedup for fuzzy searches.

Morus

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: [VOTE] David Spencer as Lucene Sandbox committer

Posted by Daniel Naber <da...@t-online.de>.

On Wednesday 15 September 2004 01:44, Otis Gospodnetic wrote:

> And +1 from me.  Dave has submitted several nice pieces of code over
> the years.

+1


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: [VOTE] David Spencer as Lucene Sandbox committer

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

If Doug and Otis give a +1 then that is enough for me!  :)

	+1


On Sep 14, 2004, at 7:44 PM, Otis Gospodnetic wrote:

> And +1 from me.  Dave has submitted several nice pieces of code over
> the years.
>
> Otis
>
> --- Doug Cutting <cu...@apache.org> wrote:
>
>> David Spencer wrote:
>>> If people want to vote me in as a committer to the sandbox then I
>> can
>>> check this code in.
>>
>> +1
>>
>> Doug
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: [VOTE] David Spencer as Lucene Sandbox committer

Posted by Jack Park <ja...@thinkalong.com>.

+1 if my vote counts for anything.
Jack

Otis Gospodnetic wrote:

> And +1 from me.  Dave has submitted several nice pieces of code over
> the years.
> 
> Otis
> 
> --- Doug Cutting <cu...@apache.org> wrote:
> 
> 
>>David Spencer wrote:
>>
>>>If people want to vote me in as a committer to the sandbox then I
>>
>>can 
>>
>>>check this code in.
>>
>>+1
>>
>>Doug
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: [VOTE] David Spencer as Lucene Sandbox committer

Posted by Christoph Goller <go...@detego-software.de>.

Otis Gospodnetic wrote:
> And +1 from me.  Dave has submitted several nice pieces of code over
> the years.

+1


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

[VOTE] David Spencer as Lucene Sandbox committer

Posted by Otis Gospodnetic <ot...@yahoo.com>.

And +1 from me.  Dave has submitted several nice pieces of code over
the years.

Otis

--- Doug Cutting <cu...@apache.org> wrote:

> David Spencer wrote:
> > If people want to vote me in as a committer to the sandbox then I
> can 
> > check this code in.
> 
> +1
> 
> Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by Doug Cutting <cu...@apache.org>.

David Spencer wrote:
> If people want to vote me in as a committer to the sandbox then I can 
> check this code in.

+1

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by David Spencer <da...@tropo.com>.

Andrzej Bialecki wrote:

> Aad Nales wrote:
> 
>> David,
>>
>> Perhaps I misunderstand somehting so please correct me if I do. I used
>> http://www.searchmorph.com/kat/spell.jsp to look for conts without
>> changing any of the default values. What I got as results did not
>> include 'const' which has quite a high frequency in your index and
> 
> 
> ??? how do you know that? Remember, this is an index of _Java_docs, and 
> "const" is not a Java keyword.

I added a line of output to the right column under the 'details' box. 
"const" appears 216 times in the index (out of 96k docs), thus it is 
indeed kinda rare.

http://www.searchmorph.com/kat/spell.jsp?s=const

> 
>> should have a pretty low levenshtein distance. Any idea what causes this
>> behavior?
> 
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by David Spencer <da...@tropo.com>.

Aad Nales wrote:

> By trying: if you type const you will find that it returns 216 hits. The
> third sports 'const' as a term (space seperated and all). I would expect
> 'conts' to return with const as well. But again I might be mistaken. I
> am now trying to figure what the problem might be: 
> 
> 1. my expectations (most likely ;-)
> 2. something in the code..



I enhanced the code to store simple transpositions also and I 
regenerated my site w/ ngrams from 2 to 5 chars. If you set the 
transposition boost up to 10 then "const" is returned 2nd...

http://www.searchmorph.com/kat/spell.jsp?s=conts&min=2&max=5&maxd=5&maxr=10&bstart=2.0&bend=1.0&btranspose=10.0&popular=1

> 
> -----Original Message-----
> From: Andrzej Bialecki [mailto:ab@getopt.org] 
> Sent: Wednesday, 15 September, 2004 12:23
> To: Lucene Users List
> Subject: Re: NGramSpeller contribution -- Re: combining open office
> spellchecker with Lucene
> 
> 
> Aad Nales wrote:
> 
> 
>>David,
>>
>>Perhaps I misunderstand somehting so please correct me if I do. I used
> 
> 
>>http://www.searchmorph.com/kat/spell.jsp to look for conts without 
>>changing any of the default values. What I got as results did not 
>>include 'const' which has quite a high frequency in your index and
> 
> 
> ??? how do you know that? Remember, this is an index of _Java_docs, and 
> "const" is not a Java keyword.
> 
> 
>>should have a pretty low levenshtein distance. Any idea what causes 
>>this behavior?
> 
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by David Spencer <da...@tropo.com>.

Andrzej Bialecki wrote:

> David Spencer wrote:
> 
>> To restate the question for a second.
>>
>> The misspelled word is: "conts".
>> The sugggestion expected is "const", which seems reasonable enough as 
>> it's just a transposition away, thus the string distance is low.
>>
>> But - I guess the problem w/ the algorithm is that for short words 
>> like this, with transpositions, the two words won't share many ngrams.
>>
>> Just looking at 3grams...
>>
>> conts -> con ont nts
>> const -> con ons nst
>>
>> Thus they just share 1 3gram, thus this is why it scores so low. This 
>> is an interesting issue, how to tune the algorithm so that it might 
>> return words this close higher.
> 
> 
> If you added 2-grams, then it would look like this (constructing also 
> special start/end grams):

Oh cute trick to indicate prefixes and suffixes.
Anyway, as per prev post I reformed index w/ ngrams from length 2 to 5, 
and also store transpositions, and w/ appropriate boosts :) then "const" 
is returned 2nd.

http://www.searchmorph.com/kat/spell.jsp?s=conts&min=2&max=5&maxd=5&maxr=10&bstart=2.0&bend=1.0&btranspose=10.0&popular=1

> 
> conts -> _c co on nt ts s_
> const -> _c co on ns st t_
> 
> which gives 50% of overlap.
> 
> In another system that I designed we were using a combination of 2-4 
> grams, albeit for a slightly different purpose, so in this case it would 
> be:
> 
> conts:
>     _c co on nt ts s_, _co con ont nts ts_, _con cont onts nts_
> 
> const:
>     _c co on ns st t_, _co con ons nst st_, _con cons onst nst_
> 
> and the overlap is 40%.
> 
> 
>>
>> I guess one way is to add all simple transpositions to the lookup 
>> table (the "ngram index") so that these could easily be found, with 
>> the heuristic that "a frequent way of misspelling words is to 
>> transpose two adjacent letters".
> 
> 
> Yes, sounds like a good idea. Even though it increases the size of the 
> lookup index, it still beats using the linear search...
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by Andrzej Bialecki <ab...@getopt.org>.

David Spencer wrote:
> To restate the question for a second.
> 
> The misspelled word is: "conts".
> The sugggestion expected is "const", which seems reasonable enough as 
> it's just a transposition away, thus the string distance is low.
> 
> But - I guess the problem w/ the algorithm is that for short words like 
> this, with transpositions, the two words won't share many ngrams.
> 
> Just looking at 3grams...
> 
> conts -> con ont nts
> const -> con ons nst
> 
> Thus they just share 1 3gram, thus this is why it scores so low. This is 
> an interesting issue, how to tune the algorithm so that it might return 
> words this close higher.

If you added 2-grams, then it would look like this (constructing also 
special start/end grams):

conts -> _c co on nt ts s_
const -> _c co on ns st t_

which gives 50% of overlap.

In another system that I designed we were using a combination of 2-4 
grams, albeit for a slightly different purpose, so in this case it would be:

conts:
	_c co on nt ts s_, _co con ont nts ts_, _con cont onts nts_

const:
	_c co on ns st t_, _co con ons nst st_, _con cons onst nst_

and the overlap is 40%.


> 
> I guess one way is to add all simple transpositions to the lookup table 
> (the "ngram index") so that these could easily be found, with the 
> heuristic that "a frequent way of misspelling words is to transpose two 
> adjacent letters".

Yes, sounds like a good idea. Even though it increases the size of the 
lookup index, it still beats using the linear search...

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by David Spencer <da...@tropo.com>.

Aad Nales wrote:

> By trying: if you type const you will find that it returns 216 hits. The
> third sports 'const' as a term (space seperated and all). I would expect
> 'conts' to return with const as well. But again I might be mistaken. I
> am now trying to figure what the problem might be: 
> 
> 1. my expectations (most likely ;-)
> 2. something in the code..

Good question.

If I use the form at the bottom of the page and ask for more results, 
the suggestion of "const" does eventually show up - 99th however(!).

http://www.searchmorph.com/kat/spell.jsp?s=conts&min=3&max=4&maxd=5&maxr=1000&bstart=2.0&bend=1.0

Even boosting the prefix match from 2.0 to 10.0 only changes the ranking 
a few slots.
http://www.searchmorph.com/kat/spell.jsp?s=conts&min=3&max=4&maxd=5&maxr=1000&bstart=10.0&bend=1.0

To restate the question for a second.

The misspelled word is: "conts".
The sugggestion expected is "const", which seems reasonable enough as 
it's just a transposition away, thus the string distance is low.

But - I guess the problem w/ the algorithm is that for short words like 
this, with transpositions, the two words won't share many ngrams.

Just looking at 3grams...

conts -> con ont nts
const -> con ons nst

Thus they just share 1 3gram, thus this is why it scores so low. This is 
an interesting issue, how to tune the algorithm so that it might return 
words this close higher.

I guess one way is to add all simple transpositions to the lookup table 
(the "ngram index") so that these could easily be found, with the 
heuristic that "a frequent way of misspelling words is to transpose two 
adjacent letters".

Based on other mails I'll make some additions to the code and will 
report back if anything of interest changes here.

> 
> -----Original Message-----
> From: Andrzej Bialecki [mailto:ab@getopt.org] 
> Sent: Wednesday, 15 September, 2004 12:23
> To: Lucene Users List
> Subject: Re: NGramSpeller contribution -- Re: combining open office
> spellchecker with Lucene
> 
> 
> Aad Nales wrote:
> 
> 
>>David,
>>
>>Perhaps I misunderstand somehting so please correct me if I do. I used
> 
> 
>>http://www.searchmorph.com/kat/spell.jsp to look for conts without 
>>changing any of the default values. What I got as results did not 
>>include 'const' which has quite a high frequency in your index and
> 
> 
> ??? how do you know that? Remember, this is an index of _Java_docs, and 
> "const" is not a Java keyword.
> 
> 
>>should have a pretty low levenshtein distance. Any idea what causes 
>>this behavior?
> 
> 
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by Aad Nales <aa...@rotterdam-cs.com>.

By trying: if you type const you will find that it returns 216 hits. The
third sports 'const' as a term (space seperated and all). I would expect
'conts' to return with const as well. But again I might be mistaken. I
am now trying to figure what the problem might be: 

1. my expectations (most likely ;-)
2. something in the code..

-----Original Message-----
From: Andrzej Bialecki [mailto:ab@getopt.org] 
Sent: Wednesday, 15 September, 2004 12:23
To: Lucene Users List
Subject: Re: NGramSpeller contribution -- Re: combining open office
spellchecker with Lucene


Aad Nales wrote:

> David,
> 
> Perhaps I misunderstand somehting so please correct me if I do. I used

> http://www.searchmorph.com/kat/spell.jsp to look for conts without 
> changing any of the default values. What I got as results did not 
> include 'const' which has quite a high frequency in your index and

??? how do you know that? Remember, this is an index of _Java_docs, and 
"const" is not a Java keyword.

> should have a pretty low levenshtein distance. Any idea what causes 
> this behavior?



-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by Andrzej Bialecki <ab...@getopt.org>.

Aad Nales wrote:

> David,
> 
> Perhaps I misunderstand somehting so please correct me if I do. I used
> http://www.searchmorph.com/kat/spell.jsp to look for conts without
> changing any of the default values. What I got as results did not
> include 'const' which has quite a high frequency in your index and

??? how do you know that? Remember, this is an index of _Java_docs, and 
"const" is not a Java keyword.

> should have a pretty low levenshtein distance. Any idea what causes this
> behavior?



-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by Aad Nales <aa...@rotterdam-cs.com>.

David,

Perhaps I misunderstand somehting so please correct me if I do. I used
http://www.searchmorph.com/kat/spell.jsp to look for conts without
changing any of the default values. What I got as results did not
include 'const' which has quite a high frequency in your index and
should have a pretty low levenshtein distance. Any idea what causes this
behavior?

cheers,
Aad

-----Original Message-----
From: David Spencer [mailto:dave-lucene-user@tropo.com] 
Sent: Tuesday, 14 September, 2004 21:23
To: Lucene Users List
Subject: NGramSpeller contribution -- Re: combining open office
spellchecker with Lucene

Andrzej Bialecki wrote:

> David Spencer wrote:
> 
>>
>> I can/should send the code out. The logic is that for any terms in a
>> query that have zero matches, go thru all the terms(!) and calculate 
>> the Levenshtein string distance, and return the best matches. A more 
>> intelligent way of doing this is to instead look for terms that also 
>> match on the 1st "n" (prob 3) chars.
> 
> 
> ...or prepare in advance a fast lookup index - split all existing 
> terms
> to bi- or trigrams, create a separate lookup index, and then simply
for 
> each term ask a phrase query (phrase = all n-grams from an input
term), 
> with a slop > 0, to get similar existing terms. This should be fast,
and 
> you could provide a "did you mean" function too...
> 

Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 
phases. First you build a "fast lookup index" as mentioned above. Then 
to correct a word you do a query in this index based on the ngrams in 
the misspelled word.

Let's see.

[1] Source is attached and I'd like to contribute it to the sandbox, esp

if someone can validate that what it's doing is reasonable and useful.

[2] Here's a demo page. I built an ngram index for ngrams of length 3 
and 4 based on the existing index I have of approx 100k 
javadoc-generated pages. You type in a misspelled word like "recursixe" 
or whatnot to see what suggestions it returns. Note this is not a normal

search index query -- rather this is a test page for spelling
corrections.

http://www.searchmorph.com/kat/spell.jsp

[3] Here's the javadoc:

http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGra
mSpeller.html

[4] Here's source in HTML:

http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/s
pell/NGramSpeller.html#line.152

[5] A few more details:

Based on a subsequent mail in this thread I set boosts for the words in 
the ngram index. The background is each word (er..term for a given 
field) in the orig index is a separate Document in the ngram index. This

Doc contains all ngrams (in my test case, like #2 above, of length 3 and

4) of the word. I also set a boost of log(word_freq)/log(num_docs) so 
that more frequently words will tend to be suggested more often.

I think in "plain" English then the way a word is suggested as a 
spelling correction is:
- frequently occuring words score higher
- words that share more ngrams with the orig word score higher
- words that share rare ngrams with the orig word score higher

[6]

If people want to vote me in as a committer to the sandbox then I can 
check this code in - though again, I'd appreciate feedback.

thx,
  Dave

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org