You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Aad Nales <aa...@rotterdam-cs.com> on 2004/09/08 09:07:00 UTC
combining open office spellchecker with Lucene
Hi All,
Before I start reinventing wheels I would like to do a short check to
see if anybody else has already tried this. A customer has requested us
to look into the possibility to perform a spell check on queries. So far
the most promising way of doing this seems to be to create an Analyzer
based on the spellchecker of OpenOffice. My question is: "has anybody
tried this before?"
Cheers,
Aad
--
Aad Nales
aad.nales@rotterdam-cs.com, +31-(0)6 54 207 340
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
RE: frequent terms - Re: combining open office spellchecker with Lucene
Posted by Aad Nales <aa...@rotterdam-cs.com>.
Also,
You can also use an alternative spellchecker for the 'checking part' and
use the Ngram algorithm for the 'suggestion' part. Only if the spell
'check' declares a word illegal the 'suggestion' part would perform its
magic.
cheers,
Aad
Doug Cutting wrote:
> David Spencer wrote:
>
>> [1] The user enters a query like:
>> recursize descent parser
>>
>> [2] The search code parses this and sees that the 1st word is not a
>> term in the index, but the next 2 are. So it ignores the last 2 terms
>> ("recursive" and "descent") and suggests alternatives to
>> "recursize"...thus if any term is in the index, regardless of
>> frequency, it is left as-is.
>>
>> I guess you're saying that, if the user enters a term that appears in
>> the index and thus is sort of spelled correctly ( as it exists in
some
>> doc), then we use the heuristic that any sufficiently large doc
>> collection will have tons of misspellings, so we assume that rare
>> terms in the query might be misspelled (i.e. not what the user
>> intended) and we suggest alternativies to these words too (in
addition
>> to the words in the query that are not in the index at all).
>
>
> Almost.
>
> If the user enters "a recursize purser", then: "a", which is in, say,
> >50% of the documents, is probably spelled correctly and "recursize",
> which is in zero documents, is probably mispelled. But what about
> "purser"? If we run the spell check algorithm on "purser" and
generate
> "parser", should we show it to the user? If "purser" occurs in 1% of
> documents and "parser" occurs in 5%, then we probably should, since
> "parser" is a more common word than "purser". But if "parser" only
> occurs in 1% of the documents and purser occurs in 5%, then we
probably
> shouldn't bother suggesting "parser".
>
> If you wanted to get really fancy then you could check how frequently
> combinations of query terms occur, i.e., does "purser" or "parser"
occur
> more frequently near "descent". But that gets expensive.
I updated the code to have an optional popularity filter - if true then
it only returns matches more popular (frequent) than the word that is
passed in for spelling correction.
If true (default) then for common words like "remove", no results are
returned now, as expected:
http://www.searchmorph.com/kat/spell.jsp?s=remove
But if you set it to false (bottom slot in the form at the bottom of the
page) then the algorithm happily looks for alternatives:
http://www.searchmorph.com/kat/spell.jsp?s=remove&min=2&max=5&maxd=5&max
r=10&bstart=2.0&bend=1.0&btranspose=1.0&popular=0
TBD I need to update the javadoc & repost the code I guess. Also as per
earlier post I also store simple transpositions for words in the
ngram-index.
-- Dave
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: frequent terms - Re: combining open office spellchecker with
Lucene
Posted by David Spencer <da...@tropo.com>.
Doug Cutting wrote:
> David Spencer wrote:
>
>> [1] The user enters a query like:
>> recursize descent parser
>>
>> [2] The search code parses this and sees that the 1st word is not a
>> term in the index, but the next 2 are. So it ignores the last 2 terms
>> ("recursive" and "descent") and suggests alternatives to
>> "recursize"...thus if any term is in the index, regardless of
>> frequency, it is left as-is.
>>
>> I guess you're saying that, if the user enters a term that appears in
>> the index and thus is sort of spelled correctly ( as it exists in some
>> doc), then we use the heuristic that any sufficiently large doc
>> collection will have tons of misspellings, so we assume that rare
>> terms in the query might be misspelled (i.e. not what the user
>> intended) and we suggest alternativies to these words too (in addition
>> to the words in the query that are not in the index at all).
>
>
> Almost.
>
> If the user enters "a recursize purser", then: "a", which is in, say,
> >50% of the documents, is probably spelled correctly and "recursize",
> which is in zero documents, is probably mispelled. But what about
> "purser"? If we run the spell check algorithm on "purser" and generate
> "parser", should we show it to the user? If "purser" occurs in 1% of
> documents and "parser" occurs in 5%, then we probably should, since
> "parser" is a more common word than "purser". But if "parser" only
> occurs in 1% of the documents and purser occurs in 5%, then we probably
> shouldn't bother suggesting "parser".
>
> If you wanted to get really fancy then you could check how frequently
> combinations of query terms occur, i.e., does "purser" or "parser" occur
> more frequently near "descent". But that gets expensive.
I updated the code to have an optional popularity filter - if true then
it only returns matches more popular (frequent) than the word that is
passed in for spelling correction.
If true (default) then for common words like "remove", no results are
returned now, as expected:
http://www.searchmorph.com/kat/spell.jsp?s=remove
But if you set it to false (bottom slot in the form at the bottom of the
page) then the algorithm happily looks for alternatives:
http://www.searchmorph.com/kat/spell.jsp?s=remove&min=2&max=5&maxd=5&maxr=10&bstart=2.0&bend=1.0&btranspose=1.0&popular=0
TBD I need to update the javadoc & repost the code I guess. Also as per
earlier post I also store simple transpositions for words in the
ngram-index.
-- Dave
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: frequent terms - Re: combining open office spellchecker with
Lucene
Posted by David Spencer <da...@tropo.com>.
Doug Cutting wrote:
> David Spencer wrote:
>
>> [1] The user enters a query like:
>> recursize descent parser
>>
>> [2] The search code parses this and sees that the 1st word is not a
>> term in the index, but the next 2 are. So it ignores the last 2 terms
>> ("recursive" and "descent") and suggests alternatives to
>> "recursize"...thus if any term is in the index, regardless of
>> frequency, it is left as-is.
>>
>> I guess you're saying that, if the user enters a term that appears in
>> the index and thus is sort of spelled correctly ( as it exists in some
>> doc), then we use the heuristic that any sufficiently large doc
>> collection will have tons of misspellings, so we assume that rare
>> terms in the query might be misspelled (i.e. not what the user
>> intended) and we suggest alternativies to these words too (in addition
>> to the words in the query that are not in the index at all).
>
>
> Almost.
>
> If the user enters "a recursize purser", then: "a", which is in, say,
> >50% of the documents, is probably spelled correctly and "recursize",
> which is in zero documents, is probably mispelled. But what about
> "purser"? If we run the spell check algorithm on "purser" and generate
> "parser", should we show it to the user? If "purser" occurs in 1% of
> documents and "parser" occurs in 5%, then we probably should, since
> "parser" is a more common word than "purser". But if "parser" only
> occurs in 1% of the documents and purser occurs in 5%, then we probably
> shouldn't bother suggesting "parser".
OK, sure, got it.
I'll give it a think and try to add this option to my just submitted
spelling code.
>
> If you wanted to get really fancy then you could check how frequently
> combinations of query terms occur, i.e., does "purser" or "parser" occur
> more frequently near "descent". But that gets expensive.
Yeah, expensive for a large scale search engine, but probably
appropriate for a desktop engine.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: frequent terms - Re: combining open office spellchecker with
Lucene
Posted by Doug Cutting <cu...@apache.org>.
David Spencer wrote:
> [1] The user enters a query like:
> recursize descent parser
>
> [2] The search code parses this and sees that the 1st word is not a term
> in the index, but the next 2 are. So it ignores the last 2 terms
> ("recursive" and "descent") and suggests alternatives to
> "recursize"...thus if any term is in the index, regardless of frequency,
> it is left as-is.
>
> I guess you're saying that, if the user enters a term that appears in
> the index and thus is sort of spelled correctly ( as it exists in some
> doc), then we use the heuristic that any sufficiently large doc
> collection will have tons of misspellings, so we assume that rare terms
> in the query might be misspelled (i.e. not what the user intended) and
> we suggest alternativies to these words too (in addition to the words in
> the query that are not in the index at all).
Almost.
If the user enters "a recursize purser", then: "a", which is in, say,
>50% of the documents, is probably spelled correctly and "recursize",
which is in zero documents, is probably mispelled. But what about
"purser"? If we run the spell check algorithm on "purser" and generate
"parser", should we show it to the user? If "purser" occurs in 1% of
documents and "parser" occurs in 5%, then we probably should, since
"parser" is a more common word than "purser". But if "parser" only
occurs in 1% of the documents and purser occurs in 5%, then we probably
shouldn't bother suggesting "parser".
If you wanted to get really fancy then you could check how frequently
combinations of query terms occur, i.e., does "purser" or "parser" occur
more frequently near "descent". But that gets expensive.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
RE: frequent terms - Re: combining open office spellchecker with Lucene
Posted by Aad Nales <aa...@rotterdam-cs.com>.
Doug Cutting wrote:
> David Spencer wrote:
>
>> Doug Cutting wrote:
>>
>>> And one should not try correction at all for terms which occur in a
>>> large proportion of the collection.
>>
>>
>>
>> I keep thinking over this one and I don't understand it. If a user
>> misspells a word and the "did you mean" spelling correction algorithm
>> determines that a frequent term is a good suggestion, why not suggest
>> it? The very fact that it's common could mean that it's more likely
>> that the user wanted this word (well, the heuristic here is that
users
>> frequently search for frequent terms, which is probabably wrong, but
>> anyway..).
>
>
> I think you misunderstood me. What I meant to say was that if the
> term
> the user enters is very common then spell correction may be skipped.
> Very common words which are similar to the term the user entered
should
> of course be shown. But if the user's term is very common one need
not
> even attempt to find similarly-spelled words. Is that any better?
Yes, sure, thx, I understand now - but maybe not - the context I was
something like this:
[1] The user enters a query like:
recursize descent parser
[2] The search code parses this and sees that the 1st word is not a term
in the index, but the next 2 are. So it ignores the last 2 terms
("recursive" and "descent") and suggests alternatives to
"recursize"...thus if any term is in the index, regardless of frequency,
it is left as-is.
>>>>
My idea is to first execute the query and only execute the 'spell check'
if the number of results is lower than a certain treshhold.
Secondly, I would like to use the 'stemming' functionality that MySpell
offers to be used for all stuff that is written to the index together
with the POS appearance.
Thirdly I want to regularly scan the index for often used words to be
added to the list of 'approved' terms. This would serve another purpose
of the customer, which is building an synonym index for Dutch words used
in an eductional context.
But having read all the input I think that using the index itself for a
first spellcheck is probably not a bad start.
>>>>
I guess you're saying that, if the user enters a term that appears in
the index and thus is sort of spelled correctly ( as it exists in some
doc), then we use the heuristic that any sufficiently large doc
collection will have tons of misspellings, so we assume that rare terms
in the query might be misspelled (i.e. not what the user intended) and
we suggest alternativies to these words too (in addition to the words in
the query that are not in the index at all).
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: frequent terms - Re: combining open office spellchecker with
Lucene
Posted by David Spencer <da...@tropo.com>.
Doug Cutting wrote:
> David Spencer wrote:
>
>> Doug Cutting wrote:
>>
>>> And one should not try correction at all for terms which occur in a
>>> large proportion of the collection.
>>
>>
>>
>> I keep thinking over this one and I don't understand it. If a user
>> misspells a word and the "did you mean" spelling correction algorithm
>> determines that a frequent term is a good suggestion, why not suggest
>> it? The very fact that it's common could mean that it's more likely
>> that the user wanted this word (well, the heuristic here is that users
>> frequently search for frequent terms, which is probabably wrong, but
>> anyway..).
>
>
> I think you misunderstood me. What I meant to say was that if the term
> the user enters is very common then spell correction may be skipped.
> Very common words which are similar to the term the user entered should
> of course be shown. But if the user's term is very common one need not
> even attempt to find similarly-spelled words. Is that any better?
Yes, sure, thx, I understand now - but maybe not - the context I was
something like this:
[1] The user enters a query like:
recursize descent parser
[2] The search code parses this and sees that the 1st word is not a term
in the index, but the next 2 are. So it ignores the last 2 terms
("recursive" and "descent") and suggests alternatives to
"recursize"...thus if any term is in the index, regardless of frequency,
it is left as-is.
I guess you're saying that, if the user enters a term that appears in
the index and thus is sort of spelled correctly ( as it exists in some
doc), then we use the heuristic that any sufficiently large doc
collection will have tons of misspellings, so we assume that rare terms
in the query might be misspelled (i.e. not what the user intended) and
we suggest alternativies to these words too (in addition to the words in
the query that are not in the index at all).
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: frequent terms - Re: combining open office spellchecker with
Lucene
Posted by Doug Cutting <cu...@apache.org>.
David Spencer wrote:
> Doug Cutting wrote:
>
>> And one should not try correction at all for terms which occur in a
>> large proportion of the collection.
>
>
> I keep thinking over this one and I don't understand it. If a user
> misspells a word and the "did you mean" spelling correction algorithm
> determines that a frequent term is a good suggestion, why not suggest
> it? The very fact that it's common could mean that it's more likely that
> the user wanted this word (well, the heuristic here is that users
> frequently search for frequent terms, which is probabably wrong, but
> anyway..).
I think you misunderstood me. What I meant to say was that if the term
the user enters is very common then spell correction may be skipped.
Very common words which are similar to the term the user entered should
of course be shown. But if the user's term is very common one need not
even attempt to find similarly-spelled words. Is that any better?
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
frequent terms - Re: combining open office spellchecker with Lucene
Posted by David Spencer <da...@tropo.com>.
Doug Cutting wrote:
> Aad Nales wrote:
>
>> Before I start reinventing wheels I would like to do a short check to
>> see if anybody else has already tried this. A customer has requested us
>> to look into the possibility to perform a spell check on queries. So far
>> the most promising way of doing this seems to be to create an Analyzer
>> based on the spellchecker of OpenOffice. My question is: "has anybody
>> tried this before?"
>
>
> Note that a spell checker used with a search engine should use
> collection frequency information. That's to say, only "corrections"
> which are more frequent in the collection than what the user entered
> should be displayed. Frequency information can also be used when
> constructing the checker. For example, one need never consider
> proposing terms that occur in very few documents.
> And one should not
> try correction at all for terms which occur in a large proportion of the
> collection.
I keep thinking over this one and I don't understand it. If a user
misspells a word and the "did you mean" spelling correction algorithm
determines that a frequent term is a good suggestion, why not suggest
it? The very fact that it's common could mean that it's more likely that
the user wanted this word (well, the heuristic here is that users
frequently search for frequent terms, which is probabably wrong, but
anyway..).
I know in other contexts of IR frequent terms are penalized but in this
context it seems that frequent terms should be fine...
-- Dave
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: combining open office spellchecker with Lucene
Posted by David Spencer <da...@tropo.com>.
eks dev wrote:
> Hi Doug,
>
>
>>Perhaps. Are folks really better at spelling the
>>beginning of words?
>
>
> Yes they are. There were some comprehensive empirical
> studies on this topic. Winkler modification on Jaro
> string distance is based on this assumption (boosting
> similarity if first n, I think 4, chars match).
> Jaro-Winkler is well documented and some folks thinks
> that it is much more efficient and precise than plain
> Edit distance (of course for normal language, not
> numbers or so).
> I will try to dig-out some references from my disk on
Good ole Citeseer finds 2 docs that seem relevant:
http://citeseer.ist.psu.edu/cs?cs=1&q=Winkler+Jaro&submit=Documents&co=Citations&cm=50&cf=Any&ao=Citations&am=20&af=Any
I have some of the ngram spelling suggestion stuff, based on earlier
msgs in this thread, working in my dev tree. I'll try to get a test site
up later today for people to fool around with.
> this topic, if you are interested.
>
> On another note,
> I would even suggest using Jaro-Winkler distance as
> default for fuzzy query. (one could configure max
> prefix required => prefix query to reduce number of
> distance calculations). This could speed-up fuzzy
> search dramatically.
>
> Hope this was helpful,
> Eks
>
>
>
>
>
>
>
>
>
>
>
> ___________________________________________________________ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: combining open office spellchecker with Lucene
Posted by eks dev <ek...@yahoo.co.uk>.
Hi Doug,
> Perhaps. Are folks really better at spelling the
> beginning of words?
Yes they are. There were some comprehensive empirical
studies on this topic. Winkler modification on Jaro
string distance is based on this assumption (boosting
similarity if first n, I think 4, chars match).
Jaro-Winkler is well documented and some folks thinks
that it is much more efficient and precise than plain
Edit distance (of course for normal language, not
numbers or so).
I will try to dig-out some references from my disk on
this topic, if you are interested.
On another note,
I would even suggest using Jaro-Winkler distance as
default for fuzzy query. (one could configure max
prefix required => prefix query to reduce number of
distance calculations). This could speed-up fuzzy
search dramatically.
Hope this was helpful,
Eks
___________________________________________________________ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: combining open office spellchecker with Lucene
Posted by Doug Cutting <cu...@apache.org>.
David Spencer wrote:
> Good heuristics but are there any more precise, standard guidelines as
> to how to balance or combine what I think are the following possible
> criteria in suggesting a better choice:
Not that I know of.
> - ignore(penalize?) terms that are rare
I think this one is easy to threshold: ignore matching terms that are
rarer than the term entered.
> - ignore(penalize?) terms that are common
This, in effect, falls out of the previous criterion. A term that is
very common will not have any matching terms that are more common. As
an optimization, you could avoid even looking for matching terms when a
term is very common.
> - terms that are closer (string distance) to the term entered are better
This is the meaty one.
> - terms that start w/ the same 'n' chars as the users term are better
Perhaps. Are folks really better at spelling the beginning of words?
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: combining open office spellchecker with Lucene
Posted by David Spencer <da...@tropo.com>.
Doug Cutting wrote:
> Aad Nales wrote:
>
>> Before I start reinventing wheels I would like to do a short check to
>> see if anybody else has already tried this. A customer has requested us
>> to look into the possibility to perform a spell check on queries. So far
>> the most promising way of doing this seems to be to create an Analyzer
>> based on the spellchecker of OpenOffice. My question is: "has anybody
>> tried this before?"
>
>
> Note that a spell checker used with a search engine should use
> collection frequency information. That's to say, only "corrections"
> which are more frequent in the collection than what the user entered
> should be displayed. Frequency information can also be used when
> constructing the checker. For example, one need never consider
> proposing terms that occur in very few documents. And one should not
> try correction at all for terms which occur in a large proportion of the
> collection.
Good heuristics but are there any more precise, standard guidelines as
to how to balance or combine what I think are the following possible
criteria in suggesting a better choice:
- ignore(penalize?) terms that are rare
- ignore(penalize?) terms that are common
- terms that are closer (string distance) to the term entered are better
- terms that start w/ the same 'n' chars as the users term are better
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: combining open office spellchecker with Lucene
Posted by Doug Cutting <cu...@apache.org>.
Aad Nales wrote:
> Before I start reinventing wheels I would like to do a short check to
> see if anybody else has already tried this. A customer has requested us
> to look into the possibility to perform a spell check on queries. So far
> the most promising way of doing this seems to be to create an Analyzer
> based on the spellchecker of OpenOffice. My question is: "has anybody
> tried this before?"
Note that a spell checker used with a search engine should use
collection frequency information. That's to say, only "corrections"
which are more frequent in the collection than what the user entered
should be displayed. Frequency information can also be used when
constructing the checker. For example, one need never consider
proposing terms that occur in very few documents. And one should not
try correction at all for terms which occur in a large proportion of the
collection.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: NGramSpeller contribution -- Re: combining open office spellchecker
with Lucene
Posted by Morus Walter <mo...@tanto.de>.
David Spencer writes:
> >
> > could you put the current version of your code on that website as a java
>
> Weblog entry updated:
>
> http://searchmorph.com/weblog/index.php?id=23
>
thanks
>
> Great suggestion and thanks for that idiom - I should know such things
> by now. To clarify the "issue", it's just a performance one, not other
> functionality...anyway I put in the code - and to be scientific I
> benchmarked it two times before the change and two times after - and the
> results were suprising the same both times (1:45 to 1:50 with an index
> that takes up > 200MB). Probably there are cases where this will run
> faster, and the code seems more "correct" now so it's in.
>
Ahh, I see, you check the field later.
The logging made me think, you index all fields you loop over, in which
case one might get unwanted words into the ngram index.
>
>
> >
> >
> > An interesting application of this might be an ngram-Index enhanced version
> > of the FuzzyQuery. While this introduces more complexity on the indexing
> > side, it might be a large speedup for fuzzy searches.
>
> I also thinking of reviewing the list to see if anyone had done a "Jaro
> Winkler" fuzzy query yet and doing that....
>
I went into another direction, and changed the ngram index and search
to use a simliarity that computes
m * m / ( n1 * n2)
where m is the number of matches and n1 is the number of ngrams in the
query and n2 is the number of ngrams in the word.
(At least if I got that right; I'm not sure if I understand all parts
of the similarity class correctly)
After removing the document boost in the ngram index based on the
word frequency in the original index I find the results pretty good.
My data is a number of encyclopedias and dictionaries and I only use the
headwords for the ngram index. Term frequency doesn't seem relevent
in this case.
I still use the levenshtein distance to modify the score and sort according
to score / distance but in most cases this does not make a difference.
So I'll probably drop the distance calculation completely.
I also see few difference between using 2- and 3-grams on the one hand
and only using 2-grams on the other. So I'll presumably drop the 3-grams.
I'm not sure, if the similarity I use, is useful in general, but I
attached it to this message in case someone is interested.
Note that you need to set the similarity for the index writer and searcher
and thus have to reindex in case you want to give it a try.
Morus
Re: NGramSpeller contribution -- Re: combining open office spellchecker
with Lucene
Posted by David Spencer <da...@tropo.com>.
Morus Walter wrote:
> Hi David,
>
>>Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2
>>phases. First you build a "fast lookup index" as mentioned above. Then
>>to correct a word you do a query in this index based on the ngrams in
>>the misspelled word.
>>
>>Let's see.
>>
>>[1] Source is attached and I'd like to contribute it to the sandbox, esp
>>if someone can validate that what it's doing is reasonable and useful.
>>
>
> great :-)
>
>>[4] Here's source in HTML:
>>
>>http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152
>>
>
> could you put the current version of your code on that website as a java
Weblog entry updated:
http://searchmorph.com/weblog/index.php?id=23
To link to source code:
http://www.searchmorph.com/pub/ngramspeller/NGramSpeller.java
> source also? At least until it's in the lucene sandbox.
>
>
> I created an ngram index on one of my indexes and think I found an issue
> in the indexing code:
>
> There is an option -f to specify the field on which the ngram index will
> be created.
> However there is no code to restrict the term enumeration on this field.
>
> So instead of
> final TermEnum te = r.terms();
> i'd suggest
> final TermEnum te = r.terms(new Term(field, ""));
> and a check within the loop over the terms if the enumerated term
> still has fieldname field, e.g.
> Term t = te.term();
> if ( !t.field().equals(field) ) {
> break;
> }
>
> otherwise you loop over all terms in all fields.
Great suggestion and thanks for that idiom - I should know such things
by now. To clarify the "issue", it's just a performance one, not other
functionality...anyway I put in the code - and to be scientific I
benchmarked it two times before the change and two times after - and the
results were suprising the same both times (1:45 to 1:50 with an index
that takes up > 200MB). Probably there are cases where this will run
faster, and the code seems more "correct" now so it's in.
>
>
> An interesting application of this might be an ngram-Index enhanced version
> of the FuzzyQuery. While this introduces more complexity on the indexing
> side, it might be a large speedup for fuzzy searches.
I also thinking of reviewing the list to see if anyone had done a "Jaro
Winkler" fuzzy query yet and doing that....
>
Thanks,
Dave
> Morus
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: NGramSpeller contribution -- Re: combining open office spellchecker
with Lucene
Posted by Morus Walter <mo...@tanto.de>.
Hi David,
>
> Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2
> phases. First you build a "fast lookup index" as mentioned above. Then
> to correct a word you do a query in this index based on the ngrams in
> the misspelled word.
>
> Let's see.
>
> [1] Source is attached and I'd like to contribute it to the sandbox, esp
> if someone can validate that what it's doing is reasonable and useful.
>
great :-)
>
> [4] Here's source in HTML:
>
> http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152
>
could you put the current version of your code on that website as a java
source also? At least until it's in the lucene sandbox.
I created an ngram index on one of my indexes and think I found an issue
in the indexing code:
There is an option -f to specify the field on which the ngram index will
be created.
However there is no code to restrict the term enumeration on this field.
So instead of
final TermEnum te = r.terms();
i'd suggest
final TermEnum te = r.terms(new Term(field, ""));
and a check within the loop over the terms if the enumerated term
still has fieldname field, e.g.
Term t = te.term();
if ( !t.field().equals(field) ) {
break;
}
otherwise you loop over all terms in all fields.
An interesting application of this might be an ngram-Index enhanced version
of the FuzzyQuery. While this introduces more complexity on the indexing
side, it might be a large speedup for fuzzy searches.
Morus
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Hits.doc(x) and range queries
Posted by ro...@xemaps.com.
Hi guys!
I've posted previously that Hits.doc(x) was taking a long time. Turns out it
has to do with a date range in our query. We usually do date ranges like this:
Date:[(lucene date field) - (lucene date field)]
Sometimes the begin date is "000000000" which is what we get from
DateField.dateToString( ( new Date( 0 ) ).
This is when getting our search results from the Hits object takes an absurd
amount of time. Its usually each time the Hits object attempts to get more
results from an IndexSearcher ( aka, every 100? ).
It also takes up more memory...
I was wondering why it affects the search so much even though we're only
returning 350 or so results. Does the QueryParser do something similar to the
DateFilter on range queries? Would it be better to use a DateFilter?
We're using Lucene 1.2 (with plans to upgrade). Do newer versions of Lucene
have this problem?
Roy.
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: NGramSpeller contribution -- Re: combining open office spellchecker
with Lucene
Posted by Doug Cutting <cu...@apache.org>.
Andrzej Bialecki wrote:
> I was wondering about the way you build the n-gram queries. You
> basically don't care about their position in the input term. Originally
> I thought about using PhraseQuery with a slop - however, after checking
> the source of PhraseQuery I realized that this probably wouldn't be that
> fast... You use BooleanQuery and start/end boosts instead, which may
> give similar results in the end but much cheaper.
Sloppy PhraseQuery's are slower than BooleanQueries, but not horribly
slower. The problem is that they don't handle the case where phrase
elements are missing altogether, while a BooleanQuery does. So what you
really need is maybe a variation of a sloppy PhraseQuery that scores
matches that do not contain all of the terms...
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: NGramSpeller contribution -- Re: combining open office spellchecker
with Lucene
Posted by David Spencer <da...@tropo.com>.
Andrzej Bialecki wrote:
> David Spencer wrote:
>
>>> ...or prepare in advance a fast lookup index - split all existing
>>> terms to bi- or trigrams, create a separate lookup index, and then
>>> simply for each term ask a phrase query (phrase = all n-grams from
>>> an input term), with a slop > 0, to get similar existing terms.
>>> This should be fast, and you could provide a "did you mean"
>>> function too...
>>>
>>
>> Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2
>> phases. First you build a "fast lookup index" as mentioned above.
>> Then to correct a word you do a query in this index based on the
>> ngrams in the misspelled word.
>>
>
> The background for this suggestion was that I was playing some time ago
> with a Luke plugin that builds various sorts of ancillary indexes, but
> then I never finished it... Kudos for actually making it work ;-)
Sure, it was a fun little edge project. For the most part the code was
done last week right after this thread appeared, but it always takes a
while to get it from 95 to 100%.
>
>> [1] Source is attached and I'd like to contribute it to the sandbox,
>> esp if someone can validate that what it's doing is reasonable and
>> useful.
>
>
> There have been many requests for this or similar functionality in the
> past, I believe it should go into sandbox.
>
> I was wondering about the way you build the n-gram queries. You
> basically don't care about their position in the input term. Originally
> I thought about using PhraseQuery with a slop - however, after checking
> the source of PhraseQuery I realized that this probably wouldn't be that
> fast... You use BooleanQuery and start/end boosts instead, which may
> give similar results in the end but much cheaper.
>
> I also wonder how this algorithm would behave for smaller values of
Sure, I'll try to rebuild the demo w/ lengths 2-5 (and then the query
page can test any conitguous combo).
> start/end lengths (e.g. 2,3,4). In a sense, the smaller the n-gram
> length, the more "fuzziness" you introduce, which may or may not be
> desirable (increased recall at the cost of precision - for small indexes
> this may be useful from the user's perspective because you will always
> get a plausible hit, for huge indexes it's a loss).
>
>>
>> [2] Here's a demo page. I built an ngram index for ngrams of length 3
>> and 4 based on the existing index I have of approx 100k
>> javadoc-generated pages. You type in a misspelled word like
>> "recursixe" or whatnot to see what suggestions it returns. Note this
>> is not a normal search index query -- rather this is a test page for
>> spelling corrections.
>>
>> http://www.searchmorph.com/kat/spell.jsp
>
>
> Very nice demo!
Thanks, kinda designed for ngram-nerds if you know what I mean :)
> I bet it's running way faster than the linear search
Indeed, this is almost zero time, whereas the simple and dumb linear
search was taking me 10sec. I will have to redo the sites main search
page so it uses this new code, TBD, prob tomorrow.
> over terms :-), even though you have to build the index in advance. But
> if you work with static or mostly static indexes this doesn't matter.
>
>> Based on a subsequent mail in this thread I set boosts for the words
>> in the ngram index. The background is each word (er..term for a given
>> field) in the orig index is a separate Document in the ngram index.
>> This Doc contains all ngrams (in my test case, like #2 above, of
>> length 3 and 4) of the word. I also set a boost of
>> log(word_freq)/log(num_docs) so that more frequently words will tend
>> to be suggested more often.
>
>
> You may want to experiment with 2 <= n <= 5. Some n-gram based
Yep, will do prob tomorrow.
> techniques use all lengths together, some others use just single length,
> results also vary depending on the language...
>
>>
>> I think in "plain" English then the way a word is suggested as a
>> spelling correction is: - frequently occuring words score higher -
>> words that share more ngrams with the orig word score higher - words
>> that share rare ngrams with the orig word score higher
>
>
> I think this is a reasonable heuristics. Reading the code I would
> present it this way:
ok, thx, will update
>
> - words that share more ngrams with the orig word score higher, and
> words that share rare ngrams with the orig word score higher
> (as a natural consequence of using BooleanQuery),
>
> - and, frequently occuring words score higher (as a consequence of using
> per-Document boosts),
>
> - from reading the source code I see that you use Levenshtein distance
> to prune the resultset of too long/too short results,
>
> I think also that because you don't use the positional information about
> the input n-grams you may be getting some really weird hits.
Good point, though I haven't seen this yet. Might be due to the prefix
boost and maybe some Markov chain magic tending to only show reasonable
words.
> You could
> prune them by simply checking if you find a (threshold) of input ngrams
> in the right sequence in the found terms. This shouldn't be too costly
Good point, I'll try to add that in as an optional parameter.
> because you operate on a small result set.
>
>>
>> [6]
>>
>> If people want to vote me in as a committer to the sandbox then I can
>
>
> Well, someone needs to maintain the code after all... ;-)
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: NGramSpeller contribution -- Re: combining open office spellchecker
with Lucene
Posted by Andrzej Bialecki <ab...@getopt.org>.
David Spencer wrote:
>> ...or prepare in advance a fast lookup index - split all existing
>> terms to bi- or trigrams, create a separate lookup index, and then
>> simply for each term ask a phrase query (phrase = all n-grams from
>> an input term), with a slop > 0, to get similar existing terms.
>> This should be fast, and you could provide a "did you mean"
>> function too...
>>
>
> Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2
> phases. First you build a "fast lookup index" as mentioned above.
> Then to correct a word you do a query in this index based on the
> ngrams in the misspelled word.
>
The background for this suggestion was that I was playing some time ago
with a Luke plugin that builds various sorts of ancillary indexes, but
then I never finished it... Kudos for actually making it work ;-)
> [1] Source is attached and I'd like to contribute it to the sandbox,
> esp if someone can validate that what it's doing is reasonable and
> useful.
There have been many requests for this or similar functionality in the
past, I believe it should go into sandbox.
I was wondering about the way you build the n-gram queries. You
basically don't care about their position in the input term. Originally
I thought about using PhraseQuery with a slop - however, after checking
the source of PhraseQuery I realized that this probably wouldn't be that
fast... You use BooleanQuery and start/end boosts instead, which may
give similar results in the end but much cheaper.
I also wonder how this algorithm would behave for smaller values of
start/end lengths (e.g. 2,3,4). In a sense, the smaller the n-gram
length, the more "fuzziness" you introduce, which may or may not be
desirable (increased recall at the cost of precision - for small indexes
this may be useful from the user's perspective because you will always
get a plausible hit, for huge indexes it's a loss).
>
> [2] Here's a demo page. I built an ngram index for ngrams of length 3
> and 4 based on the existing index I have of approx 100k
> javadoc-generated pages. You type in a misspelled word like
> "recursixe" or whatnot to see what suggestions it returns. Note this
> is not a normal search index query -- rather this is a test page for
> spelling corrections.
>
> http://www.searchmorph.com/kat/spell.jsp
Very nice demo! I bet it's running way faster than the linear search
over terms :-), even though you have to build the index in advance. But
if you work with static or mostly static indexes this doesn't matter.
> Based on a subsequent mail in this thread I set boosts for the words
> in the ngram index. The background is each word (er..term for a given
> field) in the orig index is a separate Document in the ngram index.
> This Doc contains all ngrams (in my test case, like #2 above, of
> length 3 and 4) of the word. I also set a boost of
> log(word_freq)/log(num_docs) so that more frequently words will tend
> to be suggested more often.
You may want to experiment with 2 <= n <= 5. Some n-gram based
techniques use all lengths together, some others use just single length,
results also vary depending on the language...
>
> I think in "plain" English then the way a word is suggested as a
> spelling correction is: - frequently occuring words score higher -
> words that share more ngrams with the orig word score higher - words
> that share rare ngrams with the orig word score higher
I think this is a reasonable heuristics. Reading the code I would
present it this way:
- words that share more ngrams with the orig word score higher, and
words that share rare ngrams with the orig word score higher
(as a natural consequence of using BooleanQuery),
- and, frequently occuring words score higher (as a consequence of using
per-Document boosts),
- from reading the source code I see that you use Levenshtein distance
to prune the resultset of too long/too short results,
I think also that because you don't use the positional information about
the input n-grams you may be getting some really weird hits. You could
prune them by simply checking if you find a (threshold) of input ngrams
in the right sequence in the found terms. This shouldn't be too costly
because you operate on a small result set.
>
> [6]
>
> If people want to vote me in as a committer to the sandbox then I can
Well, someone needs to maintain the code after all... ;-)
--
Best regards,
Andrzej Bialecki
-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: [VOTE] David Spencer as Lucene Sandbox committer
Posted by Daniel Naber <da...@t-online.de>.
On Wednesday 15 September 2004 01:44, Otis Gospodnetic wrote:
> And +1 from me. Dave has submitted several nice pieces of code over
> the years.
+1
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: [VOTE] David Spencer as Lucene Sandbox committer
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
If Doug and Otis give a +1 then that is enough for me! :)
+1
On Sep 14, 2004, at 7:44 PM, Otis Gospodnetic wrote:
> And +1 from me. Dave has submitted several nice pieces of code over
> the years.
>
> Otis
>
> --- Doug Cutting <cu...@apache.org> wrote:
>
>> David Spencer wrote:
>>> If people want to vote me in as a committer to the sandbox then I
>> can
>>> check this code in.
>>
>> +1
>>
>> Doug
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: [VOTE] David Spencer as Lucene Sandbox committer
Posted by Jack Park <ja...@thinkalong.com>.
+1 if my vote counts for anything.
Jack
Otis Gospodnetic wrote:
> And +1 from me. Dave has submitted several nice pieces of code over
> the years.
>
> Otis
>
> --- Doug Cutting <cu...@apache.org> wrote:
>
>
>>David Spencer wrote:
>>
>>>If people want to vote me in as a committer to the sandbox then I
>>
>>can
>>
>>>check this code in.
>>
>>+1
>>
>>Doug
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: [VOTE] David Spencer as Lucene Sandbox committer
Posted by Christoph Goller <go...@detego-software.de>.
Otis Gospodnetic wrote:
> And +1 from me. Dave has submitted several nice pieces of code over
> the years.
+1
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
[VOTE] David Spencer as Lucene Sandbox committer
Posted by Otis Gospodnetic <ot...@yahoo.com>.
And +1 from me. Dave has submitted several nice pieces of code over
the years.
Otis
--- Doug Cutting <cu...@apache.org> wrote:
> David Spencer wrote:
> > If people want to vote me in as a committer to the sandbox then I
> can
> > check this code in.
>
> +1
>
> Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: NGramSpeller contribution -- Re: combining open office spellchecker
with Lucene
Posted by Doug Cutting <cu...@apache.org>.
David Spencer wrote:
> If people want to vote me in as a committer to the sandbox then I can
> check this code in.
+1
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: NGramSpeller contribution -- Re: combining open office spellchecker
with Lucene
Posted by David Spencer <da...@tropo.com>.
Andrzej Bialecki wrote:
> Aad Nales wrote:
>
>> David,
>>
>> Perhaps I misunderstand somehting so please correct me if I do. I used
>> http://www.searchmorph.com/kat/spell.jsp to look for conts without
>> changing any of the default values. What I got as results did not
>> include 'const' which has quite a high frequency in your index and
>
>
> ??? how do you know that? Remember, this is an index of _Java_docs, and
> "const" is not a Java keyword.
I added a line of output to the right column under the 'details' box.
"const" appears 216 times in the index (out of 96k docs), thus it is
indeed kinda rare.
http://www.searchmorph.com/kat/spell.jsp?s=const
>
>> should have a pretty low levenshtein distance. Any idea what causes this
>> behavior?
>
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: NGramSpeller contribution -- Re: combining open office spellchecker
with Lucene
Posted by David Spencer <da...@tropo.com>.
Aad Nales wrote:
> By trying: if you type const you will find that it returns 216 hits. The
> third sports 'const' as a term (space seperated and all). I would expect
> 'conts' to return with const as well. But again I might be mistaken. I
> am now trying to figure what the problem might be:
>
> 1. my expectations (most likely ;-)
> 2. something in the code..
I enhanced the code to store simple transpositions also and I
regenerated my site w/ ngrams from 2 to 5 chars. If you set the
transposition boost up to 10 then "const" is returned 2nd...
http://www.searchmorph.com/kat/spell.jsp?s=conts&min=2&max=5&maxd=5&maxr=10&bstart=2.0&bend=1.0&btranspose=10.0&popular=1
>
> -----Original Message-----
> From: Andrzej Bialecki [mailto:ab@getopt.org]
> Sent: Wednesday, 15 September, 2004 12:23
> To: Lucene Users List
> Subject: Re: NGramSpeller contribution -- Re: combining open office
> spellchecker with Lucene
>
>
> Aad Nales wrote:
>
>
>>David,
>>
>>Perhaps I misunderstand somehting so please correct me if I do. I used
>
>
>>http://www.searchmorph.com/kat/spell.jsp to look for conts without
>>changing any of the default values. What I got as results did not
>>include 'const' which has quite a high frequency in your index and
>
>
> ??? how do you know that? Remember, this is an index of _Java_docs, and
> "const" is not a Java keyword.
>
>
>>should have a pretty low levenshtein distance. Any idea what causes
>>this behavior?
>
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: NGramSpeller contribution -- Re: combining open office spellchecker
with Lucene
Posted by David Spencer <da...@tropo.com>.
Andrzej Bialecki wrote:
> David Spencer wrote:
>
>> To restate the question for a second.
>>
>> The misspelled word is: "conts".
>> The sugggestion expected is "const", which seems reasonable enough as
>> it's just a transposition away, thus the string distance is low.
>>
>> But - I guess the problem w/ the algorithm is that for short words
>> like this, with transpositions, the two words won't share many ngrams.
>>
>> Just looking at 3grams...
>>
>> conts -> con ont nts
>> const -> con ons nst
>>
>> Thus they just share 1 3gram, thus this is why it scores so low. This
>> is an interesting issue, how to tune the algorithm so that it might
>> return words this close higher.
>
>
> If you added 2-grams, then it would look like this (constructing also
> special start/end grams):
Oh cute trick to indicate prefixes and suffixes.
Anyway, as per prev post I reformed index w/ ngrams from length 2 to 5,
and also store transpositions, and w/ appropriate boosts :) then "const"
is returned 2nd.
http://www.searchmorph.com/kat/spell.jsp?s=conts&min=2&max=5&maxd=5&maxr=10&bstart=2.0&bend=1.0&btranspose=10.0&popular=1
>
> conts -> _c co on nt ts s_
> const -> _c co on ns st t_
>
> which gives 50% of overlap.
>
> In another system that I designed we were using a combination of 2-4
> grams, albeit for a slightly different purpose, so in this case it would
> be:
>
> conts:
> _c co on nt ts s_, _co con ont nts ts_, _con cont onts nts_
>
> const:
> _c co on ns st t_, _co con ons nst st_, _con cons onst nst_
>
> and the overlap is 40%.
>
>
>>
>> I guess one way is to add all simple transpositions to the lookup
>> table (the "ngram index") so that these could easily be found, with
>> the heuristic that "a frequent way of misspelling words is to
>> transpose two adjacent letters".
>
>
> Yes, sounds like a good idea. Even though it increases the size of the
> lookup index, it still beats using the linear search...
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: NGramSpeller contribution -- Re: combining open office spellchecker
with Lucene
Posted by Andrzej Bialecki <ab...@getopt.org>.
David Spencer wrote:
> To restate the question for a second.
>
> The misspelled word is: "conts".
> The sugggestion expected is "const", which seems reasonable enough as
> it's just a transposition away, thus the string distance is low.
>
> But - I guess the problem w/ the algorithm is that for short words like
> this, with transpositions, the two words won't share many ngrams.
>
> Just looking at 3grams...
>
> conts -> con ont nts
> const -> con ons nst
>
> Thus they just share 1 3gram, thus this is why it scores so low. This is
> an interesting issue, how to tune the algorithm so that it might return
> words this close higher.
If you added 2-grams, then it would look like this (constructing also
special start/end grams):
conts -> _c co on nt ts s_
const -> _c co on ns st t_
which gives 50% of overlap.
In another system that I designed we were using a combination of 2-4
grams, albeit for a slightly different purpose, so in this case it would be:
conts:
_c co on nt ts s_, _co con ont nts ts_, _con cont onts nts_
const:
_c co on ns st t_, _co con ons nst st_, _con cons onst nst_
and the overlap is 40%.
>
> I guess one way is to add all simple transpositions to the lookup table
> (the "ngram index") so that these could easily be found, with the
> heuristic that "a frequent way of misspelling words is to transpose two
> adjacent letters".
Yes, sounds like a good idea. Even though it increases the size of the
lookup index, it still beats using the linear search...
--
Best regards,
Andrzej Bialecki
-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: NGramSpeller contribution -- Re: combining open office spellchecker
with Lucene
Posted by David Spencer <da...@tropo.com>.
Aad Nales wrote:
> By trying: if you type const you will find that it returns 216 hits. The
> third sports 'const' as a term (space seperated and all). I would expect
> 'conts' to return with const as well. But again I might be mistaken. I
> am now trying to figure what the problem might be:
>
> 1. my expectations (most likely ;-)
> 2. something in the code..
Good question.
If I use the form at the bottom of the page and ask for more results,
the suggestion of "const" does eventually show up - 99th however(!).
http://www.searchmorph.com/kat/spell.jsp?s=conts&min=3&max=4&maxd=5&maxr=1000&bstart=2.0&bend=1.0
Even boosting the prefix match from 2.0 to 10.0 only changes the ranking
a few slots.
http://www.searchmorph.com/kat/spell.jsp?s=conts&min=3&max=4&maxd=5&maxr=1000&bstart=10.0&bend=1.0
To restate the question for a second.
The misspelled word is: "conts".
The sugggestion expected is "const", which seems reasonable enough as
it's just a transposition away, thus the string distance is low.
But - I guess the problem w/ the algorithm is that for short words like
this, with transpositions, the two words won't share many ngrams.
Just looking at 3grams...
conts -> con ont nts
const -> con ons nst
Thus they just share 1 3gram, thus this is why it scores so low. This is
an interesting issue, how to tune the algorithm so that it might return
words this close higher.
I guess one way is to add all simple transpositions to the lookup table
(the "ngram index") so that these could easily be found, with the
heuristic that "a frequent way of misspelling words is to transpose two
adjacent letters".
Based on other mails I'll make some additions to the code and will
report back if anything of interest changes here.
>
> -----Original Message-----
> From: Andrzej Bialecki [mailto:ab@getopt.org]
> Sent: Wednesday, 15 September, 2004 12:23
> To: Lucene Users List
> Subject: Re: NGramSpeller contribution -- Re: combining open office
> spellchecker with Lucene
>
>
> Aad Nales wrote:
>
>
>>David,
>>
>>Perhaps I misunderstand somehting so please correct me if I do. I used
>
>
>>http://www.searchmorph.com/kat/spell.jsp to look for conts without
>>changing any of the default values. What I got as results did not
>>include 'const' which has quite a high frequency in your index and
>
>
> ??? how do you know that? Remember, this is an index of _Java_docs, and
> "const" is not a Java keyword.
>
>
>>should have a pretty low levenshtein distance. Any idea what causes
>>this behavior?
>
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
RE: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
Posted by Aad Nales <aa...@rotterdam-cs.com>.
By trying: if you type const you will find that it returns 216 hits. The
third sports 'const' as a term (space seperated and all). I would expect
'conts' to return with const as well. But again I might be mistaken. I
am now trying to figure what the problem might be:
1. my expectations (most likely ;-)
2. something in the code..
-----Original Message-----
From: Andrzej Bialecki [mailto:ab@getopt.org]
Sent: Wednesday, 15 September, 2004 12:23
To: Lucene Users List
Subject: Re: NGramSpeller contribution -- Re: combining open office
spellchecker with Lucene
Aad Nales wrote:
> David,
>
> Perhaps I misunderstand somehting so please correct me if I do. I used
> http://www.searchmorph.com/kat/spell.jsp to look for conts without
> changing any of the default values. What I got as results did not
> include 'const' which has quite a high frequency in your index and
??? how do you know that? Remember, this is an index of _Java_docs, and
"const" is not a Java keyword.
> should have a pretty low levenshtein distance. Any idea what causes
> this behavior?
--
Best regards,
Andrzej Bialecki
-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: NGramSpeller contribution -- Re: combining open office spellchecker
with Lucene
Posted by Andrzej Bialecki <ab...@getopt.org>.
Aad Nales wrote:
> David,
>
> Perhaps I misunderstand somehting so please correct me if I do. I used
> http://www.searchmorph.com/kat/spell.jsp to look for conts without
> changing any of the default values. What I got as results did not
> include 'const' which has quite a high frequency in your index and
??? how do you know that? Remember, this is an index of _Java_docs, and
"const" is not a Java keyword.
> should have a pretty low levenshtein distance. Any idea what causes this
> behavior?
--
Best regards,
Andrzej Bialecki
-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
RE: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
Posted by Aad Nales <aa...@rotterdam-cs.com>.
David,
Perhaps I misunderstand somehting so please correct me if I do. I used
http://www.searchmorph.com/kat/spell.jsp to look for conts without
changing any of the default values. What I got as results did not
include 'const' which has quite a high frequency in your index and
should have a pretty low levenshtein distance. Any idea what causes this
behavior?
cheers,
Aad
-----Original Message-----
From: David Spencer [mailto:dave-lucene-user@tropo.com]
Sent: Tuesday, 14 September, 2004 21:23
To: Lucene Users List
Subject: NGramSpeller contribution -- Re: combining open office
spellchecker with Lucene
Andrzej Bialecki wrote:
> David Spencer wrote:
>
>>
>> I can/should send the code out. The logic is that for any terms in a
>> query that have zero matches, go thru all the terms(!) and calculate
>> the Levenshtein string distance, and return the best matches. A more
>> intelligent way of doing this is to instead look for terms that also
>> match on the 1st "n" (prob 3) chars.
>
>
> ...or prepare in advance a fast lookup index - split all existing
> terms
> to bi- or trigrams, create a separate lookup index, and then simply
for
> each term ask a phrase query (phrase = all n-grams from an input
term),
> with a slop > 0, to get similar existing terms. This should be fast,
and
> you could provide a "did you mean" function too...
>
Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2
phases. First you build a "fast lookup index" as mentioned above. Then
to correct a word you do a query in this index based on the ngrams in
the misspelled word.
Let's see.
[1] Source is attached and I'd like to contribute it to the sandbox, esp
if someone can validate that what it's doing is reasonable and useful.
[2] Here's a demo page. I built an ngram index for ngrams of length 3
and 4 based on the existing index I have of approx 100k
javadoc-generated pages. You type in a misspelled word like "recursixe"
or whatnot to see what suggestions it returns. Note this is not a normal
search index query -- rather this is a test page for spelling
corrections.
http://www.searchmorph.com/kat/spell.jsp
[3] Here's the javadoc:
http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGra
mSpeller.html
[4] Here's source in HTML:
http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/s
pell/NGramSpeller.html#line.152
[5] A few more details:
Based on a subsequent mail in this thread I set boosts for the words in
the ngram index. The background is each word (er..term for a given
field) in the orig index is a separate Document in the ngram index. This
Doc contains all ngrams (in my test case, like #2 above, of length 3 and
4) of the word. I also set a boost of log(word_freq)/log(num_docs) so
that more frequently words will tend to be suggested more often.
I think in "plain" English then the way a word is suggested as a
spelling correction is:
- frequently occuring words score higher
- words that share more ngrams with the orig word score higher
- words that share rare ngrams with the orig word score higher
[6]
If people want to vote me in as a committer to the sandbox then I can
check this code in - though again, I'd appreciate feedback.
thx,
Dave
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
NGramSpeller contribution -- Re: combining open office spellchecker
with Lucene
Posted by David Spencer <da...@tropo.com>.
Andrzej Bialecki wrote:
> David Spencer wrote:
>
>>
>> I can/should send the code out. The logic is that for any terms in a
>> query that have zero matches, go thru all the terms(!) and calculate
>> the Levenshtein string distance, and return the best matches. A more
>> intelligent way of doing this is to instead look for terms that also
>> match on the 1st "n" (prob 3) chars.
>
>
> ...or prepare in advance a fast lookup index - split all existing terms
> to bi- or trigrams, create a separate lookup index, and then simply for
> each term ask a phrase query (phrase = all n-grams from an input term),
> with a slop > 0, to get similar existing terms. This should be fast, and
> you could provide a "did you mean" function too...
>
Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2
phases. First you build a "fast lookup index" as mentioned above. Then
to correct a word you do a query in this index based on the ngrams in
the misspelled word.
Let's see.
[1] Source is attached and I'd like to contribute it to the sandbox, esp
if someone can validate that what it's doing is reasonable and useful.
[2] Here's a demo page. I built an ngram index for ngrams of length 3
and 4 based on the existing index I have of approx 100k
javadoc-generated pages. You type in a misspelled word like "recursixe"
or whatnot to see what suggestions it returns. Note this is not a normal
search index query -- rather this is a test page for spelling corrections.
http://www.searchmorph.com/kat/spell.jsp
[3] Here's the javadoc:
http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGramSpeller.html
[4] Here's source in HTML:
http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152
[5] A few more details:
Based on a subsequent mail in this thread I set boosts for the words in
the ngram index. The background is each word (er..term for a given
field) in the orig index is a separate Document in the ngram index. This
Doc contains all ngrams (in my test case, like #2 above, of length 3 and
4) of the word. I also set a boost of log(word_freq)/log(num_docs) so
that more frequently words will tend to be suggested more often.
I think in "plain" English then the way a word is suggested as a
spelling correction is:
- frequently occuring words score higher
- words that share more ngrams with the orig word score higher
- words that share rare ngrams with the orig word score higher
[6]
If people want to vote me in as a committer to the sandbox then I can
check this code in - though again, I'd appreciate feedback.
thx,
Dave
Re: combining open office spellchecker with Lucene
Posted by David Spencer <da...@tropo.com>.
Andrzej Bialecki wrote:
> David Spencer wrote:
>
>>
>> I can/should send the code out. The logic is that for any terms in a
>> query that have zero matches, go thru all the terms(!) and calculate
>> the Levenshtein string distance, and return the best matches. A more
>> intelligent way of doing this is to instead look for terms that also
>> match on the 1st "n" (prob 3) chars.
>
>
> ...or prepare in advance a fast lookup index - split all existing terms
> to bi- or trigrams, create a separate lookup index, and then simply for
> each term ask a phrase query (phrase = all n-grams from an input term),
> with a slop > 0, to get similar existing terms. This should be fast, and
> you could provide a "did you mean" function too...
Sounds interesting/fun but I'm not sure if I'm following exactly.
Let's talk thru the trigram index case.
Are you saying that for every trigram in every word there will be a
mapping of trigram -> term?
Thus if "recursive" is in the (orig) index then we'd create entries like:
rec -> recursive
ecu -> ...
cur -> ...
urs -> ...
rsi -> ...
siv -> ...
ive -> ...
And so on for all terms in the orig index.
OK fine.
But now the user types in a query like "recursivz".
What's the algorithm - obviously I guess take all trigrams in the bad
term and go thru the trigram-index, but there will be lots of
suggestions. Now what - use string distance to score them? I guess that
makes sense - plz confirm if I understand.... And so I guess the point
here is we precalculate the trigram->term mappings to avoid an expensive
traversal of all terms in an index, but we still use string distance as
a 2nd pass (and prob should force the matches to always match on the 1st
n (3) chars using the heuristic that people can usually start the
spelling a word corrrectly).
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: combining open office spellchecker with Lucene
Posted by Andrzej Bialecki <ab...@getopt.org>.
David Spencer wrote:
>
> I can/should send the code out. The logic is that for any terms in a
> query that have zero matches, go thru all the terms(!) and calculate the
> Levenshtein string distance, and return the best matches. A more
> intelligent way of doing this is to instead look for terms that also
> match on the 1st "n" (prob 3) chars.
...or prepare in advance a fast lookup index - split all existing terms
to bi- or trigrams, create a separate lookup index, and then simply for
each term ask a phrase query (phrase = all n-grams from an input term),
with a slop > 0, to get similar existing terms. This should be fast, and
you could provide a "did you mean" function too...
--
Best regards,
Andrzej Bialecki
-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: combining open office spellchecker with Lucene
Posted by David Spencer <da...@tropo.com>.
Aad Nales wrote:
> Hi All,
>
> Before I start reinventing wheels I would like to do a short check to
> see if anybody else has already tried this. A customer has requested us
> to look into the possibility to perform a spell check on queries. So far
> the most promising way of doing this seems to be to create an Analyzer
> based on the spellchecker of OpenOffice. My question is: "has anybody
> tried this before?"
I did a WordNet/synonym query expander. Search for "WordNet" on this
page. Of interest is it stores the Wordnet info in a separate Lucene
index as at its essence all an index is is a database.
http://jakarta.apache.org/lucene/docs/lucene-sandbox/
Also, another variation, is to instead spell based on what terms are in
the index, not what an external dictionary says. I've done this on my
experimental site searchmorph.com in a dumb/inefficient way. Here's an
example:
http://www.searchmorph.com/kat/search.jsp?s=recursivz
After you click above it takes ~10sec as it produces terms close to
"recursivz". Opps - looking at the output, it looks like the same word
is suggest multiple times - ouch - I must be considering all fields, not
just the contents field. TBD is fixing this. (or no wonder it's so slow :))
I can/should send the code out. The logic is that for any terms in a
query that have zero matches, go thru all the terms(!) and calculate the
Levenshtein string distance, and return the best matches. A more
intelligent way of doing this is to instead look for terms that also
match on the 1st "n" (prob 3) chars.
>
> Cheers,
> Aad
>
>
> --
> Aad Nales
> aad.nales@rotterdam-cs.com, +31-(0)6 54 207 340
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org