You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Martin Haye <ma...@gmail.com> on 2007/03/20 23:24:43 UTC

Spelt, for better spelling correction

As part of XTF, an open source publishing engine that uses Lucene, I
developed a new spelling correction engine specifically to provide "Did you
mean..." links for misspelled queries. I and a small group are preparing
this for submission as a contrib module to Lucene. And we're inviting
interested people to join the discussion about it.

The new engine is being called "Spelt" and differs from the one currently in
Lucene contrib in the following ways:

- More accurate: Much better performance on single-word queries (90% correct
in #1 slot in my tests). On general list including multi-word queries, gets
80%+ correct.
- Multi-word: Handles and corrects multi-word queries such as "harrypotter"
-> "harry potter".
- Fast: In my tests, builds the dictionary more than 30 times faster.
- Small: Dictionary size is roughly a third of that built by the existing
engine.
- Other bells and whistles...

There is already a standalone test program that people can try out, and
we're interested in feedback. If you're interested in discussing, testing,
or previewing, consider joining the Google group:
http://groups.google.com/group/spelt/

--Martin

Re: Spelt, for better spelling correction

Posted by Martin Haye <m1...@snyder-haye.com>.
The dictionary is generated from the corpus, with the result that a larger
corpus gives better results.

Words are queued up during an index run, and at the end are munged to create
an optimized dictionary. It also supports incremental building, though the
overhead would be too much for those applications that are continuously
adding things to an index. Happily, it's not as important to keep the
spelling dictionary absolutely up to date, so it would be fine to queue
words over several index runs, and refresh the dictionary less often.

--Martin

On 3/20/07, Yonik Seeley <yo...@apache.org> wrote:
>
> Sounds interesting Martin!
> Is the dictionary static, or is it generated from the corpus or from
> user queries?
>
> -Yonik
>
> On 3/20/07, Martin Haye <ma...@gmail.com> wrote:
> > As part of XTF, an open source publishing engine that uses Lucene, I
> > developed a new spelling correction engine specifically to provide "Did
> you
> > mean..." links for misspelled queries. I and a small group are preparing
> > this for submission as a contrib module to Lucene. And we're inviting
> > interested people to join the discussion about it.
> >
> > The new engine is being called "Spelt" and differs from the one
> currently in
> > Lucene contrib in the following ways:
> >
> > - More accurate: Much better performance on single-word queries (90%
> correct
> > in #1 slot in my tests). On general list including multi-word queries,
> gets
> > 80%+ correct.
> > - Multi-word: Handles and corrects multi-word queries such as
> "harrypotter"
> > -> "harry potter".
> > - Fast: In my tests, builds the dictionary more than 30 times faster.
> > - Small: Dictionary size is roughly a third of that built by the
> existing
> > engine.
> > - Other bells and whistles...
> >
> > There is already a standalone test program that people can try out, and
> > we're interested in feedback. If you're interested in discussing,
> testing,
> > or previewing, consider joining the Google group:
> > http://groups.google.com/group/spelt/
> >
> > --Martin
> >
>

Re: Spelt, for better spelling correction

Posted by Yonik Seeley <yo...@apache.org>.
Sounds interesting Martin!
Is the dictionary static, or is it generated from the corpus or from
user queries?

-Yonik

On 3/20/07, Martin Haye <ma...@gmail.com> wrote:
> As part of XTF, an open source publishing engine that uses Lucene, I
> developed a new spelling correction engine specifically to provide "Did you
> mean..." links for misspelled queries. I and a small group are preparing
> this for submission as a contrib module to Lucene. And we're inviting
> interested people to join the discussion about it.
>
> The new engine is being called "Spelt" and differs from the one currently in
> Lucene contrib in the following ways:
>
> - More accurate: Much better performance on single-word queries (90% correct
> in #1 slot in my tests). On general list including multi-word queries, gets
> 80%+ correct.
> - Multi-word: Handles and corrects multi-word queries such as "harrypotter"
> -> "harry potter".
> - Fast: In my tests, builds the dictionary more than 30 times faster.
> - Small: Dictionary size is roughly a third of that built by the existing
> engine.
> - Other bells and whistles...
>
> There is already a standalone test program that people can try out, and
> we're interested in feedback. If you're interested in discussing, testing,
> or previewing, consider joining the Google group:
> http://groups.google.com/group/spelt/
>
> --Martin
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org