You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Alex Murzaku <mu...@yahoo.com> on 2002/09/21 17:15:38 UTC

analyzers using snowball

Since from time to time we have these questions/discussions about
whether Lucene supports specific natural languages, I adapted a set of
analyzers and filters to use the Snowball
(http://snowball.tartarus.org) generated Java stemmers. This could be a
good start for anybody needing to get into more detail in a particular
language (like the existing Russian and German analyzers). It uses the
StandardTokenizer which works fine for the other languages (except
Russian).

The whole package is located at http://download.lissus.com/snowball.zip
and it is about 2.3MB. The reason for this size is that it also
contains all the test dictionaries for the 12 languages supported.
These languages are: Danish, Dutch, English (Porter2), Finnish, French,
German, Italian, Norwegian, Portuguese, Russian, Spanish and Swedish.
Finnish has some minor problems and I wasn't able to test properly
Russian since I am not familiar with character codesets. But I wouldn't
bother with Russian (or German) since it is already contained in the
Lucene package. As for Finnish, I am already communicating with the
Snowball team and hopefully it will work in Java as well as in the
other environments.

Best regards,

Alex


=====
__________________________________
alex@lissus.com -- http://www.lissus.com

__________________________________________________
Do you Yahoo!?
New DSL Internet Access from SBC & Yahoo!
http://sbc.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>