You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Ulrich Mayring <ul...@denic.de> on 2003/06/06 17:20:10 UTC

Where to get stopword lists?

Hello,

does anyone know of good stopword lists for use with Lucene? I'm 
interested in English and German lists.

The default lists aren't very complete, for example the English list 
doesn't contain words like "every", "because" or "until" and the German 
list misses "dem" and "des" (definite articles).

Kind regards,

Ulrich



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Where to get stopword lists?

Posted by pr...@gmx.de.
There are also useful stopword lists at

http://www.unine.ch/Info/clef/

best regards
René

-- 
+++ GMX - Mail, Messaging & more  http://www.gmx.net +++
Bitte lächeln! Fotogalerie online mit GMX ohne eigene Homepage!


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Where to get stopword lists?

Posted by Bryan LaPlante <bl...@netwebapps.com>.
I found a some handy tools in the org.apache.lucene.analysis.de package
using the WordListLoader class you can load up your stop words in a verity
of ways including a line delimited text file thanks to Gerhard Schwarz.

Bryan LaPlante

----- Original Message -----
From: "Ulrich Mayring" <ul...@denic.de>
To: <lu...@jakarta.apache.org>
Sent: Friday, June 06, 2003 11:36 AM
Subject: Re: Where to get stopword lists?


> Doug Cutting wrote:
> >
> > Snowball stemmers are pre-packaged for use with Lucene at:
> >
> >   http://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/
>
> These look interesting. Am I right in assuming that in order to use
> these stemmers, I have to write an Analyzer and in its tokenStream
> method I return a SnowballFilter?
>
> I'm a bit new to Lucene, as you might gather :)
>
> Kind regards,
>
> Ulrich
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Where to get stopword lists?

Posted by Anthony Eden <me...@anthonyeden.com>.
There is already an analyzer available in the sandbox.  Take a look 
here: http://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/

Sincerely,
Anthony Eden

Ulrich Mayring wrote:
> Doug Cutting wrote:
> 
>>
>> Snowball stemmers are pre-packaged for use with Lucene at:
>>
>>   http://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/
> 
> 
> These look interesting. Am I right in assuming that in order to use 
> these stemmers, I have to write an Analyzer and in its tokenStream 
> method I return a SnowballFilter?
> 
> I'm a bit new to Lucene, as you might gather :)
> 
> Kind regards,
> 
> Ulrich
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Where to get stopword lists?

Posted by Ulrich Mayring <ul...@denic.de>.
Doug Cutting wrote:
> 
> Snowball stemmers are pre-packaged for use with Lucene at:
> 
>   http://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/

These look interesting. Am I right in assuming that in order to use 
these stemmers, I have to write an Analyzer and in its tokenStream 
method I return a SnowballFilter?

I'm a bit new to Lucene, as you might gather :)

Kind regards,

Ulrich



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Where to get stopword lists?

Posted by Doug Cutting <cu...@lucene.com>.
Ulrich Mayring wrote:
> does anyone know of good stopword lists for use with Lucene? I'm 
> interested in English and German lists.

The Snowball project has good stop lists.

See:

   http://snowball.tartarus.org/
   http://snowball.tartarus.org/english/stop.txt
   http://snowball.tartarus.org/german/stop.txt

Snowball stemmers are pre-packaged for use with Lucene at:

   http://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/

This project should be updated to include the Snowball stop lists too. 
I have not had the time to do this.  This would be a great contribution 
if someone who is qualified has the time.

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Where to get stopword lists?

Posted by Leo Galambos <Le...@seznam.cz>.
Ulrich Mayring wrote:

> Hello,
>
> does anyone know of good stopword lists for use with Lucene? I'm 
> interested in English and German lists.

What does mean ``good''? It depends on your corpus IMHO. The best way, 
how one can get a ``good'' stop-list, is an analysis that's based on 
idf. Thus, index your documents, list all the terms with low idf out, 
save them in a file and use them in next indexing round.

Just a thought...

-g-



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Where to get stopword lists?

Posted by Otis Gospodnetic <ot...@yahoo.com>.
There is a much more complete list of Englihs stop words included in
the Lucene article (the intro one) on Onjava.com.
I can't help you with German stop words.

Otis

--- Ulrich Mayring <ul...@denic.de> wrote:
> Hello,
> 
> does anyone know of good stopword lists for use with Lucene? I'm 
> interested in English and German lists.
> 
> The default lists aren't very complete, for example the English list 
> doesn't contain words like "every", "because" or "until" and the
> German 
> list misses "dem" and "des" (definite articles).
> 
> Kind regards,
> 
> Ulrich
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


__________________________________
Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org