You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Behrang Saeedzadeh <be...@gmail.com> on 2009/03/22 08:25:17 UTC

PorterStemFilter bug?

(( A complete Mavenized project illustration this problem is available at:
https://dl.getdropbox.com/u/34024/lucene-porter-stemmer-bug.zip ))

I am using Lucene to tokenzize, filter, and stem an input text using the
following chain:

String text =
IOUtils.toString(getClass().getClassLoader().getSystemResourceAsStream("J01-1001.txt"));
StringReader reader = new StringReader(text);
StandardTokenizer tokenizer = new StandardTokenizer(reader);
LowerCaseFilter lcFilter = new LowerCaseFilter(tokenizer);
StopFilter stopFilter = new StopFilter(lcFilter,
CustomStopWords.STOP_WORDS);
PorterStemFilter stemmer = new PorterStemFilter(stopFilter);

Set<String> stemmedStopWords = buildStemmedStopWords();

Token t = new Token();
while (stemmer.next(t) != null) {
    if (stemmedStopWords.contains(t.term())) {
        throw new RuntimeException("\"" + t.term() + "\" must have been
removed from the output token stream but is not");
    }
}

However I see some of the stemmed stop words still appearing in the output
tokens. For example, "describe" is one of my custom stop words. The stemmed
version of describe is "describ". However "describ" appears in the output
token stream instead of being filtered out.

I thought "describe" will be filtered out by the stopFilter even before
reaching the stemmer. However, somehow it is leaking into the stemmer and it
is turned into "describ" by it and goes into the output token stream.

The funny thing is, when I remove the stemmer from the chain, then
"describe" does not go into the output token stream:

String text =
IOUtils.toString(getClass().getClassLoader().getSystemResourceAsStream("J01-1001.txt"));
StringReader reader = new StringReader(text);
StandardTokenizer tokenizer = new StandardTokenizer(reader);
LowerCaseFilter lcFilter = new LowerCaseFilter(tokenizer);
StopFilter stopFilter = new StopFilter(lcFilter,
CustomStopWords.STOP_WORDS);

Token t = new Token();
while (stopFilter.next(t) != null) {
    assertFalse(CustomStopWords.STOP_WORDS.contains(t.term()));
}

Notice that in the first case I look for "describ" in the output token
stream but in the second case I look for "describe".

Thanks in advance,
Behrang Saeedzadeh
http://my.opera.com/behrangsa

Re: PorterStemFilter bug?

Posted by Behrang Saeedzadeh <be...@gmail.com>.
Wow! Too many grammar errors and typos! Sorry about that :D
Behrang Saeedzadeh
http://my.opera.com/behrangsa


On Sun, Mar 22, 2009 at 6:25 PM, Behrang Saeedzadeh <be...@gmail.com>wrote:

> (( A complete Mavenized project illustration this problem is available at:
> https://dl.getdropbox.com/u/34024/lucene-porter-stemmer-bug.zip ))
>
> I am using Lucene to tokenzize, filter, and stem an input text using the
> following chain:
>
> String text =
> IOUtils.toString(getClass().getClassLoader().getSystemResourceAsStream("J01-1001.txt"));
>  StringReader reader = new StringReader(text);
> StandardTokenizer tokenizer = new StandardTokenizer(reader);
>  LowerCaseFilter lcFilter = new LowerCaseFilter(tokenizer);
> StopFilter stopFilter = new StopFilter(lcFilter,
> CustomStopWords.STOP_WORDS);
>  PorterStemFilter stemmer = new PorterStemFilter(stopFilter);
>
> Set<String> stemmedStopWords = buildStemmedStopWords();
>
> Token t = new Token();
> while (stemmer.next(t) != null) {
>     if (stemmedStopWords.contains(t.term())) {
>         throw new RuntimeException("\"" + t.term() + "\" must have been
> removed from the output token stream but is not");
>     }
> }
>
> However I see some of the stemmed stop words still appearing in the output
> tokens. For example, "describe" is one of my custom stop words. The stemmed
> version of describe is "describ". However "describ" appears in the output
> token stream instead of being filtered out.
>
> I thought "describe" will be filtered out by the stopFilter even before
> reaching the stemmer. However, somehow it is leaking into the stemmer and it
> is turned into "describ" by it and goes into the output token stream.
>
> The funny thing is, when I remove the stemmer from the chain, then
> "describe" does not go into the output token stream:
>
> String text =
> IOUtils.toString(getClass().getClassLoader().getSystemResourceAsStream("J01-1001.txt"));
>  StringReader reader = new StringReader(text);
> StandardTokenizer tokenizer = new StandardTokenizer(reader);
>  LowerCaseFilter lcFilter = new LowerCaseFilter(tokenizer);
> StopFilter stopFilter = new StopFilter(lcFilter,
> CustomStopWords.STOP_WORDS);
>
> Token t = new Token();
> while (stopFilter.next(t) != null) {
>     assertFalse(CustomStopWords.STOP_WORDS.contains(t.term()));
> }
>
> Notice that in the first case I look for "describ" in the output token
> stream but in the second case I look for "describe".
>
> Thanks in advance,
> Behrang Saeedzadeh
> http://my.opera.com/behrangsa
>