You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Dmitry Serebrennikov <dm...@earthlink.net> on 2001/10/18 23:03:15 UTC

Solution for "unstemming" terms

I've found a pretty good solution for retrieving un-stemmed version of 
index terms, in case anyone is interested. This uses only the features 
already in 1.2-rc1 release.

The trick is to create an additional field on each document (say "dict" 
for dictionary) and set it to contain a list of space-separated strings 
like this:

    cat:cats likeli:likeley

And so on. So each term is composed of the stem, ':' and the unstemmed 
token. I had to create a custom Tokenizer that would split this string 
on spaces alone and not split the words at the ':' position. But there 
may be a different charachter that would work fine for one of the 
standard tokenizers.

When you need to retrieve all unstemmed forms for a particular stem, you 
simply open up a TermEnum for a term <dict:stem:> like this:
    TermEnum te = reader.terms(new Term("dict", stem + ':'));

The you just read the first one or all of the ones that startWith your 
stem. This works very fast because TermEnums are fast. You even get the 
unstemmed forms in a sorted order for free!

- Dmitry



Re: Solution for "unstemming" terms

Posted by Dmitry Serebrennikov <dm...@earthlink.net>.
I don't have an example that is factored out from the rest of my code, 
it'll take time to put one together, which I don't have either :). The 
code's pretty simple though. The only tricky thing is the space 
tokenizer but you could do that very easily with 
java.util.StringTokenizer if you are willing to load the document text 
in to memory as a string.

Here's the function that does the unstemming:

The TermEnum dict can be simply the result of calling reader.terms(). 
Term t is the term you are trying to unstem.
Constant DICTIONARY contains the name of the dictionary field that had 
the "stem:primary ..." entries when the doc was indexed.

    String unstem(TermEnum dict, Term t)
    {
        final String stem = t.text();
        final String prefix = stem + ':';

        final Term lookupTerm = new Term(DICTIONARY, prefix);

        try {
            Term dictTerm = dict.term();
            if (dictTerm.compareTo(lookupTerm) <= 0) {
                if (! dict.skipTo(lookupTerm)) return stem;
                dictTerm = dict.term();
            }
            String found = dictTerm.text();
            if (found.startsWith(prefix)) {
                final String res = found.substring(prefix.length());
                return res;
            } else {
                return stem;
            }

        } catch (IOException e) {
            if (logWriter != null) e.printStackTrace(logWriter);
            return stem;
        }
    }




Maurits van Wijland wrote:

>Dmitry,
>
>Can you send us the code? This is very usefull!
>I would like to experiment with this...
>
>Maurits.
>----- Original Message ----- 
>From: "Dmitry Serebrennikov" <dm...@earthlink.net>
>To: <lu...@jakarta.apache.org>
>Sent: Thursday, October 18, 2001 11:03 PM
>Subject: Solution for "unstemming" terms
>
>
>>I've found a pretty good solution for retrieving un-stemmed version of 
>>index terms, in case anyone is interested. This uses only the features 
>>already in 1.2-rc1 release.
>>
>>The trick is to create an additional field on each document (say "dict" 
>>for dictionary) and set it to contain a list of space-separated strings 
>>like this:
>>
>>    cat:cats likeli:likeley
>>
>>And so on. So each term is composed of the stem, ':' and the unstemmed 
>>token. I had to create a custom Tokenizer that would split this string 
>>on spaces alone and not split the words at the ':' position. But there 
>>may be a different charachter that would work fine for one of the 
>>standard tokenizers.
>>
>>When you need to retrieve all unstemmed forms for a particular stem, you 
>>simply open up a TermEnum for a term <dict:stem:> like this:
>>    TermEnum te = reader.terms(new Term("dict", stem + ':'));
>>
>>The you just read the first one or all of the ones that startWith your 
>>stem. This works very fast because TermEnums are fast. You even get the 
>>unstemmed forms in a sorted order for free!
>>
>>- Dmitry
>>
>>
>
>


Re: Solution for "unstemming" terms

Posted by Maurits van Wijland <m....@quicknet.nl>.
Dmitry,

Can you send us the code? This is very usefull!
I would like to experiment with this...

Maurits.
----- Original Message ----- 
From: "Dmitry Serebrennikov" <dm...@earthlink.net>
To: <lu...@jakarta.apache.org>
Sent: Thursday, October 18, 2001 11:03 PM
Subject: Solution for "unstemming" terms


> I've found a pretty good solution for retrieving un-stemmed version of 
> index terms, in case anyone is interested. This uses only the features 
> already in 1.2-rc1 release.
> 
> The trick is to create an additional field on each document (say "dict" 
> for dictionary) and set it to contain a list of space-separated strings 
> like this:
> 
>     cat:cats likeli:likeley
> 
> And so on. So each term is composed of the stem, ':' and the unstemmed 
> token. I had to create a custom Tokenizer that would split this string 
> on spaces alone and not split the words at the ':' position. But there 
> may be a different charachter that would work fine for one of the 
> standard tokenizers.
> 
> When you need to retrieve all unstemmed forms for a particular stem, you 
> simply open up a TermEnum for a term <dict:stem:> like this:
>     TermEnum te = reader.terms(new Term("dict", stem + ':'));
> 
> The you just read the first one or all of the ones that startWith your 
> stem. This works very fast because TermEnums are fast. You even get the 
> unstemmed forms in a sorted order for free!
> 
> - Dmitry
> 
>