You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Jong Kim <jk...@sitescape.com> on 2006/10/14 21:57:59 UTC

Looking for a stemmer that can return all inflected forms

Hi,
 
I'm looking for a stemmer that is capable of returning all morphological 
variants  of a query term (to be used for high-recall search). For example, 
given a query term of 'cares', I would like to be able to generate 'cares', 
'care', 'cared', and 'caring'.
 
I looked at the Porter stemmer, Snowball stemmer, and the K-stem. 
All of them provide a method that takes a surface string ('cares') as an 
input and returns its base form/stem, which is 'care' in this example. 
But it appears that I can not use the stemmer to generate all of the 
inflected forms of a given query term. 
 
Does anyone know of such tool for Lucene? 
Any help or pointer would be greatly appreciated. 
 
Thanks
/Jong

Re: Looking for a stemmer that can return all inflected forms

Posted by Bill Taylor <wa...@as-st.com>.
On Oct 14, 2006, at 3:57 PM, Jong Kim wrote:

> Hi,
>
> I'm looking for a stemmer that is capable of returning all 
> morphological
> variants  of a query term (to be used for high-recall search). For 
> example,
> given a query term of 'cares', I would like to be able to generate 
> 'cares',
> 'care', 'cared', and 'caring'.
>
> I looked at the Porter stemmer, Snowball stemmer, and the K-stem.
> All of them provide a method that takes a surface string ('cares') as 
> an
> input and returns its base form/stem, which is 'care' in this example.

First of all, I would GREATLY appreciate it if you would tell me which 
of these is easiest to incorporate into Lucene.  I have the same 
problem you do.  I have solved the other end of it but do not knot how 
to fit a stemmer into Lucene.

> But it appears that I can not use the stemmer to generate all of the
> inflected forms of a given query term.
>
> Does anyone know of such tool for Lucene?

I am writing one which is VERY SPECIAL PURPOSE and therefore my code 
not likely to be of much use to you.  HOWEVER, the basic idea is quite 
simple:

Idea 1:

1) Since you have to use the stemmer against something, you are reading 
words out of the index and extracting their stems.

2) Having done that for a word, find all "nearby" words which have the 
same stem.  The simplest definition of "nearby" that I can think of is 
that the word starts with the stem, but you might want to drop the last 
character of the stem and look for all words that start with that.  
Thus, if the stem is "care" you would look at all words that start with 
"car" and if they have "care" as the stem, they are in the same family.

The advantage of this approach is that you do not ever offer any words 
that are not in your index.  If you found cares and cared but not 
caring in your index, you would not want to suggest that someone search 
for caring because they won't find it.  So you use the index as the 
source of words to stem.

Idea 2:

Another way to do it is to build a hash map of tree sets keyed to the 
stem.  Each stem has a tree set of all words which have it as a stem.  
The code would look something like

HashMap<String, TreeSet> stemmedWords = new HashMap<String, TreeSet>();
TreeSet<String> wordsForStem;

for (String word : all words in the index)  {
	stem = MagicStemmer(word);  // I left out code for words that do not 
have stems
           if ( (wordsForStem = stemmedWords.get(stem)) == null) {
               wordsForStem = new TreeSet<String>();  // Tree set for 
the new stem
               stemmedWords.put(stem, wordsForStem); // Now this stem 
has a set for its words
          }
         wordsForStem.add(word); // Put the word into the tree set for 
its stem
}

For each stem from all the words in your index, you get a tree set 
which contains all the words which have it as a stem;  The tree set 
keeps its words in alphabetical order.

If you want the stems to be displayed in alphabetical order, use a 
TreeMap instead of a HashMap.

> Any help or pointer would be greatly appreciated.

I would appreciate your telling me which stemmer for English words is 
easiest to incorporate into Lucene and where to find it.  Thanks.

Bill Taylor


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Looking for a stemmer that can return all inflected forms

Posted by Steven Rowe <sa...@syr.edu>.
Hi Jong,

Jong Kim wrote:
> I'm looking for a stemmer that is capable of returning all morphological 
> variants  of a query term (to be used for high-recall search). For example, 
> given a query term of 'cares', I would like to be able to generate 'cares', 
> 'care', 'cared', and 'caring'.

To achieve high recall, can't you just stem terms both when indexing and
when querying?

The only reasons I can think of not to do so are to support both
high-precision and high-recall queries with the same index, or to give
greater weight to exact-match documents than to stemmed-match documents
within a single query.  If either of these is the case, you could
maintain two fields (one stemmed and one non-stemmed) or two indexes,
and choose which field/index to use (reason #1) or combine the two in a
single query (reason #2).

Actually, now that I think about it more, two indexes for reason #2
(i.e., stemmed match as fallback if exact match fails) would be tricky,
due to issues with fusion of search results -- better to use two fields
in a single index for this case.

Steve


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Looking for a stemmer that can return all inflected forms

Posted by Mike Klaas <mi...@gmail.com>.
On 10/14/06, Jong Kim <jk...@sitescape.com> wrote:
> Hi,
>
> I'm looking for a stemmer that is capable of returning all morphological
> variants  of a query term (to be used for high-recall search). For example,
> given a query term of 'cares', I would like to be able to generate 'cares',
> 'care', 'cared', and 'caring'.
>
> I looked at the Porter stemmer, Snowball stemmer, and the K-stem.
> All of them provide a method that takes a surface string ('cares') as an
> input and returns its base form/stem, which is 'care' in this example.
> But it appears that I can not use the stemmer to generate all of the
> inflected forms of a given query term.

Stemming is a multi-step, lossy, one-way operation and  it does not
suprised me that none of these packages attempts the reverse
operation.

My suggestion is to create a reverse stemmer yourself by taking the
lexicon of your corpus, stemming all the terms, and inverting the map.
 At query time, a lookup can be performed using a trie or hashtable.

best,
-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org