You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Luca Cavanna <ca...@gmail.com> on 2013/06/07 15:16:24 UTC

Hunspell stemmer generates multiple tokens

Hi,
I just noticed that the HunspellStemmer outputs more than one tokens, the
original word plus the stems as far as I understood.

This is not quite what I would expect and becomes tricky especially at
query time. Using for instance elasticsearch to query a stemmed field, a
boolean query would be generated, containing multiple clauses (one for each
token generated by the stemmer) instead of just a clause with the stem that
we expect to find in the index (if we indexed using stemming of course).

I would like to know if you think this is the correct behaviour and if this
is something you are aware of. If I look at snowball for example, I see
that only one token is generated.


Thanks,
Luca

Re: Hunspell stemmer generates multiple tokens

Posted by oren bochman <or...@gmail.com>.

Multiple tokens seems to be a more flexible contract.

You might want to be able to match just the stem, both the exact token and  the stemmed token or just the exact term. So putting both in the index may be expedient, depending on the language.

Also there are  a number of common situations where document text can be stemmed more  accurately than query text. In such cases you might want to boost the stemmed token adaptively.

Sent from my iPhone

On Jun 7, 2013, at 16:16, Luca Cavanna <ca...@gmail.com> wrote:

> Hi,
> I just noticed that the HunspellStemmer outputs more than one tokens, the
> original word plus the stems as far as I understood.
> 
> This is not quite what I would expect and becomes tricky especially at
> query time. Using for instance elasticsearch to query a stemmed field, a
> boolean query would be generated, containing multiple clauses (one for each
> token generated by the stemmer) instead of just a clause with the stem that
> we expect to find in the index (if we indexed using stemming of course).
> 
> I would like to know if you think this is the correct behaviour and if this
> is something you are aware of. If I look at snowball for example, I see
> that only one token is generated.
> 
> 
> Thanks,
> Luca

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org