You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by julien Blaize <ju...@gmail.com> on 2017/10/24 15:04:33 UTC

How to use Hunspell dictionary to do the reverse of stemming ?

Hello,

i am lookingfor a way to efficiently do the reverse of stemming.
Example : if i give to the program the verb "drug" it will give me
"drugged', "drugging", "drugs", "drugstore" etc...

I have used the program wordforms from hunspell to generate all possibles
combinations of the input word (even all the ridiculous one's that does not
match a real word). The i use org.apache.lucene.analysis.hunspell.Dictionary
class to check if the word exists and map to the original word.
This is really long and not efficient.

I was looking at the internals of the Dictionary class and saw the use of
patterns and FST (finite state machine). This seems a very efficient way to
check for the stem of a word. But i was unable to find a way to do the
reverse operation.

I am wondering if anyone has tried to do something similar ? Can someone
who understand FST and the usage of patterns in the Dictionary class give
me hints of wether what i am trying to do is possible and will be efficient
?

Kind Regards.

--
Julien Blaize

Re: How to use Hunspell dictionary to do the reverse of stemming ?

Posted by Robert Muir <rc...@gmail.com>.

On Tue, Oct 24, 2017 at 11:04 AM, julien Blaize <ju...@gmail.com> wrote:
> Hello,
>
> i am lookingfor a way to efficiently do the reverse of stemming.
> Example : if i give to the program the verb "drug" it will give me
> "drugged', "drugging", "drugs", "drugstore" etc...

To generate the list up-front (for all words), maybe look at the morph
generation code here and modify to your needs:
https://github.com/hunspell/hunspell/blob/master/src/tools/analyze.cxx
Then maybe try adding this to a lucene SynonymMap which will store
this in an FST with deduplication etc and may be reasonably efficient
(its just a large synonym dictionary at that point).
If you generate to wordnet or solr synonyms format there are already
parsers for those, so that may be easiest.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org