You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Lukáš Vlček <lu...@gmail.com> on 2011/06/20 12:55:56 UTC

KStem custom lexicons configuration possible?

Hi,

Is there any API in KStem filter for lexicons configuration?

As far as I understand the original code works in such a way that lexicons
are loaded from files at startup (see
http://lexicalresearch.com/kstem-doc.txt). The author (Robert Krovetz) names
possibility to modify lexicons among advantages of KStem compared to other
stemmers.

Do people not need it? Would it be a useful addition for KStem filter to
allow custom lexicon configurations in its API?

Regards,
Lukas

Note: Big kudos to all who participated in bringing KStem into Lucene!

Re: KStem custom lexicons configuration possible?

Posted by Lukáš Vlček <lu...@gmail.com>.

Hi Robert,

I think the difference between KStem and other stemmers (at least those that
I am aware of, like snowball or porter) is that KStem is expected to produce
a real valid words and thus other filtering can be applied to the tokens
after stemming more easily (for example synonym expansion). Not sure if this
is the case with other available stemmers in Lucene.

Also my impression from reading the original paper by Robert Krovetz was
that possibility to fine-tune lexicons is practical. So that is why I was
expecting that KStem API should support this as well.

Well, may be a combination of KStem with Override filter (but applied AFTER
stemming) would work too in this case :-)

Regards,
Lukas

On Mon, Jun 20, 2011 at 2:32 PM, Robert Muir <rc...@gmail.com> wrote:

> On Mon, Jun 20, 2011 at 8:23 AM, Lukáš Vlček <lu...@gmail.com>
> wrote:
> > Hi Robert,
> > this sounds interesting I will look at it in more detail.
> > However, I do not think this is really a general solution. If I
> understand
> > StemmerOverrideFilter correctly (from a quick glance) it rely on the fact
> > that you *know* exact term (the key in the map) in advance. In other
> words
> > if I wanted to "fix" some term produced by Kstem filter I would have to
> know
> > what is the product of the stemming in advance. Now, this means that if I
> > switch to snowball or porter or other stemmer instead of KStem or simply
> > update something else in the filtering chain then I am in trouble. Also
> if I
> > understand correctly the original KStem implementation it can still get
> > updates to lexicons which means that once these updates are ported to
> Java
> > implementation it can again result in problem with existing override
> filter
> > setup.
> > More generally, is there any reason why lexicons are not configurable in
>
> Because we have StemmerOverrideFilter and KeywordMarkerFilter.
>
> look at the source code to Kstem: it uses maps and sets of exceptions,
> this is what these filters provide in a general way
> (StemmerOverrideFilter being the map, and KeywordMarkerFilter being
> the set).
>
> we added these to work across the board with all lucene stemmers for
> this reason.
>
> I don't understand your concerns at all to be honest, they make no
> sense to me. If we "updated" kstem or any other algorithm: it would
> break whatever you are doing either way. A hashmap is a hashmap.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: KStem custom lexicons configuration possible?

Posted by Robert Muir <rc...@gmail.com>.

On Mon, Jun 20, 2011 at 8:23 AM, Lukáš Vlček <lu...@gmail.com> wrote:
> Hi Robert,
> this sounds interesting I will look at it in more detail.
> However, I do not think this is really a general solution. If I understand
> StemmerOverrideFilter correctly (from a quick glance) it rely on the fact
> that you *know* exact term (the key in the map) in advance. In other words
> if I wanted to "fix" some term produced by Kstem filter I would have to know
> what is the product of the stemming in advance. Now, this means that if I
> switch to snowball or porter or other stemmer instead of KStem or simply
> update something else in the filtering chain then I am in trouble. Also if I
> understand correctly the original KStem implementation it can still get
> updates to lexicons which means that once these updates are ported to Java
> implementation it can again result in problem with existing override filter
> setup.
> More generally, is there any reason why lexicons are not configurable in

Because we have StemmerOverrideFilter and KeywordMarkerFilter.

look at the source code to Kstem: it uses maps and sets of exceptions,
this is what these filters provide in a general way
(StemmerOverrideFilter being the map, and KeywordMarkerFilter being
the set).

we added these to work across the board with all lucene stemmers for
this reason.

I don't understand your concerns at all to be honest, they make no
sense to me. If we "updated" kstem or any other algorithm: it would
break whatever you are doing either way. A hashmap is a hashmap.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: KStem custom lexicons configuration possible?

Posted by Lukáš Vlček <lu...@gmail.com>.

Hi Robert,

this sounds interesting I will look at it in more detail.

However, I do not think this is really a general solution. If I understand
StemmerOverrideFilter correctly (from a quick glance) it rely on the fact
that you *know* exact term (the key in the map) in advance. In other words
if I wanted to "fix" some term produced by Kstem filter I would have to know
what is the product of the stemming in advance. Now, this means that if I
switch to snowball or porter or other stemmer instead of KStem or simply
update something else in the filtering chain then I am in trouble. Also if I
understand correctly the original KStem implementation it can still get
updates to lexicons which means that once these updates are ported to Java
implementation it can again result in problem with existing override filter
setup.

More generally, is there any reason why lexicons are not configurable in
KStem filter?

Regards,
Lukas

On Mon, Jun 20, 2011 at 1:38 PM, Robert Muir <rc...@gmail.com> wrote:

> On Mon, Jun 20, 2011 at 7:19 AM, Lukáš Vlček <lu...@gmail.com>
> wrote:
> > Having an option to modify internal lexicons I would be able to adapt the
> > KStem to work better for specific text corpora.
> > What do you think?
>
> please use StemmerOverrideFilter for this! it works with all stemmers,
> including this one.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: KStem custom lexicons configuration possible?

Posted by Robert Muir <rc...@gmail.com>.

On Mon, Jun 20, 2011 at 7:19 AM, Lukáš Vlček <lu...@gmail.com> wrote:
> Having an option to modify internal lexicons I would be able to adapt the
> KStem to work better for specific text corpora.
> What do you think?

please use StemmerOverrideFilter for this! it works with all stemmers,
including this one.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: KStem custom lexicons configuration possible?

Posted by Lukáš Vlček <lu...@gmail.com>.

May be I should show some examples where I think custom configuration can be
useful. Let me give you two examples:

1) As of now, KStem does conflation of both words "connector" and
"connected" to the same term "connect".
2) Contrary it does not do conflation of "transaction" and "transactions" to
the same term.

Having an option to modify internal lexicons I would be able to adapt the
KStem to work better for specific text corpora.

What do you think?

Regards,
Lukas

On Mon, Jun 20, 2011 at 12:55 PM, Lukáš Vlček <lu...@gmail.com> wrote:

> Hi,
>
> Is there any API in KStem filter for lexicons configuration?
>
> As far as I understand the original code works in such a way that lexicons
> are loaded from files at startup (see
> http://lexicalresearch.com/kstem-doc.txt). The author (Robert Krovetz)
> names possibility to modify lexicons among advantages of KStem compared to
> other stemmers.
>
> Do people not need it? Would it be a useful addition for KStem filter to
> allow custom lexicon configurations in its API?
>
> Regards,
> Lukas
>
> Note: Big kudos to all who participated in bringing KStem into Lucene!
>