You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Itamar Syn-Hershko <it...@divrei-tora.com> on 2008/03/01 20:07:16 UTC

RE: Rebuilding Document from index?

This is exactly where Hebrew is different from all Latin languages. I did
think about the approach you mentioned, of having 2 fields - one is stemmed
and the other is not - but even with it the search will be performed on the
non-stemmed field by default. The stemmed field will only be searched upon
explicit request, since one stem in Hebrew can be related to many nouns
adjectives and verbs - too many of those, and the stemming process itself is
not deterministic enough.

I would rather use the non-stemmed field for MoreLikeThis as well, but as I
said I will need some sort of synonyms engine, so I would be able to score
related words by their real frequency and not be tricked by any initials (as
I said before - "the", "and" and other so-called stop words are initial
letters in Hebrew, and are tough to omit).

That is mainly why I'm interested in an easy and inexpensive solution.
Mathieu seems to have went off this topic unfortunately...

Itamar.

-----Original Message-----
From: Daniel Noll [mailto:daniel@nuix.com] 
Sent: Friday, February 29, 2008 5:35 AM
To: java-user@lucene.apache.org
Subject: Re: Rebuilding Document from index?

On Wednesday 27 February 2008 03:33:53 Itamar Syn-Hershko wrote:
> I'm still trying to engineer the best possible solution for Lucene 
> with Hebrew, right now my path is NOT using a stemmer by default, only 
> by explicit request of the user. MoreLikeThis would only return 
> relevant results if I will use a non-stemmed scoring and lookup.

This appears to be the case for all languages too, the stemming will skew
similarity and result in unrelated documents scoring higher than they need
to.

Some people seem to be working around this by having two fields where one is
stemmed and the other isn't.  You could then use the stemmed field when
doing queries but use the non-stemmed field for MoreLikeThis.

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org