You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Chun Wei Ho <cw...@gmail.com> on 2006/02/10 08:25:49 UTC

Help: tweaking search - reducing IDF skew and implementing score cutoff

Hi,

I am running a search for something akin to a news site, when each
news document has a date, title, keywords/bylines, summary fields and
then the actual content. Using Lucene for this database of documents,
it seems that:

1. The relevancy score is skewed drastically by the actual number of
news document per day. For example if I RangedQuery on this week, and
there were 100 news on Monday and two news on Sunday, the two on
Sunday gets ranked highly due to idf. How do I reduce this skewness
due to the date-posted field? I saw a reference earlier to
ConstantScoreRangeQuery on JIRA - is it the solution?

2. If I choose to sort the results by date, then recent documents with
very very low relevancy (say the words searched appears only in
content, and not in title/bylines/summary fields that are boosted
higher) are still shown relatively high in the list, and I wish to
omit them in general. What is the best way to implement some sort of a
relevancy filter (include only documents with an normalized score of
0.2 or more....)? Or is there a better way around it?

Thanks :)

Best Regards,
CW

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Help: tweaking search - reducing IDF skew and implementing score cutoff

Posted by Chris Hostetter <ho...@fucit.org>.

: Sunday gets ranked highly due to idf. How do I reduce this skewness
: due to the date-posted field? I saw a reference earlier to
: ConstantScoreRangeQuery on JIRA - is it the solution?

Yes. RangeQuery expands to a BooleanQuery containing all of the terms in
the. The number of terms (and the frequency of thsoe terms in the index)
will allways affect those scores. This is why i constantly argue that
when using dates or numbers a RangeQuery never makes sense -- allways use
a RangeFilter, and if you must have a "Query" object, use
ConstantScoreRangeQuery.

: 2. If I choose to sort the results by date, then recent documents with
: very very low relevancy (say the words searched appears only in
: content, and not in title/bylines/summary fields that are boosted
: higher) are still shown relatively high in the list, and I wish to
: omit them in general. What is the best way to implement some sort of a
: relevancy filter (include only documents with an normalized score of
: 0.2 or more....)? Or is there a better way around it?

there is no safe way to filter by score, this is mentioned in the FAQ...

http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-912c1f237bb00259185353182948e5935f0c2f03

An alternate approach is to sort by score, but use something like a
FunctionQuery to inflate the scores of more recent documents...

https://issues.apache.org/jira/browse/LUCENE-446

-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Help: tweaking search - reducing IDF skew and implementing score cutoff

Posted by Chris Lamprecht <cl...@gmail.com>.

> 2. If I choose to sort the results by date, then recent documents with
> very very low relevancy (say the words searched appears only in
> content, and not in title/bylines/summary fields that are boosted
> higher) are still shown relatively high in the list, and I wish to
> omit them in general. What is the best way to implement some sort of a
> relevancy filter (include only documents with an normalized score of
> 0.2 or more....)? Or is there a better way around it?

As Chris pointed out, there isn't always an easy way to do this.  Your
suggestion of filtering below normalized scores of 0.2 might work,
assuming the most relevant document is 1.0.  You'll have to tune this
cutoff point and see how well it works.  One thing to watch out for is
that if the raw (non-normalized) score is less than 1.0, it is not
"normalized", so your most relevant document can have a score of less
than 1.0.  This may or may not be what you want, just something to
consider.  Lucene's Hits.java is where the normalization happens.

-chris

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org