You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Marcus Falck <ma...@observer.se> on 2006/09/08 09:45:13 UTC

Changing the Scoring api for OR parameters

Hi everyone,

 

I want to override the default scoring when it comes to queries
containing the OR operator.

 

For example if I got the following headlines in my index :

"Sun sues Microsoft"

"Microsoft want to buy Tiscali"

".NU domain sues Microsoft"

"The sun is shining"

"Sun brings antitrust suit against Microsoft"

 

Those documents have been boosted in desc fashion ("Sun sues Microsoft"
has higher calculated norm value then "Sun brings antirust suit against
Microsoft"), 

The similarity class that has been used has made the norm values to be
exactly as the boost value ( I have even modified the norm to be a float
so I won't loose precision ).

 

If I perform a search for: Microsoft OR Sun

 

The topranked results will almost certainly be:

Sun sues Microsoft

Sun Brings antitrust suit against Microsoft

....

 

I just want the documents returned like this:

"Sun sues Microsoft"

"Microsoft want to buy Tiscali"

".NU domain sues Microsoft"

"The sun is shining"

"Sun brings antitrust suit against Microsoft"

 

I have to get this to work since I'm indexing news material and the
customers are only interested in the newest articles ( so the date of
the article is being used as a boost factor).

 

Any ideas? My rank changes to lucene works as expected when it comes to
AND operator and single term queries.

 

/

Regards

 

Marcus Falck

Re: Changing the Scoring api for OR parameters

Posted by Chris Hostetter <ho...@fucit.org>.

if you are already seting the document boost based on the "date" of hte
Document, then the next thing you should familiarize yourself with is
Similarity.coord function.  It's specific purpose is for dealing with
Queries which "aggregate" other queries (like a BooleanQuery does with
it's clauses) to let you determine how much "penalty" documents recieve
for only matching on a subset of the clauses.

(the conventional wisdom being that a document matching more clauses is
"better" then a document matching few clauses, even if the aggregate
score of second document is a little higher thn hte first)

If you don't like that conventional wisdon, you can make the coord method
of your Similarity return a constant value (like "1") to illiminate it's
impact completely, or define some function that has a lower impact then
the overlap/maxOverlap algorithm used in the DefaultSimilarity.

Since illiminating the impact completely is a common need among
"artifically" constructed queries (like Prefix queries and Wild card
ueries) there is acctually a constructor for BooleanQuery that takes in a
boolean to do this.



Lastly: if you want more exact control over the way the dates of your
documents influence the score (without mucking with norms to make them
floats instead of bytes) consider using FunctionQuery ... searching the
archives will pop up several examples of how it can be used, and you can
find it in the Solr code base...

http://incubator.apache.org/solr/
http://incubator.apache.org/solr/docs/api/org/apache/solr/search/function/package-summary.html


: Date: Fri, 8 Sep 2006 09:45:13 +0200
: From: Marcus Falck <ma...@observer.se>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Changing the Scoring api for OR parameters
:
: Hi everyone,
:
:
:
: I want to override the default scoring when it comes to queries
: containing the OR operator.
:
:
:
: For example if I got the following headlines in my index :
:
: "Sun sues Microsoft"
:
: "Microsoft want to buy Tiscali"
:
: ".NU domain sues Microsoft"
:
: "The sun is shining"
:
: "Sun brings antitrust suit against Microsoft"
:
:
:
: Those documents have been boosted in desc fashion ("Sun sues Microsoft"
: has higher calculated norm value then "Sun brings antirust suit against
: Microsoft"),
:
: The similarity class that has been used has made the norm values to be
: exactly as the boost value ( I have even modified the norm to be a float
: so I won't loose precision ).
:
:
:
: If I perform a search for: Microsoft OR Sun
:
:
:
: The topranked results will almost certainly be:
:
: Sun sues Microsoft
:
: Sun Brings antitrust suit against Microsoft
:
: ....
:
:
:
: I just want the documents returned like this:
:
: "Sun sues Microsoft"
:
: "Microsoft want to buy Tiscali"
:
: ".NU domain sues Microsoft"
:
: "The sun is shining"
:
: "Sun brings antitrust suit against Microsoft"
:
:
:
: I have to get this to work since I'm indexing news material and the
: customers are only interested in the newest articles ( so the date of
: the article is being used as a boost factor).
:
:
:
: Any ideas? My rank changes to lucene works as expected when it comes to
: AND operator and single term queries.
:
:
:
: /
:
: Regards
:
:
:
: Marcus Falck
:
:
:
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org