You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Joel Halbert <jo...@su3analytics.com> on 2009/10/28 14:29:29 UTC

similarity function

Hi,

Given a query with multiple terms, e.g. fish oil, and searching across
multiple fields e.g. 

query= fieldA:fish fieldA:oil fieldB:fish fieldB:oil  etc...

I don't want to give any more weight to documents that match the same
word multiple times (either in the same, or different fields). I am only
interested in lending additional weight to a match of both words (fish
and oil) in the SAME field.

So for example if I have documents:

Doc1
fieldA=fish is good for you
fieldB=vegetable oil and sunflower oil is good for you 

and Doc2
fieldA=fish oil is good for you
fieldB=bla bla bla

with the default similarity I would have 3 term matches in document 1
(fish, oil, oil) and 2 in document 2 (fish, oil), but I only want to
count 2 term matches in document 1 (fish, oil) and I want to give
increased weight to the two matches in document 2 because they occur in
the same field (fieldA).

Any ideas? Is there a simple way to achieve this? (it goes without
saying I want to match both documents, i.e. don't want to use quotes
"fish oil")

Thanks,
Joel



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: similarity function

Posted by Chris Hostetter <ho...@fucit.org>.
: "how do i set the score of each document result to be the score of that
: of the field that best matches the search terms"?

you'll want something like this psuedo code...

 DisjunctionMaxQuery dq = new DMQ
 foreach fieldname in list_of_fields {
    BooleanQuery bq = new BQ
    foreach word in list_of_words {
      bq.add(new TermQuery(fieldname,word),  SHOULD)
    }
    bq.setMinSHouldMatch(1)
 }
 dq.setTieBreaker(0.0)


...the DisjunctioNmaxQuery will only take it's score from whichever of hte 
BooleanQueries scores highest, and the setMinSHouldMatch will ensure that 
those boolean queries will match as long as at least one of the words is 
found in that field, but the more words that match the higher the score.

then all you need to do is modify your similarity class to change the tf() 
function so that a doc doesn't get a really high score just for matching 
one word many many times.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: similarity function

Posted by Joel Halbert <jo...@su3analytics.com>.
I suppose this could be summarised as:

"how do i set the score of each document result to be the score of that
of the field that best matches the search terms"?

 
-----Original Message-----
From: Joel Halbert <jo...@su3analytics.com>
Reply-To: java-user@lucene.apache.org
To: Lucene Users <ja...@lucene.apache.org>
Subject: similarity function
Date: Wed, 28 Oct 2009 13:29:29 +0000

Hi,

Given a query with multiple terms, e.g. fish oil, and searching across
multiple fields e.g. 

query= fieldA:fish fieldA:oil fieldB:fish fieldB:oil  etc...

I don't want to give any more weight to documents that match the same
word multiple times (either in the same, or different fields). I am only
interested in lending additional weight to a match of both words (fish
and oil) in the SAME field.

So for example if I have documents:

Doc1
fieldA=fish is good for you
fieldB=vegetable oil and sunflower oil is good for you 

and Doc2
fieldA=fish oil is good for you
fieldB=bla bla bla

with the default similarity I would have 3 term matches in document 1
(fish, oil, oil) and 2 in document 2 (fish, oil), but I only want to
count 2 term matches in document 1 (fish, oil) and I want to give
increased weight to the two matches in document 2 because they occur in
the same field (fieldA).

Any ideas? Is there a simple way to achieve this? (it goes without
saying I want to match both documents, i.e. don't want to use quotes
"fish oil")

Thanks,
Joel



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org