You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by chandrakant k <ch...@gmail.com> on 2009/06/30 20:02:18 UTC

Query which gives high score proportional to 'distinct term matches'

I have a index which has got fields like 

title :
content :

If I search for, lets say  obama fly ,  then the documents having obama and
fly should be given high scores irrespective of the number of times they may
occur. This requirement is for fields -  title and content.

The implementation which I did with a simple OR query will score high the
documents for e.g.
 having more occurrence of 'obama'  even if it has no occurrence  'fly' word
in it. The tf for 'obama' here in this case is more; so even if 'fly' word
is not present the document is scored higher.

Expected behaviour is that - 
(a)  documents having 'obama' and 'fly' both should be scored higher in
order of their tf .
(b)  documents having either of terms should be given scores but less than
those matched in (a)

I tried by overiding the the coord() in a Custom Similarity implementation
and boosting it if multiple terms match, but what I see is that coord() is
gets boosted even if same word matches in multiple fields (say obama is
present in title: and content: ).

Searching for solutions, I have not got any results which talk about similar
requirement... I guess I am not using right keywords....

Thanks
Chandrakant K.




  
-- 
View this message in context: http://www.nabble.com/Query-which-gives-high-score-proportional-to-%27distinct-term-matches%27-tp24276724p24276724.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Query which gives high score proportional to 'distinct term matches'

Posted by Matthew Hall <mh...@informatics.jax.org>.
Well, we have a very similar requirement here, but for us its for every 
single field that we wanted this kind of behavior.

We got this in by eliminating the TF (Term Frequency) contribution to 
score via a custom Similarity. (Which is very easy to do.)

I... think in the newer versions of lucene you can omit TF more 
programatically at query time, but I don't recall if you could do it on 
a per field basis.  Anyone else want to speak on this a bit better?

Matt

chandrakant k wrote:
> I have a index which has got fields like 
>
> title :
> content :
>
> If I search for, lets say  obama fly ,  then the documents having obama and
> fly should be given high scores irrespective of the number of times they may
> occur. This requirement is for fields -  title and content.
>
> The implementation which I did with a simple OR query will score high the
> documents for e.g.
>  having more occurrence of 'obama'  even if it has no occurrence  'fly' word
> in it. The tf for 'obama' here in this case is more; so even if 'fly' word
> is not present the document is scored higher.
>
> Expected behaviour is that - 
> (a)  documents having 'obama' and 'fly' both should be scored higher in
> order of their tf .
> (b)  documents having either of terms should be given scores but less than
> those matched in (a)
>
> I tried by overiding the the coord() in a Custom Similarity implementation
> and boosting it if multiple terms match, but what I see is that coord() is
> gets boosted even if same word matches in multiple fields (say obama is
> present in title: and content: ).
>
> Searching for solutions, I have not got any results which talk about similar
> requirement... I guess I am not using right keywords....
>
> Thanks
> Chandrakant K.
>
>
>
>
>   
>   


-- 
Matthew Hall
Software Engineer
Mouse Genome Informatics
mhall@informatics.jax.org
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org