You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Antony Bowesman <ad...@teamware.com> on 2007/03/22 06:53:16 UTC

Combining score from two or more hits

I have indexed objects that contain one or more attachments.  Each attachment is 
indexed as a separate Document along with the object metadata.

When I make a search, I may get hits in more than one Document that refer to the 
same object.  I have a HitCollector which knows if the object has already been 
found, so I want to be able to update the score of an existing hit in a way that 
makes sense.  e.g. If hit H1 has score 1.35 and hit H2 has score 2.9 is is 
possible to re-score it on the basis that the real hit result is (H1 AND H2).

I can take the highest score of any Document, but just wondered if this is 
possible during the HitCollector.collect method?

Antony





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Combining score from two or more hits

Posted by Antony Bowesman <ad...@teamware.com>.
Chris Hostetter wrote:
> 
> if you are using a HitCollector, there any re-evaluation is going to
> happen in your code using whatever mechanism you want -- once your collect
> method is called on a docid, Lucene is done with that docid and no longer
> cares about it ... it's only whatever storage you may be maintaining of
> high scoring docs thta needs to know that you've decided the score has
> changed.
> 
> your big problem is going to be that you basically need to maintain a list
> of *every* doc collected, if you don't know what the score of any of them
> are until you've processed all the rest ... since docs are collected in
> increasing order of docid, you might be able to make some optimizations
> based on how big of a gap you've got between the doc you are currently
> collecting and the last doc you've collected if you know that you're
> always going to add docs that "relate" to eachother in sequential bundles
> -- but this would be some very custom code depending on your use case.

I only ever need to return a couple of ID fields per doc hit, so I load them 
with FieldCache when I start a new searcher.  These IDs refer to unique objects 
elsewhere, but there can be one or more instances of the same Id in the index 
due to the way I've structured Documents.  A Document = an attachment in the 
other system attached to the other system's object which can have 1...n 
attachments.  My problem is I need to return only unique external Ids with some 
kind of combined score up to the requested maxHits from the client.

Getting the unique Ids is no problem, but as you say I either have to store all 
hits and then sort them by score at the end once I know all unique docs, or do 
some clever stuff with some type of PriorityQueue that allows me to re-jig 
scores that already exist in the sorted queue.

One idea your comments raise is the relationship of docids to the group of 
Documents added for the higher level object.  All the Documents for the external 
object are added with a single writer at index time.  Assuming that the 
Documents for a single external Id will either all exist or none, then will the 
doc ids always be sequential for ever for that external Id or will they 
'reorganise' themselves?

Thanks
Antony



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Combining score from two or more hits

Posted by Chris Hostetter <ho...@fucit.org>.
: Thanks Erick, I've been using TopDocs, but am playing with my own HitCollector
: variant of TopDocHitCollector.  The problem is not adjusting the score, it's
: what to adjust it by, i.e. is it possible to re-evaluate the scores of H1 and H2
: knowing that the original query resulted in hits on H1 AND H2.

if you are using a HitCollector, there any re-evaluation is going to
happen in your code using whatever mechanism you want -- once your collect
method is called on a docid, Lucene is done with that docid and no longer
cares about it ... it's only whatever storage you may be maintaining of
high scoring docs thta needs to know that you've decided the score has
changed.

your big problem is going to be that you basically need to maintain a list
of *every* doc collected, if you don't know what the score of any of them
are until you've processed all the rest ... since docs are collected in
increasing order of docid, you might be able to make some optimizations
based on how big of a gap you've got between the doc you are currently
collecting and the last doc you've collected if you know that you're
always going to add docs that "relate" to eachother in sequential bundles
-- but this would be some very custom code depending on your use case.





-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Combining score from two or more hits

Posted by Antony Bowesman <ad...@teamware.com>.
Erick Erickson wrote:
> Don't know if it's useful or not, but if you used  TopDocs instead,
> you have access to an array of ScoreDoc which you could modify
> freely. In my app, I used a FieldSortedHitQueue to re-sort things
> when I needed to.

Thanks Erick, I've been using TopDocs, but am playing with my own HitCollector 
variant of TopDocHitCollector.  The problem is not adjusting the score, it's 
what to adjust it by, i.e. is it possible to re-evaluate the scores of H1 and H2 
knowing that the original query resulted in hits on H1 AND H2.

Antony

> 
> ERick
> 
> On 3/22/07, Antony Bowesman <ad...@teamware.com> wrote:
>>
>> I have indexed objects that contain one or more attachments.  Each
>> attachment is
>> indexed as a separate Document along with the object metadata.
>>
>> When I make a search, I may get hits in more than one Document that refer
>> to the
>> same object.  I have a HitCollector which knows if the object has already
>> been
>> found, so I want to be able to update the score of an existing hit in a
>> way that
>> makes sense.  e.g. If hit H1 has score 1.35 and hit H2 has score 2.9 
>> is is
>> possible to re-score it on the basis that the real hit result is (H1 AND
>> H2).
>>
>> I can take the highest score of any Document, but just wondered if 
>> this is
>> possible during the HitCollector.collect method?
>>
>> Antony



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Combining score from two or more hits

Posted by Erick Erickson <er...@gmail.com>.
Don't know if it's useful or not, but if you used  TopDocs instead,
you have access to an array of ScoreDoc which you could modify
freely. In my app, I used a FieldSortedHitQueue to re-sort things
when I needed to.

ERick

On 3/22/07, Antony Bowesman <ad...@teamware.com> wrote:
>
> I have indexed objects that contain one or more attachments.  Each
> attachment is
> indexed as a separate Document along with the object metadata.
>
> When I make a search, I may get hits in more than one Document that refer
> to the
> same object.  I have a HitCollector which knows if the object has already
> been
> found, so I want to be able to update the score of an existing hit in a
> way that
> makes sense.  e.g. If hit H1 has score 1.35 and hit H2 has score 2.9 is is
> possible to re-score it on the basis that the real hit result is (H1 AND
> H2).
>
> I can take the highest score of any Document, but just wondered if this is
> possible during the HitCollector.collect method?
>
> Antony
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>