You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Antony Bowesman <ad...@teamware.com> on 2007/03/22 06:53:16 UTC
Combining score from two or more hits
I have indexed objects that contain one or more attachments. Each attachment is
indexed as a separate Document along with the object metadata.
When I make a search, I may get hits in more than one Document that refer to the
same object. I have a HitCollector which knows if the object has already been
found, so I want to be able to update the score of an existing hit in a way that
makes sense. e.g. If hit H1 has score 1.35 and hit H2 has score 2.9 is is
possible to re-score it on the basis that the real hit result is (H1 AND H2).
I can take the highest score of any Document, but just wondered if this is
possible during the HitCollector.collect method?
Antony
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Combining score from two or more hits
Posted by Antony Bowesman <ad...@teamware.com>.
Chris Hostetter wrote:
>
> if you are using a HitCollector, there any re-evaluation is going to
> happen in your code using whatever mechanism you want -- once your collect
> method is called on a docid, Lucene is done with that docid and no longer
> cares about it ... it's only whatever storage you may be maintaining of
> high scoring docs thta needs to know that you've decided the score has
> changed.
>
> your big problem is going to be that you basically need to maintain a list
> of *every* doc collected, if you don't know what the score of any of them
> are until you've processed all the rest ... since docs are collected in
> increasing order of docid, you might be able to make some optimizations
> based on how big of a gap you've got between the doc you are currently
> collecting and the last doc you've collected if you know that you're
> always going to add docs that "relate" to eachother in sequential bundles
> -- but this would be some very custom code depending on your use case.
I only ever need to return a couple of ID fields per doc hit, so I load them
with FieldCache when I start a new searcher. These IDs refer to unique objects
elsewhere, but there can be one or more instances of the same Id in the index
due to the way I've structured Documents. A Document = an attachment in the
other system attached to the other system's object which can have 1...n
attachments. My problem is I need to return only unique external Ids with some
kind of combined score up to the requested maxHits from the client.
Getting the unique Ids is no problem, but as you say I either have to store all
hits and then sort them by score at the end once I know all unique docs, or do
some clever stuff with some type of PriorityQueue that allows me to re-jig
scores that already exist in the sorted queue.
One idea your comments raise is the relationship of docids to the group of
Documents added for the higher level object. All the Documents for the external
object are added with a single writer at index time. Assuming that the
Documents for a single external Id will either all exist or none, then will the
doc ids always be sequential for ever for that external Id or will they
'reorganise' themselves?
Thanks
Antony
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Combining score from two or more hits
Posted by Chris Hostetter <ho...@fucit.org>.
: Thanks Erick, I've been using TopDocs, but am playing with my own HitCollector
: variant of TopDocHitCollector. The problem is not adjusting the score, it's
: what to adjust it by, i.e. is it possible to re-evaluate the scores of H1 and H2
: knowing that the original query resulted in hits on H1 AND H2.
if you are using a HitCollector, there any re-evaluation is going to
happen in your code using whatever mechanism you want -- once your collect
method is called on a docid, Lucene is done with that docid and no longer
cares about it ... it's only whatever storage you may be maintaining of
high scoring docs thta needs to know that you've decided the score has
changed.
your big problem is going to be that you basically need to maintain a list
of *every* doc collected, if you don't know what the score of any of them
are until you've processed all the rest ... since docs are collected in
increasing order of docid, you might be able to make some optimizations
based on how big of a gap you've got between the doc you are currently
collecting and the last doc you've collected if you know that you're
always going to add docs that "relate" to eachother in sequential bundles
-- but this would be some very custom code depending on your use case.
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Combining score from two or more hits
Posted by Antony Bowesman <ad...@teamware.com>.
Erick Erickson wrote:
> Don't know if it's useful or not, but if you used TopDocs instead,
> you have access to an array of ScoreDoc which you could modify
> freely. In my app, I used a FieldSortedHitQueue to re-sort things
> when I needed to.
Thanks Erick, I've been using TopDocs, but am playing with my own HitCollector
variant of TopDocHitCollector. The problem is not adjusting the score, it's
what to adjust it by, i.e. is it possible to re-evaluate the scores of H1 and H2
knowing that the original query resulted in hits on H1 AND H2.
Antony
>
> ERick
>
> On 3/22/07, Antony Bowesman <ad...@teamware.com> wrote:
>>
>> I have indexed objects that contain one or more attachments. Each
>> attachment is
>> indexed as a separate Document along with the object metadata.
>>
>> When I make a search, I may get hits in more than one Document that refer
>> to the
>> same object. I have a HitCollector which knows if the object has already
>> been
>> found, so I want to be able to update the score of an existing hit in a
>> way that
>> makes sense. e.g. If hit H1 has score 1.35 and hit H2 has score 2.9
>> is is
>> possible to re-score it on the basis that the real hit result is (H1 AND
>> H2).
>>
>> I can take the highest score of any Document, but just wondered if
>> this is
>> possible during the HitCollector.collect method?
>>
>> Antony
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Combining score from two or more hits
Posted by Erick Erickson <er...@gmail.com>.
Don't know if it's useful or not, but if you used TopDocs instead,
you have access to an array of ScoreDoc which you could modify
freely. In my app, I used a FieldSortedHitQueue to re-sort things
when I needed to.
ERick
On 3/22/07, Antony Bowesman <ad...@teamware.com> wrote:
>
> I have indexed objects that contain one or more attachments. Each
> attachment is
> indexed as a separate Document along with the object metadata.
>
> When I make a search, I may get hits in more than one Document that refer
> to the
> same object. I have a HitCollector which knows if the object has already
> been
> found, so I want to be able to update the score of an existing hit in a
> way that
> makes sense. e.g. If hit H1 has score 1.35 and hit H2 has score 2.9 is is
> possible to re-score it on the basis that the real hit result is (H1 AND
> H2).
>
> I can take the highest score of any Document, but just wondered if this is
> possible during the HitCollector.collect method?
>
> Antony
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>