You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Erick Erickson <er...@gmail.com> on 2007/03/01 13:31:31 UTC

Re: Sorting by Score

Peter:

About a custom ScoreComparator. The problem I couldn't get past was that I
needed to know the max score of all the docs in order to divide the raw
scores into quintiles since I was dealing with raw scores. I didn't see how
to make that work with ScoreComparator, but I confess that I didn't look
very hard after someone on the list turned me on to FieldSortedHitQueue....

Erick

On 2/28/07, Erick Erickson <er...@gmail.com> wrote:
>
> It may well be, but as I said this is efficient enough for my needs
> so I didn't pursue it. One of my pet peeves is spending time making
> things "more efficient" when there's no need, and my index isn't
> going to grow enough larger to worry about that now <G>...
>
> Erick
>
> On 2/28/07, Peter Keegan < peterlkeegan@gmail.com> wrote:
> >
> > Erich,
> >
> > Yes, this seems to be the simplest way to implement score
> > 'bucketization',
> > but wouldn't it be more efficient to do this with a custom
> > ScoreComparator?
> > That way, you'd do the bucketizing and sorting in one 'step'
> > (compare()).
> > Maybe the savings isn't measurable, though. A comparator might also
> > allow
> > one to do a more sophisticated rounding or bucketizing since you'd be
> > getting 2 scores at a time.
> >
> > Peter
> >
> >
> > On 2/28/07, Erick Erickson <erickerickson@gmail.com > wrote:
> > >
> > > Empirically, when I insert the elements in the FieldSortedHitQueue
> > > they get sorted according to the Sort object. The original query
> > > that gives me a TopDocs applied
> > > no secondary sorting, only relevancy. Since I normalized
> > > all the scores into one of only 5 discrete values, and secondary
> > > sorting was applied to all docs with the same score when I inserted
> > > them in the FieldSortedHitQueue.
> > >
> > > Now popping things of the FieldSortedHitQueue is ordered the
> > > way I want.
> > >
> > > You could just operate on the FieldSortedHitQueue at this point, but
> > > I decided the rest of my code would be simpler if I stuffed them back
> > > into the TopDocs, so there's some explanation below that you can
> > > just skip if I've cleared things up already.....
> > >
> > > *****************
> > > The step I left out is moving the documents from the
> > > FIeldSortedHitQueue back to topDocs.scoreDocs.
> > > So the steps are as follows..
> > >
> > > 1> "bucketize" the scores. That is, go through the
> > > TopDocs.scoreDocs and adjust each raw score into
> > > one of my buckets. This is made easy by the
> > > existence of topDocs.getMaxScore . TopDocs has
> > > had no sorting other than relevancy applied so far.
> > >
> > > 2> assemble the FieldSortedHitQueue by inserting
> > > each element from scoreDocs into it, with a suitable
> > > Sort object, relevance is the first field ( SortField.FIELD_SCORE).
> > >
> > > 3> pop the entries off the FieldSortedHitQueue, overwriting
> > > the elements in topDocs.scoreDocs.
> > >
> > > I left out step <3>, although I suppose you could
> > > operate directly on the FieldSortedHitQueue.
> > >
> > > NOTE: in my case, I just put everything back in the
> > > scoreDocs without attempting any efficiencies. If I
> > > needed more performance, I'd only put as many items
> > > back as I needed to display. But as I wrote yesterday,
> > > performance isn't an issue so there's no point. Although
> > > I know one place to look if we need to squeeze more QPS.
> > >
> > > How efficient this is is an open question. But it's "fast enough"
> > > and relatively simple so I stopped looking for more
> > > efficiencies....
> > >
> > > Erick
> > >
> > > On 2/28/07, Chris Hostetter <hossman_lucene@fucit.org > wrote:
> > > >
> > > >
> > > > : The first part was just to iterate through the TopDocs that's
> > > available
> > > > to
> > > > : my and normalize the scores right in the ScoreDocs. Like this...
> > > >
> > > > Won't that be done after the Lucene does the hitcollecting/sorting?
> > ...
> > > he
> > > > wants the "bucketing" to happen as part of hte scoring so that the
> > > > secondary sort will determine the ordering within the bucket.
> > > >
> > > > (or am i missing something about your description?)
> > > >
> > > >
> > > >
> > > >
> > > > -Hoss
> > > >
> > > >
> > > >
> > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> >
>
>

Re: Sorting by Score

Posted by Peter Keegan <pe...@gmail.com>.

Erick,

I think you're right because you'd wouldn't know the max score before the
comparisons. I'm just thinking about a rounding algorithm that involves
comparing the raw scores to the theoretical maximum score, which I think
could be computed from the Similarity class and knowing the max boost value
used during indexing.

Peter

On 3/1/07, Erick Erickson <er...@gmail.com> wrote:
>
> Peter:
>
> About a custom ScoreComparator. The problem I couldn't get past was that I
> needed to know the max score of all the docs in order to divide the raw
> scores into quintiles since I was dealing with raw scores. I didn't see
> how
> to make that work with ScoreComparator, but I confess that I didn't look
> very hard after someone on the list turned me on to
> FieldSortedHitQueue....
>
> Erick
>
> On 2/28/07, Erick Erickson <er...@gmail.com> wrote:
> >
> > It may well be, but as I said this is efficient enough for my needs
> > so I didn't pursue it. One of my pet peeves is spending time making
> > things "more efficient" when there's no need, and my index isn't
> > going to grow enough larger to worry about that now <G>...
> >
> > Erick
> >
> > On 2/28/07, Peter Keegan < peterlkeegan@gmail.com> wrote:
> > >
> > > Erich,
> > >
> > > Yes, this seems to be the simplest way to implement score
> > > 'bucketization',
> > > but wouldn't it be more efficient to do this with a custom
> > > ScoreComparator?
> > > That way, you'd do the bucketizing and sorting in one 'step'
> > > (compare()).
> > > Maybe the savings isn't measurable, though. A comparator might also
> > > allow
> > > one to do a more sophisticated rounding or bucketizing since you'd be
> > > getting 2 scores at a time.
> > >
> > > Peter
> > >
> > >
> > > On 2/28/07, Erick Erickson <erickerickson@gmail.com > wrote:
> > > >
> > > > Empirically, when I insert the elements in the FieldSortedHitQueue
> > > > they get sorted according to the Sort object. The original query
> > > > that gives me a TopDocs applied
> > > > no secondary sorting, only relevancy. Since I normalized
> > > > all the scores into one of only 5 discrete values, and secondary
> > > > sorting was applied to all docs with the same score when I inserted
> > > > them in the FieldSortedHitQueue.
> > > >
> > > > Now popping things of the FieldSortedHitQueue is ordered the
> > > > way I want.
> > > >
> > > > You could just operate on the FieldSortedHitQueue at this point, but
> > > > I decided the rest of my code would be simpler if I stuffed them
> back
> > > > into the TopDocs, so there's some explanation below that you can
> > > > just skip if I've cleared things up already.....
> > > >
> > > > *****************
> > > > The step I left out is moving the documents from the
> > > > FIeldSortedHitQueue back to topDocs.scoreDocs.
> > > > So the steps are as follows..
> > > >
> > > > 1> "bucketize" the scores. That is, go through the
> > > > TopDocs.scoreDocs and adjust each raw score into
> > > > one of my buckets. This is made easy by the
> > > > existence of topDocs.getMaxScore . TopDocs has
> > > > had no sorting other than relevancy applied so far.
> > > >
> > > > 2> assemble the FieldSortedHitQueue by inserting
> > > > each element from scoreDocs into it, with a suitable
> > > > Sort object, relevance is the first field ( SortField.FIELD_SCORE).
> > > >
> > > > 3> pop the entries off the FieldSortedHitQueue, overwriting
> > > > the elements in topDocs.scoreDocs.
> > > >
> > > > I left out step <3>, although I suppose you could
> > > > operate directly on the FieldSortedHitQueue.
> > > >
> > > > NOTE: in my case, I just put everything back in the
> > > > scoreDocs without attempting any efficiencies. If I
> > > > needed more performance, I'd only put as many items
> > > > back as I needed to display. But as I wrote yesterday,
> > > > performance isn't an issue so there's no point. Although
> > > > I know one place to look if we need to squeeze more QPS.
> > > >
> > > > How efficient this is is an open question. But it's "fast enough"
> > > > and relatively simple so I stopped looking for more
> > > > efficiencies....
> > > >
> > > > Erick
> > > >
> > > > On 2/28/07, Chris Hostetter <hossman_lucene@fucit.org > wrote:
> > > > >
> > > > >
> > > > > : The first part was just to iterate through the TopDocs that's
> > > > available
> > > > > to
> > > > > : my and normalize the scores right in the ScoreDocs. Like this...
> > > > >
> > > > > Won't that be done after the Lucene does the
> hitcollecting/sorting?
> > > ...
> > > > he
> > > > > wants the "bucketing" to happen as part of hte scoring so that the
> > > > > secondary sort will determine the ordering within the bucket.
> > > > >
> > > > > (or am i missing something about your description?)
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > -Hoss
> > > > >
> > > > >
> > > > >
> > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >
> > > > >
> > > >
> > >
> >
> >
>