You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Shai Erera <se...@gmail.com> on 2009/03/17 12:44:30 UTC

Is TopDocCollector's collect() implementation correct?

Hi

TopDocCollector's (TDC) implementation of collect() seems a bit problematic
to me.

      if (reusableSD == null) {
        reusableSD = new ScoreDoc(doc, score);
      } else if (score >= reusableSD.score) {
        // reusableSD holds the last "rejected" entry, so, if
        // this new score is not better than that, there's no
        // need to try inserting it
        reusableSD.doc = doc;
        reusableSD.score = score;
      } else {
        return;
      }
      reusableSD = (ScoreDoc) hq.insertWithOverflow(reusableSD);

The first 'if' is safe, as it assumes that if reusableSD is null, then there
is more room in PQ.

The second (else if) is a bit problematic. It assumes that if the given
score is less than the latest item rejected from PQ, then it should not be
added. The problem I see here is if someone extends TDC and provides his own
PQ, which for example, sorts based on the lowest scoring documents, or just
by document IDs. TDC has a protected constructor which allows that. In that
case, comparing the current score to the latest rejected ScoreDoc is wrong.
Also, PQ itself will check that (using lessThan) before attempting to call
the insert itself.

On the other hand, this check saves two method calls (PQ.insert and
PQ.lessThan) in case someone wants to sort by top ranking documents, which
is the default behavior.

Of course we could say to that person who wants to sort by docIDs, to extend
TDC and override collect() itself, but I would like to propose an
alternative, which will allow extending TDC by providing a different PQ,
while still using its collect() method.

Introduce in TDC a private boolean which signals whether the default PQ is
used or not. If it's not used, don't do the 'else if' at all. If it is used,
then the 'else if' is safe. Then code could look like:

      if (reusableSD == null) {
        reusableSD = new ScoreDoc(doc, score);
      } else if (useDefaultPQ){
        if (score >= reusableSD.score) {
          // reusableSD holds the last "rejected" entry, so, if
          // this new score is not better than that, there's no
          // need to try inserting it
          reusableSD.doc = doc;
          reusableSD.score = score;
        } else {
          return;
        }
      } else {
         reusableSD.doc = doc;
         reusableSD.score = score;
      }

      reusableSD = (ScoreDoc) hq.insertWithOverflow(reusableSD);

I don't think that adding useDefaultPQ will have any performance
implications, but it does add one extra boolean check. Therefore if you
agree that this should be fixed in TDC, rather than forcing anyone who
provides his own PQ to also override collect() (in the cases I've noted
above), then I can submit a patch.

Shai