You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Ignacio Vera (Jira)" <ji...@apache.org> on 2022/02/21 07:28:00 UTC
[jira] [Created] (LUCENE-10429) DocIdsetBuilder implementation implementation is inconsistent with DocIdsetBuilder#grow contract

Ignacio Vera created LUCENE-10429:
-------------------------------------

             Summary: DocIdsetBuilder implementation implementation is inconsistent with DocIdsetBuilder#grow contract 
                 Key: LUCENE-10429
                 URL: https://issues.apache.org/jira/browse/LUCENE-10429
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Ignacio Vera


Currently the contract of DocIdSetBulder#grow says:

{noformat}
/**
 * Reserve space and return a \{@link BulkAdder} object that can be used to add up to \{@code * numDocs}documents.
*/
public BulkAdder grow(int numDocs)
{noformat}

 
I would expect that from the PointValues API I could call this method by using the following:

{noformat}
DocIdSetBulder#grow((int) Math.min(docCount, node.size()));
{noformat}

But it seems it is not true as the implementation expects that the method is grow for all the call to growAdder, counting duplicated documents. Therefore we have this other implementation of addAll instead, to make happy the implementation:

{noformat}
 public void addAll(PointValues.IntersectVisitor visitor, boolean grown) throws IOException {
      if (grown == false) {
        final long size = size();
        if (size <= Integer.MAX_VALUE) {
          visitor.grow((int) size);
          grown = true;
        }
      }
      if (isLeafNode()) {
        // Leaf node
        leafNodes.seek(getLeafBlockFP());
        // How many points are stored in this leaf cell:
        int count = leafNodes.readVInt();
        // No need to call grow(), it has been called up-front
        docIdsWriter.readInts(leafNodes, count, visitor);
      } else {
        pushLeft();
        addAll(visitor, grown);
        pop();
        pushRight();
        addAll(visitor, grown);
        pop();
      }
    }

{noformat}

Therefore we have three options here:

1) Modify the grow API to reflect that it can be called more than Integer#MAX_VALUE, and therefore change the input parameter from int to long. Note that this method is exposed due to the points API so we have tried this implementation in LUCENE-10311 by creating a specific implementation of DocIdSetBuilder for points. This has so far been rejected.

2) Modify the implementation of DocIdSetBuilder. Currently the issue is that we have a counter inside the implementation that it is used to compute the cost for the dense case of the final iterator. Therefore we need to change the way we compute the cost.
The proposal here is to change the way we compute cost from:
{noformat}
 final long cost = Math.round(counter / numValuesPerDoc);
{noformat}
Which might underestimate the cost of the iterator to the following that overestimate the cost:
{noformat}
 final long cost = Math.min(counter, docCount))
{noformat}
I lack of intuition of how this might affect performance down the line. One thing I notice is that for the Terms API (that is when we add docs using a DocIdSetIterator via DocIdSetBuilder#add), we ignore the counter in the dense case, so we are already providing a totally wrong cost on that case! 

3) If none of this proposals is successful we should at least update the java docs to reflect reality.






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org