You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Ignacio Vera (Jira)" <ji...@apache.org> on 2022/02/21 07:29:00 UTC

[jira] [Updated] (LUCENE-10429) DocIdsetBuilder implementation is inconsistent with DocIdSetBuilder#grow contract

     [ https://issues.apache.org/jira/browse/LUCENE-10429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ignacio Vera updated LUCENE-10429:
----------------------------------
    Summary: DocIdsetBuilder implementation is inconsistent with DocIdSetBuilder#grow contract   (was: DocIdsetBuilder implementation is inconsistent with DocIdsetBuilder#grow contract )

> DocIdsetBuilder implementation is inconsistent with DocIdSetBuilder#grow contract 
> ----------------------------------------------------------------------------------
>
>                 Key: LUCENE-10429
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10429
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Ignacio Vera
>            Priority: Major
>
> Currently the contract of DocIdSetBulder#grow says:
> {noformat}
> /**
>  * Reserve space and return a \{@link BulkAdder} object that can be used to add up to \{@code * numDocs}documents.
> */
> public BulkAdder grow(int numDocs)
> {noformat}
>  
> I would expect that from the PointValues API I could call this method by using the following:
> {noformat}
> DocIdSetBulder#grow((int) Math.min(docCount, node.size()));
> {noformat}
> But it seems it is not true as the implementation expects that the method is grow for all the call to growAdder, counting duplicated documents. Therefore we have this other implementation of addAll instead, to make happy the implementation:
> {noformat}
>  public void addAll(PointValues.IntersectVisitor visitor, boolean grown) throws IOException {
>       if (grown == false) {
>         final long size = size();
>         if (size <= Integer.MAX_VALUE) {
>           visitor.grow((int) size);
>           grown = true;
>         }
>       }
>       if (isLeafNode()) {
>         // Leaf node
>         leafNodes.seek(getLeafBlockFP());
>         // How many points are stored in this leaf cell:
>         int count = leafNodes.readVInt();
>         // No need to call grow(), it has been called up-front
>         docIdsWriter.readInts(leafNodes, count, visitor);
>       } else {
>         pushLeft();
>         addAll(visitor, grown);
>         pop();
>         pushRight();
>         addAll(visitor, grown);
>         pop();
>       }
>     }
> {noformat}
> Therefore we have three options here:
> 1) Modify the grow API to reflect that it can be called more than Integer#MAX_VALUE, and therefore change the input parameter from int to long. Note that this method is exposed due to the points API so we have tried this implementation in LUCENE-10311 by creating a specific implementation of DocIdSetBuilder for points. This has so far been rejected.
> 2) Modify the implementation of DocIdSetBuilder. Currently the issue is that we have a counter inside the implementation that it is used to compute the cost for the dense case of the final iterator. Therefore we need to change the way we compute the cost.
> The proposal here is to change the way we compute cost from:
> {noformat}
>  final long cost = Math.round(counter / numValuesPerDoc);
> {noformat}
> Which might underestimate the cost of the iterator to the following that overestimate the cost:
> {noformat}
>  final long cost = Math.min(counter, docCount))
> {noformat}
> I lack of intuition of how this might affect performance down the line. One thing I notice is that for the Terms API (that is when we add docs using a DocIdSetIterator via DocIdSetBuilder#add), we ignore the counter in the dense case, so we are already providing a totally wrong cost on that case! 
> 3) If none of this proposals is successful we should at least update the java docs to reflect reality.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org