You are viewing a plain text version of this content. The canonical link for it is here.
Posted to pr@cassandra.apache.org by GitBox <gi...@apache.org> on 2022/04/07 13:51:41 UTC

[GitHub] [cassandra] blambov commented on a diff in pull request #1462: CASSANDRA-15510 (4.0): Optimise BTree.{build,update,transform}

blambov commented on code in PR #1462:
URL: https://github.com/apache/cassandra/pull/1462#discussion_r845109933


##########
src/java/org/apache/cassandra/utils/btree/BTree.java:
##########
@@ -3174,23 +3163,23 @@ void copy(Object[] unode, int usz, int offset, int length)
          * {@code from} may be {@code -1}, representing the first child only;
          * all other indices represent the key/child pairs that follow (i.e. a key and its right-hand child).
          */
-        private void copyNoOverflow(Object[] unode, int usz, int[] uszmap, int offset, int length)
+        private void copyPrecedingNoOverflow(Object[] unode, int usz, int[] uszmap, int offset, int length)
         {
             if (length <= 1)
             {
                 if (length == 0)
                     return;
 
+                buffer[MAX_KEYS + count] = unode[usz + offset];
+                sizes[count] = uszmap[offset] - (offset > 0 ? (1 + uszmap[offset - 1]) : 0);
                 buffer[count] = unode[offset];
-                buffer[MAX_KEYS + count + 1] = unode[usz + offset + 1];
-                sizes[count + 1] = uszmap[offset + 1] - (1 + uszmap[offset]);
                 ++count;
             }
             else
             {
+                System.arraycopy(unode, usz + offset, buffer, MAX_KEYS + count, length);

Review Comment:
   At some point I had both `copy` and `copyPreceding` and, because the children are logically before the keys in this case, I moved it for some internal clarity.
   But that really isn't helpful now that there is again only one method, moved back for the potential efficiency gain.



##########
src/java/org/apache/cassandra/utils/btree/BTree.java:
##########
@@ -219,103 +214,92 @@ public static Dir desc(boolean desc)
         if (height == 2)
         {
             // we use the _exact same logic_ as below, only we invoke buildLeaf
-            // denseCount: subtract the minimal (sparse) size of the tree and
-            // divide the remainder by the amount extra needed extra per dense
-            int denseCount = (size - (childCount * MIN_KEYS + keyCount + 1)) / (1+MIN_KEYS);
+            int remaining = size;
+            int threshold = MAX_KEYS + 1 + MIN_KEYS;
             int i = 0;
-            while (i < denseCount)
-            {
-                sizeMap[i] = i * (1+MAX_KEYS) + MAX_KEYS;
-                branch[keyCount + i] = buildLeafWithoutSizeTracking(source, MAX_KEYS, updateF);
-                branch[i++] = isSimple(updateF) ? source.next() : updateF.apply(source.next());
-            }
-            int sparseCount = childCount - (1 + denseCount);
-            int remainderSize = size - (denseCount * (1+MAX_KEYS) + sparseCount * (1+MIN_KEYS));
-            {
-                branch[keyCount + i] = buildLeafWithoutSizeTracking(source, remainderSize, updateF);
-                sizeMap[i] = size = i * (1+MAX_KEYS) + remainderSize;
-            }
-            // finally build "sparse" children to consume the remaining elements
-            while (i < keyCount)
+            while (remaining >= threshold)
             {
-                branch[i++] = isSimple(updateF) ? source.next() : updateF.apply(source.next());
-                branch[keyCount + i] = buildLeafWithoutSizeTracking(source, MIN_KEYS, updateF);
-                sizeMap[i] = size += 1 + MIN_KEYS;
+                branch[keyCount + i] = buildLeaf(source, MAX_KEYS, updateF);
+                branch[i] = isSimple(updateF) ? source.next() : updateF.apply(source.next());
+                remaining -= MAX_KEYS + 1;
+                sizeMap[i++] = size - remaining - 1;
             }
-            if (!isSimple(updateF))
+            if (remaining > MAX_KEYS)
             {
-                updateF.onAllocated(denseCount * PERFECT_DENSE_SIZE_ON_HEAP[0]
-                                    + sparseCount * PERFECT_SPARSE_SIZE_ON_HEAP[0]
-                                    + ObjectSizes.sizeOfReferenceArray(remainderSize | 1));
+                int childSize = remaining / 2;
+                branch[keyCount + i] = buildLeaf(source, childSize, updateF);
+                branch[i] = isSimple(updateF) ? source.next() : updateF.apply(source.next());
+                remaining -= childSize + 1;
+                sizeMap[i++] = size - remaining - 1;
             }
+            branch[keyCount + i] = buildLeaf(source, remaining, updateF);
+            sizeMap[i++] = size;
+            assert i == childCount;
         }
         else
         {
             --height;
-
-            // denseCount: subtract the minimal (sparse) size of the tree and
-            // divide the remainder by the amount extra needed extra per dense
-            int denseCount = (size - (childCount * sparseSize(height) + keyCount + 1)) / (denseSize(height) - sparseSize(height));
+            int denseChildSize = denseSize(height);
+            int denseGrandChildSize = denseSize(height - 1);
+            // The threshold is the point after which we can't add a dense child and still add another child with
+            // at least MIN_KEYS fully dense children plus at least one more key.
+            int threshold = denseChildSize + 1 + MIN_KEYS * (denseGrandChildSize + 1);
+            int remaining = size;
             int i = 0;
-            while (i < denseCount)
-            {
-                sizeMap[i] = (1 + i) * (1 + denseSize(height)) - 1;
-                branch[keyCount + i] = buildPerfectDense(source, height, updateF);
-                branch[i++] = isSimple(updateF) ? source.next() : updateF.apply(source.next());
-            }
-            int sparseCount = childCount - (1 + denseCount);
-            int remainderSize = size - (denseCount * (1+denseSize(height)) + sparseCount * (1+sparseSize(height)));
+            // Add dense children until we reach the threshold.
+            while (remaining >= threshold)
             {
-                // need to decide our number of children either based on denseGrandChildSize,
-                // or the least number of children we can use, if we want to be maximally dense
-                int grandChildCount = remainderSize < denseSize(height)/2
-                                      ? MIN_KEYS + 1
-                                      : remainderSize / (denseSize(height-1)+1) + 1;
-
-                branch[keyCount + i] = buildMaximallyDense(source, grandChildCount, remainderSize, height, updateF);
-                sizeMap[i] = size = (1 + denseSize(height)) * i + remainderSize;
+                branch[keyCount + i] = buildPerfectDenseAndAddUsage(source, height, updateF);
+                branch[i] = isSimple(updateF) ? source.next() : updateF.apply(source.next());
+                remaining -= denseChildSize + 1;
+                sizeMap[i++] = size - remaining - 1;
             }
-            // finally build sparse children to consume the remaining elements
-            while (i < keyCount)
+            // If the remainder does not fit one child, split it roughly in half, where the first child only has dense
+            // grandchildren.
+            if (remaining > denseChildSize)
             {
-                branch[i++] = isSimple(updateF) ? source.next() : updateF.apply(source.next());
-                branch[keyCount + i] = buildPerfectSparse(source, height, updateF);
-                sizeMap[i] = size += 1 + sparseSize(height);
-            }
-            if (!isSimple(updateF))
-            {
-                // accounts only for the perfect allocations; recursive calls of buildMaximallyDense will account for the remainder
-                updateF.onAllocated(denseCount * PERFECT_DENSE_SIZE_ON_HEAP[height - 1]
-                                    + sparseCount * PERFECT_SPARSE_SIZE_ON_HEAP[height - 1]);
+                int grandChildCount = remaining / ((denseGrandChildSize + 1) * 2);
+                assert grandChildCount >= MIN_KEYS + 1;
+                int childSize = grandChildCount * (denseGrandChildSize + 1) - 1;
+                branch[keyCount + i] = buildMaximallyDense(source, grandChildCount, childSize, height, updateF);

Review Comment:
   Do you mean create a separate method for a child with `n` dense grandchildren? That was too much extra code for me; instead, added a special case for dense last child, which should effectively make this the same thing.



##########
src/java/org/apache/cassandra/utils/btree/BTree.java:
##########
@@ -3405,301 +3396,161 @@ private void clearBranchBuffer(Object[] array)
 
         /**
          * Precondition: {@code update} should not be empty.
-         *
+         * <p>
          * Inserts {@code insert} into {@code update}, after applying {@code updateF} to each item, or matched item pairs.
          */
         Object[] update(Object[] update, Object[] insert, Comparator<? super Compare> comparator, UpdateFunction<Insert, Existing> updateF)
         {
-            this.update.init(update);
             this.insert.init(insert);
+            this.updateF = updateF;
+            this.comparator = comparator;
             this.allocated = isSimple(updateF) ? -1 : 0;
-            return update(comparator, updateF);
+            int leafDepth = BTree.depth(update) - 1;
+            LeafOrBranchBuilder builder = leaf();
+            for (int i = 0; i < leafDepth; ++i)
+                builder = builder.ensureParent();
+
+            Insert ik = this.insert.next();
+            ik = updateRecursive(ik, update, null, builder);
+            assert ik == null;
+            Object[] result = completeBuild(builder);
+
+            if (allocated > 0)
+                updateF.onAllocated(allocated);
+
+            return result;
         }
 
         /**
-         * We base our operation on the shape of {@code update}, trying to steal as much of the original tree as
-         * possible for our new tree
+         * Merge a BTree recursively with the contents of {@code insert} up to the given upper bound.
+         *
+         * @param ik The next key from the inserted data.
+         * @param unode The source branch to update.
+         * @param uub The branch's upper bound
+         * @param builder The builder that will receive the data. It needs to be at the same level of the hierarchy
+         *                as the source unode.
+         * @return The next key from the inserted data, >= uub.
          */
-        private Object[] update(Comparator<? super Compare> comparator, UpdateFunction<Insert, Existing> updateF)
-        {
-            Existing uub = null;
-            Object[] unode = update.node(), inode = insert.node();
-            int upos = update.position(), usz = shallowSize(unode);
-            int ipos = insert.position(), isz = shallowSize(inode);
-
-            LeafOrBranchBuilder level = leaf(); // once level == null, we're done
-            BranchBuilder branch = null; // branch == null if level == leaf(); this is unfortunate, but we need to invoke branch-specific methods
-            for (int i = 0; i < update.leafDepth; ++i)
-                level = branch = level.ensureParent();
-
-            // define:      xnode[xsz] == xub (i.e. the key to our right in our parent can be treated below as a key past the end of the node)
-            // invariants:  iub > unode[upos], uub > inode[ipos]
-            // invariants:  ipos >=  0 && ipos  < isz (imposed by SimpleTreeIterator)
-            // invariants:  upos >= -1 && upos <= usz
-            while (true)
-            {
-                if (level == leaf())
-                {
-                    if (isLeaf(inode))
-                    {
-                        // Both are leaves, so can simply linearly merge them until we reach the update upper bound.
-                        // This is most easily achieved by looping until one of the leaves is consumed, then
-                        // continuing to consume inode until we hit uub
+        private Insert updateRecursive(Insert ik, Object[] unode, Existing uub, LeafOrBranchBuilder builder)
+        {
+            return builder == leaf()
+                   ? updateRecursive(ik, unode, uub, (LeafBuilder) builder)
+                   : updateRecursive(ik, unode, uub, (BranchBuilder) builder);
+        }
 
-                        // Note that unode can already be exhausted, because we use the original tree to define the
-                        // shape of our updated tree and so we retain the _bounds_ of each sub-tree, only creating
-                        // new bounds during overflow, so once we have copied a leaf we continue inserting elements
+        private Insert updateRecursive(Insert ik, Object[] unode, Existing uub, BranchBuilder builder)
+        {
+            int upos = 0;
+            int usz = shallowSizeOfBranch(unode);
 
-                        Insert ik =  (Insert) inode[ipos];
-                        if (upos < usz)
-                        {
-                            Existing uk = (Existing) unode[upos];
-                            int c = comparator.compare(uk, ik);
-                            while (true)
-                            {
-                                if (c == 0)
-                                {
-                                    leaf().addKey(updateF.apply(uk, ik));
-                                    if (++upos < usz)
-                                        uk = (Existing) unode[upos];
-                                    if (++ipos < isz)
-                                        ik = (Insert) inode[ipos];
-                                    if (upos == usz || ipos == isz)
-                                        break;
-                                    c = comparator.compare(uk, ik);
-                                }
-                                else if (c < 0)
-                                {
-                                    int ulim = exponentialSearch(comparator, unode, upos + 1, usz, ik);
-                                    c = ulim < 0 ? 1 : 0; // positive or zero
-                                    if (ulim < 0) ulim = -(1 + ulim);
-                                    leaf().copy(unode, upos, ulim - upos);
-                                    if ((upos = ulim) == usz)
-                                        break;
-                                    uk = (Existing) unode[upos];
-                                }
-                                else
-                                {
-                                    int ilim = exponentialSearch(comparator, inode, ipos + 1, isz, uk);
-                                    c = ilim & 0x80000000; // negative or zero
-                                    if (ilim < 0) ilim = -(1 + ilim);
-                                    leaf().copy(inode, ipos, ilim - ipos, updateF);
-                                    if ((ipos = ilim) == isz)
-                                        break;
-                                    ik = (Insert) inode[ipos];
-                                }
-                            }
-                        }
+            while (ik != null)
+            {
+                int find = exponentialSearchWithUpperBound(comparator, unode, upos, usz, uub, ik);
+                int c = searchResultToComparison(find);
+                if (find < 0)
+                    find = -1 - find;
 
-                        // once here we have either exhausted inode, or inode[ipos] > unode[usz-1]
-                        // if inode is not exhausted, we want to insert its contents up to unode's upper bound (uub)
-                        if (ipos < isz)
-                        {
-                            int ilim = exponentialSearchForMaybeInfinity(comparator, inode, ipos, isz, uub);
-                            if (ilim != -(1 + ipos))
-                            {
-                                if (ilim < 0) ilim = -(1 + ilim);
-                                leaf().copy(inode, ipos, ilim - ipos, updateF);
-                                ipos = ilim;
-                            }
-                        }
+                if (find > usz)
+                    break;  // nothing else needs to be inserted in this branch
+                if (find > upos)
+                    builder.copyPreceding(unode, usz, upos, find - upos);
 
-                        // if inode is _still_ not exhausted, we've reached uub
-                        // so finish unode, and move to its parent
-                        if (ipos < isz)
-                        {
-                            if (null == (level = branch = finish(level, unode, usz)))
-                                break;
-                            unode = update.node(); upos = update.position(); uub = (Existing) update.upperBound(); usz = shallowSizeOfBranch(unode);
-                        }
-                        else if (insert.ascendToUnfinishedParent())
-                        {
-                            // if inode has been exhausted, move it forwards
-                            inode = insert.node(); ipos = insert.position(); isz = shallowSizeOfBranch(inode);
-                        }
-                        else break;
-                    }
-                    else
-                    {
-                        // {@code insert} is a branch: perform a linear merge between the leaf and this one branch key
-                        Insert ik = (Insert) inode[ipos];
-                        int find = exponentialSearchWithUpperBound(comparator, unode, upos, usz, uub, ik);
-                        boolean isInRange = find != -(2 + usz);
-                        boolean found = find >= 0;
-                        if (find < 0) find = isInRange ? -(1 + find) : usz;
-
-                        if (upos < find)
-                        {
-                            copy(unode, upos, find - upos);
-                            upos = find;
-                        }
+                final Existing nextUKey = find < usz ? (Existing) unode[find] : uub;
+                final Object[] childUNode = (Object[]) unode[find + usz];
 
-                        if (isInRange)
-                        {
-                            leaf().addKey(found ? updateF.apply((Existing) unode[upos++], ik)
-                                                : isSimple(updateF)
-                                                  ? ik
-                                                  : updateF.apply(ik));
-
-                            // We are now immediately after ik; since inode is a branch, the next is in a leaf node
-                            insert.descendIntoNextLeaf(inode, ipos, isz);
-                            inode = insert.node(); ipos = insert.position(); isz = sizeOfLeaf(inode);
-                        }
-                        else
-                        {
-                            if (null == (level = branch = finish(level, unode, usz)))
-                                break;
-                            unode = update.node(); upos = update.position(); uub = (Existing) update.upperBound(); usz = shallowSizeOfBranch(unode);
-                        }
-                    }
+                // process next child
+                if (c < 0)
+                {
+                    // ik fall inside it -- recursively merge the child with the update, using next key as an upper bound
+                    LeafOrBranchBuilder childBuilder = builder.child;
+                    ik = updateRecursive(ik, childUNode, nextUKey, childBuilder);
+                    childBuilder.drainAndPropagate(childUNode, shallowSizeOfBranch(childUNode), builder);
+                    if (find == usz)    // this was the right-most child, branch is complete and we can return immediately
+                        return ik;
+                    c = ik != null ? comparator.compare(nextUKey, ik) : -1;
                 }
                 else
+                    builder.addChild(childUNode);
+
+                // process next key
+                if (c == 0)
                 {
-                    // we don't care if the new node is a leaf or branch we just want to decide where we go in the old tree
-                    Insert ik = (Insert) inode[ipos];
-
-                    // invariants:
-                    //      upos >= -1, upos < usz
-                    //      branchBuffers[>bufferIndex] must always end on a key (i.e. have as many children as keys)
-                    //                                  because on ascent they will have a child to add
-                    //      branchBuffers[bufferIndex]  must end _before_ a key, since we enter this loop on a key to handle
-                    //      ukey[0..upos) and uchild[0..upos] have been merged into branchBuffers[branchIndex]
-                    // i.e. we have handled all data in unode prior to unode[upos], the current key
-
-                    int find = exponentialSearchWithUpperBound(comparator, unode, max(0, upos), usz, uub, ik);
-                    boolean found = find >= 0;
-                    if (find < 0) find = -(2 + find);
-                    boolean isInRange = find < usz;
-
-                    // post-conditions:
-                    //           find  >= -1
-                    //      ukey[find] == ik or ukey[find+1] > uk
-                    // ->   ukey[find] <= ik
-                    //    uchild[find]  < ik
-
-                    if (upos < find)
-                    {
-                        branch.copy(unode, usz, upos, find - upos);
-                        upos = find;
-                    }
+                    // ik matches next key
+                    builder.addKey(updateF.apply(nextUKey, ik));
+                    ik = insert.next();
+                }
+                else
+                    builder.addKey(nextUKey);
 
-                    // given: ukey[find] <= ik
-                    // if find is a valid key, we copy it, possibly merging with ik
-                    if (found)
-                    {
-                        // we have an exact match, so merge the key, and we now have as many children as keys
-                        branch.addKey(updateF.apply((Existing) unode[upos], ik));
-                    }
-                    else if (upos >= 0 && upos < usz)
-                    {
-                        // otherwise ik occurs _after_ unode[upos] so we copy it unadulterated
-                        // (but may need to descend into its right child)
-                        branch.addKey(unode[upos]);
-                    }
-                    //      ukey[0..upos+1) and uchild[0..upos+1) have been dealt with
+                upos = find + 1;
+            }
+            // copy the rest of the branch and exit
+            if (upos <= usz)
+            {
+                builder.copyPreceding(unode, usz, upos, usz - upos);
+                builder.addChild((Object[]) unode[usz + usz]);
+            }
 
-                    boolean noMoreToInsert = false; // exitLoop
-                    boolean descend = isInRange; // assume !found; update in found block otherwise
-                    // now we need to restore our invariants
-                    if (found)
-                    {
-                        // we have consumed ik and uk; first walk inode forwards
-                        if (!isLeaf(inode))
-                        {
-                            insert.descendIntoNextLeaf(inode, ipos, isz);
-                            inode = insert.node(); ipos = insert.position(); isz = sizeOfLeaf(inode);
-                        }
-                        else if (++ipos == isz && !(noMoreToInsert = !insert.ascendToUnfinishedParent()))
-                        {
-                            inode = insert.node(); ipos = insert.position(); isz = shallowSizeOfBranch(inode);
-                        }
+            return ik;
+        }
 
-                        // now we have to restore our invariant, as we last copied a key,
-                        // so we need to either descend or copy the next child
-
-                        // compare ik to the next key:
-                        //   if doneInsert then we have no ik, so treat it as infinity;
-                        //   if exhausted unode's keys, compare with uub for final child
-                        Existing ub = 1 + upos >= usz ? uub : (Existing) unode[1 + upos];
-                        int c = noMoreToInsert ? 1 : compareWithMaybeInfinity(comparator, (Compare) inode[ipos], ub);
-                        descend = c < 0; // descend if ik occurs in the range owned by the child
-                        if (!descend) // otherwise copy the child and continue the loop
-                            branch.addChild((Object[]) unode[usz + ++upos]);
-                    }
+        private Insert updateRecursive(Insert ik, Object[] unode, Existing uub, LeafBuilder builder)
+        {
+            int upos = 0;
+            int usz = sizeOfLeaf(unode);
+            Existing uk = (Existing) unode[upos];
+            int c = comparator.compare(uk, ik);
 
-                    if (descend)
-                    {
-                        // ik occurs in the child, so descend
-                        // if we're descending, we _want_ to have a dangling child that we populate on ascent
-                        update.descendIntoNextChild(unode, upos, usz);
-                        unode = update.node(); upos = update.position(); uub = (Existing) update.upperBound(); usz = shallowSize(unode);
-                        level = level.child;
-                        branch = level == leaf() ? null : (BranchBuilder) level;
-                    }
-                    else if (upos >= usz)
+            while (true)
+            {
+                if (c == 0)
+                {
+                    leaf().addKey(updateF.apply(uk, ik));
+                    if (++upos < usz)
+                        uk = (Existing) unode[upos];
+                    ik = insert.next();
+                    if (ik == null)
                     {
-                        // -> we're done with unode
-                        if (null == (level = branch = finish(level, unode, usz)))
-                            break;
-                        unode = update.node(); upos = update.position(); uub = (Existing) update.upperBound(); usz = shallowSize(unode);
+                        builder.copy(unode, upos, usz - upos);
+                        return null;
                     }
-
-                    if (noMoreToInsert)
+                    if (upos == usz)
                         break;
+                    c = comparator.compare(uk, ik);
                 }
-            }
-
-            if (level != null)
-            {
-                // finish copying each in-progress node upto but excluding root unode
-                // NOTE: we stop maintaining uub at this point, as we no longer need it
-                while (true)
+                else if (c < 0)
                 {
-                    if (level == leaf()) leaf().copy(unode, upos, usz - upos);
-                    else branch.copy(unode, usz, upos, usz - upos);
-
-                    if (null == (level = branch = finish(level, unode, usz)))
+                    int ulim = exponentialSearch(comparator, unode, upos + 1, usz, ik);
+                    c = -searchResultToComparison(ulim); // 0 if match, 1 otherwise
+                    if (ulim < 0)
+                        ulim = -(1 + ulim);
+                    builder.copy(unode, upos, ulim - upos);
+                    if ((upos = ulim) == usz)
                         break;
-
-                    unode = update.node(); upos = update.position(); usz = shallowSizeOfBranch(unode);
+                    uk = (Existing) unode[upos];
+                }
+                else
+                {
+                    builder.addKey(updateF.apply(ik));
+                    c = insert.copyKeysSmallerThan(uk, comparator, builder, updateF); // 0 on match, -1 otherwise

Review Comment:
   Removed



##########
src/java/org/apache/cassandra/utils/btree/BTree.java:
##########
@@ -4212,63 +4060,101 @@ boolean ascendToParent()
                 return false;
             return --depth >= 0;
         }
-
-        boolean isRoot()
-        {
-            return depth == 0;
-        }
     }
 
-    /**
-     * Like {@code SimpleTreeIterator} except we may not visit every node.
-     * When we first enter branch, we begin before its first key to consider if we should descend into the first child.
-     * Should only be used on non-empty trees.
-     */
-    private static final class SimpleTreeNavigator extends SimpleTreeStack
+
+    private static class SimpleTreeKeysIterator<Compare, Insert extends Compare>
     {
-        Object[] upperBounds;
+        int leafSize;
+        int leafPos;
+        Object[] leaf;
+        Object[][] nodes;
+        int[] positions;
+        int depth;
 
         void init(Object[] tree)
         {
-            int height = height(tree);
-            leafDepth = height - 1;
-            if (positions == null || height >= positions.length)
+            int maxHeight = maxRootHeight(size(tree));
+            if (positions == null || maxHeight >= positions.length)
             {
-                positions = new int[height];
-                nodes = new Object[height][];
-                upperBounds = new Object[height];
+                positions = new int[maxHeight + 1];
+                nodes = new Object[maxHeight + 1][];
             }
-            nodes[this.depth = 0] = tree;
-            positions[0] = leafDepth == 0 ? 0 : -1;
+            depth = 0;
+            descendToLeaf(tree);
         }
 
-        void descendIntoNextChild(Object[] parent, int pos, int sz)
+        void reset()
         {
-            positions[depth] = ++pos;
-            ++depth;
-            Object[] child = (Object[]) parent[sz + pos];
-            nodes[depth] = child;
-            upperBounds[depth] = pos < sz ? parent[pos] : upperBounds[depth - 1];
-            positions[depth] = depth == leafDepth ? 0 : -1;
+            leaf = null;
+            Arrays.fill(nodes, 0, nodes.length, null);
         }
 
-        void reset()
+        Insert next()
         {
-            super.reset();
-            Arrays.fill(upperBounds, 0, leafDepth + 1, null);
+            if (leafPos < leafSize) // fast path
+                return (Insert) leaf[leafPos++];
+
+            if (depth == 0)
+                return null;
+
+            Object[] node = nodes[depth - 1];
+            final int position = positions[depth - 1];
+            Insert result = (Insert) node[position];
+            advanceBranch(node, position + 1);
+            return result;
         }
 
-        boolean ascendToParent()
+        private void advanceBranch(Object[] node, int position)
         {
-            if (depth == 0)
-                return false;
-            --depth;
-            return true;
+            int count = shallowSizeOfBranch(node);
+            if (position < count)
+                positions[depth - 1] = position;
+            else
+                --depth;  // no more children in this branch, remove from stack
+            descendToLeaf((Object[]) node[count + position]);
         }
 
-        Object upperBound()
+        void descendToLeaf(Object[] node)
         {
-            return upperBounds[depth];
+            while (!isLeaf(node))
+            {
+                nodes[depth] = node;
+                positions[depth] = 0;
+                node = (Object[]) node[shallowSizeOfBranch(node)];
+                ++depth;
+            }
+            leaf = node;
+            leafPos = 0;
+            leafSize = sizeOfLeaf(node);
+        }
+
+        <Update> int copyKeysSmallerThan(Compare bound, Comparator<? super Compare> comparator, LeafBuilder builder, UpdateFunction<Insert, Update> transformer)
+        {
+            while (true)
+            {
+                int lim = exponentialSearchForMaybeInfinity(comparator, leaf, leafPos, leafSize, bound);
+                int end = lim >= 0 ? lim : -1 - lim;
+                if (end > leafPos)
+                {
+                    builder.copy(leaf, leafPos, end - leafPos, transformer);
+                    leafPos = end;
+                }
+                if (end < leafSize)
+                    return searchResultToComparison(lim);    // 0 if next is a match for bound, -1 otherwise
+
+                if (depth == 0)
+                    return -1;
+
+                Object[] node = nodes[depth - 1];
+                final int position = positions[depth - 1];
+                Insert branchKey = (Insert) node[position];
+                int cmp = compareWithMaybeInfinity(comparator, branchKey, bound);
+                if (cmp >= 0)
+                    return -cmp;
+                builder.addKey(transformer.apply(branchKey));

Review Comment:
   Done



##########
src/java/org/apache/cassandra/utils/btree/BTree.java:
##########
@@ -3405,301 +3396,161 @@ private void clearBranchBuffer(Object[] array)
 
         /**
          * Precondition: {@code update} should not be empty.
-         *
+         * <p>
          * Inserts {@code insert} into {@code update}, after applying {@code updateF} to each item, or matched item pairs.
          */
         Object[] update(Object[] update, Object[] insert, Comparator<? super Compare> comparator, UpdateFunction<Insert, Existing> updateF)
         {
-            this.update.init(update);
             this.insert.init(insert);
+            this.updateF = updateF;
+            this.comparator = comparator;
             this.allocated = isSimple(updateF) ? -1 : 0;
-            return update(comparator, updateF);
+            int leafDepth = BTree.depth(update) - 1;
+            LeafOrBranchBuilder builder = leaf();
+            for (int i = 0; i < leafDepth; ++i)
+                builder = builder.ensureParent();
+
+            Insert ik = this.insert.next();
+            ik = updateRecursive(ik, update, null, builder);
+            assert ik == null;
+            Object[] result = completeBuild(builder);
+
+            if (allocated > 0)
+                updateF.onAllocated(allocated);
+
+            return result;
         }
 
         /**
-         * We base our operation on the shape of {@code update}, trying to steal as much of the original tree as
-         * possible for our new tree
+         * Merge a BTree recursively with the contents of {@code insert} up to the given upper bound.
+         *
+         * @param ik The next key from the inserted data.
+         * @param unode The source branch to update.
+         * @param uub The branch's upper bound
+         * @param builder The builder that will receive the data. It needs to be at the same level of the hierarchy
+         *                as the source unode.
+         * @return The next key from the inserted data, >= uub.
          */
-        private Object[] update(Comparator<? super Compare> comparator, UpdateFunction<Insert, Existing> updateF)
-        {
-            Existing uub = null;
-            Object[] unode = update.node(), inode = insert.node();
-            int upos = update.position(), usz = shallowSize(unode);
-            int ipos = insert.position(), isz = shallowSize(inode);
-
-            LeafOrBranchBuilder level = leaf(); // once level == null, we're done
-            BranchBuilder branch = null; // branch == null if level == leaf(); this is unfortunate, but we need to invoke branch-specific methods
-            for (int i = 0; i < update.leafDepth; ++i)
-                level = branch = level.ensureParent();
-
-            // define:      xnode[xsz] == xub (i.e. the key to our right in our parent can be treated below as a key past the end of the node)
-            // invariants:  iub > unode[upos], uub > inode[ipos]
-            // invariants:  ipos >=  0 && ipos  < isz (imposed by SimpleTreeIterator)
-            // invariants:  upos >= -1 && upos <= usz
-            while (true)
-            {
-                if (level == leaf())
-                {
-                    if (isLeaf(inode))
-                    {
-                        // Both are leaves, so can simply linearly merge them until we reach the update upper bound.
-                        // This is most easily achieved by looping until one of the leaves is consumed, then
-                        // continuing to consume inode until we hit uub
+        private Insert updateRecursive(Insert ik, Object[] unode, Existing uub, LeafOrBranchBuilder builder)
+        {
+            return builder == leaf()
+                   ? updateRecursive(ik, unode, uub, (LeafBuilder) builder)
+                   : updateRecursive(ik, unode, uub, (BranchBuilder) builder);
+        }
 
-                        // Note that unode can already be exhausted, because we use the original tree to define the
-                        // shape of our updated tree and so we retain the _bounds_ of each sub-tree, only creating
-                        // new bounds during overflow, so once we have copied a leaf we continue inserting elements
+        private Insert updateRecursive(Insert ik, Object[] unode, Existing uub, BranchBuilder builder)
+        {
+            int upos = 0;
+            int usz = shallowSizeOfBranch(unode);
 
-                        Insert ik =  (Insert) inode[ipos];
-                        if (upos < usz)
-                        {
-                            Existing uk = (Existing) unode[upos];
-                            int c = comparator.compare(uk, ik);
-                            while (true)
-                            {
-                                if (c == 0)
-                                {
-                                    leaf().addKey(updateF.apply(uk, ik));
-                                    if (++upos < usz)
-                                        uk = (Existing) unode[upos];
-                                    if (++ipos < isz)
-                                        ik = (Insert) inode[ipos];
-                                    if (upos == usz || ipos == isz)
-                                        break;
-                                    c = comparator.compare(uk, ik);
-                                }
-                                else if (c < 0)
-                                {
-                                    int ulim = exponentialSearch(comparator, unode, upos + 1, usz, ik);
-                                    c = ulim < 0 ? 1 : 0; // positive or zero
-                                    if (ulim < 0) ulim = -(1 + ulim);
-                                    leaf().copy(unode, upos, ulim - upos);
-                                    if ((upos = ulim) == usz)
-                                        break;
-                                    uk = (Existing) unode[upos];
-                                }
-                                else
-                                {
-                                    int ilim = exponentialSearch(comparator, inode, ipos + 1, isz, uk);
-                                    c = ilim & 0x80000000; // negative or zero
-                                    if (ilim < 0) ilim = -(1 + ilim);
-                                    leaf().copy(inode, ipos, ilim - ipos, updateF);
-                                    if ((ipos = ilim) == isz)
-                                        break;
-                                    ik = (Insert) inode[ipos];
-                                }
-                            }
-                        }
+            while (ik != null)
+            {
+                int find = exponentialSearchWithUpperBound(comparator, unode, upos, usz, uub, ik);
+                int c = searchResultToComparison(find);
+                if (find < 0)
+                    find = -1 - find;
 
-                        // once here we have either exhausted inode, or inode[ipos] > unode[usz-1]
-                        // if inode is not exhausted, we want to insert its contents up to unode's upper bound (uub)
-                        if (ipos < isz)
-                        {
-                            int ilim = exponentialSearchForMaybeInfinity(comparator, inode, ipos, isz, uub);
-                            if (ilim != -(1 + ipos))
-                            {
-                                if (ilim < 0) ilim = -(1 + ilim);
-                                leaf().copy(inode, ipos, ilim - ipos, updateF);
-                                ipos = ilim;
-                            }
-                        }
+                if (find > usz)
+                    break;  // nothing else needs to be inserted in this branch
+                if (find > upos)
+                    builder.copyPreceding(unode, usz, upos, find - upos);
 
-                        // if inode is _still_ not exhausted, we've reached uub
-                        // so finish unode, and move to its parent
-                        if (ipos < isz)
-                        {
-                            if (null == (level = branch = finish(level, unode, usz)))
-                                break;
-                            unode = update.node(); upos = update.position(); uub = (Existing) update.upperBound(); usz = shallowSizeOfBranch(unode);
-                        }
-                        else if (insert.ascendToUnfinishedParent())
-                        {
-                            // if inode has been exhausted, move it forwards
-                            inode = insert.node(); ipos = insert.position(); isz = shallowSizeOfBranch(inode);
-                        }
-                        else break;
-                    }
-                    else
-                    {
-                        // {@code insert} is a branch: perform a linear merge between the leaf and this one branch key
-                        Insert ik = (Insert) inode[ipos];
-                        int find = exponentialSearchWithUpperBound(comparator, unode, upos, usz, uub, ik);
-                        boolean isInRange = find != -(2 + usz);
-                        boolean found = find >= 0;
-                        if (find < 0) find = isInRange ? -(1 + find) : usz;
-
-                        if (upos < find)
-                        {
-                            copy(unode, upos, find - upos);
-                            upos = find;
-                        }
+                final Existing nextUKey = find < usz ? (Existing) unode[find] : uub;
+                final Object[] childUNode = (Object[]) unode[find + usz];
 
-                        if (isInRange)
-                        {
-                            leaf().addKey(found ? updateF.apply((Existing) unode[upos++], ik)
-                                                : isSimple(updateF)
-                                                  ? ik
-                                                  : updateF.apply(ik));
-
-                            // We are now immediately after ik; since inode is a branch, the next is in a leaf node
-                            insert.descendIntoNextLeaf(inode, ipos, isz);
-                            inode = insert.node(); ipos = insert.position(); isz = sizeOfLeaf(inode);
-                        }
-                        else
-                        {
-                            if (null == (level = branch = finish(level, unode, usz)))
-                                break;
-                            unode = update.node(); upos = update.position(); uub = (Existing) update.upperBound(); usz = shallowSizeOfBranch(unode);
-                        }
-                    }
+                // process next child
+                if (c < 0)
+                {
+                    // ik fall inside it -- recursively merge the child with the update, using next key as an upper bound
+                    LeafOrBranchBuilder childBuilder = builder.child;
+                    ik = updateRecursive(ik, childUNode, nextUKey, childBuilder);
+                    childBuilder.drainAndPropagate(childUNode, shallowSizeOfBranch(childUNode), builder);
+                    if (find == usz)    // this was the right-most child, branch is complete and we can return immediately
+                        return ik;
+                    c = ik != null ? comparator.compare(nextUKey, ik) : -1;
                 }
                 else
+                    builder.addChild(childUNode);
+
+                // process next key
+                if (c == 0)
                 {
-                    // we don't care if the new node is a leaf or branch we just want to decide where we go in the old tree
-                    Insert ik = (Insert) inode[ipos];
-
-                    // invariants:
-                    //      upos >= -1, upos < usz
-                    //      branchBuffers[>bufferIndex] must always end on a key (i.e. have as many children as keys)
-                    //                                  because on ascent they will have a child to add
-                    //      branchBuffers[bufferIndex]  must end _before_ a key, since we enter this loop on a key to handle
-                    //      ukey[0..upos) and uchild[0..upos] have been merged into branchBuffers[branchIndex]
-                    // i.e. we have handled all data in unode prior to unode[upos], the current key
-
-                    int find = exponentialSearchWithUpperBound(comparator, unode, max(0, upos), usz, uub, ik);
-                    boolean found = find >= 0;
-                    if (find < 0) find = -(2 + find);
-                    boolean isInRange = find < usz;
-
-                    // post-conditions:
-                    //           find  >= -1
-                    //      ukey[find] == ik or ukey[find+1] > uk
-                    // ->   ukey[find] <= ik
-                    //    uchild[find]  < ik
-
-                    if (upos < find)
-                    {
-                        branch.copy(unode, usz, upos, find - upos);
-                        upos = find;
-                    }
+                    // ik matches next key
+                    builder.addKey(updateF.apply(nextUKey, ik));
+                    ik = insert.next();
+                }
+                else
+                    builder.addKey(nextUKey);
 
-                    // given: ukey[find] <= ik
-                    // if find is a valid key, we copy it, possibly merging with ik
-                    if (found)
-                    {
-                        // we have an exact match, so merge the key, and we now have as many children as keys
-                        branch.addKey(updateF.apply((Existing) unode[upos], ik));
-                    }
-                    else if (upos >= 0 && upos < usz)
-                    {
-                        // otherwise ik occurs _after_ unode[upos] so we copy it unadulterated
-                        // (but may need to descend into its right child)
-                        branch.addKey(unode[upos]);
-                    }
-                    //      ukey[0..upos+1) and uchild[0..upos+1) have been dealt with
+                upos = find + 1;
+            }
+            // copy the rest of the branch and exit
+            if (upos <= usz)
+            {
+                builder.copyPreceding(unode, usz, upos, usz - upos);
+                builder.addChild((Object[]) unode[usz + usz]);
+            }
 
-                    boolean noMoreToInsert = false; // exitLoop
-                    boolean descend = isInRange; // assume !found; update in found block otherwise
-                    // now we need to restore our invariants
-                    if (found)
-                    {
-                        // we have consumed ik and uk; first walk inode forwards
-                        if (!isLeaf(inode))
-                        {
-                            insert.descendIntoNextLeaf(inode, ipos, isz);
-                            inode = insert.node(); ipos = insert.position(); isz = sizeOfLeaf(inode);
-                        }
-                        else if (++ipos == isz && !(noMoreToInsert = !insert.ascendToUnfinishedParent()))
-                        {
-                            inode = insert.node(); ipos = insert.position(); isz = shallowSizeOfBranch(inode);
-                        }
+            return ik;
+        }
 
-                        // now we have to restore our invariant, as we last copied a key,
-                        // so we need to either descend or copy the next child
-
-                        // compare ik to the next key:
-                        //   if doneInsert then we have no ik, so treat it as infinity;
-                        //   if exhausted unode's keys, compare with uub for final child
-                        Existing ub = 1 + upos >= usz ? uub : (Existing) unode[1 + upos];
-                        int c = noMoreToInsert ? 1 : compareWithMaybeInfinity(comparator, (Compare) inode[ipos], ub);
-                        descend = c < 0; // descend if ik occurs in the range owned by the child
-                        if (!descend) // otherwise copy the child and continue the loop
-                            branch.addChild((Object[]) unode[usz + ++upos]);
-                    }
+        private Insert updateRecursive(Insert ik, Object[] unode, Existing uub, LeafBuilder builder)
+        {
+            int upos = 0;
+            int usz = sizeOfLeaf(unode);
+            Existing uk = (Existing) unode[upos];
+            int c = comparator.compare(uk, ik);
 
-                    if (descend)
-                    {
-                        // ik occurs in the child, so descend
-                        // if we're descending, we _want_ to have a dangling child that we populate on ascent
-                        update.descendIntoNextChild(unode, upos, usz);
-                        unode = update.node(); upos = update.position(); uub = (Existing) update.upperBound(); usz = shallowSize(unode);
-                        level = level.child;
-                        branch = level == leaf() ? null : (BranchBuilder) level;
-                    }
-                    else if (upos >= usz)
+            while (true)
+            {
+                if (c == 0)
+                {
+                    leaf().addKey(updateF.apply(uk, ik));
+                    if (++upos < usz)
+                        uk = (Existing) unode[upos];
+                    ik = insert.next();
+                    if (ik == null)
                     {
-                        // -> we're done with unode
-                        if (null == (level = branch = finish(level, unode, usz)))
-                            break;
-                        unode = update.node(); upos = update.position(); uub = (Existing) update.upperBound(); usz = shallowSize(unode);
+                        builder.copy(unode, upos, usz - upos);
+                        return null;
                     }
-
-                    if (noMoreToInsert)
+                    if (upos == usz)
                         break;
+                    c = comparator.compare(uk, ik);
                 }
-            }
-
-            if (level != null)
-            {
-                // finish copying each in-progress node upto but excluding root unode
-                // NOTE: we stop maintaining uub at this point, as we no longer need it
-                while (true)
+                else if (c < 0)
                 {
-                    if (level == leaf()) leaf().copy(unode, upos, usz - upos);
-                    else branch.copy(unode, usz, upos, usz - upos);
-
-                    if (null == (level = branch = finish(level, unode, usz)))
+                    int ulim = exponentialSearch(comparator, unode, upos + 1, usz, ik);
+                    c = -searchResultToComparison(ulim); // 0 if match, 1 otherwise
+                    if (ulim < 0)
+                        ulim = -(1 + ulim);
+                    builder.copy(unode, upos, ulim - upos);
+                    if ((upos = ulim) == usz)
                         break;
-
-                    unode = update.node(); upos = update.position(); usz = shallowSizeOfBranch(unode);
+                    uk = (Existing) unode[upos];
+                }
+                else
+                {
+                    builder.addKey(updateF.apply(ik));

Review Comment:
   Added



##########
src/java/org/apache/cassandra/utils/btree/BTree.java:
##########
@@ -3405,301 +3396,161 @@ private void clearBranchBuffer(Object[] array)
 
         /**
          * Precondition: {@code update} should not be empty.
-         *
+         * <p>
          * Inserts {@code insert} into {@code update}, after applying {@code updateF} to each item, or matched item pairs.
          */
         Object[] update(Object[] update, Object[] insert, Comparator<? super Compare> comparator, UpdateFunction<Insert, Existing> updateF)
         {
-            this.update.init(update);
             this.insert.init(insert);
+            this.updateF = updateF;
+            this.comparator = comparator;
             this.allocated = isSimple(updateF) ? -1 : 0;
-            return update(comparator, updateF);
+            int leafDepth = BTree.depth(update) - 1;
+            LeafOrBranchBuilder builder = leaf();
+            for (int i = 0; i < leafDepth; ++i)
+                builder = builder.ensureParent();
+
+            Insert ik = this.insert.next();
+            ik = updateRecursive(ik, update, null, builder);
+            assert ik == null;
+            Object[] result = completeBuild(builder);
+
+            if (allocated > 0)
+                updateF.onAllocated(allocated);
+
+            return result;
         }
 
         /**
-         * We base our operation on the shape of {@code update}, trying to steal as much of the original tree as
-         * possible for our new tree
+         * Merge a BTree recursively with the contents of {@code insert} up to the given upper bound.
+         *
+         * @param ik The next key from the inserted data.
+         * @param unode The source branch to update.
+         * @param uub The branch's upper bound
+         * @param builder The builder that will receive the data. It needs to be at the same level of the hierarchy
+         *                as the source unode.
+         * @return The next key from the inserted data, >= uub.
          */
-        private Object[] update(Comparator<? super Compare> comparator, UpdateFunction<Insert, Existing> updateF)
-        {
-            Existing uub = null;
-            Object[] unode = update.node(), inode = insert.node();
-            int upos = update.position(), usz = shallowSize(unode);
-            int ipos = insert.position(), isz = shallowSize(inode);
-
-            LeafOrBranchBuilder level = leaf(); // once level == null, we're done
-            BranchBuilder branch = null; // branch == null if level == leaf(); this is unfortunate, but we need to invoke branch-specific methods
-            for (int i = 0; i < update.leafDepth; ++i)
-                level = branch = level.ensureParent();
-
-            // define:      xnode[xsz] == xub (i.e. the key to our right in our parent can be treated below as a key past the end of the node)
-            // invariants:  iub > unode[upos], uub > inode[ipos]
-            // invariants:  ipos >=  0 && ipos  < isz (imposed by SimpleTreeIterator)
-            // invariants:  upos >= -1 && upos <= usz
-            while (true)
-            {
-                if (level == leaf())
-                {
-                    if (isLeaf(inode))
-                    {
-                        // Both are leaves, so can simply linearly merge them until we reach the update upper bound.
-                        // This is most easily achieved by looping until one of the leaves is consumed, then
-                        // continuing to consume inode until we hit uub
+        private Insert updateRecursive(Insert ik, Object[] unode, Existing uub, LeafOrBranchBuilder builder)
+        {
+            return builder == leaf()
+                   ? updateRecursive(ik, unode, uub, (LeafBuilder) builder)
+                   : updateRecursive(ik, unode, uub, (BranchBuilder) builder);
+        }
 
-                        // Note that unode can already be exhausted, because we use the original tree to define the
-                        // shape of our updated tree and so we retain the _bounds_ of each sub-tree, only creating
-                        // new bounds during overflow, so once we have copied a leaf we continue inserting elements
+        private Insert updateRecursive(Insert ik, Object[] unode, Existing uub, BranchBuilder builder)
+        {
+            int upos = 0;
+            int usz = shallowSizeOfBranch(unode);
 
-                        Insert ik =  (Insert) inode[ipos];
-                        if (upos < usz)
-                        {
-                            Existing uk = (Existing) unode[upos];
-                            int c = comparator.compare(uk, ik);
-                            while (true)
-                            {
-                                if (c == 0)
-                                {
-                                    leaf().addKey(updateF.apply(uk, ik));
-                                    if (++upos < usz)
-                                        uk = (Existing) unode[upos];
-                                    if (++ipos < isz)
-                                        ik = (Insert) inode[ipos];
-                                    if (upos == usz || ipos == isz)
-                                        break;
-                                    c = comparator.compare(uk, ik);
-                                }
-                                else if (c < 0)
-                                {
-                                    int ulim = exponentialSearch(comparator, unode, upos + 1, usz, ik);
-                                    c = ulim < 0 ? 1 : 0; // positive or zero
-                                    if (ulim < 0) ulim = -(1 + ulim);
-                                    leaf().copy(unode, upos, ulim - upos);
-                                    if ((upos = ulim) == usz)
-                                        break;
-                                    uk = (Existing) unode[upos];
-                                }
-                                else
-                                {
-                                    int ilim = exponentialSearch(comparator, inode, ipos + 1, isz, uk);
-                                    c = ilim & 0x80000000; // negative or zero
-                                    if (ilim < 0) ilim = -(1 + ilim);
-                                    leaf().copy(inode, ipos, ilim - ipos, updateF);
-                                    if ((ipos = ilim) == isz)
-                                        break;
-                                    ik = (Insert) inode[ipos];
-                                }
-                            }
-                        }
+            while (ik != null)
+            {
+                int find = exponentialSearchWithUpperBound(comparator, unode, upos, usz, uub, ik);
+                int c = searchResultToComparison(find);
+                if (find < 0)
+                    find = -1 - find;
 
-                        // once here we have either exhausted inode, or inode[ipos] > unode[usz-1]
-                        // if inode is not exhausted, we want to insert its contents up to unode's upper bound (uub)
-                        if (ipos < isz)
-                        {
-                            int ilim = exponentialSearchForMaybeInfinity(comparator, inode, ipos, isz, uub);
-                            if (ilim != -(1 + ipos))
-                            {
-                                if (ilim < 0) ilim = -(1 + ilim);
-                                leaf().copy(inode, ipos, ilim - ipos, updateF);
-                                ipos = ilim;
-                            }
-                        }
+                if (find > usz)
+                    break;  // nothing else needs to be inserted in this branch
+                if (find > upos)
+                    builder.copyPreceding(unode, usz, upos, find - upos);
 
-                        // if inode is _still_ not exhausted, we've reached uub
-                        // so finish unode, and move to its parent
-                        if (ipos < isz)
-                        {
-                            if (null == (level = branch = finish(level, unode, usz)))
-                                break;
-                            unode = update.node(); upos = update.position(); uub = (Existing) update.upperBound(); usz = shallowSizeOfBranch(unode);
-                        }
-                        else if (insert.ascendToUnfinishedParent())
-                        {
-                            // if inode has been exhausted, move it forwards
-                            inode = insert.node(); ipos = insert.position(); isz = shallowSizeOfBranch(inode);
-                        }
-                        else break;
-                    }
-                    else
-                    {
-                        // {@code insert} is a branch: perform a linear merge between the leaf and this one branch key
-                        Insert ik = (Insert) inode[ipos];
-                        int find = exponentialSearchWithUpperBound(comparator, unode, upos, usz, uub, ik);
-                        boolean isInRange = find != -(2 + usz);
-                        boolean found = find >= 0;
-                        if (find < 0) find = isInRange ? -(1 + find) : usz;
-
-                        if (upos < find)
-                        {
-                            copy(unode, upos, find - upos);
-                            upos = find;
-                        }
+                final Existing nextUKey = find < usz ? (Existing) unode[find] : uub;
+                final Object[] childUNode = (Object[]) unode[find + usz];
 
-                        if (isInRange)
-                        {
-                            leaf().addKey(found ? updateF.apply((Existing) unode[upos++], ik)
-                                                : isSimple(updateF)
-                                                  ? ik
-                                                  : updateF.apply(ik));
-
-                            // We are now immediately after ik; since inode is a branch, the next is in a leaf node
-                            insert.descendIntoNextLeaf(inode, ipos, isz);
-                            inode = insert.node(); ipos = insert.position(); isz = sizeOfLeaf(inode);
-                        }
-                        else
-                        {
-                            if (null == (level = branch = finish(level, unode, usz)))
-                                break;
-                            unode = update.node(); upos = update.position(); uub = (Existing) update.upperBound(); usz = shallowSizeOfBranch(unode);
-                        }
-                    }
+                // process next child
+                if (c < 0)
+                {
+                    // ik fall inside it -- recursively merge the child with the update, using next key as an upper bound
+                    LeafOrBranchBuilder childBuilder = builder.child;
+                    ik = updateRecursive(ik, childUNode, nextUKey, childBuilder);
+                    childBuilder.drainAndPropagate(childUNode, shallowSizeOfBranch(childUNode), builder);

Review Comment:
   Done



##########
src/java/org/apache/cassandra/utils/btree/BTree.java:
##########
@@ -3239,6 +3228,21 @@ final boolean producesOnlyDense()
         abstract void reset();
     }
 
+    public static Object[] completeBuild(LeafOrBranchBuilder level)

Review Comment:
   Makes sense. `update` calls it on a the pre-update root, which is not a leaf, so it goes to `LeafOrBranchBuilder`.



##########
src/java/org/apache/cassandra/utils/btree/BTree.java:
##########
@@ -182,35 +180,32 @@ public static Dir desc(boolean desc)
         // first calculate the minimum height needed for this size of tree
         int height = minHeight(size);
         assert height > 1;
+        assert height * BRANCH_SHIFT < 32;
 
         int denseChildSize = denseSize(height - 1);
+        // Divide the size by the child size + 1, adjusting size by +1 to compensate for not having an upper key on the
+        // last child and rounding up, i.e. (size + 1 + div - 1) / div == size / div + 1 where div = childSize + 1
         int childCount = size / (denseChildSize + 1) + 1;
 
         return buildMaximallyDense(source, childCount, size, height, updateF);
     }
 
     /**
-     * Build a tree containing some initial quantity of dense nodes, up to a single node of arbitrary size (between
-     * dense and sparse), and the remainder of nodes all sparse.
-     *
-     * This permits us to build any size of tree:
-     * 1) A root node can have any number of children, so for a given height we can build our smallest tree with two
-     *    sparse nodes.  For a given childCount we can vary up to a whole dense child's size with this scheme,
-     *    and any more difference would require a change in child count anyway. We have at most two sparse nodes.
-     * 2) An internal node of size > 1/2 max can be constructed in exactly the same way, the extra constraint only
-     *    imposed because we must have at least BRANCH_FACTOR/2 children.
-     * 3) A smaller internal node simply uses precisely BRANCH_FACTOR/2 children, and applies the same algorithm.
-     *    This permits up to all children to be sparse, if necessary.
+     * Build a tree containing only dense nodes except at most two on any level. This matches the structure that
+     * a FastBuilder would create, with some optimizations in constructing the dense nodes.
      *
-     * For branches just above the leaf level, we split more evenly, since there's no advantage to imbalance.
+     * We do this by repeatedly constructing fully dense children until we reach a threshold, chosen so that we would
+     * not be able to create another child with fully dense children and at least MIN_KEYS keys. After the threshold,
+     * the remainder may fit a single node, or is otherwise split roughly halfway to create one child with at least
+     * MIN_KEYS+1 fully dense children, and one that has at least MIN_KEYS-1 fully dense and up to two non-dense.
      */
     private static <C, I extends C, O extends C> Object[] buildMaximallyDense(BulkIterator<I> source,

Review Comment:
   Went the other way, as we had precedent in `buildLeafWithoutSizeTracking`.



##########
src/java/org/apache/cassandra/utils/btree/BTree.java:
##########
@@ -4186,12 +4040,6 @@ int init(Object[] tree)
             return leafDepth + 1;
         }
 
-        void advanceLeaf()
-        {
-            if (++positions[depth] == sizeOfLeaf(nodes[depth]))
-                ascendToUnfinishedParent();

Review Comment:
   Removed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org