You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "cloud-fan (via GitHub)" <gi...@apache.org> on 2023/02/22 11:25:10 UTC

[GitHub] [spark] cloud-fan commented on a diff in pull request #40121: [SPARK-42528] Optimize PercentileHeap

cloud-fan commented on code in PR #40121:
URL: https://github.com/apache/spark/pull/40121#discussion_r1114194028


##########
core/src/main/scala/org/apache/spark/util/collection/PercentileHeap.scala:
##########
@@ -20,97 +20,55 @@ package org.apache.spark.util.collection
 import scala.collection.mutable.PriorityQueue
 
 /**
- * PercentileHeap is designed to be used to quickly track the percentile of a group of numbers
- * that may contain duplicates. Inserting a new number has O(log n) time complexity and
- * determining the percentile has O(1) time complexity.
- * The basic idea is to maintain two heaps: a smallerHalf and a largerHalf. The smallerHalf
- * stores the smaller half of all numbers while the largerHalf stores the larger half.
- * The sizes of two heaps need to match the percentage each time when a new number is inserted so
- * that the ratio of their sizes is percentage to (1 - percentage). Therefore each time when
- * percentile() is called we check if the sizes of two heaps match the percentage. If they do,
- * we should return the average of the two top values of heaps. Otherwise we return the top of the
- * heap which exceeds its percentage.
+ * PercentileHeap tracks the percentile of a collection of numbers.
+ *
+ * Insertion is O(log n), Lookup is O(1).
+ *
+ * The implementation keeps two heaps: a bottom heap (`botHeap`) and a top heap (`topHeap`). The
+ * bottom heap stores all the numbers below the percentile and the top heap the ones above the

Review Comment:
   ```suggestion
    * bottom heap stores all the numbers below the percentile and the top heap stores the ones above the
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org