You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/23 06:47:24 UTC

[GitHub] [spark] wankunde opened a new pull request #35619: [FOLLOWUP][SPARK-36967][CORE] Report accurate shuffle block size if its skewed

wankunde opened a new pull request #35619:
URL: https://github.com/apache/spark/pull/35619


   
   ### What changes were proposed in this pull request?
   
   Now we will sort the shuffle blocks twice in map side to find out the skewed blocks. 
   We can add a flag in `Utils.median(sizes: Array[Long])` to avoid sort the input array.
   
   ### Why are the changes needed?
   
   To avoid additional sort.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Existing UTs
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #35619: [FOLLOWUP][SPARK-36967][CORE] Report accurate shuffle block size if its skewed

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #35619:
URL: https://github.com/apache/spark/pull/35619#discussion_r815525460



##########
File path: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala
##########
@@ -265,7 +265,7 @@ private[spark] object HighlyCompressedMapStatus {
     val threshold =
       if (accurateBlockSkewedFactor > 0) {
         val sortedSizes = uncompressedSizes.sorted
-        val medianSize: Long = Utils.median(sortedSizes)
+        val medianSize: Long = Utils.median(sortedSizes, true)

Review comment:
       ok, I am netural on it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wankunde commented on a change in pull request #35619: [FOLLOWUP][SPARK-36967][CORE] Report accurate shuffle block size if its skewed

Posted by GitBox <gi...@apache.org>.
wankunde commented on a change in pull request #35619:
URL: https://github.com/apache/spark/pull/35619#discussion_r813581591



##########
File path: core/src/main/scala/org/apache/spark/util/Utils.scala
##########
@@ -3221,11 +3221,12 @@ private[spark] object Utils extends Logging {
    * Return the median number of a long array
    *
    * @param sizes
+   * @param alreadySorted
    * @return
    */
-  def median(sizes: Array[Long]): Long = {
+  def median(sizes: Array[Long], alreadySorted: Boolean = false): Long = {

Review comment:
       Hi, @attilapiros , in my opinion,  `median` method should support unsorted input array. But if we change the parameter name to `sortedSizes`, that will be a little confusing?
   WDYT ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wankunde commented on a change in pull request #35619: [FOLLOWUP][SPARK-36967][CORE] Report accurate shuffle block size if its skewed

Posted by GitBox <gi...@apache.org>.
wankunde commented on a change in pull request #35619:
URL: https://github.com/apache/spark/pull/35619#discussion_r813633320



##########
File path: core/src/main/scala/org/apache/spark/util/Utils.scala
##########
@@ -3221,11 +3221,12 @@ private[spark] object Utils extends Logging {
    * Return the median number of a long array
    *
    * @param sizes
+   * @param alreadySorted
    * @return
    */
-  def median(sizes: Array[Long]): Long = {
+  def median(sizes: Array[Long], alreadySorted: Boolean = false): Long = {

Review comment:
       Agree, I have updated `median` method.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on a change in pull request #35619: [FOLLOWUP][SPARK-36967][CORE] Report accurate shuffle block size if its skewed

Posted by GitBox <gi...@apache.org>.
attilapiros commented on a change in pull request #35619:
URL: https://github.com/apache/spark/pull/35619#discussion_r812641442



##########
File path: core/src/main/scala/org/apache/spark/util/Utils.scala
##########
@@ -3221,11 +3221,12 @@ private[spark] object Utils extends Logging {
    * Return the median number of a long array
    *
    * @param sizes
+   * @param alreadySorted
    * @return
    */
-  def median(sizes: Array[Long]): Long = {
+  def median(sizes: Array[Long], alreadySorted: Boolean = false): Long = {

Review comment:
       What about always assuming the `sizes` are sorted (mentioning this in the description and reflecting in the parameter name ie. `sortedSizes` or in the name of the method..)?
    
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #35619: [FOLLOWUP][SPARK-36967][CORE] Report accurate shuffle block size if its skewed

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #35619:
URL: https://github.com/apache/spark/pull/35619#issuecomment-1066594192


   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on a change in pull request #35619: [FOLLOWUP][SPARK-36967][CORE] Report accurate shuffle block size if its skewed

Posted by GitBox <gi...@apache.org>.
attilapiros commented on a change in pull request #35619:
URL: https://github.com/apache/spark/pull/35619#discussion_r813629210



##########
File path: core/src/main/scala/org/apache/spark/util/Utils.scala
##########
@@ -3221,11 +3221,12 @@ private[spark] object Utils extends Logging {
    * Return the median number of a long array
    *
    * @param sizes
+   * @param alreadySorted
    * @return
    */
-  def median(sizes: Array[Long]): Long = {
+  def median(sizes: Array[Long], alreadySorted: Boolean = false): Long = {

Review comment:
       Ok, but in that case I won't give a default value for the `alreadySorted`, let's require the client to define it. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #35619: [FOLLOWUP][SPARK-36967][CORE] Report accurate shuffle block size if its skewed

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #35619:
URL: https://github.com/apache/spark/pull/35619#issuecomment-1049909088


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wankunde commented on a change in pull request #35619: [FOLLOWUP][SPARK-36967][CORE] Report accurate shuffle block size if its skewed

Posted by GitBox <gi...@apache.org>.
wankunde commented on a change in pull request #35619:
URL: https://github.com/apache/spark/pull/35619#discussion_r813581591



##########
File path: core/src/main/scala/org/apache/spark/util/Utils.scala
##########
@@ -3221,11 +3221,12 @@ private[spark] object Utils extends Logging {
    * Return the median number of a long array
    *
    * @param sizes
+   * @param alreadySorted
    * @return
    */
-  def median(sizes: Array[Long]): Long = {
+  def median(sizes: Array[Long], alreadySorted: Boolean = false): Long = {

Review comment:
       Hi, @attilapiros , in my opinion,  `median` method should support unsorted input array. But if we change the parameter name to `sortedSizes`, that will be a little confusing?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ulysses-you commented on pull request #35619: [FOLLOWUP][SPARK-36967][CORE] Report accurate shuffle block size if its skewed

Posted by GitBox <gi...@apache.org>.
ulysses-you commented on pull request #35619:
URL: https://github.com/apache/spark/pull/35619#issuecomment-1066677535


   thank you all !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #35619: [FOLLOWUP][SPARK-36967][CORE] Report accurate shuffle block size if its skewed

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #35619:
URL: https://github.com/apache/spark/pull/35619#discussion_r813669582



##########
File path: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala
##########
@@ -265,7 +265,7 @@ private[spark] object HighlyCompressedMapStatus {
     val threshold =
       if (accurateBlockSkewedFactor > 0) {
         val sortedSizes = uncompressedSizes.sorted
-        val medianSize: Long = Utils.median(sortedSizes)
+        val medianSize: Long = Utils.median(sortedSizes, true)

Review comment:
       only in this call site `alreadySorted` is `true`, what about only changing this place to utilize the order?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wankunde commented on a change in pull request #35619: [FOLLOWUP][SPARK-36967][CORE] Report accurate shuffle block size if its skewed

Posted by GitBox <gi...@apache.org>.
wankunde commented on a change in pull request #35619:
URL: https://github.com/apache/spark/pull/35619#discussion_r814647556



##########
File path: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala
##########
@@ -265,7 +265,7 @@ private[spark] object HighlyCompressedMapStatus {
     val threshold =
       if (accurateBlockSkewedFactor > 0) {
         val sortedSizes = uncompressedSizes.sorted
-        val medianSize: Long = Utils.median(sortedSizes)
+        val medianSize: Long = Utils.median(sortedSizes, true)

Review comment:
       Hi, @zhengruifeng Originally I wrote the median function code here, but  @JoshRosen suggested extracting the common code into a util function https://github.com/apache/spark/pull/34234#discussion_r731369282 .
   Either changing the median method or copying some code here all makes sense to me.
   WDYT?
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #35619: [FOLLOWUP][SPARK-36967][CORE] Report accurate shuffle block size if its skewed

Posted by GitBox <gi...@apache.org>.
mridulm commented on pull request #35619:
URL: https://github.com/apache/spark/pull/35619#issuecomment-1068661887


   late LGTM, thanks for adding this @wankunde !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ulysses-you commented on pull request #35619: [FOLLOWUP][SPARK-36967][CORE] Report accurate shuffle block size if its skewed

Posted by GitBox <gi...@apache.org>.
ulysses-you commented on pull request #35619:
URL: https://github.com/apache/spark/pull/35619#issuecomment-1066259005


   can we go ahead since 3.3.0 will be cut soon. thank you ! also cc @mridulm @Ngone51 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng closed pull request #35619: [FOLLOWUP][SPARK-36967][CORE] Report accurate shuffle block size if its skewed

Posted by GitBox <gi...@apache.org>.
zhengruifeng closed pull request #35619:
URL: https://github.com/apache/spark/pull/35619


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wankunde commented on a change in pull request #35619: [FOLLOWUP][SPARK-36967][CORE] Report accurate shuffle block size if its skewed

Posted by GitBox <gi...@apache.org>.
wankunde commented on a change in pull request #35619:
URL: https://github.com/apache/spark/pull/35619#discussion_r813581591



##########
File path: core/src/main/scala/org/apache/spark/util/Utils.scala
##########
@@ -3221,11 +3221,12 @@ private[spark] object Utils extends Logging {
    * Return the median number of a long array
    *
    * @param sizes
+   * @param alreadySorted
    * @return
    */
-  def median(sizes: Array[Long]): Long = {
+  def median(sizes: Array[Long], alreadySorted: Boolean = false): Long = {

Review comment:
       Hi, @attilapiros , in my opinion,  `median` method should support unsorted input array and it has already been used in some other classes. But if we change the parameter name to `sortedSizes`, that will be a little confusing?
   WDYT ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org