You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/07/25 00:07:21 UTC

[GitHub] [spark] agrawaldevesh opened a new pull request #29226: Make the block manager decommissioning test be less flaky

agrawaldevesh opened a new pull request #29226:
URL: https://github.com/apache/spark/pull/29226


   ### What changes were proposed in this pull request?
   
   It's possible for this test to schedule the 3 tasks on just 2 out of 3
   executors and then end up decommissioning the third one. Since the third
   executor does not have any blocks, none get replicated, thereby failing
   the test.
   
   This condition can be triggered on a loaded system when the Worker
   registrations are sometimes delayed. It does trigger on Github checks
   frequently enough to annoy me to fix it.
   
   I added some logging to diagnose the problem. But the fix is simple:
   Just require that each executor gets the full worker so that no sharing
   is possible.
   
   ### Why are the changes needed?
   
   I have had bad luck with BlockManagerDecommissionIntegrationSuite and it has failed several times on my PRs. So fixing it.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, unit test only change.
   
   ### How was this patch tested?
   
   Github checks. Ran this test 100 times about 10 at a time in parallel (to create enough load). Ensured that the same running regime, without this fix fails at least once.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663901241






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon edited a comment on pull request #29226: [SPARK-32417] Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
HyukjinKwon edited a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-670286829


   For some reasons, the test seems still flaky:
   - https://github.com/apache/spark/pull/29373/checks?check_run_id=953464967
   - https://github.com/apache/spark/pull/29378/checks?check_run_id=953464638
   
   ```
   BlockManagerDecommissionIntegrationSuite:
   - verify that an already running task which is going to cache data succeeds on a decommissioned executor after task start *** FAILED *** (13 seconds, 145 milliseconds)
     The code passed to eventually never returned normally. Attempted 1829 times over 6.001362179 seconds. Last failure message: getCandidateExecutorToDecom.isDefined was false. (BlockManagerDecommissionIntegrationSuite.scala:179)
     org.scalatest.exceptions.TestFailedDueToTimeoutException:
     at org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:185)
     at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:192)
     at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:402)
     at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:401)
     at org.apache.spark.storage.BlockManagerDecommissionIntegrationSuite.eventually(BlockManagerDecommissionIntegrationSuite.scala:34)
     at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:312)
     at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:311)
     at org.apache.spark.storage.BlockManagerDecommissionIntegrationSuite.eventually(BlockManagerDecommissionIntegrationSuite.scala:34)
     at org.apache.spark.storage.BlockManagerDecommissionIntegrationSuite.runDecomTest(BlockManagerDecommissionIntegrationSuite.scala:179)
     at org.apache.spark.storage.BlockManagerDecommissionIntegrationSuite.$anonfun$new$1(BlockManagerDecommissionIntegrationSuite.scala:45)
     at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
     at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
     at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
     at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
     at org.scalatest.Transformer.apply(Transformer.scala:22)
     at org.scalatest.Transformer.apply(Transformer.scala:20)
     at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:189)
     at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:158)
     at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:187)
     at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:199)
     at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
     at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:199)
     at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:181)
     at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:60)
     at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
     at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
     at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:60)
     at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:232)
     at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
     at scala.collection.immutable.List.foreach(List.scala:392)
     at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
     at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
     at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
     at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:232)
     at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:231)
     at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1562)
     at org.scalatest.Suite.run(Suite.scala:1112)
     at org.scalatest.Suite.run$(Suite.scala:1094)
     at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1562)
     at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:236)
     at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
     at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:236)
     at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:235)
     at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:60)
     at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
     at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
     at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
     at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:60)
     at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318)
     at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513)
     at sbt.ForkMain$Run$2.call(ForkMain.java:296)
     at sbt.ForkMain$Run$2.call(ForkMain.java:286)
     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
     at java.lang.Thread.run(Thread.java:748)
     Cause: org.scalatest.exceptions.TestFailedException: getCandidateExecutorToDecom.isDefined was false
     at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
     at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
     at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
     at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
     at org.apache.spark.storage.BlockManagerDecommissionIntegrationSuite.$anonfun$runDecomTest$6(BlockManagerDecommissionIntegrationSuite.scala:180)
     at org.scalatest.enablers.Retrying$$anon$4.makeAValiantAttempt$1(Retrying.scala:150)
     at org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:162)
     at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:192)
     at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:402)
     at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:401)
     at org.apache.spark.storage.BlockManagerDecommissionIntegrationSuite.eventually(BlockManagerDecommissionIntegrationSuite.scala:34)
     at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:312)
     at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:311)
     at org.apache.spark.storage.BlockManagerDecommissionIntegrationSuite.eventually(BlockManagerDecommissionIntegrationSuite.scala:34)
     at org.apache.spark.storage.BlockManagerDecommissionIntegrationSuite.runDecomTest(BlockManagerDecommissionIntegrationSuite.scala:179)
     at org.apache.spark.storage.BlockManagerDecommissionIntegrationSuite.$anonfun$new$1(BlockManagerDecommissionIntegrationSuite.scala:45)
     at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
     at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
     at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
     at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
     at org.scalatest.Transformer.apply(Transformer.scala:22)
     at org.scalatest.Transformer.apply(Transformer.scala:20)
     at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:189)
     at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:158)
     at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:187)
     at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:199)
     at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
     at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:199)
     at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:181)
     at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:60)
     at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
     at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
     at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:60)
     at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:232)
     at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
     at scala.collection.immutable.List.foreach(List.scala:392)
     at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
     at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
     at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
     at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:232)
     at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:231)
     at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1562)
     at org.scalatest.Suite.run(Suite.scala:1112)
     at org.scalatest.Suite.run$(Suite.scala:1094)
     at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1562)
     at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:236)
     at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
     at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:236)
     at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:235)
     at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:60)
     at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
     at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
     at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
     at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:60)
     at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318)
     at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513)
     at sbt.ForkMain$Run$2.call(ForkMain.java:296)
     at sbt.ForkMain$Run$2.call(ForkMain.java:286)
     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
     at java.lang.Thread.run(Thread.java:748)
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663808623






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663820927






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663820820


   **[Test build #126531 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126531/testReport)** for PR 29226 at commit [`199ea03`](https://github.com/apache/spark/commit/199ea03f028804dc12039bbca4f25a7138a378a8).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] agrawaldevesh commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
agrawaldevesh commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663875213


   The remaining failure is in SparkR and is probably environment related.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663819995


   **[Test build #126530 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126530/testReport)** for PR 29226 at commit [`4c8bafb`](https://github.com/apache/spark/commit/4c8bafb9d55e54332a83d97fcbe8662a20116215).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29226: [SPARK-32417] Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664559165






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on a change in pull request #29226: [SPARK-32417] Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
holdenk commented on a change in pull request #29226:
URL: https://github.com/apache/spark/pull/29226#discussion_r461133372



##########
File path: core/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionIntegrationSuite.scala
##########
@@ -125,27 +118,26 @@ class BlockManagerDecommissionIntegrationSuite extends SparkFunSuite with LocalS
     // Start the computation of RDD - this step will also cache the RDD
     val asyncCount = testRdd.countAsync()
 
-    // Wait for the job to have started.
-    taskStartSem.acquire(1)
-    // Wait for each executor + driver to have it's broadcast info delivered.
-    broadcastSem.acquire((numExecs + 1))
-
     // Make sure the job is either mid run or otherwise has data to migrate.
     if (migrateDuring) {
-      // Give Spark a tiny bit to start executing after the broadcast blocks land.
-      // For me this works at 100, set to 300 for system variance.
-      Thread.sleep(300)
+      // Wait for one of the tasks to succeed and finish writing its blocks.
+      // This way we know that this executor had real data to migrate when it is subsequently
+      // decommissioned below.

Review comment:
       This goes against the purpose of this test: making sure that an executor with a running task that receives a decom has the block migrated. It is not migrating during decommissioning in the same way.

##########
File path: core/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionIntegrationSuite.scala
##########
@@ -125,27 +118,26 @@ class BlockManagerDecommissionIntegrationSuite extends SparkFunSuite with LocalS
     // Start the computation of RDD - this step will also cache the RDD
     val asyncCount = testRdd.countAsync()
 
-    // Wait for the job to have started.
-    taskStartSem.acquire(1)
-    // Wait for each executor + driver to have it's broadcast info delivered.
-    broadcastSem.acquire((numExecs + 1))
-
     // Make sure the job is either mid run or otherwise has data to migrate.
     if (migrateDuring) {
-      // Give Spark a tiny bit to start executing after the broadcast blocks land.
-      // For me this works at 100, set to 300 for system variance.
-      Thread.sleep(300)
+      // Wait for one of the tasks to succeed and finish writing its blocks.
+      // This way we know that this executor had real data to migrate when it is subsequently
+      // decommissioned below.

Review comment:
       So I think another way we could make sure the test covers what we want is to run a job repeatedly until all 3 executors come up.
   
   The block manager (in decom state) does indeed refuses puts, but RDD computation on the executor goes through `getOrElseUpdate` which immediately calls `doPutIterator` if there is not a cache hit *before* the iterator starts being computed. Since the check to see if the block manager is decommissioning occurs before the start of the computation, not at the end we want to ensure that block can be put (and then later migrated).
   
   

##########
File path: core/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionIntegrationSuite.scala
##########
@@ -125,27 +118,26 @@ class BlockManagerDecommissionIntegrationSuite extends SparkFunSuite with LocalS
     // Start the computation of RDD - this step will also cache the RDD
     val asyncCount = testRdd.countAsync()
 
-    // Wait for the job to have started.
-    taskStartSem.acquire(1)
-    // Wait for each executor + driver to have it's broadcast info delivered.
-    broadcastSem.acquire((numExecs + 1))
-
     // Make sure the job is either mid run or otherwise has data to migrate.
     if (migrateDuring) {
-      // Give Spark a tiny bit to start executing after the broadcast blocks land.
-      // For me this works at 100, set to 300 for system variance.
-      Thread.sleep(300)
+      // Wait for one of the tasks to succeed and finish writing its blocks.
+      // This way we know that this executor had real data to migrate when it is subsequently
+      // decommissioned below.

Review comment:
       Maybe we should add a comment where we do the `isDecommissioning` check to explain that it is intentionally done there so that we don't reject blocks which have already started computation. Do you think that would help?

##########
File path: core/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionIntegrationSuite.scala
##########
@@ -125,27 +118,26 @@ class BlockManagerDecommissionIntegrationSuite extends SparkFunSuite with LocalS
     // Start the computation of RDD - this step will also cache the RDD
     val asyncCount = testRdd.countAsync()
 
-    // Wait for the job to have started.
-    taskStartSem.acquire(1)
-    // Wait for each executor + driver to have it's broadcast info delivered.
-    broadcastSem.acquire((numExecs + 1))
-
     // Make sure the job is either mid run or otherwise has data to migrate.
     if (migrateDuring) {
-      // Give Spark a tiny bit to start executing after the broadcast blocks land.
-      // For me this works at 100, set to 300 for system variance.
-      Thread.sleep(300)
+      // Wait for one of the tasks to succeed and finish writing its blocks.
+      // This way we know that this executor had real data to migrate when it is subsequently
+      // decommissioned below.

Review comment:
       I don't believe we need another block manager PR to realize it, I think this is just a flaky test because we took out the original sleep and tried to use TestUtils which doesn't do a good enough job of waiting for the executor to fully come up.
   
   Since doPut is called *before* the task starts computation, we don't throw away any of the in-progress data.
   
   I'll make an alternate PR to this one to illustrate my understanding and hopefully we can iron it out and make the code path clear for everyone :) 

##########
File path: core/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionIntegrationSuite.scala
##########
@@ -87,36 +113,30 @@ class BlockManagerDecommissionIntegrationSuite extends SparkFunSuite with LocalS
     }
 
     // Listen for the job & block updates
-    val taskStartSem = new Semaphore(0)
-    val broadcastSem = new Semaphore(0)
     val executorRemovedSem = new Semaphore(0)
-    val taskEndEvents = ArrayBuffer.empty[SparkListenerTaskEnd]
+    val taskEndEvents = new ConcurrentLinkedQueue[SparkListenerTaskEnd]()
     val blocksUpdated = ArrayBuffer.empty[SparkListenerBlockUpdated]
-    sc.addSparkListener(new SparkListener {
 
+    def getCandidateExecutorToDecom: Option[String] = if (whenToDecom == TaskStarted) {
+      accum.value.asScala.headOption

Review comment:
       I don't think this is going to work as intended, accumulators send updates back at the end of the task, unless something has changed.

##########
File path: core/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionIntegrationSuite.scala
##########
@@ -87,36 +113,30 @@ class BlockManagerDecommissionIntegrationSuite extends SparkFunSuite with LocalS
     }
 
     // Listen for the job & block updates
-    val taskStartSem = new Semaphore(0)
-    val broadcastSem = new Semaphore(0)
     val executorRemovedSem = new Semaphore(0)
-    val taskEndEvents = ArrayBuffer.empty[SparkListenerTaskEnd]
+    val taskEndEvents = new ConcurrentLinkedQueue[SparkListenerTaskEnd]()
     val blocksUpdated = ArrayBuffer.empty[SparkListenerBlockUpdated]
-    sc.addSparkListener(new SparkListener {
 
+    def getCandidateExecutorToDecom: Option[String] = if (whenToDecom == TaskStarted) {
+      accum.value.asScala.headOption

Review comment:
       If you don't believe that is the case, can we add an assertion in here that none of the tasks have finished yet?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664047993


   **[Test build #126583 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126583/testReport)** for PR 29226 at commit [`913e6d7`](https://github.com/apache/spark/commit/913e6d72c0bb3b18739b730a0634bcbc35444437).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29226: Make the block manager decommissioning test be less flaky

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663780638






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663912934


   **[Test build #126554 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126554/testReport)** for PR 29226 at commit [`a939c67`](https://github.com/apache/spark/commit/a939c675dd70ec0368e7d9d8f08f1ffca662cf1d).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663901104


   **[Test build #126554 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126554/testReport)** for PR 29226 at commit [`a939c67`](https://github.com/apache/spark/commit/a939c675dd70ec0368e7d9d8f08f1ffca662cf1d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663826074






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663808742


   **[Test build #126526 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126526/testReport)** for PR 29226 at commit [`288c34e`](https://github.com/apache/spark/commit/288c34ebaaec481f0fdf0db1938cf39e3970eb6d).
    * This patch **fails Scala style tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663831392






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663783050






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664047993


   **[Test build #126583 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126583/testReport)** for PR 29226 at commit [`913e6d7`](https://github.com/apache/spark/commit/913e6d72c0bb3b18739b730a0634bcbc35444437).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663893861






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663879650






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on pull request #29226: [SPARK-32417] Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
holdenk commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664674525






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664110421


   @agrawaldevesh, shell we file a JIRA? I think it's better to file a JIRA to track in general.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663808556


   **[Test build #126526 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126526/testReport)** for PR 29226 at commit [`288c34e`](https://github.com/apache/spark/commit/288c34ebaaec481f0fdf0db1938cf39e3970eb6d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663808753


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/126526/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
attilapiros commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664122085


   @agrawaldevesh I gave some more thought: it must be really about the middle of the job so to wait for task finish is fine here.
   
   I suggest to go for the much simpler fix by checking when the block appears in the `blocksUpdated` collection and trigger the decommissioning only after we have this block on the disk.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663834276






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29226: [SPARK-32417] Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-670286829


   For some reasons, the test seems still flaky: https://github.com/apache/spark/pull/29373/checks?check_run_id=953464967
   
   ```
   BlockManagerDecommissionIntegrationSuite:
   - verify that an already running task which is going to cache data succeeds on a decommissioned executor after task start *** FAILED *** (13 seconds, 145 milliseconds)
     The code passed to eventually never returned normally. Attempted 1829 times over 6.001362179 seconds. Last failure message: getCandidateExecutorToDecom.isDefined was false. (BlockManagerDecommissionIntegrationSuite.scala:179)
     org.scalatest.exceptions.TestFailedDueToTimeoutException:
     at org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:185)
     at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:192)
     at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:402)
     at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:401)
     at org.apache.spark.storage.BlockManagerDecommissionIntegrationSuite.eventually(BlockManagerDecommissionIntegrationSuite.scala:34)
     at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:312)
     at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:311)
     at org.apache.spark.storage.BlockManagerDecommissionIntegrationSuite.eventually(BlockManagerDecommissionIntegrationSuite.scala:34)
     at org.apache.spark.storage.BlockManagerDecommissionIntegrationSuite.runDecomTest(BlockManagerDecommissionIntegrationSuite.scala:179)
     at org.apache.spark.storage.BlockManagerDecommissionIntegrationSuite.$anonfun$new$1(BlockManagerDecommissionIntegrationSuite.scala:45)
     at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
     at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
     at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
     at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
     at org.scalatest.Transformer.apply(Transformer.scala:22)
     at org.scalatest.Transformer.apply(Transformer.scala:20)
     at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:189)
     at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:158)
     at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:187)
     at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:199)
     at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
     at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:199)
     at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:181)
     at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:60)
     at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
     at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
     at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:60)
     at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:232)
     at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
     at scala.collection.immutable.List.foreach(List.scala:392)
     at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
     at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
     at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
     at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:232)
     at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:231)
     at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1562)
     at org.scalatest.Suite.run(Suite.scala:1112)
     at org.scalatest.Suite.run$(Suite.scala:1094)
     at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1562)
     at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:236)
     at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
     at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:236)
     at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:235)
     at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:60)
     at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
     at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
     at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
     at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:60)
     at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318)
     at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513)
     at sbt.ForkMain$Run$2.call(ForkMain.java:296)
     at sbt.ForkMain$Run$2.call(ForkMain.java:286)
     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
     at java.lang.Thread.run(Thread.java:748)
     Cause: org.scalatest.exceptions.TestFailedException: getCandidateExecutorToDecom.isDefined was false
     at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
     at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
     at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
     at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
     at org.apache.spark.storage.BlockManagerDecommissionIntegrationSuite.$anonfun$runDecomTest$6(BlockManagerDecommissionIntegrationSuite.scala:180)
     at org.scalatest.enablers.Retrying$$anon$4.makeAValiantAttempt$1(Retrying.scala:150)
     at org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:162)
     at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:192)
     at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:402)
     at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:401)
     at org.apache.spark.storage.BlockManagerDecommissionIntegrationSuite.eventually(BlockManagerDecommissionIntegrationSuite.scala:34)
     at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:312)
     at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:311)
     at org.apache.spark.storage.BlockManagerDecommissionIntegrationSuite.eventually(BlockManagerDecommissionIntegrationSuite.scala:34)
     at org.apache.spark.storage.BlockManagerDecommissionIntegrationSuite.runDecomTest(BlockManagerDecommissionIntegrationSuite.scala:179)
     at org.apache.spark.storage.BlockManagerDecommissionIntegrationSuite.$anonfun$new$1(BlockManagerDecommissionIntegrationSuite.scala:45)
     at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
     at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
     at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
     at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
     at org.scalatest.Transformer.apply(Transformer.scala:22)
     at org.scalatest.Transformer.apply(Transformer.scala:20)
     at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:189)
     at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:158)
     at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:187)
     at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:199)
     at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
     at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:199)
     at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:181)
     at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:60)
     at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
     at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
     at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:60)
     at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:232)
     at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
     at scala.collection.immutable.List.foreach(List.scala:392)
     at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
     at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
     at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
     at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:232)
     at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:231)
     at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1562)
     at org.scalatest.Suite.run(Suite.scala:1112)
     at org.scalatest.Suite.run$(Suite.scala:1094)
     at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1562)
     at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:236)
     at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
     at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:236)
     at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:235)
     at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:60)
     at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
     at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
     at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
     at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:60)
     at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318)
     at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513)
     at sbt.ForkMain$Run$2.call(ForkMain.java:296)
     at sbt.ForkMain$Run$2.call(ForkMain.java:286)
     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
     at java.lang.Thread.run(Thread.java:748)
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
attilapiros commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664108813


   We are still not there yet.
   
   So according to you analyses:
   > - We wait 300 ms and decommission executor 0.
   >
   > - If the task is not yet done on executor 0, it will now fail because
   the block manager won't be able to save the block. This condition is
   easy to trigger on a loaded machine where the github checks run.
   
   If we would need this block to be saved then there would be a much simpler fix by checking when the block appears in the `blocksUpdated` collection. But this goes against the `migrateDuring` original intention (see the first comment):
   
   https://github.com/apache/spark/blob/913e6d72c0bb3b18739b730a0634bcbc35444437/core/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionIntegrationSuite.scala#L157-L164
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664041781






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on a change in pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
attilapiros commented on a change in pull request #29226:
URL: https://github.com/apache/spark/pull/29226#discussion_r460502249



##########
File path: core/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionIntegrationSuite.scala
##########
@@ -57,6 +58,11 @@ class BlockManagerDecommissionIntegrationSuite extends SparkFunSuite with LocalS
       .set(config.STORAGE_DECOMMISSION_ENABLED, true)
       .set(config.STORAGE_DECOMMISSION_RDD_BLOCKS_ENABLED, persist)
       .set(config.STORAGE_DECOMMISSION_SHUFFLE_BLOCKS_ENABLED, shuffle)
+      // Force exactly one executor per worker such that all block managers
+      // get the shuffle and RDD blocks.
+      .set(config.EXECUTOR_CORES.key, "1")
+      .set(config.CPUS_PER_TASK.key, "1")
+      .set(config.EXECUTOR_MEMORY.key, "1024m")

Review comment:
       I think these changes are redundant and not needed as they are setting the same values as the defaults for these configs:
   
   https://github.com/apache/spark/blob/a939c675dd70ec0368e7d9d8f08f1ffca662cf1d/core/src/main/scala/org/apache/spark/internal/config/package.scala#L293-L302
   
   https://github.com/apache/spark/blob/a939c675dd70ec0368e7d9d8f08f1ffca662cf1d/core/src/main/scala/org/apache/spark/internal/config/package.scala#L524-L525




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663783050






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663880272


   **[Test build #126550 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126550/testReport)** for PR 29226 at commit [`ba12869`](https://github.com/apache/spark/commit/ba12869cd1d87c54566a2584786763aff20e80bf).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663808748


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29226: [SPARK-32417] Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664637705






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664055890






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664041627


   **[Test build #126582 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126582/testReport)** for PR 29226 at commit [`2cea5f4`](https://github.com/apache/spark/commit/2cea5f4673b746d068bf69f6818f23d1962039ca).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664055664


   **[Test build #126582 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126582/testReport)** for PR 29226 at commit [`2cea5f4`](https://github.com/apache/spark/commit/2cea5f4673b746d068bf69f6818f23d1962039ca).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663820123






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663798252


   **[Test build #126518 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126518/testReport)** for PR 29226 at commit [`558702e`](https://github.com/apache/spark/commit/558702e6f55d6c17439d1942000bea08acde0b29).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663820927






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
holdenk commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663795770


   Oh awesome, thanks for fixing this. I totally misdiagnosed this issue
   (thought it was around placement of the partitions).
   
   On Fri, Jul 24, 2020 at 5:36 PM Apache Spark QA <no...@github.com>
   wrote:
   
   > *Test build #126518 has started
   > <https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126518/testReport>*
   > for PR 29226 at commit 558702e
   > <https://github.com/apache/spark/commit/558702e6f55d6c17439d1942000bea08acde0b29>
   > .
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/spark/pull/29226#issuecomment-663784088>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AAAOT5OSXJNHXNXHEYMHJALR5ISHXANCNFSM4PHFIUWQ>
   > .
   >
   -- 
   Cell : 425-233-8271
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663893861






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29226: [SPARK-32417] Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664638729






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664041781






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663798416






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663831238


   **[Test build #126530 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126530/testReport)** for PR 29226 at commit [`4c8bafb`](https://github.com/apache/spark/commit/4c8bafb9d55e54332a83d97fcbe8662a20116215).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663780471


   **[Test build #126517 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126517/testReport)** for PR 29226 at commit [`7659d22`](https://github.com/apache/spark/commit/7659d22bcc3e2a423a1068dd15d9b11eba76c2df).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] agrawaldevesh commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
agrawaldevesh commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663782885


   cc: @holdenk @attilapiros for review please. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664063689


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663796598






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663893673


   **[Test build #126550 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126550/testReport)** for PR 29226 at commit [`ba12869`](https://github.com/apache/spark/commit/ba12869cd1d87c54566a2584786763aff20e80bf).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664048083






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on a change in pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
attilapiros commented on a change in pull request #29226:
URL: https://github.com/apache/spark/pull/29226#discussion_r460505054



##########
File path: core/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionIntegrationSuite.scala
##########
@@ -107,6 +115,21 @@ class BlockManagerDecommissionIntegrationSuite extends SparkFunSuite with LocalS
       }
 
       override def onBlockUpdated(blockUpdated: SparkListenerBlockUpdated): Unit = {
+        if (blockUpdated.blockUpdatedInfo.blockId.isRDD && persist) {
+          // Persisted RDD blocks are a bit weirder than shuffle blocks: Even though
+          // the tasks are run say on executors (0, 1, 2), the RDD blocks might end up only
+          // on executors 0 and 1. So we cannot just indiscriminately decommission any executor.
+          // Instead we must decommission an executor that actually has an RDD block.
+          // Fortunately, this isn't the case for shuffle blocks which are indeed present on all
+          // executors and thus any executor can be decommissioned when `persist` is false.
+          val candidateExecToDecom = blockUpdated.blockUpdatedInfo.blockManagerId.executorId
+          if (execToDecommission.compareAndSet(null, candidateExecToDecom)) {
+            val decomContext = s"Decommissioning executor ${candidateExecToDecom} for persist"
+            logInfo(decomContext)
+            sched.decommissionExecutor(candidateExecToDecom,
+              ExecutorDecommissionInfo(decomContext, false))

Review comment:
       duplicated code




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663834276






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663808556


   **[Test build #126526 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126526/testReport)** for PR 29226 at commit [`288c34e`](https://github.com/apache/spark/commit/288c34ebaaec481f0fdf0db1938cf39e3970eb6d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664055890






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663901241






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29226: [SPARK-32417] Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664638729






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] agrawaldevesh edited a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
agrawaldevesh edited a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664146969


   @attilapiros thanks for thinking about this. Here is my line of reasoning for why I wrote it the way I did:
   
   > If we would need this block to be saved then there would be a much simpler fix by checking when the block appears in the `blocksUpdated` collection. ~But this goes against the `migrateDuring` original intention (see the first comment):~
   
   I considered this approach (among others) but decided against it. Please let me explain my reasoning:
   
   - First, while I could always poll for the block to show up in blocksUpdated, you would agree that it is a lot of machinery particularly when we know that the blocksUpdated is placed by the `onBlockUpdated` listener call. So I decided to just trigger this decommissioning directly in the listener.
   
   - Second question is whether the decommissioning be done in `onBlockUpdated` or `onTaskEnd`. I chose `onTaskEnd` because that is the point of no return for the block to be "live". Consider what happens when a result task fails after the block is updated but before it can be considered "done". The block it wrote would get garbage collected. This garbage collection would race with the decommissioning. It's possible that the migration starts after the GC and does not find the block to replicated. This would fail the assertion below.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663879650






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663808748






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663784088


   **[Test build #126518 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126518/testReport)** for PR 29226 at commit [`558702e`](https://github.com/apache/spark/commit/558702e6f55d6c17439d1942000bea08acde0b29).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663825949


   **[Test build #126536 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126536/testReport)** for PR 29226 at commit [`b2c2fad`](https://github.com/apache/spark/commit/b2c2fad1eb1caeeb5d6e135220398ad43a963858).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] agrawaldevesh commented on pull request #29226: [SPARK-32417] Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
agrawaldevesh commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664694713






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] agrawaldevesh commented on a change in pull request #29226: [SPARK-32417] Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
agrawaldevesh commented on a change in pull request #29226:
URL: https://github.com/apache/spark/pull/29226#discussion_r461139578



##########
File path: core/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionIntegrationSuite.scala
##########
@@ -125,27 +118,26 @@ class BlockManagerDecommissionIntegrationSuite extends SparkFunSuite with LocalS
     // Start the computation of RDD - this step will also cache the RDD
     val asyncCount = testRdd.countAsync()
 
-    // Wait for the job to have started.
-    taskStartSem.acquire(1)
-    // Wait for each executor + driver to have it's broadcast info delivered.
-    broadcastSem.acquire((numExecs + 1))
-
     // Make sure the job is either mid run or otherwise has data to migrate.
     if (migrateDuring) {
-      // Give Spark a tiny bit to start executing after the broadcast blocks land.
-      // For me this works at 100, set to 300 for system variance.
-      Thread.sleep(300)
+      // Wait for one of the tasks to succeed and finish writing its blocks.
+      // This way we know that this executor had real data to migrate when it is subsequently
+      // decommissioned below.

Review comment:
       In which case, I don't follow how this is supposed to work in the first place. Consider the root cause: You decommission an executor while a (result) task is running. The block manager associated with the executor enters the decom state. The task tries to write the RDD persisted block. The block manager fails that because it is decom'd, thereby failing the task. The task reruns somewhere else and the job succeeds. Now you go to check if that executor has migrated the block: That check fails because the block was never written in the first place.
   
   So I am relaxing the notion of "migrate during" to mean "migrate while the job is running". The question is "which executor" do you want to decommission and force a migration off ? Picking a wrong executor as above can mean that no migrations indeed happen because no real blocks were written (or written and thence discarded).
   
   If that's the intent of the test then I think we need to change the block manager decommissioning (production) code to realize the intent. 
   
   Without this PR: This test is fundamentally racy and thus provides low signal. The race is between the task's end and the block manager's decommission: If the task ends successfully before the decom, the test passes. Otherwise the test fails.
   
   

##########
File path: core/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionIntegrationSuite.scala
##########
@@ -125,27 +118,26 @@ class BlockManagerDecommissionIntegrationSuite extends SparkFunSuite with LocalS
     // Start the computation of RDD - this step will also cache the RDD
     val asyncCount = testRdd.countAsync()
 
-    // Wait for the job to have started.
-    taskStartSem.acquire(1)
-    // Wait for each executor + driver to have it's broadcast info delivered.
-    broadcastSem.acquire((numExecs + 1))
-
     // Make sure the job is either mid run or otherwise has data to migrate.
     if (migrateDuring) {
-      // Give Spark a tiny bit to start executing after the broadcast blocks land.
-      // For me this works at 100, set to 300 for system variance.
-      Thread.sleep(300)
+      // Wait for one of the tasks to succeed and finish writing its blocks.
+      // This way we know that this executor had real data to migrate when it is subsequently
+      // decommissioned below.

Review comment:
       > The block manager (in decom state) does indeed refuses puts, but RDD computation on the executor goes through `getOrElseUpdate` which immediately calls `doPutIterator` if there is not a cache hit _before_ the iterator starts being computed.
   
   I don't see this. This is what I see: (BM stands for BlockManager)
   - RDD.getOrCompute -> BM.getOrElseUpdate -> BM.doPutIterator -> BM.doPut 
   
   BM.doPut throws BlockSavedOnDecommissionedBlockManagerException if `isDecommissioning`. And this fails the job. I am not sure what it should do instead off the top of my head since I am new to this codepath. But it is certainly not continuing on with the computation (as far as I can tell).
   
   If the intent of the test is indeed to test decommissioning "Before the task has ended", I think then we need another BlockManager PR to actually realize it :-) 

##########
File path: core/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionIntegrationSuite.scala
##########
@@ -125,27 +118,26 @@ class BlockManagerDecommissionIntegrationSuite extends SparkFunSuite with LocalS
     // Start the computation of RDD - this step will also cache the RDD
     val asyncCount = testRdd.countAsync()
 
-    // Wait for the job to have started.
-    taskStartSem.acquire(1)
-    // Wait for each executor + driver to have it's broadcast info delivered.
-    broadcastSem.acquire((numExecs + 1))
-
     // Make sure the job is either mid run or otherwise has data to migrate.
     if (migrateDuring) {
-      // Give Spark a tiny bit to start executing after the broadcast blocks land.
-      // For me this works at 100, set to 300 for system variance.
-      Thread.sleep(300)
+      // Wait for one of the tasks to succeed and finish writing its blocks.
+      // This way we know that this executor had real data to migrate when it is subsequently
+      // decommissioned below.

Review comment:
       Cool ! Looking forward to your alternate PR to fix this.
   
   > I don't believe we need another block manager PR to realize it, I think this is just a flaky test because we took out the original sleep and tried to use TestUtils which doesn't do a good enough job of waiting for the executor to fully come up.
   
   I don't think the issue is that the executor does not come up. It does come up. The issue I think is that it tries to run the task and fails because it is decommissioned.
   
   > Since doPut is called _before_ the task starts computation, we don't throw away any of the in-progress data.
   
   I agree that doPut is called _before_ the task starts computation. I didn't follow what you mean by "in-progress data" ?. Per my understanding, the task is simply failing before the iterator is even created. 
   
   Please check this for yourself by doing a decommission in the `onTaskStart` listener callback and ensuring that the task has a long enough sleep time to ensure that it waits for this decommissioning to happen.
   
   > 
   > I'll make an alternate PR to this one to illustrate my understanding and hopefully we can iron it out and make the code path clear for everyone :)
   
   Hurray :-) 
   

##########
File path: core/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionIntegrationSuite.scala
##########
@@ -36,21 +37,34 @@ class BlockManagerDecommissionIntegrationSuite extends SparkFunSuite with LocalS
   val numExecs = 3
   val numParts = 3
 
+  test(s"verify that an already running task which is going to cache data succeeds " +
+    s"on a decommissioned executor after task start") {
+    runDecomTest(true, false, true, 1)
+  }
+
+  test(s"verify that an already running task which is going to cache data succeeds " +
+    s"on a decommissioned executor after iterator start") {
+    runDecomTest(true, false, true, 2)
+  }
+

Review comment:
       @holdenk .. this is what I meant: I have added two tests to test the decom after "Task Start". 
   
   The first test is a negative test. It defines task start as when the driver says that the task is "launched" and as we discussed, it is expected to fail because decommissioning will be triggered before `BlockManager.doPut` is called and thus fail the task without starting the iterator.
   
   The second test is what you want mentioned in your "intention": We want the decommissioning after the "iterator start" (ie user code started) to trigger a migration: The block manager will write the block anyway and it will be migrated.
   
   The first negative test is just to demonstrate that it is a case we don't handle yet. Since we agree that it is not a case worth fixing for production, I will eventually delete it. 
   
   So now we cover the full gamut of interesting decom points in task: When it is just launched, When user code is activated, After it has ended. 

##########
File path: core/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionIntegrationSuite.scala
##########
@@ -36,21 +37,34 @@ class BlockManagerDecommissionIntegrationSuite extends SparkFunSuite with LocalS
   val numExecs = 3
   val numParts = 3
 
+  test(s"verify that an already running task which is going to cache data succeeds " +
+    s"on a decommissioned executor after task start") {
+    runDecomTest(true, false, true, 1)
+  }
+
+  test(s"verify that an already running task which is going to cache data succeeds " +
+    s"on a decommissioned executor after iterator start") {
+    runDecomTest(true, false, true, 2)
+  }
+

Review comment:
       @holdenk .. this is what I meant: I have added two tests to test the decom after "Task Start". 
   
   The first test is a negative test. It defines task start as when the driver says that the task is "launched" and as we discussed, it is expected to fail because decommissioning will be triggered before `BlockManager.doPut` is called and thus fail the task without starting the iterator.
   
   The second test is what you want mentioned in your "intention": We want the decommissioning after the "iterator start" (ie user code started) to trigger a migration: The block manager will write the block anyway and it will be migrated. Do you think this second test indeed captures the intention of the test ?
   
   The first negative test is just to demonstrate that it is a case we don't handle yet. Since we agree that it is not a case worth fixing for production, I will eventually delete it. 
   
   So now we cover the full gamut of interesting decom points in task: When it is just launched, When user code is activated, After it has ended. 
   

##########
File path: core/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionIntegrationSuite.scala
##########
@@ -125,27 +118,26 @@ class BlockManagerDecommissionIntegrationSuite extends SparkFunSuite with LocalS
     // Start the computation of RDD - this step will also cache the RDD
     val asyncCount = testRdd.countAsync()
 
-    // Wait for the job to have started.
-    taskStartSem.acquire(1)
-    // Wait for each executor + driver to have it's broadcast info delivered.
-    broadcastSem.acquire((numExecs + 1))
-
     // Make sure the job is either mid run or otherwise has data to migrate.
     if (migrateDuring) {
-      // Give Spark a tiny bit to start executing after the broadcast blocks land.
-      // For me this works at 100, set to 300 for system variance.
-      Thread.sleep(300)
+      // Wait for one of the tasks to succeed and finish writing its blocks.
+      // This way we know that this executor had real data to migrate when it is subsequently
+      // decommissioned below.

Review comment:
       @holdenk, Please take a look at the PR once again. I have added another test to specifically capture the intent of decommissioning after the task has started but before the task has ended. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663819995


   **[Test build #126530 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126530/testReport)** for PR 29226 at commit [`4c8bafb`](https://github.com/apache/spark/commit/4c8bafb9d55e54332a83d97fcbe8662a20116215).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663796437


   **[Test build #126517 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126517/testReport)** for PR 29226 at commit [`7659d22`](https://github.com/apache/spark/commit/7659d22bcc3e2a423a1068dd15d9b11eba76c2df).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663820820


   **[Test build #126531 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126531/testReport)** for PR 29226 at commit [`199ea03`](https://github.com/apache/spark/commit/199ea03f028804dc12039bbca4f25a7138a378a8).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663798416






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664063689






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664063691


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/126583/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664041627


   **[Test build #126582 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126582/testReport)** for PR 29226 at commit [`2cea5f4`](https://github.com/apache/spark/commit/2cea5f4673b746d068bf69f6818f23d1962039ca).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on a change in pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
attilapiros commented on a change in pull request #29226:
URL: https://github.com/apache/spark/pull/29226#discussion_r460504301



##########
File path: core/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionIntegrationSuite.scala
##########
@@ -107,6 +115,21 @@ class BlockManagerDecommissionIntegrationSuite extends SparkFunSuite with LocalS
       }
 
       override def onBlockUpdated(blockUpdated: SparkListenerBlockUpdated): Unit = {
+        if (blockUpdated.blockUpdatedInfo.blockId.isRDD && persist) {
+          // Persisted RDD blocks are a bit weirder than shuffle blocks: Even though
+          // the tasks are run say on executors (0, 1, 2), the RDD blocks might end up only
+          // on executors 0 and 1. So we cannot just indiscriminately decommission any executor.

Review comment:
       Please investigate why this placement is happening (as at first it sounds a bit strange). 
   You might find a more important bug to fix or very good lesson about the internals of RDD persistence.
   
   Tipp: please double check "though the tasks are run say on executors (0, 1, 2)" part!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros edited a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
attilapiros edited a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664108813


   We are still not there yet.
   
   So according to you analyses:
   > - We wait 300 ms and decommission executor 0.
   >
   > - If the task is not yet done on executor 0, it will now fail because
   the block manager won't be able to save the block. This condition is
   easy to trigger on a loaded machine where the github checks run.
   
   If we would need this block to be saved then there would be a much simpler fix by checking when the block appears in the `blocksUpdated` collection. ~~But this goes against the `migrateDuring` original intention (see the first comment):~~
   
   https://github.com/apache/spark/blob/913e6d72c0bb3b18739b730a0634bcbc35444437/core/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionIntegrationSuite.scala#L157-L164
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664063325


   **[Test build #126583 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126583/testReport)** for PR 29226 at commit [`913e6d7`](https://github.com/apache/spark/commit/913e6d72c0bb3b18739b730a0634bcbc35444437).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663831392


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663820123






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29226: Make the block manager decommissioning test be less flaky

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663780638






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663834089


   **[Test build #126531 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126531/testReport)** for PR 29226 at commit [`199ea03`](https://github.com/apache/spark/commit/199ea03f028804dc12039bbca4f25a7138a378a8).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663839348






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] agrawaldevesh commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
agrawaldevesh commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663802421


   Sorry it didn't work. I will make another go at it. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663825949


   **[Test build #126536 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126536/testReport)** for PR 29226 at commit [`b2c2fad`](https://github.com/apache/spark/commit/b2c2fad1eb1caeeb5d6e135220398ad43a963858).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664048083






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] agrawaldevesh commented on pull request #29226: [SPARK-32417] Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
agrawaldevesh commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664150010


   > @agrawaldevesh, shell we file a JIRA? I think it's better to file a JIRA to track in general.
   
   Turns out I was lucky: @gaborgsomogyi had already created the Jira some days ago :-)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663784088


   **[Test build #126518 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126518/testReport)** for PR 29226 at commit [`558702e`](https://github.com/apache/spark/commit/558702e6f55d6c17439d1942000bea08acde0b29).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663913064






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663880272


   **[Test build #126550 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126550/testReport)** for PR 29226 at commit [`ba12869`](https://github.com/apache/spark/commit/ba12869cd1d87c54566a2584786763aff20e80bf).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663826074






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663839169


   **[Test build #126536 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126536/testReport)** for PR 29226 at commit [`b2c2fad`](https://github.com/apache/spark/commit/b2c2fad1eb1caeeb5d6e135220398ad43a963858).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663913064






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk edited a comment on pull request #29226: [SPARK-32417] Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
holdenk edited a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664696816


   Sounds good to me if it works. I’m not sure that checking the executor id in the task start message would guarantee that though.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] agrawaldevesh commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
agrawaldevesh commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664048430


   @attilapiros Thank you for pushing me to do the right thing here and go that extra mile ! I think I have now finally understood why this race condition was happening and I have fixed the PR description and the code accordingly. The fix hasn't changed fundamentally but I now know the right reasons behind it :-)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663831398


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/126530/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663901104


   **[Test build #126554 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126554/testReport)** for PR 29226 at commit [`a939c67`](https://github.com/apache/spark/commit/a939c675dd70ec0368e7d9d8f08f1ffca662cf1d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663808623






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663796598






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663839348






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29226: Make the block manager decommissioning test be less flaky

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-663780471


   **[Test build #126517 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126517/testReport)** for PR 29226 at commit [`7659d22`](https://github.com/apache/spark/commit/7659d22bcc3e2a423a1068dd15d9b11eba76c2df).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] asfgit closed pull request #29226: [SPARK-32417] Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
asfgit closed pull request #29226:
URL: https://github.com/apache/spark/pull/29226


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] agrawaldevesh commented on pull request #29226: Fix flakyness of BlockManagerDecommissionIntegrationSuite

Posted by GitBox <gi...@apache.org>.
agrawaldevesh commented on pull request #29226:
URL: https://github.com/apache/spark/pull/29226#issuecomment-664146969


   > If we would need this block to be saved then there would be a much simpler fix by checking when the block appears in the `blocksUpdated` collection. ~But this goes against the `migrateDuring` original intention (see the first comment):~
   
   I considered this approach (among others) but decided against it. Please let me explain my reasoning:
   
   - First, while I could always poll for the block to show up in blocksUpdated, you would agree that it is a lot of machinery particularly when we know that the blocksUpdated is placed by the `onBlockUpdated` listener call. So I decided to just trigger this decommissioning directly in the listener.
   
   - Second question is whether the decommissioning be done in `onBlockUpdated` or `onTaskEnd`. I chose `onTaskEnd` because that is the point of no return for the block to be "live". Consider what happens when a result task fails after the block is updated but before it can be considered "done". The block it wrote would get garbage collected. This garbage collection would race with the decommissioning. It's possible that the migration starts after the GC and does not find the block to replicated. This would fail the assertion below.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org