You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/06/17 04:54:41 UTC

[GitHub] [spark] wypoon opened a new pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

wypoon opened a new pull request #28848:
URL: https://github.com/apache/spark/pull/28848


   …after executor is lost
   
   ### What changes were proposed in this pull request?
   
   If an executor is lost, the `DAGScheduler` handles the executor loss by removing the executor but does not unregister its outputs if the external shuffle service is used. However, if the node on which the executor runs is lost, the shuffle service may not be able to serve the shuffle files.
   In such a case, when fetches from the executor's outputs fail in the same stage, the `DAGScheduler` again removes the executor and by right, should unregister its outputs. It doesn't because the epoch used to track the executor failure has not increased.
   
   We track the epoch for failed executors that result in lost file output separately, so we can unregister the outputs in this scenario. The idea to track a second epoch is due to Attila Zsolt Piros.
   
   ### Why are the changes needed?
   
   Without the changes, the loss of a node could require two stage attempts to recover instead of one.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   New unit test. This test fails without the change and passes with it.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-655383707


   **[Test build #125340 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125340/testReport)** for PR 28848 at commit [`0ece4ac`](https://github.com/apache/spark/commit/0ece4acc0bb76c58ff12e710b60289e206a8703f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r441676778



##########
File path: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##########
@@ -540,6 +540,46 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
     assert(mapStatus2(2).location.host === "hostB")
   }
 
+  test("[SPARK-32003] All shuffle files for executor should be cleaned up on fetch failure") {
+    // reset the test context with the right shuffle service config
+    afterEach()

Review comment:
       Attila summed it up.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646169358






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-655392272


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on a change in pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
attilapiros commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r441336489



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1986,6 +1986,9 @@ private[spark] class DAGScheduler(
       logInfo("Host added was in lost list earlier: " + host)
       failedEpoch -= execId
     }
+    if (fileLostEpoch.contains(execId)) {

Review comment:
       This `if` is not needed as  removing a non-existing key from a `HashMap` gives back the `HashMap`(a few lines above it must be used because of the `logInfo`).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-655396368






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645330944






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-658532861


   **[Test build #125863 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125863/testReport)** for PR 28848 at commit [`0e00862`](https://github.com/apache/spark/commit/0e0086288f6279569e8a11cef9d928b87c40469b).
    * This patch **fails PySpark pip packaging tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-650446374






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645657591


   **[Test build #124180 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124180/testReport)** for PR 28848 at commit [`4c0b98c`](https://github.com/apache/spark/commit/4c0b98cc30cefb9745fa1c1232022bfcb55077cd).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645152222


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124153/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r451926602



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1933,29 +1951,45 @@ private[spark] class DAGScheduler(
       maybeEpoch = None)
   }
 
+  /**
+   * Handles removing an executor from the BlockManagerMaster as well as unregistering shuffle
+   * outputs for the executor or optionally its host.
+   *
+   * @param execId executor to be removed
+   * @param fileLost If true, indicates that we assume we've lost all shuffle blocks associated
+   *   with the executor; this happens if the executor serves its own blocks (i.e., we're not
+   *   using external shuffle), the Standalone worker (which serves the shuffle data) is lost,

Review comment:
       Other clusters, like YARN, also could use external shuffle service. So they may face the same issue.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
wypoon commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-647766743


   > Ideally if the node is lost, we should have `SlaveLost` as the executor loss reason, could you explain more about why doesn't this happen in your use case? Thank you!
   
   Only Spark Standalone uses `SlaveLost` with `workerLost=true`. Neither the YARN nor Mesos backend uses that. Our customer uses Spark on YARN with the external shuffle service enabled. When the executor is lost, `DAGScheduler#handleExecutorLost` is called with `workerLost=false`. Since
   ```
       // if the cluster manager explicitly tells us that the entire worker was lost, then
       // we know to unregister shuffle output.  (Note that "worker" specifically refers to the process
       // from a Standalone cluster, where the shuffle service lives in the Worker.)
       val fileLost = workerLost || !env.blockManager.externalShuffleServiceEnabled
   ```
   (note the comment there referring to Spark Standalone) `removeExecutorAndUnregisterOutputs` is then called with `fileLost=false`.
   `DAGScheduler#handleExecutorLost` is called from 
   ```
       case ExecutorLost(execId, reason) =>
         val workerLost = reason match {
           case SlaveLost(_, true) => true
           case _ => false
         }
         dagScheduler.handleExecutorLost(execId, workerLost)
   ```
   As I said, only Spark Standalone creates `SlaveLost(_, true)` instances.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-647952077


   **[Test build #124388 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124388/testReport)** for PR 28848 at commit [`633a0e7`](https://github.com/apache/spark/commit/633a0e7841c9a9f9175bed510120ef2fe66ebe00).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r443083551



##########
File path: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##########
@@ -540,6 +540,42 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
     assert(mapStatus2(2).location.host === "hostB")
   }
 
+  test("[SPARK-32003] All shuffle files for executor should be cleaned up on fetch failure") {

Review comment:
       Thank you for adding this UT. BTW, since the recommendation from https://spark.apache.org/contributing.html is `SPARK-32003: `, it would be great if this follows the guideline.

##########
File path: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##########
@@ -540,6 +540,42 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
     assert(mapStatus2(2).location.host === "hostB")
   }
 
+  test("[SPARK-32003] All shuffle files for executor should be cleaned up on fetch failure") {

Review comment:
       Thank you for adding this UT. BTW, since the recommendation from https://spark.apache.org/contributing.html is `SPARK-32003: ...`, it would be great if this follows the guideline.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r443814139



##########
File path: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##########
@@ -540,6 +540,42 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
     assert(mapStatus2(2).location.host === "hostB")
   }
 
+  test("[SPARK-32003] All shuffle files for executor should be cleaned up on fetch failure") {

Review comment:
       No problem. I'll change to use "SPARK-32003: ...". There were several different styles in the file; I didn't realize there was an official/preferred style. The first couple of tests in the file had the style I used.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-650014692


   **[Test build #124530 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124530/testReport)** for PR 28848 at commit [`d09ef93`](https://github.com/apache/spark/commit/d09ef9335e5d3657b830497155abb7a0c2bb0cde).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
wypoon commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-648344459


   @tgravescs a suggestion to improve the title of the PR is also welcome. It is hard to do justice in one simple sentence. I see how you would fail to grasp what the change is for on reading the current PR title.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645149561






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] jiangxb1987 commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
jiangxb1987 commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-647322894


   Ideally if the node is lost, we should have `SlaveLost` as the executor loss reason, could you explain more about why doesn't this happen in your use case? Thank you!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645747557


   **[Test build #124184 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124184/testReport)** for PR 28848 at commit [`4c0b98c`](https://github.com/apache/spark/commit/4c0b98cc30cefb9745fa1c1232022bfcb55077cd).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
wypoon commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-658933529


   PySpark packaging tests fail with
   ```
   Writing pyspark-3.1.0.dev0/setup.cfg
   creating dist
   Creating tar archive
   removing 'pyspark-3.1.0.dev0' (and everything under it)
   Installing dist into virtual env
   Processing ./python/dist/pyspark-3.1.0.dev0.tar.gz
   Collecting py4j==0.10.9 (from pyspark==3.1.0.dev0)
     Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
   Installing collected packages: py4j, pyspark
     Found existing installation: py4j 0.10.9
       Uninstalling py4j-0.10.9:
         Successfully uninstalled py4j-0.10.9
     Found existing installation: pyspark 3.1.0.dev0
   Exception:
   Traceback (most recent call last):
     File "/home/anaconda/envs/py36/lib/python3.6/site-packages/pip/_internal/cli/base_command.py", line 179, in main
       status = self.run(options, args)
     File "/home/anaconda/envs/py36/lib/python3.6/site-packages/pip/_internal/commands/install.py", line 393, in run
       use_user_site=options.use_user_site,
     File "/home/anaconda/envs/py36/lib/python3.6/site-packages/pip/_internal/req/__init__.py", line 50, in install_given_reqs
       auto_confirm=True
     File "/home/anaconda/envs/py36/lib/python3.6/site-packages/pip/_internal/req/req_install.py", line 816, in uninstall
       uninstalled_pathset = UninstallPathSet.from_dist(dist)
     File "/home/anaconda/envs/py36/lib/python3.6/site-packages/pip/_internal/req/req_uninstall.py", line 505, in from_dist
       '(at %s)' % (link_pointer, dist.project_name, dist.location)
   AssertionError: Egg-link /home/jenkins/workspace/SparkPullRequestBuilder@4/python does not match installed location of pyspark (at /home/jenkins/workspace/SparkPullRequestBuilder/python)
   Cleaning up temporary directory - /tmp/tmp.nkhMEcKRla
   [error] running /home/jenkins/workspace/SparkPullRequestBuilder/dev/run-pip-tests ; received return code 2
   ```
   but I don't understand what that means. Any help?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-657728722


   **[Test build #125785 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125785/testReport)** for PR 28848 at commit [`56f3ae5`](https://github.com/apache/spark/commit/56f3ae546ecdf3afe8d3217db410a5527be09aca).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646903499


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124305/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645658374






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-654577955


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125168/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
wypoon commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645700340


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r441671240



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1939,24 +1941,22 @@ private[spark] class DAGScheduler(
       hostToUnregisterOutputs: Option[String],
       maybeEpoch: Option[Long] = None): Unit = {
     val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
+    logDebug(s"Removing executor $execId, fileLost: $fileLost, currentEpoch: $currentEpoch")

Review comment:
       See explanation in my response to your comment about the previous logDebug.
   There are now two if clauses. I think it is simpler just to log the values at the beginning of the method; knowing the values will tell you which if clauses will get executed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-650446374






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646848537


   **[Test build #124300 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124300/testReport)** for PR 28848 at commit [`0069e71`](https://github.com/apache/spark/commit/0069e71118587a30911165e762e22e95fad97fe1).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646849205






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-662099210


   **[Test build #126279 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126279/testReport)** for PR 28848 at commit [`0e00862`](https://github.com/apache/spark/commit/0e0086288f6279569e8a11cef9d928b87c40469b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on a change in pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
attilapiros commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r441647042



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1939,24 +1941,22 @@ private[spark] class DAGScheduler(
       hostToUnregisterOutputs: Option[String],
       maybeEpoch: Option[Long] = None): Unit = {
     val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
+    logDebug(s"Removing executor $execId, fileLost: $fileLost, currentEpoch: $currentEpoch")
     if (!failedEpoch.contains(execId) || failedEpoch(execId) < currentEpoch) {
       failedEpoch(execId) = currentEpoch
       logInfo("Executor lost: %s (epoch %d)".format(execId, currentEpoch))
       blockManagerMaster.removeExecutor(execId)
-      if (fileLost) {
-        hostToUnregisterOutputs match {
-          case Some(host) =>
-            logInfo("Shuffle files lost for host: %s (epoch %d)".format(host, currentEpoch))
-            mapOutputTracker.removeOutputsOnHost(host)
-          case None =>
-            logInfo("Shuffle files lost for executor: %s (epoch %d)".format(execId, currentEpoch))
-            mapOutputTracker.removeOutputsOnExecutor(execId)
-        }
-        clearCacheLocs()
-
-      } else {
-        logDebug("Additional executor lost message for %s (epoch %d)".format(execId, currentEpoch))
+    }
+    if (fileLost && (!fileLostEpoch.contains(execId) || fileLostEpoch(execId) < currentEpoch)) {

Review comment:
       No as we we would like to avoid calling `clearCacheLocs()` repeatedly for each fetch failure occurring within this stage. Fetch failures are coming from this map side executor but there could be several reducer side executors requesting block from that one and all would be generating a call here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r441671240



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1939,24 +1941,22 @@ private[spark] class DAGScheduler(
       hostToUnregisterOutputs: Option[String],
       maybeEpoch: Option[Long] = None): Unit = {
     val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
+    logDebug(s"Removing executor $execId, fileLost: $fileLost, currentEpoch: $currentEpoch")

Review comment:
       See explanation in my response to your comment about the previous logDebug.
   There are now two if clauses. I think it is simpler just to log the values at the beginning of the method; knowing the values will tell you which if clauses will get executed. But more importantly, this logDebug tells you that the method is invoked. Then depending on whether there is additional log output, you know if one or both of the if clauses are executed as well. (If it is inside an if, then you don't know if the method was invoked but if conditions were not met.)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645658081






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r451925255



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1912,12 +1934,8 @@ private[spark] class DAGScheduler(
    * modify the scheduler's internal state. Use executorLost() to post a loss event from outside.
    *
    * We will also assume that we've lost all shuffle blocks associated with the executor if the
-   * executor serves its own blocks (i.e., we're not using external shuffle), the entire slave
-   * is lost (likely including the shuffle service), or a FetchFailed occurred, in which case we
-   * presume all shuffle data related to this executor to be lost.
-   *
-   * Optionally the epoch during which the failure was caught can be passed to avoid allowing
-   * stray fetch failures from possibly retriggering the detection of a node as lost.
+   * executor serves its own blocks (i.e., we're not using external shuffle), or the Standalone
+   * worker (which serves the shuffle data) is lost.

Review comment:
       `or the Standalone worker` -> `or the external shuffle service`?
   
   (Standalone worker itself not really serves the shuffle data)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r452590712



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1912,12 +1934,8 @@ private[spark] class DAGScheduler(
    * modify the scheduler's internal state. Use executorLost() to post a loss event from outside.
    *
    * We will also assume that we've lost all shuffle blocks associated with the executor if the
-   * executor serves its own blocks (i.e., we're not using external shuffle), the entire slave
-   * is lost (likely including the shuffle service), or a FetchFailed occurred, in which case we
-   * presume all shuffle data related to this executor to be lost.
-   *
-   * Optionally the epoch during which the failure was caught can be passed to avoid allowing
-   * stray fetch failures from possibly retriggering the detection of a node as lost.
+   * executor serves its own blocks (i.e., we're not using external shuffle), or the Standalone
+   * worker (which serves the shuffle data) is lost.

Review comment:
       I do mean the Standalone worker is lost. This is the only case in which `workerLost=true`. I believe I already explained to you previously that only Spark Standalone can produce `SlaveLost(_, true)` which is what results in `workerLost=true`. I can remove the phrase "(which serves the shuffle data)" if that is inaccurate.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646167955






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-662099772






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-662161103






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r443083211



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1939,24 +1941,24 @@ private[spark] class DAGScheduler(
       hostToUnregisterOutputs: Option[String],
       maybeEpoch: Option[Long] = None): Unit = {
     val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
+    logDebug(s"Considering removal of executor $execId; " +
+      s"fileLost: $fileLost, currentEpoch: $currentEpoch")
     if (!failedEpoch.contains(execId) || failedEpoch(execId) < currentEpoch) {
       failedEpoch(execId) = currentEpoch
-      logInfo("Executor lost: %s (epoch %d)".format(execId, currentEpoch))
+      logInfo(s"Executor lost: $execId (epoch $currentEpoch)")

Review comment:
       nit. If you don't mind, can we avoid a style change like this in order to focus the feature?
   ```
   - logInfo("Executor lost: %s (epoch %d)".format(execId, currentEpoch))
   + logInfo(s"Executor lost: $execId (epoch $currentEpoch)")
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646270339


   **[Test build #124226 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124226/testReport)** for PR 28848 at commit [`4b18369`](https://github.com/apache/spark/commit/4b1836961cfcd718589ddd5c6a768748e174b533).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
attilapiros commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-649707311


   @wypoon if you have not started extending the test with the multiple fetch failures case you can use this I you agree with it:
   https://github.com/attilapiros/spark/commit/be14a51ca766711d793d9a7314a2cf030e2acdc7


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] squito commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
squito commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r444508358



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -177,6 +177,8 @@ private[spark] class DAGScheduler(
   // TODO: Garbage collect information about failure epochs when we know there are no more
   //       stray messages to detect.
   private val failedEpoch = new HashMap[String, Long]
+  // In addition, track epoch for failed executors that result in lost file output

Review comment:
       yeah I really agree with this.  I think we need to rename both of these epochs.  I think I finally get why they both matter. One tracks files owned by the external shuffle service -- so you don't update those files when you learn that an executor is lost.  The other tracks files owned by the executor itself.  So that includes memory-cached blocks always, and also shuffle files w/out the external shuffle service.
   
   Once I wrapped my head around that part, the change in the logic made sense.
   
   I will think some more about better names and comments which would help with that part




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-657812860






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
attilapiros commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-655482429


   retest this please
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646903272






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] squito edited a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
squito edited a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-648856169


   I've been thinking about this more, and in particular why there are two different epochs and what the meaning of both is.   I tried out another change -- I just moved the `failedEpoch(execId) = currentEpoch` in [`removeExecutorAndUnregisterOutputs`](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1943) inside the `if(fileLost)`.  With that change, all tests in `org.apache.spark.scheduler` pass, including your added test (and I confirmed that your test does indeed fail with the current code in master).
   
   The idea is this:
   the purpose of the epochs is to know what to do when you get a fetch failure.  Epochs tell you if you've already dealt with an "equivalent" failure or not, and whether or not to consider more shuffle output as being lost.  Without the shuffle service, you remove shuffle output when there is an executor failure OR there is a fetch failure.  With the shuffle service, you only do it when there is a fetch failure.  The code already has those two branches, controlled by `fileLost`.
   
   But, I think there is actually a problem there.  
   
   I think I'm mostly repeating stuff @wypoon and @attilapiros said before -- I hope that restating it might help others out.  but it also pointed out to me a lack of test coverage we have.  Particularly that `blockManagerMaster.removeExecutor()` is only called once for multiple fetch failures (or a mix of executorLost + fetch failures).
   
   I'm thinking we should rename the epochs to `blockManagerFailureEpoch` and `externalShuffleFailureEpoch`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-650015377


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-654557231


   **[Test build #125160 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125160/testReport)** for PR 28848 at commit [`0ece4ac`](https://github.com/apache/spark/commit/0ece4acc0bb76c58ff12e710b60289e206a8703f).
    * This patch **fails to generate documentation**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r441676133



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1986,6 +1986,9 @@ private[spark] class DAGScheduler(
       logInfo("Host added was in lost list earlier: " + host)
       failedEpoch -= execId
     }
+    if (fileLostEpoch.contains(execId)) {

Review comment:
       I see. Ok, I'll remove this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tgravescs commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
tgravescs commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-648171843


   I'll take a look at the code, but could you please clarify the description as I'm not completely understanding what your approach is here.
   
   The intention is that with the external shuffle service an executor lost doesn't mean the node is lost.  Are you saying you are differentiating that somehow?
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r452586406



##########
File path: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##########
@@ -248,17 +252,18 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
    */
   val cacheLocations = new HashMap[(Int, Int), Seq[BlockManagerId]]
   // stub out BlockManagerMaster.getLocations to use our cacheLocations
-  val blockManagerMaster = new BlockManagerMaster(null, null, conf, true) {
-      override def getLocations(blockIds: Array[BlockId]): IndexedSeq[Seq[BlockManagerId]] = {
-        blockIds.map {
-          _.asRDDId.map(id => (id.rddId -> id.splitIndex)).flatMap(key => cacheLocations.get(key)).
-            getOrElse(Seq())
-        }.toIndexedSeq
-      }
-      override def removeExecutor(execId: String): Unit = {
-        // don't need to propagate to the driver, which we don't have
-      }
+  class MyBlockManagerMaster(conf: SparkConf) extends BlockManagerMaster(null, null, conf, true) {
+    override def getLocations(blockIds: Array[BlockId]): IndexedSeq[Seq[BlockManagerId]] = {
+      blockIds.map {
+        _.asRDDId.map { id => (id.rddId -> id.splitIndex)
+        }.flatMap { key => cacheLocations.get(key)
+        }.getOrElse(Seq())

Review comment:
       @squito requested this change, to conform with what he said was the preferred scala style (map or flatMap with "=>" should use braces rather than parentheses).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-647953071


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645151419


   **[Test build #124153 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124153/testReport)** for PR 28848 at commit [`4e976ab`](https://github.com/apache/spark/commit/4e976ab16e922dfe28125798461e45afaa1d62a7).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-647867134






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646174885






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r442369811



##########
File path: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##########
@@ -540,6 +540,43 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
     assert(mapStatus2(2).location.host === "hostB")
   }
 
+  test("[SPARK-32003] All shuffle files for executor should be cleaned up on fetch failure") {
+    // reset the test context with the right shuffle service config
+    afterEach()
+    val conf = new SparkConf()
+    conf.set(config.SHUFFLE_SERVICE_ENABLED.key, "true")
+    init(conf)
+
+    val shuffleMapRdd = new MyRDD(sc, 3, Nil)
+    val shuffleDep = new ShuffleDependency(shuffleMapRdd, new HashPartitioner(3))
+    val shuffleId = shuffleDep.shuffleId
+    val reduceRdd = new MyRDD(sc, 3, List(shuffleDep), tracker = mapOutputTracker)
+
+    submit(reduceRdd, Array(0, 1, 2))
+    // Map stage completes successfully,
+    // two tasks are run on an executor on hostA and one on an executor on hostB
+    completeShuffleMapStageSuccessfully(0, 0, 3, Seq("hostA", "hostA", "hostB"))
+    // Now the executor on hostA is lost
+    runEvent(ExecutorLost("hostA-exec", ExecutorExited(-100, false, "Container marked as failed")))
+
+    // The MapOutputTracker has all the shuffle files
+    val initialMapStatuses = mapOutputTracker.shuffleStatuses(shuffleId).mapStatuses
+    assert(initialMapStatuses.count(_ != null) === 3)
+    assert(initialMapStatuses.count(s => s != null && s.location.executorId == "hostA-exec") === 2)
+    assert(initialMapStatuses.count(s => s != null && s.location.executorId == "hostB-exec") === 1)
+
+    // Now a fetch failure from the lost executor occurs
+    complete(taskSets(1), Seq(
+      (FetchFailed(makeBlockManagerId("hostA"), shuffleId, 0L, 0, 0, "ignored"), null)
+    ))
+
+    // Shuffle files for hostA-exec should be lost
+    val mapStatuses = mapOutputTracker.shuffleStatuses(shuffleId).mapStatuses
+    assert(mapStatuses.count(_ != null) === 1)
+    assert(initialMapStatuses.count(s => s != null && s.location.executorId == "hostA-exec") === 0)
+    assert(initialMapStatuses.count(s => s != null && s.location.executorId == "hostB-exec") === 1)

Review comment:
       You are absolutely right; `shuffleStatuses` in the MapOutputTracker is a mutable `Map` and `mapStatuses` in the values are mutable `Array`s. So everything is updated in place. (And I accidentally just used `initialMapStatuses` for the verification anyway! (due to careless copy-and-paste))
   The reason I'd gotten `mapStatuses` twice is because I followed the pattern in the test preceding this one.
   Thanks for catching this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-647866462


   **[Test build #124374 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124374/testReport)** for PR 28848 at commit [`633a0e7`](https://github.com/apache/spark/commit/633a0e7841c9a9f9175bed510120ef2fe66ebe00).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] squito commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
squito commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-662095979


   Jenkins, retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645748224






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r445965319



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -177,6 +177,8 @@ private[spark] class DAGScheduler(
   // TODO: Garbage collect information about failure epochs when we know there are no more
   //       stray messages to detect.
   private val failedEpoch = new HashMap[String, Long]

Review comment:
       I changed `failedEpoch` to `executorFailureEpoch` and more or less adopted your suggestion for the comment explaining it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645255732


   **[Test build #124160 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124160/testReport)** for PR 28848 at commit [`4e976ab`](https://github.com/apache/spark/commit/4e976ab16e922dfe28125798461e45afaa1d62a7).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
attilapiros commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-648016232


   I checked the failure (org.apache.spark.scheduler.BarrierTaskContextSuite.successively sync with allGather and barrier) and locally there is no such problem must be flaky.
   
   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-658485657






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-658485359


   **[Test build #125863 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125863/testReport)** for PR 28848 at commit [`0e00862`](https://github.com/apache/spark/commit/0e0086288f6279569e8a11cef9d928b87c40469b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r451925745



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1933,29 +1951,45 @@ private[spark] class DAGScheduler(
       maybeEpoch = None)
   }
 
+  /**
+   * Handles removing an executor from the BlockManagerMaster as well as unregistering shuffle
+   * outputs for the executor or optionally its host.
+   *
+   * @param execId executor to be removed
+   * @param fileLost If true, indicates that we assume we've lost all shuffle blocks associated
+   *   with the executor; this happens if the executor serves its own blocks (i.e., we're not
+   *   using external shuffle), the Standalone worker (which serves the shuffle data) is lost,

Review comment:
       `external shuffle` -> `external shuffle service`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-647907550






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-658533182


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-649963283






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-654557248


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-650115505






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r442599457



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1939,24 +1941,23 @@ private[spark] class DAGScheduler(
       hostToUnregisterOutputs: Option[String],
       maybeEpoch: Option[Long] = None): Unit = {
     val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
+    logDebug(s"Removing executor $execId, fileLost: $fileLost, currentEpoch: $currentEpoch")

Review comment:
       > See explanation in my response to your comment about the previous logDebug.
   But more importantly, this logDebug tells you that the method is invoked. Then depending on whether there is additional log output, you know if one or both of the if clauses are executed as well. (Knowing the value of fileLost may rule out one of the clauses.)
   
   
   
   What if the executor does not meet the if condition? In that case, the executor is not going to be removed. @wypoon 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-662099772






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-649962628


   **[Test build #124530 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124530/testReport)** for PR 28848 at commit [`d09ef93`](https://github.com/apache/spark/commit/d09ef9335e5d3657b830497155abb7a0c2bb0cde).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645658381


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124180/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646848969






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-657812167


   **[Test build #125784 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125784/testReport)** for PR 28848 at commit [`4301291`](https://github.com/apache/spark/commit/430129195b222a51bd525ad6b012ab9b32981b25).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646902839


   Retest this please.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-649963283






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645657591


   **[Test build #124180 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124180/testReport)** for PR 28848 at commit [`4c0b98c`](https://github.com/apache/spark/commit/4c0b98cc30cefb9745fa1c1232022bfcb55077cd).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-652750696






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r442610030



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1939,24 +1941,23 @@ private[spark] class DAGScheduler(
       hostToUnregisterOutputs: Option[String],
       maybeEpoch: Option[Long] = None): Unit = {
     val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
+    logDebug(s"Removing executor $execId, fileLost: $fileLost, currentEpoch: $currentEpoch")
     if (!failedEpoch.contains(execId) || failedEpoch(execId) < currentEpoch) {
       failedEpoch(execId) = currentEpoch
-      logInfo("Executor lost: %s (epoch %d)".format(execId, currentEpoch))
+      logInfo(s"Executor lost: $execId (epoch $currentEpoch)")
       blockManagerMaster.removeExecutor(execId)
-      if (fileLost) {
-        hostToUnregisterOutputs match {
-          case Some(host) =>
-            logInfo("Shuffle files lost for host: %s (epoch %d)".format(host, currentEpoch))
-            mapOutputTracker.removeOutputsOnHost(host)
-          case None =>
-            logInfo("Shuffle files lost for executor: %s (epoch %d)".format(execId, currentEpoch))
-            mapOutputTracker.removeOutputsOnExecutor(execId)
-        }
-        clearCacheLocs()
-
-      } else {
-        logDebug("Additional executor lost message for %s (epoch %d)".format(execId, currentEpoch))
+    }
+    if (fileLost && (!fileLostEpoch.contains(execId) || fileLostEpoch(execId) < currentEpoch)) {

Review comment:
       I see.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] squito commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
squito commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r444939999



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1939,24 +1941,24 @@ private[spark] class DAGScheduler(
       hostToUnregisterOutputs: Option[String],
       maybeEpoch: Option[Long] = None): Unit = {
     val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
+    logDebug(s"Considering removal of executor $execId; " +
+      s"fileLost: $fileLost, currentEpoch: $currentEpoch")
     if (!failedEpoch.contains(execId) || failedEpoch(execId) < currentEpoch) {
       failedEpoch(execId) = currentEpoch
-      logInfo("Executor lost: %s (epoch %d)".format(execId, currentEpoch))
+      logInfo(s"Executor lost: $execId (epoch $currentEpoch)")

Review comment:
       while I don't think this is a big deal either way, I also try to follow the "boy scout rule" as long as its fixing stuff very close to the actual change.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] squito commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
squito commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r458853672



##########
File path: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##########
@@ -548,6 +560,56 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
     assert(mapStatus2(2).location.host === "hostB")
   }
 
+  test("SPARK-32003: All shuffle files for executor should be cleaned up on fetch failure") {

Review comment:
       personally, I'm OK with this as is, I think its OK for some of the details to be down in the test itself and balance a super-duper long name.
   
   @dongjoon-hyun since you called this a nit I'm assuming you're OK with me merging this anyhow, but if not lemme know, can submit a quick followup.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-647953078


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124388/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r452588463



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1912,12 +1934,8 @@ private[spark] class DAGScheduler(
    * modify the scheduler's internal state. Use executorLost() to post a loss event from outside.
    *
    * We will also assume that we've lost all shuffle blocks associated with the executor if the
-   * executor serves its own blocks (i.e., we're not using external shuffle), the entire slave
-   * is lost (likely including the shuffle service), or a FetchFailed occurred, in which case we
-   * presume all shuffle data related to this executor to be lost.
-   *
-   * Optionally the epoch during which the failure was caught can be passed to avoid allowing
-   * stray fetch failures from possibly retriggering the detection of a node as lost.
+   * executor serves its own blocks (i.e., we're not using external shuffle), or the Standalone

Review comment:
       I have no problem with this suggestion. I was simply using what was there before.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-657729251






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r441865418



##########
File path: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##########
@@ -540,6 +540,46 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
     assert(mapStatus2(2).location.host === "hostB")
   }
 
+  test("[SPARK-32003] All shuffle files for executor should be cleaned up on fetch failure") {
+    // reset the test context with the right shuffle service config
+    afterEach()
+    val conf = new SparkConf()
+    conf.set(config.SHUFFLE_SERVICE_ENABLED.key, "true")
+    init(conf)
+
+    val shuffleMapRdd = new MyRDD(sc, 3, Nil)
+    val shuffleDep = new ShuffleDependency(shuffleMapRdd, new HashPartitioner(3))
+    val shuffleId = shuffleDep.shuffleId
+    val reduceRdd = new MyRDD(sc, 3, List(shuffleDep), tracker = mapOutputTracker)
+
+    submit(reduceRdd, Array(0, 1, 2))
+    // Map stage completes successfully,
+    // two tasks are run on an executor on hostA and one on an executor on hostB
+    complete(taskSets(0), Seq(
+      (Success, makeMapStatus("hostA", 3)),
+      (Success, makeMapStatus("hostA", 3)),
+      (Success, makeMapStatus("hostB", 3))))

Review comment:
       I adopted your suggestion.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r445965785



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -177,6 +177,8 @@ private[spark] class DAGScheduler(
   // TODO: Garbage collect information about failure epochs when we know there are no more
   //       stray messages to detect.
   private val failedEpoch = new HashMap[String, Long]
+  // In addition, track epoch for failed executors that result in lost file output

Review comment:
       I changed `fileLostEpoch` to `shuffleFileLostEpoch` and more or less adopted your suggestion for the comment explaining it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-652684878






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r451925745



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1933,29 +1951,45 @@ private[spark] class DAGScheduler(
       maybeEpoch = None)
   }
 
+  /**
+   * Handles removing an executor from the BlockManagerMaster as well as unregistering shuffle
+   * outputs for the executor or optionally its host.
+   *
+   * @param execId executor to be removed
+   * @param fileLost If true, indicates that we assume we've lost all shuffle blocks associated
+   *   with the executor; this happens if the executor serves its own blocks (i.e., we're not
+   *   using external shuffle), the Standalone worker (which serves the shuffle data) is lost,

Review comment:
        `... external shuffle service), or the external shuffle service...`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-657864098


   **[Test build #125785 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125785/testReport)** for PR 28848 at commit [`56f3ae5`](https://github.com/apache/spark/commit/56f3ae546ecdf3afe8d3217db410a5527be09aca).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-654552544


   **[Test build #125160 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125160/testReport)** for PR 28848 at commit [`0ece4ac`](https://github.com/apache/spark/commit/0ece4acc0bb76c58ff12e710b60289e206a8703f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r443029105



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1939,24 +1941,23 @@ private[spark] class DAGScheduler(
       hostToUnregisterOutputs: Option[String],
       maybeEpoch: Option[Long] = None): Unit = {
     val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
+    logDebug(s"Removing executor $execId, fileLost: $fileLost, currentEpoch: $currentEpoch")

Review comment:
       Ah, you're objecting to the *wording* of the debug message?
   I'll change it from "Removing executor ..." to "Considering removal of executor ...".




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r442599457



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1939,24 +1941,23 @@ private[spark] class DAGScheduler(
       hostToUnregisterOutputs: Option[String],
       maybeEpoch: Option[Long] = None): Unit = {
     val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
+    logDebug(s"Removing executor $execId, fileLost: $fileLost, currentEpoch: $currentEpoch")

Review comment:
       > See explanation in my response to your comment about the previous logDebug.
   But more importantly, this logDebug tells you that the method is invoked. Then depending on whether there is additional log output, you know if one or both of the if clauses are executed as well. (Knowing the value of fileLost may rule out one of the clauses.)
   
   
   
   What if the executor does not meet the if condition? In that case, the executor is not going to be removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] squito commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
squito commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-649628531


   yeah I was suggesting a change in the logic a bit to match those notions of epochs (same end behavior, I just thought it might make the code easier to follow).  Maybe I'm wrong about that, though ... I might try to play with it a bit later.  But for now, I'll make a pass of suggestions for renaming & comments following what's proposed here.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645702279






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tgravescs commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
tgravescs commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-648329200


   sorry if I wasn't clear.  I think this approach of having the fileLostEpoch is better so we avoid the locking in MapOutputTracker. Personally I wouldn't mind fileLost being renamed to something more obvious,  like unregisterShuffleOutput .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646173660


   **[Test build #124226 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124226/testReport)** for PR 28848 at commit [`4b18369`](https://github.com/apache/spark/commit/4b1836961cfcd718589ddd5c6a768748e174b533).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
attilapiros commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r445721856



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1939,24 +1941,24 @@ private[spark] class DAGScheduler(
       hostToUnregisterOutputs: Option[String],
       maybeEpoch: Option[Long] = None): Unit = {
     val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
+    logDebug(s"Considering removal of executor $execId; " +
+      s"fileLost: $fileLost, currentEpoch: $currentEpoch")
     if (!failedEpoch.contains(execId) || failedEpoch(execId) < currentEpoch) {
       failedEpoch(execId) = currentEpoch
-      logInfo("Executor lost: %s (epoch %d)".format(execId, currentEpoch))
+      logInfo(s"Executor lost: $execId (epoch $currentEpoch)")
       blockManagerMaster.removeExecutor(execId)
-      if (fileLost) {
-        hostToUnregisterOutputs match {
-          case Some(host) =>
-            logInfo("Shuffle files lost for host: %s (epoch %d)".format(host, currentEpoch))
-            mapOutputTracker.removeOutputsOnHost(host)
-          case None =>
-            logInfo("Shuffle files lost for executor: %s (epoch %d)".format(execId, currentEpoch))
-            mapOutputTracker.removeOutputsOnExecutor(execId)
-        }
-        clearCacheLocs()
-
-      } else {
-        logDebug("Additional executor lost message for %s (epoch %d)".format(execId, currentEpoch))
+    }
+    if (fileLost && (!fileLostEpoch.contains(execId) || fileLostEpoch(execId) < currentEpoch)) {
+      fileLostEpoch(execId) = currentEpoch
+      hostToUnregisterOutputs match {
+        case Some(host) =>
+          logInfo(s"Shuffle files lost for host: $host (epoch $currentEpoch)")
+          mapOutputTracker.removeOutputsOnHost(host)
+        case None =>
+          logInfo(s"Shuffle files lost for executor: $execId (epoch $currentEpoch)")
+          mapOutputTracker.removeOutputsOnExecutor(execId)
       }
+      clearCacheLocs()

Review comment:
       That's right as it is filled by locations coming from block manager master and removing an executor will [update the bock location](https://github.com/apache/spark/blob/be14a51ca766711d793d9a7314a2cf030e2acdc7/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala#L278). So the location cache must be invalidated. 
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-654577911


   **[Test build #125168 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125168/testReport)** for PR 28848 at commit [`0ece4ac`](https://github.com/apache/spark/commit/0ece4acc0bb76c58ff12e710b60289e206a8703f).
    * This patch **fails to generate documentation**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
wypoon commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-648293945


   @tgravescs as you point out, it is ok to call `mapOutputTracker.removeOutputsOnHost` or `mapOutputTracker.removeOutputsOnExecutor` multiple times with the same host/execId. However, there is a lock that gets used.
   For this reason, to avoid calling removeOutputs multiple times, @attilapiros suggested using `fileLostEpoch` to track if the removeOutputs has already been called. Is there a better name for it you wish to suggest?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r451928121



##########
File path: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##########
@@ -248,17 +252,18 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
    */
   val cacheLocations = new HashMap[(Int, Int), Seq[BlockManagerId]]
   // stub out BlockManagerMaster.getLocations to use our cacheLocations
-  val blockManagerMaster = new BlockManagerMaster(null, null, conf, true) {
-      override def getLocations(blockIds: Array[BlockId]): IndexedSeq[Seq[BlockManagerId]] = {
-        blockIds.map {
-          _.asRDDId.map(id => (id.rddId -> id.splitIndex)).flatMap(key => cacheLocations.get(key)).
-            getOrElse(Seq())
-        }.toIndexedSeq
-      }
-      override def removeExecutor(execId: String): Unit = {
-        // don't need to propagate to the driver, which we don't have
-      }
+  class MyBlockManagerMaster(conf: SparkConf) extends BlockManagerMaster(null, null, conf, true) {
+    override def getLocations(blockIds: Array[BlockId]): IndexedSeq[Seq[BlockManagerId]] = {
+      blockIds.map {
+        _.asRDDId.map { id => (id.rddId -> id.splitIndex)
+        }.flatMap { key => cacheLocations.get(key)
+        }.getOrElse(Seq())

Review comment:
       unnecessary change?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646848537


   **[Test build #124300 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124300/testReport)** for PR 28848 at commit [`0069e71`](https://github.com/apache/spark/commit/0069e71118587a30911165e762e22e95fad97fe1).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646848969






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646903497






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-652750138


   **[Test build #124824 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124824/testReport)** for PR 28848 at commit [`5e233ac`](https://github.com/apache/spark/commit/5e233ac6281e1ccb371401a965f7d9c33ce2f54a).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r446446147



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1933,29 +1946,43 @@ private[spark] class DAGScheduler(
       maybeEpoch = None)
   }
 
+  /**
+   * Handles removing an executor from the BlockManagerMaster as well as unregistering shuffle
+   * outputs for the executor or optionally its host.
+   *
+   * The fileLost parameter being true indicates that we assume we've lost all shuffle blocks

Review comment:
       Added parameters to doc comment.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
attilapiros commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-658942726


   @wypoon 
   
   Check this out: https://issues.apache.org/jira/browse/SPARK-32303


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-657728722


   **[Test build #125785 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125785/testReport)** for PR 28848 at commit [`56f3ae5`](https://github.com/apache/spark/commit/56f3ae546ecdf3afe8d3217db410a5527be09aca).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646398922


   @wypoon Just a reminder, please do not resolve conversations before we reach into an agreement. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645658081






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646271764






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645255732


   **[Test build #124160 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124160/testReport)** for PR 28848 at commit [`4e976ab`](https://github.com/apache/spark/commit/4e976ab16e922dfe28125798461e45afaa1d62a7).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646166931


   **[Test build #124225 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124225/testReport)** for PR 28848 at commit [`4b18369`](https://github.com/apache/spark/commit/4b1836961cfcd718589ddd5c6a768748e174b533).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r451924824



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1912,12 +1934,8 @@ private[spark] class DAGScheduler(
    * modify the scheduler's internal state. Use executorLost() to post a loss event from outside.
    *
    * We will also assume that we've lost all shuffle blocks associated with the executor if the
-   * executor serves its own blocks (i.e., we're not using external shuffle), the entire slave
-   * is lost (likely including the shuffle service), or a FetchFailed occurred, in which case we
-   * presume all shuffle data related to this executor to be lost.
-   *
-   * Optionally the epoch during which the failure was caught can be passed to avoid allowing
-   * stray fetch failures from possibly retriggering the detection of a node as lost.
+   * executor serves its own blocks (i.e., we're not using external shuffle), or the Standalone

Review comment:
       `external shuffle` -> `external shuffle service`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-650446036


   **[Test build #124553 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124553/testReport)** for PR 28848 at commit [`5e233ac`](https://github.com/apache/spark/commit/5e233ac6281e1ccb371401a965f7d9c33ce2f54a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-657718459






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-657812860






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
wypoon commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645486137


   Thanks @attilapiros for the review and suggestions. Thanks @Ngone51 for reviewing as well.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] asfgit closed pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
asfgit closed pull request #28848:
URL: https://github.com/apache/spark/pull/28848


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] squito commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
squito commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-662502026


   merged to master, thanks everyone!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-662160753


   **[Test build #126279 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126279/testReport)** for PR 28848 at commit [`0e00862`](https://github.com/apache/spark/commit/0e0086288f6279569e8a11cef9d928b87c40469b).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r446446304



##########
File path: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##########
@@ -764,6 +825,7 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
       (FetchFailed(makeBlockManagerId("hostA"), shuffleId, 0L, 0, 0, "ignored"), null)))
     // this will get called
     // blockManagerMaster.removeExecutor("hostA-exec")

Review comment:
       I left the original comment as an explanation for the verify. But I can remove it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645329948


   **[Test build #124160 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124160/testReport)** for PR 28848 at commit [`4e976ab`](https://github.com/apache/spark/commit/4e976ab16e922dfe28125798461e45afaa1d62a7).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-652750696


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r458554090



##########
File path: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##########
@@ -548,6 +560,56 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
     assert(mapStatus2(2).location.host === "hostB")
   }
 
+  test("SPARK-32003: All shuffle files for executor should be cleaned up on fetch failure") {

Review comment:
       nit. Do we need to describe `when external shuffle service is used` assumption here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646174885






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
attilapiros commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r443970197



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1939,24 +1941,24 @@ private[spark] class DAGScheduler(
       hostToUnregisterOutputs: Option[String],
       maybeEpoch: Option[Long] = None): Unit = {
     val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
+    logDebug(s"Considering removal of executor $execId; " +
+      s"fileLost: $fileLost, currentEpoch: $currentEpoch")
     if (!failedEpoch.contains(execId) || failedEpoch(execId) < currentEpoch) {
       failedEpoch(execId) = currentEpoch
-      logInfo("Executor lost: %s (epoch %d)".format(execId, currentEpoch))
+      logInfo(s"Executor lost: $execId (epoch $currentEpoch)")

Review comment:
       Guilty as charged. My thinking was the following: as the string interpolation preferred in the project and the change itself is quite tiny and has zero risk and we are here and touching these lines it would be good to do it now (aka Boy Scout Rule).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r441669857



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1939,24 +1941,22 @@ private[spark] class DAGScheduler(
       hostToUnregisterOutputs: Option[String],
       maybeEpoch: Option[Long] = None): Unit = {
     val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
+    logDebug(s"Removing executor $execId, fileLost: $fileLost, currentEpoch: $currentEpoch")
     if (!failedEpoch.contains(execId) || failedEpoch(execId) < currentEpoch) {
       failedEpoch(execId) = currentEpoch
       logInfo("Executor lost: %s (epoch %d)".format(execId, currentEpoch))
       blockManagerMaster.removeExecutor(execId)
-      if (fileLost) {
-        hostToUnregisterOutputs match {
-          case Some(host) =>
-            logInfo("Shuffle files lost for host: %s (epoch %d)".format(host, currentEpoch))
-            mapOutputTracker.removeOutputsOnHost(host)
-          case None =>
-            logInfo("Shuffle files lost for executor: %s (epoch %d)".format(execId, currentEpoch))
-            mapOutputTracker.removeOutputsOnExecutor(execId)
-        }
-        clearCacheLocs()
-
-      } else {
-        logDebug("Additional executor lost message for %s (epoch %d)".format(execId, currentEpoch))

Review comment:
       I think the purpose of this is to know that fileLost=false, as this is in the else part of the
   `if (fileLost) {...} else {...}` and in the if part, there will be logInfo output.
   I think it is simpler to simply have a logDebug on entering the method giving us the values of the key variables, including fileLost. Therefore this is not needed.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r441865293



##########
File path: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##########
@@ -540,6 +540,46 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
     assert(mapStatus2(2).location.host === "hostB")
   }
 
+  test("[SPARK-32003] All shuffle files for executor should be cleaned up on fetch failure") {
+    // reset the test context with the right shuffle service config
+    afterEach()
+    val conf = new SparkConf()
+    conf.set(config.SHUFFLE_SERVICE_ENABLED.key, "true")
+    init(conf)
+
+    val shuffleMapRdd = new MyRDD(sc, 3, Nil)
+    val shuffleDep = new ShuffleDependency(shuffleMapRdd, new HashPartitioner(3))
+    val shuffleId = shuffleDep.shuffleId
+    val reduceRdd = new MyRDD(sc, 3, List(shuffleDep), tracker = mapOutputTracker)
+
+    submit(reduceRdd, Array(0, 1, 2))
+    // Map stage completes successfully,
+    // two tasks are run on an executor on hostA and one on an executor on hostB
+    complete(taskSets(0), Seq(
+      (Success, makeMapStatus("hostA", 3)),
+      (Success, makeMapStatus("hostA", 3)),
+      (Success, makeMapStatus("hostB", 3))))
+    // Now the executor on hostA is lost
+    runEvent(ExecutorLost("hostA-exec", ExecutorExited(-100, false, "Container marked as failed")))
+
+    // The MapOutputTracker has all the shuffle files
+    val initialMapStatuses = mapOutputTracker.shuffleStatuses(shuffleId).mapStatuses
+    assert(initialMapStatuses.count(_ != null) == 3)
+    assert(initialMapStatuses(0).location.executorId === "hostA-exec")
+    assert(initialMapStatuses(1).location.executorId === "hostA-exec")
+    assert(initialMapStatuses(2).location.executorId === "hostB-exec")

Review comment:
       I adopted your suggestion but I just inlined the function.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645658365


   **[Test build #124180 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124180/testReport)** for PR 28848 at commit [`4c0b98c`](https://github.com/apache/spark/commit/4c0b98cc30cefb9745fa1c1232022bfcb55077cd).
    * This patch **fails build dependency tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-654577951


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-662261403


   cc @holdenk and @dbtsai 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-652683787


   **[Test build #124824 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124824/testReport)** for PR 28848 at commit [`5e233ac`](https://github.com/apache/spark/commit/5e233ac6281e1ccb371401a965f7d9c33ce2f54a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-654571303






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-658533186


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125863/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
attilapiros commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645254072


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-650044495






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] squito commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
squito commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r445681220



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -177,6 +177,8 @@ private[spark] class DAGScheduler(
   // TODO: Garbage collect information about failure epochs when we know there are no more
   //       stray messages to detect.
   private val failedEpoch = new HashMap[String, Long]

Review comment:
       I suggest we rename this to `blockManagerFailureEpoch`, or perhaps `executorFailureEpoch` and change the comment significantly, to something like
   
   Tracks the latest epoch of a fully processed error related to the given executor.
   
   When an executor fails, it will affect the results of many tasks, and we have to deal with all of them consistently.  We don't simply ignore all future results from that executor, as the failures may have been transient; but we also don't want to "over-react" to the many follow-on errors we'll receive.  Furthermore, we might receive notification of a task success, after we find out the executor has actually failed; we'll assume those successes are, in fact, simply delayed notifications and the results have been lost, if they come from the same epoch.  In particular, we use this to control when we tell the blockManagerMaster that the blockManager has been lost.

##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1939,24 +1941,24 @@ private[spark] class DAGScheduler(
       hostToUnregisterOutputs: Option[String],
       maybeEpoch: Option[Long] = None): Unit = {
     val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
+    logDebug(s"Considering removal of executor $execId; " +
+      s"fileLost: $fileLost, currentEpoch: $currentEpoch")
     if (!failedEpoch.contains(execId) || failedEpoch(execId) < currentEpoch) {
       failedEpoch(execId) = currentEpoch
-      logInfo("Executor lost: %s (epoch %d)".format(execId, currentEpoch))
+      logInfo(s"Executor lost: $execId (epoch $currentEpoch)")
       blockManagerMaster.removeExecutor(execId)
-      if (fileLost) {
-        hostToUnregisterOutputs match {
-          case Some(host) =>
-            logInfo("Shuffle files lost for host: %s (epoch %d)".format(host, currentEpoch))
-            mapOutputTracker.removeOutputsOnHost(host)
-          case None =>
-            logInfo("Shuffle files lost for executor: %s (epoch %d)".format(execId, currentEpoch))
-            mapOutputTracker.removeOutputsOnExecutor(execId)
-        }
-        clearCacheLocs()
-
-      } else {
-        logDebug("Additional executor lost message for %s (epoch %d)".format(execId, currentEpoch))
+    }
+    if (fileLost && (!fileLostEpoch.contains(execId) || fileLostEpoch(execId) < currentEpoch)) {
+      fileLostEpoch(execId) = currentEpoch
+      hostToUnregisterOutputs match {
+        case Some(host) =>
+          logInfo(s"Shuffle files lost for host: $host (epoch $currentEpoch)")
+          mapOutputTracker.removeOutputsOnHost(host)
+        case None =>
+          logInfo(s"Shuffle files lost for executor: $execId (epoch $currentEpoch)")
+          mapOutputTracker.removeOutputsOnExecutor(execId)
       }
+      clearCacheLocs()

Review comment:
       shouldn't this be within the `failedEpoch(execId) < currentEpoch` set above?  This is about cached RDD blocks, which has to do with the blockManager; not the shuffle files.

##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -177,6 +177,8 @@ private[spark] class DAGScheduler(
   // TODO: Garbage collect information about failure epochs when we know there are no more
   //       stray messages to detect.
   private val failedEpoch = new HashMap[String, Long]
+  // In addition, track epoch for failed executors that result in lost file output

Review comment:
       how about `shuffleFileLostEpoch` here, and a comment building on the suggested comment above:
   
   Tracks the latest epoch of a fully processed error which indicated shuffle files have been lost from the given executor.
   
   This is closely related to `executorFailureEpoch`.  In fact, anytime an entry is added here, it must also be added to `executorFailureEpoch`, so `shuffleFileLostEpoch(exec) <= executorFailureEpoch(exec)`.  These only differ when the external shuffle service is enabled.  In that case, when an executor is lost, we do *not* update the epoch here; we wait for a fetch failure.  This way, if only the executor fails, we do not remove the shuffle data as it can still be served by the shuffle service; but if there is a failure in the shuffle service, we remove the shuffle data only once, though we are likely to get many failures.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-658485657






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
attilapiros commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645194360


   Build failure is unrelated. I have checked and there are some other PRs with the same error:
   
   ```
   [ERROR] Failed to execute goal org.apache.maven.plugins:maven-install-plugin:3.0.0-M1:install (default-cli) on project spark-parent_2.12: ArtifactInstallerException: Failed to install metadata org.apache.spark:spark-parent_2.12/maven-metadata.xml: Could not parse metadata /home/jenkins/.m2/repository/org/apache/spark/spark-parent_2.12/maven-metadata-local.xml: in epilog non whitespace content is not allowed but got > (position: END_TAG seen ...</metadata>\n>... @13:2) -> [Help 1]
   ``` 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-650043889


   **[Test build #124535 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124535/testReport)** for PR 28848 at commit [`d09ef93`](https://github.com/apache/spark/commit/d09ef9335e5d3657b830497155abb7a0c2bb0cde).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r458554090



##########
File path: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##########
@@ -548,6 +560,56 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
     assert(mapStatus2(2).location.host === "hostB")
   }
 
+  test("SPARK-32003: All shuffle files for executor should be cleaned up on fetch failure") {

Review comment:
       nit. Do we need to describe `when external shuffle service is used` assumption here because we have `conf.set(config.SHUFFLE_SERVICE_ENABLED.key, "true")` at line 567?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r452593445



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1933,29 +1951,45 @@ private[spark] class DAGScheduler(
       maybeEpoch = None)
   }
 
+  /**
+   * Handles removing an executor from the BlockManagerMaster as well as unregistering shuffle
+   * outputs for the executor or optionally its host.
+   *
+   * @param execId executor to be removed
+   * @param fileLost If true, indicates that we assume we've lost all shuffle blocks associated
+   *   with the executor; this happens if the executor serves its own blocks (i.e., we're not
+   *   using external shuffle), the Standalone worker (which serves the shuffle data) is lost,

Review comment:
       I'm not sure I understand the point you're making here. The 3 conditions when `fileLost=true` are: (1) we're not using the external shuffle service (`!env.blockManager.externalShuffleServiceEnabled`); or (2) the Standalone worker is lost (as I explained above); or (3) a FetchFailed occurred.
   In the case of case of YARN + external shuffle service, `fileLost=true` happens only with a FetchFailed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-647907228


   **[Test build #124388 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124388/testReport)** for PR 28848 at commit [`633a0e7`](https://github.com/apache/spark/commit/633a0e7841c9a9f9175bed510120ef2fe66ebe00).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-650472356


   **[Test build #124553 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124553/testReport)** for PR 28848 at commit [`5e233ac`](https://github.com/apache/spark/commit/5e233ac6281e1ccb371401a965f7d9c33ce2f54a).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] squito edited a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
squito edited a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-662096205


   lgtm still.  Since its been a few days and there was a bit of flakiness before, I triggered one more set of tests just to make sure everything is still OK and then will merge


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
wypoon commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646170970


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
wypoon commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-655000416


   Thanks @dongjoon-hyun for the move to disable the doc generation in Jenkins.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r442597972



##########
File path: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##########
@@ -540,6 +540,46 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
     assert(mapStatus2(2).location.host === "hostB")
   }
 
+  test("[SPARK-32003] All shuffle files for executor should be cleaned up on fetch failure") {
+    // reset the test context with the right shuffle service config
+    afterEach()

Review comment:
       I see, make sense.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645423126


   cc @jiangxb1987 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
wypoon commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-662003702


   @tgravescs @squito I made some small tweaks to the scaladoc comments and also had to rebase since you approved the change. It has been open for some time now to any additional feedback. Can you give it the once over again and merge it?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
wypoon commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-649874411


   > @wypoon if you have not started extending the test with the multiple fetch failures case you can use this I you agree with it:
   > [attilapiros@be14a51](https://github.com/attilapiros/spark/commit/be14a51ca766711d793d9a7314a2cf030e2acdc7)
   
   @attilapiros thanks for the code; that is very helpful. I had an offline chat with @squito, and he had a different test in mind, but in a similar spirit. He was thinking of a test to verify that in `DAGScheduler`, `blockManagerMaster.removeExecutor` is not called more than once after the executor is lost. I can use your approach (using Mockito spy) there as well.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-662468893


   still LGTM


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r451924286



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -170,13 +170,34 @@ private[spark] class DAGScheduler(
    */
   private val cacheLocs = new HashMap[Int, IndexedSeq[Seq[TaskLocation]]]
 
-  // For tracking failed nodes, we use the MapOutputTracker's epoch number, which is sent with
-  // every task. When we detect a node failing, we note the current epoch number and failed
-  // executor, increment it for new tasks, and use this to ignore stray ShuffleMapTask results.
-  //
-  // TODO: Garbage collect information about failure epochs when we know there are no more
-  //       stray messages to detect.
-  private val failedEpoch = new HashMap[String, Long]
+  /**
+   * Tracks the latest epoch of a fully processed error related to the given executor. (We use
+   * the MapOutputTracker's epoch number, which is sent with every task.)
+   *
+   * When an executor fails, it can affect the results of many tasks, and we have to deal with
+   * all of them consistently. We don't simply ignore all future results from that executor,
+   * as the failures may have been transient; but we also don't want to "overreact" to follow-
+   * on errors we receive. Furthermore, we might receive notification of a task success, after
+   * we find out the executor has actually failed; we'll assume those successes are, in fact,
+   * simply delayed notifications and the results have been lost, if the tasks started in the
+   * same or an earlier epoch. In particular, we use this to control when we tell the
+   * BlockManagerMaster that the BlockManager has been lost.
+   */
+  private val executorFailureEpoch = new HashMap[String, Long]
+
+  /**
+   * Tracks the latest epoch of a fully processed error where shuffle files have been lost from
+   * the given executor.
+   *
+   * This is closely related to executorFailureEpoch.
+   * They only differ for the executor when there is a Standalone worker or an external shuffle

Review comment:
       `Standalone worker` -> `BlockManager`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
attilapiros commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-650042030


   jenkins retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646169373


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124225/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] squito commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
squito commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-662096205


   lgtm still.  Since its been a few days and there was a bit of flakiness before, I triggered one more set of tests just to make sure


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645256358






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r444377024



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -177,6 +177,8 @@ private[spark] class DAGScheduler(
   // TODO: Garbage collect information about failure epochs when we know there are no more
   //       stray messages to detect.
   private val failedEpoch = new HashMap[String, Long]
+  // In addition, track epoch for failed executors that result in lost file output

Review comment:
       I'll try to expand on this comment.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-658485359


   **[Test build #125863 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125863/testReport)** for PR 28848 at commit [`0e00862`](https://github.com/apache/spark/commit/0e0086288f6279569e8a11cef9d928b87c40469b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-650044495






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r445738252



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1939,24 +1941,24 @@ private[spark] class DAGScheduler(
       hostToUnregisterOutputs: Option[String],
       maybeEpoch: Option[Long] = None): Unit = {
     val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
+    logDebug(s"Considering removal of executor $execId; " +
+      s"fileLost: $fileLost, currentEpoch: $currentEpoch")
     if (!failedEpoch.contains(execId) || failedEpoch(execId) < currentEpoch) {
       failedEpoch(execId) = currentEpoch
-      logInfo("Executor lost: %s (epoch %d)".format(execId, currentEpoch))
+      logInfo(s"Executor lost: $execId (epoch $currentEpoch)")
       blockManagerMaster.removeExecutor(execId)
-      if (fileLost) {
-        hostToUnregisterOutputs match {
-          case Some(host) =>
-            logInfo("Shuffle files lost for host: %s (epoch %d)".format(host, currentEpoch))
-            mapOutputTracker.removeOutputsOnHost(host)
-          case None =>
-            logInfo("Shuffle files lost for executor: %s (epoch %d)".format(execId, currentEpoch))
-            mapOutputTracker.removeOutputsOnExecutor(execId)
-        }
-        clearCacheLocs()
-
-      } else {
-        logDebug("Additional executor lost message for %s (epoch %d)".format(execId, currentEpoch))
+    }
+    if (fileLost && (!fileLostEpoch.contains(execId) || fileLostEpoch(execId) < currentEpoch)) {
+      fileLostEpoch(execId) = currentEpoch
+      hostToUnregisterOutputs match {
+        case Some(host) =>
+          logInfo(s"Shuffle files lost for host: $host (epoch $currentEpoch)")
+          mapOutputTracker.removeOutputsOnHost(host)
+        case None =>
+          logInfo(s"Shuffle files lost for executor: $execId (epoch $currentEpoch)")
+          mapOutputTracker.removeOutputsOnExecutor(execId)
       }
+      clearCacheLocs()

Review comment:
       Yes, you are right. This should be in the other if clause.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-655396381


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125340/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645149561






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-647866462


   **[Test build #124374 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124374/testReport)** for PR 28848 at commit [`633a0e7`](https://github.com/apache/spark/commit/633a0e7841c9a9f9175bed510120ef2fe66ebe00).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646909066


   Thank you so much for this contribution, @wypoon and @attilapiros .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-648017945


   let me take a look at the flay barrier test. we've actually already fixed a few of them.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r458554090



##########
File path: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##########
@@ -548,6 +560,56 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
     assert(mapStatus2(2).location.host === "hostB")
   }
 
+  test("SPARK-32003: All shuffle files for executor should be cleaned up on fetch failure") {

Review comment:
       Do we need `when external shuffle service is used` condition here?

##########
File path: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##########
@@ -548,6 +560,56 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
     assert(mapStatus2(2).location.host === "hostB")
   }
 
+  test("SPARK-32003: All shuffle files for executor should be cleaned up on fetch failure") {

Review comment:
       Do we need `when external shuffle service is used` assumption here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-655392272






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646849212


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124300/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r441671240



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1939,24 +1941,22 @@ private[spark] class DAGScheduler(
       hostToUnregisterOutputs: Option[String],
       maybeEpoch: Option[Long] = None): Unit = {
     val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
+    logDebug(s"Removing executor $execId, fileLost: $fileLost, currentEpoch: $currentEpoch")

Review comment:
       See explanation in my response to your comment about the previous logDebug.
   But more importantly, this logDebug tells you that the method is invoked. Then depending on whether there is additional log output, you know if one or both of the if clauses are executed as well. (Knowing the value of fileLost may rule out one of the clauses.)
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r443083344



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1939,24 +1941,24 @@ private[spark] class DAGScheduler(
       hostToUnregisterOutputs: Option[String],
       maybeEpoch: Option[Long] = None): Unit = {
     val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
+    logDebug(s"Considering removal of executor $execId; " +
+      s"fileLost: $fileLost, currentEpoch: $currentEpoch")
     if (!failedEpoch.contains(execId) || failedEpoch(execId) < currentEpoch) {
       failedEpoch(execId) = currentEpoch
-      logInfo("Executor lost: %s (epoch %d)".format(execId, currentEpoch))
+      logInfo(s"Executor lost: $execId (epoch $currentEpoch)")
       blockManagerMaster.removeExecutor(execId)
-      if (fileLost) {
-        hostToUnregisterOutputs match {
-          case Some(host) =>
-            logInfo("Shuffle files lost for host: %s (epoch %d)".format(host, currentEpoch))
-            mapOutputTracker.removeOutputsOnHost(host)
-          case None =>
-            logInfo("Shuffle files lost for executor: %s (epoch %d)".format(execId, currentEpoch))
-            mapOutputTracker.removeOutputsOnExecutor(execId)
-        }
-        clearCacheLocs()
-
-      } else {
-        logDebug("Additional executor lost message for %s (epoch %d)".format(execId, currentEpoch))
+    }
+    if (fileLost && (!fileLostEpoch.contains(execId) || fileLostEpoch(execId) < currentEpoch)) {
+      fileLostEpoch(execId) = currentEpoch
+      hostToUnregisterOutputs match {
+        case Some(host) =>
+          logInfo(s"Shuffle files lost for host: $host (epoch $currentEpoch)")

Review comment:
       ditto.

##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1939,24 +1941,24 @@ private[spark] class DAGScheduler(
       hostToUnregisterOutputs: Option[String],
       maybeEpoch: Option[Long] = None): Unit = {
     val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
+    logDebug(s"Considering removal of executor $execId; " +
+      s"fileLost: $fileLost, currentEpoch: $currentEpoch")
     if (!failedEpoch.contains(execId) || failedEpoch(execId) < currentEpoch) {
       failedEpoch(execId) = currentEpoch
-      logInfo("Executor lost: %s (epoch %d)".format(execId, currentEpoch))
+      logInfo(s"Executor lost: $execId (epoch $currentEpoch)")
       blockManagerMaster.removeExecutor(execId)
-      if (fileLost) {
-        hostToUnregisterOutputs match {
-          case Some(host) =>
-            logInfo("Shuffle files lost for host: %s (epoch %d)".format(host, currentEpoch))
-            mapOutputTracker.removeOutputsOnHost(host)
-          case None =>
-            logInfo("Shuffle files lost for executor: %s (epoch %d)".format(execId, currentEpoch))
-            mapOutputTracker.removeOutputsOnExecutor(execId)
-        }
-        clearCacheLocs()
-
-      } else {
-        logDebug("Additional executor lost message for %s (epoch %d)".format(execId, currentEpoch))
+    }
+    if (fileLost && (!fileLostEpoch.contains(execId) || fileLostEpoch(execId) < currentEpoch)) {
+      fileLostEpoch(execId) = currentEpoch
+      hostToUnregisterOutputs match {
+        case Some(host) =>
+          logInfo(s"Shuffle files lost for host: $host (epoch $currentEpoch)")
+          mapOutputTracker.removeOutputsOnHost(host)
+        case None =>
+          logInfo(s"Shuffle files lost for executor: $execId (epoch $currentEpoch)")

Review comment:
       ditto.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r442610125



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1939,24 +1941,22 @@ private[spark] class DAGScheduler(
       hostToUnregisterOutputs: Option[String],
       maybeEpoch: Option[Long] = None): Unit = {
     val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
+    logDebug(s"Removing executor $execId, fileLost: $fileLost, currentEpoch: $currentEpoch")
     if (!failedEpoch.contains(execId) || failedEpoch(execId) < currentEpoch) {
       failedEpoch(execId) = currentEpoch
       logInfo("Executor lost: %s (epoch %d)".format(execId, currentEpoch))
       blockManagerMaster.removeExecutor(execId)
-      if (fileLost) {
-        hostToUnregisterOutputs match {
-          case Some(host) =>
-            logInfo("Shuffle files lost for host: %s (epoch %d)".format(host, currentEpoch))
-            mapOutputTracker.removeOutputsOnHost(host)
-          case None =>
-            logInfo("Shuffle files lost for executor: %s (epoch %d)".format(execId, currentEpoch))
-            mapOutputTracker.removeOutputsOnExecutor(execId)
-        }
-        clearCacheLocs()
-
-      } else {
-        logDebug("Additional executor lost message for %s (epoch %d)".format(execId, currentEpoch))
+    }
+    if (fileLost && (!fileLostEpoch.contains(execId) || fileLostEpoch(execId) < currentEpoch)) {

Review comment:
       I see.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r446447794



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1939,24 +1941,24 @@ private[spark] class DAGScheduler(
       hostToUnregisterOutputs: Option[String],
       maybeEpoch: Option[Long] = None): Unit = {
     val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
+    logDebug(s"Considering removal of executor $execId; " +
+      s"fileLost: $fileLost, currentEpoch: $currentEpoch")
     if (!failedEpoch.contains(execId) || failedEpoch(execId) < currentEpoch) {
       failedEpoch(execId) = currentEpoch
-      logInfo("Executor lost: %s (epoch %d)".format(execId, currentEpoch))
+      logInfo(s"Executor lost: $execId (epoch $currentEpoch)")

Review comment:
       @dongjoon-hyun based on Attila's and Imran's comments, I'm inclined to stick with the string interpolation here and below, since it is all within the method I'm actively modifying.
   Are you ok with resolving this then?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646903133


   **[Test build #124305 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124305/testReport)** for PR 28848 at commit [`0069e71`](https://github.com/apache/spark/commit/0069e71118587a30911165e762e22e95fad97fe1).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-647907550






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-654557248






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
attilapiros commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-649670581


   +1 to test the case with multiple fetch failures. For this `mapOutputTracker` could be wrapped around a mockito's spy and then the verify can be used to check the unregistering occurred only once.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646903133


   **[Test build #124305 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124305/testReport)** for PR 28848 at commit [`0069e71`](https://github.com/apache/spark/commit/0069e71118587a30911165e762e22e95fad97fe1).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon edited a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
wypoon edited a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-648293945


   @tgravescs as you point out, it is ok to call `mapOutputTracker.removeOutputsOnHost` or `mapOutputTracker.removeOutputsOnExecutor` multiple times with the same host/execId. However, there is a lock that gets used.
   For this reason, to avoid calling removeOutputs multiple times, @attilapiros suggested using `fileLostEpoch` to track if the removeOutputs has already been called. Is there a better name for it you wish to suggest?
   Or do you still think that it is fine to call `mapOutputTracker.removeOutputsOnHost` or `mapOutputTracker.removeOutputsOnExecutor` multiple times?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r443083211



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1939,24 +1941,24 @@ private[spark] class DAGScheduler(
       hostToUnregisterOutputs: Option[String],
       maybeEpoch: Option[Long] = None): Unit = {
     val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
+    logDebug(s"Considering removal of executor $execId; " +
+      s"fileLost: $fileLost, currentEpoch: $currentEpoch")
     if (!failedEpoch.contains(execId) || failedEpoch(execId) < currentEpoch) {
       failedEpoch(execId) = currentEpoch
-      logInfo("Executor lost: %s (epoch %d)".format(execId, currentEpoch))
+      logInfo(s"Executor lost: $execId (epoch $currentEpoch)")

Review comment:
       If you don't mind, can we avoid a style change like this in order to focus the feature?
   ```
   - logInfo("Executor lost: %s (epoch %d)".format(execId, currentEpoch))
   + logInfo(s"Executor lost: $execId (epoch $currentEpoch)")
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645151419


   **[Test build #124153 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124153/testReport)** for PR 28848 at commit [`4e976ab`](https://github.com/apache/spark/commit/4e976ab16e922dfe28125798461e45afaa1d62a7).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646903272






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-654557256


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125160/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r441678652



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1939,24 +1941,22 @@ private[spark] class DAGScheduler(
       hostToUnregisterOutputs: Option[String],
       maybeEpoch: Option[Long] = None): Unit = {
     val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
+    logDebug(s"Removing executor $execId, fileLost: $fileLost, currentEpoch: $currentEpoch")
     if (!failedEpoch.contains(execId) || failedEpoch(execId) < currentEpoch) {
       failedEpoch(execId) = currentEpoch
       logInfo("Executor lost: %s (epoch %d)".format(execId, currentEpoch))
       blockManagerMaster.removeExecutor(execId)
-      if (fileLost) {
-        hostToUnregisterOutputs match {
-          case Some(host) =>
-            logInfo("Shuffle files lost for host: %s (epoch %d)".format(host, currentEpoch))
-            mapOutputTracker.removeOutputsOnHost(host)
-          case None =>
-            logInfo("Shuffle files lost for executor: %s (epoch %d)".format(execId, currentEpoch))
-            mapOutputTracker.removeOutputsOnExecutor(execId)
-        }
-        clearCacheLocs()
-
-      } else {
-        logDebug("Additional executor lost message for %s (epoch %d)".format(execId, currentEpoch))
+    }
+    if (fileLost && (!fileLostEpoch.contains(execId) || fileLostEpoch(execId) < currentEpoch)) {

Review comment:
       As Attila explained, the purpose of fileLostEpoch is to track if this code has been called once for that executor and epoch, to avoid repeated calls to unregister the outputs, since there will usually be multiple fetch failures causing this method to be called.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tgravescs commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
tgravescs commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-648187620


   I assume this is the case where spark.files.fetchFailure.unRegisterOutputOnHost is turned on with external shuffle service.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-654577951






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-650115505






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-654572280






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-650446036


   **[Test build #124553 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124553/testReport)** for PR 28848 at commit [`5e233ac`](https://github.com/apache/spark/commit/5e233ac6281e1ccb371401a965f7d9c33ce2f54a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] squito commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
squito commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-662502752


   @wypoon  sorry, I mistakenly only merged this to master, but I should have also merged it to branch-3.0.  Would you mind also opening a pr targeting branch-3.0?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r446445975



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -170,13 +170,29 @@ private[spark] class DAGScheduler(
    */
   private val cacheLocs = new HashMap[Int, IndexedSeq[Seq[TaskLocation]]]
 
-  // For tracking failed nodes, we use the MapOutputTracker's epoch number, which is sent with
-  // every task. When we detect a node failing, we note the current epoch number and failed
-  // executor, increment it for new tasks, and use this to ignore stray ShuffleMapTask results.
+  // Tracks the latest epoch of a fully processed error related to the given executor. (We use
+  // the MapOutputTracker's epoch number, which is sent with every task.)
   //
-  // TODO: Garbage collect information about failure epochs when we know there are no more
-  //       stray messages to detect.
-  private val failedEpoch = new HashMap[String, Long]
+  // When an executor fails, it can affect the results of many tasks, and we have to deal with
+  // all of them consistently. We don't simply ignore all future results from that executor,
+  // as the failures may have been transient; but we also don't want to "overreact" to follow-
+  // on errors we receive. Furthermore, we might receive notification of a task success, after
+  // we find out the executor has actually failed; we'll assume those successes are, in fact,
+  // simply delayed notifications and the results have been lost, if they come from the same
+  // epoch. In particular, we use this to control when we tell the BlockManagerMaster that the
+  // BlockManager has been lost.

Review comment:
       Adopted.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-655392280






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on a change in pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
attilapiros commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r441393834



##########
File path: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##########
@@ -540,6 +540,46 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
     assert(mapStatus2(2).location.host === "hostB")
   }
 
+  test("[SPARK-32003] All shuffle files for executor should be cleaned up on fetch failure") {
+    // reset the test context with the right shuffle service config
+    afterEach()
+    val conf = new SparkConf()
+    conf.set(config.SHUFFLE_SERVICE_ENABLED.key, "true")
+    init(conf)
+
+    val shuffleMapRdd = new MyRDD(sc, 3, Nil)
+    val shuffleDep = new ShuffleDependency(shuffleMapRdd, new HashPartitioner(3))
+    val shuffleId = shuffleDep.shuffleId
+    val reduceRdd = new MyRDD(sc, 3, List(shuffleDep), tracker = mapOutputTracker)
+
+    submit(reduceRdd, Array(0, 1, 2))
+    // Map stage completes successfully,
+    // two tasks are run on an executor on hostA and one on an executor on hostB
+    complete(taskSets(0), Seq(
+      (Success, makeMapStatus("hostA", 3)),
+      (Success, makeMapStatus("hostA", 3)),
+      (Success, makeMapStatus("hostB", 3))))
+    // Now the executor on hostA is lost
+    runEvent(ExecutorLost("hostA-exec", ExecutorExited(-100, false, "Container marked as failed")))
+
+    // The MapOutputTracker has all the shuffle files
+    val initialMapStatuses = mapOutputTracker.shuffleStatuses(shuffleId).mapStatuses
+    assert(initialMapStatuses.count(_ != null) == 3)
+    assert(initialMapStatuses(0).location.executorId === "hostA-exec")
+    assert(initialMapStatuses(1).location.executorId === "hostA-exec")
+    assert(initialMapStatuses(2).location.executorId === "hostB-exec")

Review comment:
       This solution is fine but I think what we really care for is the number of shuffle files on `hostA-exec` and not their order. So what I would do here is:
   
   ```suggestion
       val isLocatedOnHostA = 
         (status: MapStatus) => status != null && status.location.executorId == "hostA-exec"
       assert(initialMapStatuses.count(isLocatedOnHostA) === 2)
   ```
   
   And down below this filter can be reused like:
   
   ```scala
   assert(mapStatuses.exists(isLocatedOnHostA) === false)
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645702279






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r450568119



##########
File path: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##########
@@ -248,17 +252,17 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
    */
   val cacheLocations = new HashMap[(Int, Int), Seq[BlockManagerId]]
   // stub out BlockManagerMaster.getLocations to use our cacheLocations
-  val blockManagerMaster = new BlockManagerMaster(null, null, conf, true) {
-      override def getLocations(blockIds: Array[BlockId]): IndexedSeq[Seq[BlockManagerId]] = {
-        blockIds.map {
-          _.asRDDId.map(id => (id.rddId -> id.splitIndex)).flatMap(key => cacheLocations.get(key)).
-            getOrElse(Seq())
-        }.toIndexedSeq
-      }
-      override def removeExecutor(execId: String): Unit = {
-        // don't need to propagate to the driver, which we don't have
-      }
+  class MyBlockManagerMaster(conf: SparkConf) extends BlockManagerMaster(null, null, conf, true) {
+    override def getLocations(blockIds: Array[BlockId]): IndexedSeq[Seq[BlockManagerId]] = {
+      blockIds.map {
+        _.asRDDId.map(id => (id.rddId -> id.splitIndex)).flatMap(key => cacheLocations.get(key)).

Review comment:
       This was existing code that I simply moved into a class. Nevertheless, I have now changed it to use braces for the map and flatMap.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-657717862


   **[Test build #125784 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125784/testReport)** for PR 28848 at commit [`4301291`](https://github.com/apache/spark/commit/430129195b222a51bd525ad6b012ab9b32981b25).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on a change in pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
attilapiros commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r441644246



##########
File path: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##########
@@ -540,6 +540,46 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
     assert(mapStatus2(2).location.host === "hostB")
   }
 
+  test("[SPARK-32003] All shuffle files for executor should be cleaned up on fetch failure") {
+    // reset the test context with the right shuffle service config
+    afterEach()

Review comment:
       It is called by explicitly as the `init` which is called by the `beforeEach` was called with the `SHUFFLE_SERVICE_ENABLED` default config. So that one should be cleaned up and a new initialisation is needed see a few lines below.
   
   Some tests follow this pattern within this suite. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-652750704


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124824/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tgravescs commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
tgravescs commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r444282018



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -177,6 +177,8 @@ private[spark] class DAGScheduler(
   // TODO: Garbage collect information about failure epochs when we know there are no more
   //       stray messages to detect.
   private val failedEpoch = new HashMap[String, Long]
+  // In addition, track epoch for failed executors that result in lost file output

Review comment:
       I think a better description here would be nice, because without the specific context in removeExecutorAndUnregisterOutputs lost file isn't a usual term we use.  Perhaps at least clarify saying shuffle output file(s) lost.
   it might also be nice to say why we need ti separate from the failedEpoch




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-647867134






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646166931


   **[Test build #124225 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124225/testReport)** for PR 28848 at commit [`4b18369`](https://github.com/apache/spark/commit/4b1836961cfcd718589ddd5c6a768748e174b533).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-662161103






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-652684886


   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/29432/
   Test PASSed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-657718459






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-655383183


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
wypoon commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645149861


   @attilapiros 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645152215






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
wypoon commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-647906200


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-650015377






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-657717862


   **[Test build #125784 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125784/testReport)** for PR 28848 at commit [`4301291`](https://github.com/apache/spark/commit/430129195b222a51bd525ad6b012ab9b32981b25).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-657866669






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-654571950


   **[Test build #125168 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125168/testReport)** for PR 28848 at commit [`0ece4ac`](https://github.com/apache/spark/commit/0ece4acc0bb76c58ff12e710b60289e206a8703f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tgravescs commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
tgravescs commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-648870480


   > I'm thinking we should rename the epochs to blockManagerFailureEpoch and externalShuffleFailureEpoch
    
   @squito Maybe I'm miss understanding what you are suggesting (over what is in patch now) but the fileLost case right now is more than external shuffle, it is if an executor is lost in standalone mode or if external shuffle is not enabled, in those cases we still unregister the output for the executor.  This is why I suggested the parameter fileLost to be unregisterShuffleOutput.  The new Epoch we are tracking is whether it has already removed the output from that shuffle so I don't think externalShuffleFailureEpoch is inclusive enough.
   
   Unless you are suggesting the change that the remove output is in 2 places - 1 stays under blockManagerFailureEpoch in the case its not exteranal shuffle and then you have an additional check for exteranal shuffle and only do remove shuffle output with externalShuffleFailureEpoch.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-654552863






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-658533182






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-650043889


   **[Test build #124535 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124535/testReport)** for PR 28848 at commit [`d09ef93`](https://github.com/apache/spark/commit/d09ef9335e5d3657b830497155abb7a0c2bb0cde).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] squito commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
squito commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r448726041



##########
File path: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##########
@@ -248,17 +252,17 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
    */
   val cacheLocations = new HashMap[(Int, Int), Seq[BlockManagerId]]
   // stub out BlockManagerMaster.getLocations to use our cacheLocations
-  val blockManagerMaster = new BlockManagerMaster(null, null, conf, true) {
-      override def getLocations(blockIds: Array[BlockId]): IndexedSeq[Seq[BlockManagerId]] = {
-        blockIds.map {
-          _.asRDDId.map(id => (id.rddId -> id.splitIndex)).flatMap(key => cacheLocations.get(key)).
-            getOrElse(Seq())
-        }.toIndexedSeq
-      }
-      override def removeExecutor(execId: String): Unit = {
-        // don't need to propagate to the driver, which we don't have
-      }
+  class MyBlockManagerMaster(conf: SparkConf) extends BlockManagerMaster(null, null, conf, true) {
+    override def getLocations(blockIds: Array[BlockId]): IndexedSeq[Seq[BlockManagerId]] = {
+      blockIds.map {
+        _.asRDDId.map(id => (id.rddId -> id.splitIndex)).flatMap(key => cacheLocations.get(key)).

Review comment:
       nit: I think the style is that anytime you have a `=>`, you wrap in braces -- so both `map` and `flatMap` here should use braces




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646903497


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-657729251






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646169358


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on a change in pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r443811065



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1939,24 +1941,24 @@ private[spark] class DAGScheduler(
       hostToUnregisterOutputs: Option[String],
       maybeEpoch: Option[Long] = None): Unit = {
     val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
+    logDebug(s"Considering removal of executor $execId; " +

Review comment:
       I deliberately used a semicolon; it was not a typo.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646169326


   **[Test build #124225 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124225/testReport)** for PR 28848 at commit [`4b18369`](https://github.com/apache/spark/commit/4b1836961cfcd718589ddd5c6a768748e174b533).
    * This patch **fails build dependency tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-652683787


   **[Test build #124824 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124824/testReport)** for PR 28848 at commit [`5e233ac`](https://github.com/apache/spark/commit/5e233ac6281e1ccb371401a965f7d9c33ce2f54a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645256358






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
wypoon commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-648290364


   @tgravescs thanks for reviewing.
   Our customer was not using spark.files.fetchFailure.unRegisterOutputOnHost.
   In case of `FetchFailure`, in `DAGScheduler#handleTaskCompletion`, there is this code:
   ```
             if (bmAddress != null) {
               val hostToUnregisterOutputs = if (env.blockManager.externalShuffleServiceEnabled &&
                 unRegisterOutputOnHostOnFetchFailure) {
                 // We had a fetch failure with the external shuffle service, so we
                 // assume all shuffle data on the node is bad.
                 Some(bmAddress.host)
               } else {
                 // Unregister shuffle data just for one executor (we don't have any
                 // reason to believe shuffle data has been lost for the entire host).
                 None
               }
               removeExecutorAndUnregisterOutputs(
                 execId = bmAddress.executorId,
                 fileLost = true,
                 hostToUnregisterOutputs = hostToUnregisterOutputs,
                 maybeEpoch = Some(task.epoch))
             }
   ```
   This is the only place where `removeExecutorAndUnregisterOutputs` is called with `fileLost=true` for us. For us, since we're not using Spark Standalone (we're using Spark on YARN) and we're using the external shuffle service, the call to `removeExecutorAndUnregisterOutputs` from `DAGScheduler#handleExecutorLost` will always be with `fileLost=false`.
   Returning to handling the `FetchFailure`, we're calling `removeExecutorAndUnregisterOutputs` with `hostToUnregisterOutputs=None`. We should be unregistering the outputs for that executor, but when the `FetchFailure` happens in the same epoch after the executor is lost (so `removeExecutorAndUnregisterOutputs` has already been called from `handleExecutorLost`), this (before my change) is a no-op (since `failedEpoch(execId)` for that executor has already been set to that epoch). In general, there are multiple such `FetchFailure`s.
   
   In the Jira, I have summarized the bug. The `FetchFailure`s after the executor is lost will lead to a stage re-attempt, and in the new stage attempt, `FetchFailure`s will continue to happen from the lost executor's outputs, but now the epoch is higher, so this time, the executor's outputs will be unregistered. But this means it takes two stage attempts instead of one to recover from the lost executor.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-650472551






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646173660


   **[Test build #124226 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124226/testReport)** for PR 28848 at commit [`4b18369`](https://github.com/apache/spark/commit/4b1836961cfcd718589ddd5c6a768748e174b533).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on a change in pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
attilapiros commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r441329936



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1939,24 +1941,22 @@ private[spark] class DAGScheduler(
       hostToUnregisterOutputs: Option[String],
       maybeEpoch: Option[Long] = None): Unit = {
     val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
+    logDebug(s"Removing executor $execId, fileLost: $fileLost, currentEpoch: $currentEpoch")
     if (!failedEpoch.contains(execId) || failedEpoch(execId) < currentEpoch) {
       failedEpoch(execId) = currentEpoch
       logInfo("Executor lost: %s (epoch %d)".format(execId, currentEpoch))
       blockManagerMaster.removeExecutor(execId)
-      if (fileLost) {
-        hostToUnregisterOutputs match {
-          case Some(host) =>
-            logInfo("Shuffle files lost for host: %s (epoch %d)".format(host, currentEpoch))
-            mapOutputTracker.removeOutputsOnHost(host)
-          case None =>
-            logInfo("Shuffle files lost for executor: %s (epoch %d)".format(execId, currentEpoch))
-            mapOutputTracker.removeOutputsOnExecutor(execId)
-        }
-        clearCacheLocs()
-
-      } else {
-        logDebug("Additional executor lost message for %s (epoch %d)".format(execId, currentEpoch))
+    }
+    if (fileLost && (!fileLostEpoch.contains(execId) || fileLostEpoch(execId) < currentEpoch)) {
+      hostToUnregisterOutputs match {
+        case Some(host) =>
+          logInfo("Shuffle files lost for host: %s (epoch %d)".format(host, currentEpoch))

Review comment:
       Nit: I would prefer string interpolation:
   ```suggestion
             logInfo(s"Shuffle files lost for host: $host (epoch $currentEpoch)")
   ```
   
   And also a few lines below.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646849205


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 edited a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
Ngone51 edited a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-648017945


   let me take a look at the flaky barrier test. we've actually already fixed a few of them.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wypoon commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
wypoon commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-662246115


   I created https://github.com/apache/spark/pull/29182 for the backport to branch-2.4.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645152215


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-654552863






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-654572280






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-650015385


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124530/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646167955






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-654571950


   **[Test build #125168 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125168/testReport)** for PR 28848 at commit [`0ece4ac`](https://github.com/apache/spark/commit/0ece4acc0bb76c58ff12e710b60289e206a8703f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646903891


   Thank you for making a PR, @wypoon .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-647867136


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124374/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on a change in pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
attilapiros commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r442079384



##########
File path: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##########
@@ -540,6 +540,43 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
     assert(mapStatus2(2).location.host === "hostB")
   }
 
+  test("[SPARK-32003] All shuffle files for executor should be cleaned up on fetch failure") {
+    // reset the test context with the right shuffle service config
+    afterEach()
+    val conf = new SparkConf()
+    conf.set(config.SHUFFLE_SERVICE_ENABLED.key, "true")
+    init(conf)
+
+    val shuffleMapRdd = new MyRDD(sc, 3, Nil)
+    val shuffleDep = new ShuffleDependency(shuffleMapRdd, new HashPartitioner(3))
+    val shuffleId = shuffleDep.shuffleId
+    val reduceRdd = new MyRDD(sc, 3, List(shuffleDep), tracker = mapOutputTracker)
+
+    submit(reduceRdd, Array(0, 1, 2))
+    // Map stage completes successfully,
+    // two tasks are run on an executor on hostA and one on an executor on hostB
+    completeShuffleMapStageSuccessfully(0, 0, 3, Seq("hostA", "hostA", "hostB"))
+    // Now the executor on hostA is lost
+    runEvent(ExecutorLost("hostA-exec", ExecutorExited(-100, false, "Container marked as failed")))
+
+    // The MapOutputTracker has all the shuffle files
+    val initialMapStatuses = mapOutputTracker.shuffleStatuses(shuffleId).mapStatuses
+    assert(initialMapStatuses.count(_ != null) === 3)
+    assert(initialMapStatuses.count(s => s != null && s.location.executorId == "hostA-exec") === 2)
+    assert(initialMapStatuses.count(s => s != null && s.location.executorId == "hostB-exec") === 1)
+
+    // Now a fetch failure from the lost executor occurs
+    complete(taskSets(1), Seq(
+      (FetchFailed(makeBlockManagerId("hostA"), shuffleId, 0L, 0, 0, "ignored"), null)
+    ))
+
+    // Shuffle files for hostA-exec should be lost
+    val mapStatuses = mapOutputTracker.shuffleStatuses(shuffleId).mapStatuses
+    assert(mapStatuses.count(_ != null) === 1)
+    assert(initialMapStatuses.count(s => s != null && s.location.executorId == "hostA-exec") === 0)
+    assert(initialMapStatuses.count(s => s != null && s.location.executorId == "hostB-exec") === 1)

Review comment:
       As `mapStatuses` is reflecting the underlying change it is not needed to get it in the second time:
   
   ```suggestion
       val mapStatuses = mapOutputTracker.shuffleStatuses(shuffleId).mapStatuses
       assert(mapStatuses.count(_ != null) === 3)
       assert(mapStatuses.count(s => s != null && s.location.executorId == "hostA-exec") === 2)
       assert(mapStatuses.count(s => s != null && s.location.executorId == "hostB-exec") === 1)
   
       // Now a fetch failure from the lost executor occurs
       complete(taskSets(1), Seq(
         (FetchFailed(makeBlockManagerId("hostA"), shuffleId, 0L, 0, 0, "ignored"), null)
       ))
   
       // Shuffle files for hostA-exec should be lost
       assert(mapStatuses.count(_ != null) === 1)
       assert(mapStatuses.count(s => s != null && s.location.executorId == "hostA-exec") === 0)
       assert(mapStatuses.count(s => s != null && s.location.executorId == "hostB-exec") === 1)
   
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-645701890


   **[Test build #124184 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124184/testReport)** for PR 28848 at commit [`4c0b98c`](https://github.com/apache/spark/commit/4c0b98cc30cefb9745fa1c1232022bfcb55077cd).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on a change in pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
attilapiros commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r446034258



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -170,13 +170,29 @@ private[spark] class DAGScheduler(
    */
   private val cacheLocs = new HashMap[Int, IndexedSeq[Seq[TaskLocation]]]
 
-  // For tracking failed nodes, we use the MapOutputTracker's epoch number, which is sent with
-  // every task. When we detect a node failing, we note the current epoch number and failed
-  // executor, increment it for new tasks, and use this to ignore stray ShuffleMapTask results.
+  // Tracks the latest epoch of a fully processed error related to the given executor. (We use
+  // the MapOutputTracker's epoch number, which is sent with every task.)
   //
-  // TODO: Garbage collect information about failure epochs when we know there are no more
-  //       stray messages to detect.
-  private val failedEpoch = new HashMap[String, Long]
+  // When an executor fails, it can affect the results of many tasks, and we have to deal with
+  // all of them consistently. We don't simply ignore all future results from that executor,
+  // as the failures may have been transient; but we also don't want to "overreact" to follow-
+  // on errors we receive. Furthermore, we might receive notification of a task success, after
+  // we find out the executor has actually failed; we'll assume those successes are, in fact,
+  // simply delayed notifications and the results have been lost, if they come from the same
+  // epoch. In particular, we use this to control when we tell the BlockManagerMaster that the
+  // BlockManager has been lost.
+  private val executorFailureEpoch = new HashMap[String, Long]
+  // Tracks the latest epoch of a fully processed error where shuffle files have been lost from

Review comment:
       Nit: add empty line to separate this two
   ```suggestion
   
     // Tracks the latest epoch of a fully processed error where shuffle files have been lost from
   ```

##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -170,13 +170,29 @@ private[spark] class DAGScheduler(
    */
   private val cacheLocs = new HashMap[Int, IndexedSeq[Seq[TaskLocation]]]
 
-  // For tracking failed nodes, we use the MapOutputTracker's epoch number, which is sent with
-  // every task. When we detect a node failing, we note the current epoch number and failed
-  // executor, increment it for new tasks, and use this to ignore stray ShuffleMapTask results.
+  // Tracks the latest epoch of a fully processed error related to the given executor. (We use
+  // the MapOutputTracker's epoch number, which is sent with every task.)
   //
-  // TODO: Garbage collect information about failure epochs when we know there are no more
-  //       stray messages to detect.
-  private val failedEpoch = new HashMap[String, Long]
+  // When an executor fails, it can affect the results of many tasks, and we have to deal with
+  // all of them consistently. We don't simply ignore all future results from that executor,
+  // as the failures may have been transient; but we also don't want to "overreact" to follow-
+  // on errors we receive. Furthermore, we might receive notification of a task success, after
+  // we find out the executor has actually failed; we'll assume those successes are, in fact,
+  // simply delayed notifications and the results have been lost, if they come from the same
+  // epoch. In particular, we use this to control when we tell the BlockManagerMaster that the
+  // BlockManager has been lost.

Review comment:
       As tasks started for an earlier epoch are also considered to be lost:
   
   https://github.com/apache/spark/blob/d09ef9335e5d3657b830497155abb7a0c2bb0cde/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1585-L1586
   
   Please mention that case too, like:
   
   ```suggestion
     // simply delayed notifications and the results have been lost, if those tasks started for an
     // earlier or the same epoch. In particular, we use this to control when we tell the 
     // BlockManagerMaster that the BlockManager has been lost.
   ```

##########
File path: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##########
@@ -764,6 +825,7 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
       (FetchFailed(makeBlockManagerId("hostA"), shuffleId, 0L, 0, 0, "ignored"), null)))
     // this will get called
     // blockManagerMaster.removeExecutor("hostA-exec")

Review comment:
       This comment is not needed as ensured by the `verify`.
   ```suggestion
   ```

##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1933,29 +1946,43 @@ private[spark] class DAGScheduler(
       maybeEpoch = None)
   }
 
+  /**
+   * Handles removing an executor from the BlockManagerMaster as well as unregistering shuffle
+   * outputs for the executor or optionally its host.
+   *
+   * The fileLost parameter being true indicates that we assume we've lost all shuffle blocks

Review comment:
       Nit: what about documenting the params (at least the `fileLost` and `maybeEpoch`) with a "@param" scaladoc annotation?
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-662099210


   **[Test build #126279 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126279/testReport)** for PR 28848 at commit [`0e00862`](https://github.com/apache/spark/commit/0e0086288f6279569e8a11cef9d928b87c40469b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-646849193


   **[Test build #124300 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124300/testReport)** for PR 28848 at commit [`0069e71`](https://github.com/apache/spark/commit/0069e71118587a30911165e762e22e95fad97fe1).
    * This patch **fails build dependency tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-647867127


   **[Test build #124374 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124374/testReport)** for PR 28848 at commit [`633a0e7`](https://github.com/apache/spark/commit/633a0e7841c9a9f9175bed510120ef2fe66ebe00).
    * This patch **fails build dependency tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #28848: [SPARK-32003][CORE] Unregister outputs for executor on fetch failure …

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #28848:
URL: https://github.com/apache/spark/pull/28848#discussion_r443291018



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1939,24 +1941,24 @@ private[spark] class DAGScheduler(
       hostToUnregisterOutputs: Option[String],
       maybeEpoch: Option[Long] = None): Unit = {
     val currentEpoch = maybeEpoch.getOrElse(mapOutputTracker.getEpoch)
+    logDebug(s"Considering removal of executor $execId; " +

Review comment:
       nit: `;` -> `,`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org