You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/06/17 08:31:25 UTC

[GitHub] [spark] Ngone51 commented on pull request #28839: [SPARK-32000][CORE][TESTS] Fix the flaky testcase for partially launched task in barrier-mode.

Ngone51 commented on pull request #28839:
URL: https://github.com/apache/spark/pull/28839#issuecomment-645234503


   Hi @sarutak, thanks for reporting and the fix.
   
   First of all, I think it's very unlikely that we'll reach the locality wait timeout(default 3s) since it is still very long for such a unit test. 
   
   After checking the [log](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124086/testReport/org.apache.spark.scheduler/BarrierTaskContextSuite/SPARK_31485__barrier_stage_should_fail_if_only_partial_tasks_are_launched/), I believe the real root cause should be:
   
   Two test cases from different test suites got submitted at the same time because of concurrent execution. In this particular case, the two test cases (from DistributedSuite and BarrierTaskContextSuite) both launch under local-cluster mode. The two applications are submitted at the SAME time so they have the same applications(app-20200615210132-0000). Thus, when the cluster of BarrierTaskContextSuite is launching executors, it failed to create the directory for the executor 0/1, because the path (/home/jenkins/workspace/work/app-app-20200615210132-0000/0) has been used by the cluster of DistributedSuite. Therefore, it has to launch executor 2 and 3 instead, that lead to non of the tasks can get perferred locality thus they got scheduled together and lead to the test failure.
   
   You can download the log from `https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124086/artifact/core/` and search appId app-20200615210132-0000 to confirm the root cause.
   
   
   The right fix I think is to use the dynamic executor id from the SparkContext instead of hardcode it. I'd like to open a separate PR for the fix if you don't mind.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org