You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by vanzin <gi...@git.apache.org> on 2018/06/15 23:42:01 UTC

[GitHub] spark pull request #21577: [WIP] [SPARK-24552][core] Correctly identify task...

GitHub user vanzin opened a pull request:

    https://github.com/apache/spark/pull/21577

    [WIP] [SPARK-24552][core] Correctly identify tasks in output commit coordinator.

    When an output stage is retried, it's possible that tasks from the previous
    attempt are still running. In that case, there would be a new task for the
    same partition in the new attempt, and the coordinator would allow both
    tasks to commit their output since it did not keep track of stage attempts.
    
    The change adds more information to the stage state tracked by the coordinator,
    so that only one task if allowed to commit the output in the above case.
    
    This also removes some code added in SPARK-18113 that allowed for duplicate
    commit requests; with the RPC code used in Spark 2, that situation cannot
    happen, so there is no need to handle it.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vanzin/spark SPARK-24552

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21577.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21577
    
----
commit 09e5d158e5dda6af7d83e9714dad6a64c21adf17
Author: Marcelo Vanzin <va...@...>
Date:   2018-06-15T23:21:58Z

    [SPARK-24552][core] Correctly identify tasks in output commit coordinator.
    
    When an output stage is retried, it's possible that tasks from the previous
    attempt are still running. In that case, there would be a new task for the
    same partition in the new attempt, and the coordinator would allow both
    tasks to commit their output since it did not keep track of stage attempts.
    
    The change adds more information to the stage state tracked by the coordinator,
    so that only one task if allowed to commit the output in the above case.
    
    This also removes some code added in SPARK-18113 that allowed for duplicate
    commit requests; with the RPC code used in Spark 2, that situation cannot
    happen, so there is no need to handle it.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21577: [WIP] [SPARK-24552][core] Correctly identify task...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21577#discussion_r196164787
  
    --- Diff: core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala ---
    @@ -81,7 +81,7 @@ object SparkHadoopMapRedUtil extends Logging {
               logInfo(message)
               // We need to abort the task so that the driver can reschedule new attempts, if necessary
               committer.abortTask(mrTaskContext)
    -          throw new CommitDeniedException(message, stageId, splitId, taskAttemptNumber)
    +          throw new CommitDeniedException(message, ctx.stageId(), splitId, ctx.attemptNumber())
    --- End diff --
    
    Sure. Was trying to minimize changes in the first version, for testing.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [WIP] [SPARK-24552][core] Correctly identify tasks in ou...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/4225/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/337/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/334/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by mridulm <gi...@git.apache.org>.

Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    > * t2 finishes before that kill message arrives, is allowed to commit.
    > If that can happen it would generate a duplicate map output; but my guess (hope?) is that the map output tracker would only keep one of them.
    
    This should not happen after this PR for two reasons:
    a) we do not clear status until stage finishes (which should be sufficient to prevent the bug in entirety)
    b) In (other) cases where we allow a task to commit but then kill it (perhaps for other reasons - like user initiated kill, executor pre-emption, etc), the task failure will be recorded and the commit state for that partition will be cleared - and resubmitted task for partition will commit.
    
    There is an inherent race in (b) always - where task is killed before task completion and after output commit - that is something we cannot fix, and which is to be handled/resolved by final job commit.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92109/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/4219/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [WIP] [SPARK-24552][core] Correctly identify tasks in ou...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    **[Test build #91949 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91949/testReport)** for PR 21577 at commit [`09e5d15`](https://github.com/apache/spark/commit/09e5d158e5dda6af7d83e9714dad6a64c21adf17).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    **[Test build #92109 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92109/testReport)** for PR 21577 at commit [`264c533`](https://github.com/apache/spark/commit/264c533737410786faae24df8cb5b27218f804cd).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/331/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24552][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92039/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    **[Test build #92096 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92096/testReport)** for PR 21577 at commit [`adb0d18`](https://github.com/apache/spark/commit/adb0d18fa1bb43488245b7d6b7ee02d4997d6215).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/4226/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24552][core] Correctly identify tasks in output c...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    I tried to create a test based on actually running a job, but I'd have to do a lot of hacking to control what the result stage does, and it was starting to feel not much better than the unit test I added here already, so I gave up.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21577: [WIP] [SPARK-24552][core] Correctly identify task...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21577#discussion_r196174240
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
    @@ -155,47 +164,41 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean)
     
       // Marked private[scheduler] instead of private so this can be mocked in tests
       private[scheduler] def handleAskPermissionToCommit(
    -      stage: StageId,
    -      partition: PartitionId,
    -      attemptNumber: TaskAttemptNumber): Boolean = synchronized {
    +      stage: Int,
    +      stageAttempt: Int,
    +      partition: Int,
    +      attemptNumber: Int): Boolean = synchronized {
         stageStates.get(stage) match {
    -      case Some(state) if attemptFailed(state, partition, attemptNumber) =>
    -        logInfo(s"Denying attemptNumber=$attemptNumber to commit for stage=$stage," +
    -          s" partition=$partition as task attempt $attemptNumber has already failed.")
    +      case Some(state) if attemptFailed(state, stageAttempt, partition, attemptNumber) =>
    +        logInfo(s"Commit denied for stage=$stage/$attemptNumber, partition=$partition: " +
    +          s"task attempt $attemptNumber already marked as failed.")
             false
           case Some(state) =>
    -        state.authorizedCommitters(partition) match {
    -          case NO_AUTHORIZED_COMMITTER =>
    -            logDebug(s"Authorizing attemptNumber=$attemptNumber to commit for stage=$stage, " +
    -              s"partition=$partition")
    -            state.authorizedCommitters(partition) = attemptNumber
    -            true
    -          case existingCommitter =>
    -            // Coordinator should be idempotent when receiving AskPermissionToCommit.
    -            if (existingCommitter == attemptNumber) {
    -              logWarning(s"Authorizing duplicate request to commit for " +
    -                s"attemptNumber=$attemptNumber to commit for stage=$stage," +
    -                s" partition=$partition; existingCommitter = $existingCommitter." +
    -                s" This can indicate dropped network traffic.")
    -              true
    -            } else {
    -              logDebug(s"Denying attemptNumber=$attemptNumber to commit for stage=$stage, " +
    -                s"partition=$partition; existingCommitter = $existingCommitter")
    -              false
    -            }
    +        val existing = state.authorizedCommitters(partition)
    +        if (existing == null) {
    +          logDebug(s"Commit allowed for stage=$stage/$attemptNumber, partition=$partition: " +
    +            s"task attempt $attemptNumber")
    +          state.authorizedCommitters(partition) = TaskIdentifier(stageAttempt, attemptNumber)
    +          true
    +        } else {
    +          logDebug(s"Commit denied for stage=$stage/$attemptNumber, partition=$partition: " +
    --- End diff --
    
    would be nice to include the stage attempt in the log messages as well.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21577: [SPARK-24552][core] Correctly identify tasks in o...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21577#discussion_r196247811
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
    @@ -109,20 +116,21 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean)
        * @param maxPartitionId the maximum partition id that could appear in this stage's tasks (i.e.
        *                       the maximum possible value of `context.partitionId`).
        */
    -  private[scheduler] def stageStart(stage: StageId, maxPartitionId: Int): Unit = synchronized {
    +  private[scheduler] def stageStart(stage: Int, maxPartitionId: Int): Unit = synchronized {
         stageStates(stage) = new StageState(maxPartitionId + 1)
    --- End diff --
    
    I don't think the semantics are changing. It's always been racy, in that either of the concurrent tasks from different stage attempts may succeed first. And I'm almost sure the assumption is that both task attempts are equivalent (i.e. the output is deterministic or at least should be), so it should be fine for either to be committed.
    
    The problem is that without this change the coordinator would allow both attempts to commit, and that is kinda bad.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    I'm fine with separating them but we need a jira or need to update the v2 jira to handle all cases


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21577: [SPARK-24589][core] Correctly identify tasks in o...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21577#discussion_r196284982
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
    @@ -109,20 +116,21 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean)
        * @param maxPartitionId the maximum partition id that could appear in this stage's tasks (i.e.
        *                       the maximum possible value of `context.partitionId`).
        */
    -  private[scheduler] def stageStart(stage: StageId, maxPartitionId: Int): Unit = synchronized {
    +  private[scheduler] def stageStart(stage: Int, maxPartitionId: Int): Unit = synchronized {
         stageStates(stage) = new StageState(maxPartitionId + 1)
    --- End diff --
    
    `T1_1.2` will not be allowed, it has a different task attempt number.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21577: [SPARK-24589][core] Correctly identify tasks in o...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21577#discussion_r196290828
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
    @@ -109,20 +116,21 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean)
        * @param maxPartitionId the maximum partition id that could appear in this stage's tasks (i.e.
        *                       the maximum possible value of `context.partitionId`).
        */
    -  private[scheduler] def stageStart(stage: StageId, maxPartitionId: Int): Unit = synchronized {
    +  private[scheduler] def stageStart(stage: Int, maxPartitionId: Int): Unit = synchronized {
         stageStates(stage) = new StageState(maxPartitionId + 1)
    --- End diff --
    
    I checked the related code in `DAGScheduler`, if `T1_1.1` succeeds, the re-tried stage won't launch task for this partition, because Spark tracks finished tasks for a job.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24552][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/4179/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    **[Test build #92104 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92104/testReport)** for PR 21577 at commit [`0e91f1b`](https://github.com/apache/spark/commit/0e91f1b49ee4720c9b89a2090080efd7c89ccaf4).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21577: [SPARK-24552][core] Correctly identify tasks in o...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21577#discussion_r196214944
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
    @@ -131,16 +139,17 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean)
         reason match {
           case Success =>
           // The task output has been committed successfully
    -      case denied: TaskCommitDenied =>
    -        logInfo(s"Task was denied committing, stage: $stage, partition: $partition, " +
    -          s"attempt: $attemptNumber")
    -      case otherReason =>
    +      case _: TaskCommitDenied =>
    +        logInfo(s"Task was denied committing, stage: $stage / $stageAttempt, " +
    --- End diff --
    
    Nit: Should this be `s"$stage.$stageAttempt"`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    So I think the commit/delete thing is also an issue for existing v1 and hadoop committers as well.  So this doesn't fully solve the problem.  spark uses a file format like (HadoopMapReduceWriteConfigUtil/HadoopMapRedWriteConfigUtil): {date}_{rddid}_{m/r}_{partitionid}_{task attempt number}
    I believe the same fix as the v2 would work using the taskAttemptId instead of the attemptNumber.  (see 
    
    In the case we have the stage failure and a second stage attempt the task attempt number could be the same and thus both tasks write to the same place.  If one of them fails or is told not to commit it could delete the output which is being used by both.
    
    Need to think through all the scenarios to make sure its covered. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    **[Test build #92138 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92138/testReport)** for PR 21577 at commit [`264c533`](https://github.com/apache/spark/commit/264c533737410786faae24df8cb5b27218f804cd).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24552][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24552][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Thanks for fixing this, @vanzin!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21577: [SPARK-24552][core] Correctly identify tasks in o...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21577#discussion_r196246952
  
    --- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala ---
    @@ -399,7 +399,8 @@ private[spark] object JsonProtocol {
             ("Full Stack Trace" -> exceptionFailure.fullStackTrace) ~
             ("Accumulator Updates" -> accumUpdates)
           case taskCommitDenied: TaskCommitDenied =>
    -        ("Job ID" -> taskCommitDenied.jobID) ~
    +        ("Job ID" -> taskCommitDenied.stageID) ~
    +        ("Job Attempt Number" -> taskCommitDenied.stageAttempt) ~
    --- End diff --
    
    For the new property, I'm just following what the old property says, even though it's wrong. I think having `Job ID` and `Stage Attempt Number` would just be even more confusing...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21577: [SPARK-24589][core] Correctly identify tasks in o...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21577#discussion_r196514319
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
    @@ -109,20 +116,21 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean)
        * @param maxPartitionId the maximum partition id that could appear in this stage's tasks (i.e.
        *                       the maximum possible value of `context.partitionId`).
        */
    -  private[scheduler] def stageStart(stage: StageId, maxPartitionId: Int): Unit = synchronized {
    +  private[scheduler] def stageStart(stage: Int, maxPartitionId: Int): Unit = synchronized {
         stageStates(stage) = new StageState(maxPartitionId + 1)
    --- End diff --
    
    Yep, I'm going down that path. I just want to add a proper test to make sure the behavior is correct and that's a little bit tricky.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [WIP] [SPARK-24552][core] Correctly identify tasks in ou...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    > Are there other docs that need to be updated for v2 datasource api?
    
    Yes we need, but this can be done in a different PR


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/325/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21577: [SPARK-24589][core] Correctly identify tasks in o...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21577#discussion_r196482353
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
    @@ -109,20 +116,21 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean)
        * @param maxPartitionId the maximum partition id that could appear in this stage's tasks (i.e.
        *                       the maximum possible value of `context.partitionId`).
        */
    -  private[scheduler] def stageStart(stage: StageId, maxPartitionId: Int): Unit = synchronized {
    +  private[scheduler] def stageStart(stage: Int, maxPartitionId: Int): Unit = synchronized {
         stageStates(stage) = new StageState(maxPartitionId + 1)
    --- End diff --
    
    Yeah, I think this can happen. The problem is that with the current way it's used, the output committer forgets the commit status between stage retries. I think the right thing would be for the committer to keep the stage-related stage until the scheduler is done with all its attempts.
    
    > whether it will generate corrupted data when the commit process of T1_1.1 didn't finish
    
    I don't think that's the problem. The problem is that if both the initial task and the speculative task finish successfully, but across stage attempt barriers (so the output committer is "reset" in between), both will be allowed to commit, so you get duplicate data.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    So anyone wants to do the actual merging?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [WIP] [SPARK-24552][core] Correctly identify tasks in ou...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/223/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [WIP] [SPARK-24552][core] Correctly identify tasks in ou...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/332/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Sounds good to me (although I'm trying the change locally and unit tests are so far happy).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24552][core] Correctly identify tasks in output c...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    +1. This fixes the commit coordinator problem where two separate tasks can be authorized. That case could lead to duplicate data (if, for example, both tasks generated unique file names using a random UUID).
    
    However, this doesn't address the problem I hit in practice, where a file was created twice and deleted once because the same task attempt number was both allowed to commit by the coordinator and denied commit by the coordinator (after the stage had finished).
    
    We still need the solution proposed in https://github.com/apache/spark/pull/21558 for the v2 API. But that's more of a v2 API problem because that API makes the guarantee that implementations can rely on the attempt ID.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    +1 
    
    this is a bit of a side while looking through the scenarios I filed: https://issues.apache.org/jira/browse/SPARK-24622 . shouldn't be a problem here though with this fix.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    FYI I plan to fix the mima issue later. Haven't decided whether to revert the change or just add excludes... probably the latter since it's a developer api.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21577: [SPARK-24552][core] Correctly identify tasks in o...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21577#discussion_r196246700
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
    @@ -131,16 +139,17 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean)
         reason match {
           case Success =>
           // The task output has been committed successfully
    -      case denied: TaskCommitDenied =>
    -        logInfo(s"Task was denied committing, stage: $stage, partition: $partition, " +
    -          s"attempt: $attemptNumber")
    -      case otherReason =>
    +      case _: TaskCommitDenied =>
    +        logInfo(s"Task was denied committing, stage: $stage / $stageAttempt, " +
    +          s"partition: $partition, attempt: $attemptNumber")
    +      case _ =>
             // Mark the attempt as failed to blacklist from future commit protocol
    -        stageState.failures.getOrElseUpdate(partition, mutable.Set()) += attemptNumber
    -        if (stageState.authorizedCommitters(partition) == attemptNumber) {
    +        val taskId = TaskIdentifier(stageAttempt, attemptNumber)
    +        stageState.failures.getOrElseUpdate(partition, mutable.Set()) += taskId
    +        if (stageState.authorizedCommitters(partition) == taskId) {
               logDebug(s"Authorized committer (attemptNumber=$attemptNumber, stage=$stage, " +
                 s"partition=$partition) failed; clearing lock")
    -          stageState.authorizedCommitters(partition) = NO_AUTHORIZED_COMMITTER
    +          stageState.authorizedCommitters(partition) = null
    --- End diff --
    
    Less memory usage, at least. Not sure what advantage using `Option` would bring here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21577: [SPARK-24589][core] Correctly identify tasks in o...

Posted by jiangxb1987 <gi...@git.apache.org>.

Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21577#discussion_r196479588
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
    @@ -109,20 +116,21 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean)
        * @param maxPartitionId the maximum partition id that could appear in this stage's tasks (i.e.
        *                       the maximum possible value of `context.partitionId`).
        */
    -  private[scheduler] def stageStart(stage: StageId, maxPartitionId: Int): Unit = synchronized {
    +  private[scheduler] def stageStart(stage: Int, maxPartitionId: Int): Unit = synchronized {
         stageStates(stage) = new StageState(maxPartitionId + 1)
    --- End diff --
    
    Actually @mridulm have a point here, on Scenario 2 we'll finally have  two attempts of a single task committed. Normally this shall not cause any problem because the two attempts shall generate the same output. But I'm still wondering whether it will generate corrupted data when the commit process of T1_1.1 didn't finish (so it may write the same dir with T1_1.2 at the same time) or T1_1.2 didn't finish the commit process(the executor get killed in the mid of commit process).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92106/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    > Ah, right, d'oh. I just checked about whether stages register with the coordinator, and saw the duplicate registration for the resubmitted map stage.
    
    Yeah I noticed that to but I think we should perhaps file separate jira and only do that in 2.4 and maybe 2.3.2 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24552][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21577: [SPARK-24552][core] Correctly identify tasks in o...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21577#discussion_r196247047
  
    --- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala ---
    @@ -399,7 +399,8 @@ private[spark] object JsonProtocol {
             ("Full Stack Trace" -> exceptionFailure.fullStackTrace) ~
             ("Accumulator Updates" -> accumUpdates)
           case taskCommitDenied: TaskCommitDenied =>
    -        ("Job ID" -> taskCommitDenied.jobID) ~
    +        ("Job ID" -> taskCommitDenied.stageID) ~
    +        ("Job Attempt Number" -> taskCommitDenied.stageAttempt) ~
    --- End diff --
    
    And it shouldn't affect compatibility. Given the code, even an old history server would be able to read these new log files.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    **[Test build #92138 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92138/testReport)** for PR 21577 at commit [`264c533`](https://github.com/apache/spark/commit/264c533737410786faae24df8cb5b27218f804cd).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92104/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [WIP] [SPARK-24552][core] Correctly identify tasks in ou...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    I think Ryan's change might still be good to introduce (i.e. a change that replaces the attempt id in that code with something a little more unique), regardless of any fix here.
    
    The unit tests I added artificially re-creates the calls that would lead to the situation, but I haven't tried to create a test case that would run things through the scheduler.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [WIP] [SPARK-24552][core] Correctly identify tasks in ou...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/221/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92138/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    >  * fixed the issue Mridul brought up, but I think the race that Tom describes still exists. I'm just not sure it would cause problems, since as far as I can tell it can only happen in a map stage, not a result stage.
    
    @vanzin  which race were you referring to here, I think tracking the stage across attempts fixes both the ones I mentioned in reference to scenario 2 for Mridul
    
    > There is another case though here where T1_1.1 could have just asked to be committed, but not yet committed, then if it gets delayed committing, the new stage attempt starts and T1_1.2 asks if it could commit and is granted, so then both try to commit at the same time causing corruption.
    
    Fixed because T1_1.2 won't be allowed to commit because we track first state attempt as committing.
    
    > The caveat there though would be if since T1_1.1 was committed, the second stage attempt could finish and call commitJob while T1_1.2 is committing since spark thinks it doesn't need to wait for T1_1.2. Anyway this seems very unlikely but we should protect against it.
    
    T1_1.2 shouldn't ever be allowed to commit since we track across the attempts so it wouldn't ever commit after the stage itself has completed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21577: [SPARK-24552][core] Correctly identify tasks in o...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21577#discussion_r196217742
  
    --- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala ---
    @@ -399,7 +399,8 @@ private[spark] object JsonProtocol {
             ("Full Stack Trace" -> exceptionFailure.fullStackTrace) ~
             ("Accumulator Updates" -> accumUpdates)
           case taskCommitDenied: TaskCommitDenied =>
    -        ("Job ID" -> taskCommitDenied.jobID) ~
    +        ("Job ID" -> taskCommitDenied.stageID) ~
    +        ("Job Attempt Number" -> taskCommitDenied.stageAttempt) ~
    --- End diff --
    
    Also, will this affect the compatibility of the history server files?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    > I pushed the change for that in: vanzin/spark@e6a862e
    
    I like it, it's simpler to use task id to replace stage attempt id and task attempt id. For safety we should do it in master only after this PR is merged.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21577: [SPARK-24589][core] Correctly identify tasks in o...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21577#discussion_r196513151
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
    @@ -109,20 +116,21 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean)
        * @param maxPartitionId the maximum partition id that could appear in this stage's tasks (i.e.
        *                       the maximum possible value of `context.partitionId`).
        */
    -  private[scheduler] def stageStart(stage: StageId, maxPartitionId: Int): Unit = synchronized {
    +  private[scheduler] def stageStart(stage: Int, maxPartitionId: Int): Unit = synchronized {
         stageStates(stage) = new StageState(maxPartitionId + 1)
    --- End diff --
    
    >  I think the right thing would be for the committer to keep the stage-related state until the scheduler is done with all its attempts.
    
    We should change the `DAGScheduler` a little bit that, if a stage is killed and going to be re-tried, do not clear the stage states in output coordinator.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [WIP] [SPARK-24552][core] Correctly identify tasks in ou...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [WIP] [SPARK-24552][core] Correctly identify tasks in ou...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    this was along the lines of what I was thinking as well. Will do a full review later.
    Just curious if you were able to create a test to actually reproduce it?
    
    From the other PR:
    >> and data source v2 API assumes (job id, partition id, task attemp id) can uniquely define a write task, even counting the failure cases.
    
    Are there other docs that need to be updated for v2 datasource api?  @rdblue @cloud-fan 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Yeah sorry about that, my fault.  I merged the fix - SPARK-22897


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/4252/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    **[Test build #92106 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92106/testReport)** for PR 21577 at commit [`5ece2f1`](https://github.com/apache/spark/commit/5ece2f12a820d6438146758f0e944f3b1c70d489).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/21577

A few notes about the latest updates:

- I reverted the `TaskCommitDenied` changes so that this patch can be backported more easily. I'm not against the change but I think it'd be better if we made it only in master, so we can postpone it. The information there is also not critical, since the end reason is generally attached to a task, which has the needed info anyway.

- I fixed the issue Mridul brought up, but I think the race that Tom describes still exists. I'm just not sure it would cause problems, since as far as I can tell it can only happen in a map stage, not a result stage.

The test I added can sort of illustrate that if you look at what happens. There are two stages (map stage 2, result stage 3), and the fetch failure causes a retry of stage 3 *plus* a resubmission of stage 2 - which means that stage 2 is starting in the committer with a fresh list of committers.

So if there's still a speculative task from stage 2 that hasn't been properly killed, it might be allowed to commit. But this being a map stage, I assume the map output tracker would take care of filtering out duplicates?

That's obviously really hard to hit, but if it can be an issue we could look at it separately.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21577: [SPARK-24552][core] Correctly identify tasks in o...

Posted by mridulm <gi...@git.apache.org>.

Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21577#discussion_r196235040
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
    @@ -109,20 +116,21 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean)
        * @param maxPartitionId the maximum partition id that could appear in this stage's tasks (i.e.
        *                       the maximum possible value of `context.partitionId`).
        */
    -  private[scheduler] def stageStart(stage: StageId, maxPartitionId: Int): Unit = synchronized {
    +  private[scheduler] def stageStart(stage: Int, maxPartitionId: Int): Unit = synchronized {
         stageStates(stage) = new StageState(maxPartitionId + 1)
    --- End diff --
    
    My memory is a bit rusty here, but are we changing the semantics of which task can commit here ?
    Couple of queries:
    * Are we allowing task from a previous stage attempt to commit for current stage attempt ?
    ** If yes, we should not overwrite `stageStates(stage)` if it exists.
    *** Based on `TaskIdentifier` above, I this yes ?
    ** If no, we should check and reject commit requests from tasks from 'older' stage when the current stage attempt is different.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21577: [SPARK-24589][core] Correctly identify tasks in o...

Posted by mridulm <gi...@git.apache.org>.

Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21577#discussion_r196282241
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
    @@ -109,20 +116,21 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean)
        * @param maxPartitionId the maximum partition id that could appear in this stage's tasks (i.e.
        *                       the maximum possible value of `context.partitionId`).
        */
    -  private[scheduler] def stageStart(stage: StageId, maxPartitionId: Int): Unit = synchronized {
    +  private[scheduler] def stageStart(stage: Int, maxPartitionId: Int): Unit = synchronized {
         stageStates(stage) = new StageState(maxPartitionId + 1)
    --- End diff --
    
    There are two cases here (both not handled in existing/earlier code).
    
    Handled in PR:
    * Stage S1 attempt A1 launched.
    * Tasks T1_1 launched for partition P1
    * A1 fails
    * Stage S1 attempt A2 launched.
    * Tasks T1_2 for partition P1 launched.
    * T1_1 finishes, and is allowed to commit.
    
    IMO not handled in PR:
    * Stage S1 attempt A1 launched.
    * Tasks T1_1.1 launched for partition P1
    * Tasks T1_1.2 launched for partition P1 (speculative)
    * Task T1_1.1 committed.
    * A1 fails
    * Stage S1 attempt A2 launched for some other pending partitions.
    * Tasks T1_1.2 wants to commit.
    
    T1_1.2 will be allowed to commit.
    Now we have two tasks for same partition successfully committing.
    
    Did I miss something here ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [WIP] [SPARK-24552][core] Correctly identify tasks in ou...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    **[Test build #92106 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92106/testReport)** for PR 21577 at commit [`5ece2f1`](https://github.com/apache/spark/commit/5ece2f12a820d6438146758f0e944f3b1c70d489).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [WIP] [SPARK-24552][core] Correctly identify tasks in ou...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91950/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [WIP] [SPARK-24552][core] Correctly identify tasks in ou...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    thanks! the fix LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21577: [SPARK-24589][core] Correctly identify tasks in o...

Posted by mridulm <gi...@git.apache.org>.

Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21577#discussion_r196291657
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
    @@ -109,20 +116,21 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean)
        * @param maxPartitionId the maximum partition id that could appear in this stage's tasks (i.e.
        *                       the maximum possible value of `context.partitionId`).
        */
    -  private[scheduler] def stageStart(stage: StageId, maxPartitionId: Int): Unit = synchronized {
    +  private[scheduler] def stageStart(stage: Int, maxPartitionId: Int): Unit = synchronized {
         stageStates(stage) = new StageState(maxPartitionId + 1)
    --- End diff --
    
    Ah, I see your confusion.
    Please note that T1_1.2 is a speculative task, not a retried task.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    test this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    > I think the commit/delete thing is also an issue for existing v1 and hadoop committers as well.
    
    I took a look at that code and I agree with you. It's actually quite annoying how "task id" and "task attempt id" are used interchangeably with "task attempt number" in a number of places when they aren't the same thing.
    
    Anyway, I think that should either be a separate change or perhaps integrated with @rdblue's patch, since it's more similar to that one. The fix should actually be quite simple: `createTaskAttemptContext` should take the actual task attempt id (type `Long`) instead of the attempt number as an argument.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/4230/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24552][core] Correctly identify tasks in output c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    **[Test build #92039 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92039/testReport)** for PR 21577 at commit [`a2f4c1b`](https://github.com/apache/spark/commit/a2f4c1bde8b6d293c20983bd019f15c3c4f2703d).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    I filed SPARK-24611 to track some enhancements to this part of the code that have been discussed here. Of those, I'd consider the "use task IDs instead of TaskIdentifier" as something we could potentially do here, but at the same time I don't really want to delay this patch too much.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [WIP] [SPARK-24552][core] Correctly identify tasks in ou...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    **[Test build #91950 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91950/testReport)** for PR 21577 at commit [`d471b74`](https://github.com/apache/spark/commit/d471b74d4562ca8aa8e68a7ff90d881a67de5e59).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21577: [SPARK-24552][core] Correctly identify tasks in o...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21577#discussion_r196246754
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
    @@ -131,16 +139,17 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean)
         reason match {
           case Success =>
           // The task output has been committed successfully
    -      case denied: TaskCommitDenied =>
    -        logInfo(s"Task was denied committing, stage: $stage, partition: $partition, " +
    -          s"attempt: $attemptNumber")
    -      case otherReason =>
    +      case _: TaskCommitDenied =>
    +        logInfo(s"Task was denied committing, stage: $stage / $stageAttempt, " +
    --- End diff --
    
    That looks better and saves some space, so will do.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    the code here lgtm, I was trying to make one more pass through all the scenarios but got stuck in meetings, will try to do it later tonight or tomorrow morning but we can always have another follow up if we find another case. 
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24552][core] Correctly identify tasks in output c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    **[Test build #92039 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92039/testReport)** for PR 21577 at commit [`a2f4c1b`](https://github.com/apache/spark/commit/a2f4c1bde8b6d293c20983bd019f15c3c4f2703d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by jiangxb1987 <gi...@git.apache.org>.

Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    This in general looks good,  IMO we shall focus on fixing the output commit coordinator issue in this PR, and discuss the data source issue in a separated thread.
    I'm OOO this week but will still look into more detail on this issue.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21577: [SPARK-24589][core] Correctly identify tasks in o...

Posted by mridulm <gi...@git.apache.org>.

Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21577#discussion_r196514785
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
    @@ -109,20 +116,21 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean)
        * @param maxPartitionId the maximum partition id that could appear in this stage's tasks (i.e.
        *                       the maximum possible value of `context.partitionId`).
        */
    -  private[scheduler] def stageStart(stage: StageId, maxPartitionId: Int): Unit = synchronized {
    +  private[scheduler] def stageStart(stage: Int, maxPartitionId: Int): Unit = synchronized {
         stageStates(stage) = new StageState(maxPartitionId + 1)
    --- End diff --
    
    I agree @vanzin, @cloud-fan. We should remove the stage info only after the stage is done.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21577: [WIP] [SPARK-24552][core] Correctly identify task...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21577#discussion_r196163389
  
    --- Diff: core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala ---
    @@ -81,7 +81,7 @@ object SparkHadoopMapRedUtil extends Logging {
               logInfo(message)
               // We need to abort the task so that the driver can reschedule new attempts, if necessary
               committer.abortTask(mrTaskContext)
    -          throw new CommitDeniedException(message, stageId, splitId, taskAttemptNumber)
    +          throw new CommitDeniedException(message, ctx.stageId(), splitId, ctx.attemptNumber())
    --- End diff --
    
    shall we also include stage attempt number in the exception?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    I will 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21577: [SPARK-24589][core] Correctly identify tasks in o...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21577#discussion_r196494533
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
    @@ -109,20 +116,21 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean)
        * @param maxPartitionId the maximum partition id that could appear in this stage's tasks (i.e.
        *                       the maximum possible value of `context.partitionId`).
        */
    -  private[scheduler] def stageStart(stage: StageId, maxPartitionId: Int): Unit = synchronized {
    +  private[scheduler] def stageStart(stage: Int, maxPartitionId: Int): Unit = synchronized {
         stageStates(stage) = new StageState(maxPartitionId + 1)
    --- End diff --
    
    So in scenario 2, once the first task finishes and is committed, the taskset manager will kill the speculative task T1_1.2. But since it sends an async message to kill the task, the task could actually try to commit after another task fails and causes the stage to remove itself from the output commit coordinator and after it starts another stage attempt.  So it could actually end up committed the task output for T1_1.2.  I'm not sure this case by itself is a problem though since if it actually committed T1_1.1 and T1_1.2 is allowed to commit, they should have the same output and commitJob would handle in at least most cases.   The caveat there though would be if since T1_1.1 was committed, the second stage attempt could finish and call commitJob while T1_1.2 is committing since spark thinks it doesn't need to wait for T1_1.2.  Anyway this seems very unlikely but we should protect against it.
    
    There is another case though here where T1_1.1 could have just asked to be committed, but not yet committed, then if it gets delayed committing, the new stage attempt starts and T1_1.2 asks if it could commit and is granted, so then both try to commit at the same time causing corruption.
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92096/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [WIP] [SPARK-24552][core] Correctly identify tasks in ou...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/4115/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [WIP] [SPARK-24552][core] Correctly identify tasks in ou...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    **[Test build #91949 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91949/testReport)** for PR 21577 at commit [`09e5d15`](https://github.com/apache/spark/commit/09e5d158e5dda6af7d83e9714dad6a64c21adf17).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24552][core] Correctly identify tasks in output c...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    > We still need the solution proposed in #21558 for the v2 API
    
    Should we have a separate bug for these then? I just piggybacked on the bug you filed, but if they're separate issues, even if complementary, might be better to separate them.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [WIP] [SPARK-24552][core] Correctly identify tasks in ou...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    **[Test build #91950 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91950/testReport)** for PR 21577 at commit [`d471b74`](https://github.com/apache/spark/commit/d471b74d4562ca8aa8e68a7ff90d881a67de5e59).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [WIP] [SPARK-24552][core] Correctly identify tasks in ou...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    I was referring to a race caused by asynchronously killing speculative tasks. Granted it's incredibly unlikely to occur in real life:
    
    - in s1a1 1, t1 and t2 are started for the same partition, t1 succeeds, a kill is sent for t2
    - s1 finishes, coordinator state is cleared for that stage
    - s2a1 fails, causes s1 to be re-submitted
    - t2 finishes before that kill message arrives, is allowed to commit.
    
    If that can happen it would generate a duplicate map output; but my guess (hope?) is that the map output tracker would only keep one of them.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [WIP] [SPARK-24552][core] Correctly identify tasks in ou...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24552][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/285/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    > if its a map stage then I don't expect it to be asking to commit.
    
    Ah, right, d'oh. I just checked about whether stages register with the coordinator, and saw the duplicate registration for the resubmitted map stage.
    
    I guess we could avoid that too. The scheduler does that explicitly, but if it doesn't serve any purpose, it would save some memory:
    
    ```
          case s: ShuffleMapStage =>
            outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    **[Test build #92104 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92104/testReport)** for PR 21577 at commit [`0e91f1b`](https://github.com/apache/spark/commit/0e91f1b49ee4720c9b89a2090080efd7c89ccaf4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by zzcclp <gi...@git.apache.org>.

Github user zzcclp commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    @vanzin @tgravescs , after merge this pr into branch-2.2, there is an error "stageAttemptNumber is not a member of org.apache.spark.TaskContext" in SparkHadoopMapRedUtil, I think it needs to merge PR-20082 first.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by mridulm <gi...@git.apache.org>.

Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Thanks for the changes @vanzin, looks good to me !
    Ideally would have been great to test the speculative execution part as well; but that would be fairly nasty to reliably reproduce I guess.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [WIP] [SPARK-24552][core] Correctly identify tasks in ou...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91949/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21577: [SPARK-24589][core] Correctly identify tasks in o...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/21577


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/4233/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Also, one idea to try to fix the remaining race would be to move the commit-related state to the `org.apache.spark.scheduler.Stage` class, which is reused across attempts (even in the resubmission case I mentioned), and keep the coordinator class purely for the RPC stuff.
    
    But that would be a bigger change.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    **[Test build #92096 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92096/testReport)** for PR 21577 at commit [`adb0d18`](https://github.com/apache/spark/commit/adb0d18fa1bb43488245b7d6b7ee02d4997d6215).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    I pushed the change for that in: https://github.com/vanzin/spark/commit/e6a862ecb83c64a0ea2f5bd469bc0febe25e15ba
    
    In case anyone wants to take a look.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21577: [SPARK-24589][core] Correctly identify tasks in o...

Posted by mridulm <gi...@git.apache.org>.

Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21577#discussion_r196288823
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
    @@ -109,20 +116,21 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean)
        * @param maxPartitionId the maximum partition id that could appear in this stage's tasks (i.e.
        *                       the maximum possible value of `context.partitionId`).
        */
    -  private[scheduler] def stageStart(stage: StageId, maxPartitionId: Int): Unit = synchronized {
    +  private[scheduler] def stageStart(stage: Int, maxPartitionId: Int): Unit = synchronized {
         stageStates(stage) = new StageState(maxPartitionId + 1)
    --- End diff --
    
    If I read @vanzin's PR right, T1_1.2 will be allowed to commit - since there is a stageEnd + stageStart in between (which clear the existing stage state).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/356/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    **[Test build #92109 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92109/testReport)** for PR 21577 at commit [`264c533`](https://github.com/apache/spark/commit/264c533737410786faae24df8cb5b27218f804cd).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    merged to master, 2.3, and 2.2


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21577: [SPARK-24552][core] Correctly identify tasks in o...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21577#discussion_r196215961
  
    --- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala ---
    @@ -399,7 +399,8 @@ private[spark] object JsonProtocol {
             ("Full Stack Trace" -> exceptionFailure.fullStackTrace) ~
             ("Accumulator Updates" -> accumUpdates)
           case taskCommitDenied: TaskCommitDenied =>
    -        ("Job ID" -> taskCommitDenied.jobID) ~
    +        ("Job ID" -> taskCommitDenied.stageID) ~
    +        ("Job Attempt Number" -> taskCommitDenied.stageAttempt) ~
    --- End diff --
    
    Why does this use "Job" and not "Stage"?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21577: [SPARK-24552][core] Correctly identify tasks in o...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21577#discussion_r196214788
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
    @@ -131,16 +139,17 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean)
         reason match {
           case Success =>
           // The task output has been committed successfully
    -      case denied: TaskCommitDenied =>
    -        logInfo(s"Task was denied committing, stage: $stage, partition: $partition, " +
    -          s"attempt: $attemptNumber")
    -      case otherReason =>
    +      case _: TaskCommitDenied =>
    +        logInfo(s"Task was denied committing, stage: $stage / $stageAttempt, " +
    +          s"partition: $partition, attempt: $attemptNumber")
    +      case _ =>
             // Mark the attempt as failed to blacklist from future commit protocol
    -        stageState.failures.getOrElseUpdate(partition, mutable.Set()) += attemptNumber
    -        if (stageState.authorizedCommitters(partition) == attemptNumber) {
    +        val taskId = TaskIdentifier(stageAttempt, attemptNumber)
    +        stageState.failures.getOrElseUpdate(partition, mutable.Set()) += taskId
    +        if (stageState.authorizedCommitters(partition) == taskId) {
               logDebug(s"Authorized committer (attemptNumber=$attemptNumber, stage=$stage, " +
                 s"partition=$partition) failed; clearing lock")
    -          stageState.authorizedCommitters(partition) = NO_AUTHORIZED_COMMITTER
    +          stageState.authorizedCommitters(partition) = null
    --- End diff --
    
    Nit: why not use Option[TaskIdentifier] and None here?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [WIP] [SPARK-24552][core] Correctly identify tasks in ou...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    I'll try to create a test to exercise this in a real job (aside from the exception changes @cloud-fan suggested), but wouldn't hold my breath.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [WIP] [SPARK-24552][core] Correctly identify tasks in ou...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/4117/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21577: [SPARK-24589][core] Correctly identify tasks in output c...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21577
  
    > The test I added can sort of illustrate that if you look at what happens. There are two stages (map stage 2, result stage 3), and the fetch failure causes a retry of stage 3 plus a resubmission of stage 2 - which means that stage 2 is starting in the coordinator with a fresh list of committers.
    > 
    > So if there's still a speculative task from stage 2's first run that hasn't been properly killed yet, it might be allowed to commit. But this being a map stage, I assume the map output tracker would take care of filtering out duplicates?
    
    if its a map stage then I don't expect it to be asking to commit.  Is there a case you know that does? 
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org