You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Marcelo Vanzin (JIRA)" <ji...@apache.org> on 2019/03/15 22:16:00 UTC

[jira] [Resolved] (SPARK-26634) OutputCommitCoordinator may allow task of FetchFailureStage commit again

     [ https://issues.apache.org/jira/browse/SPARK-26634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marcelo Vanzin resolved SPARK-26634.
------------------------------------
    Resolution: Duplicate

> OutputCommitCoordinator may allow task of FetchFailureStage commit again
> ------------------------------------------------------------------------
>
>                 Key: SPARK-26634
>                 URL: https://issues.apache.org/jira/browse/SPARK-26634
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.0
>            Reporter: liupengcheng
>            Priority: Major
>
> In our production spark cluster, we encoutered a case that the task of retry stage due to FetchFailure is denied to commit. However, the task is the first attempt of this retry stage.
> After carefully investigating, it was found that the call of canCommit of OutputCommitCoordinator would allow the task of FetchFailure stage(with the same parition number as new task of retry stage) commit. which result in the TaskCommitDenied for all the task (same partition) of retry stage. Becuase of TaskCommitDenied is not countTowardsFailure, thus might cause Application hangs forever.
>  
> {code:java}
> 2019-01-09,08:39:53,676 INFO org.apache.spark.scheduler.TaskSetManager: Starting task 138.0 in stage 5.1 (TID 31437, zjy-hadoop-prc-st159.bj, executor 456, partition 138, PROCESS_LOCAL, 5829 bytes)
> 2019-01-09,08:43:37,514 INFO org.apache.spark.scheduler.TaskSetManager: Finished task 138.0 in stage 5.0 (TID 30634) in 466958 ms on zjy-hadoop-prc-st1212.bj (executor 1632) (674/5000)
> 2019-01-09,08:45:57,372 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 138.0 in stage 5.1 (TID 31437, zjy-hadoop-prc-st159.bj, executor 456): TaskCommitDenied (Driver denied task commit) for job: 5, partition: 138, attemptNumber: 1
> 166483 2019-01-09,08:45:57,373 INFO org.apache.spark.scheduler.OutputCommitCoordinator: Task was denied committing, stage: 5, partition: 138, attempt number: 0, attempt number(counting failed stage): 1
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org