You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2022/12/12 03:36:00 UTC

[jira] [Assigned] (SPARK-41483) MetricsSystem report takes too much time, which may lead to spark application failed on yarn.

     [ https://issues.apache.org/jira/browse/SPARK-41483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-41483:
------------------------------------

    Assignee:     (was: Apache Spark)

> MetricsSystem report takes too much time, which may lead to spark application failed on yarn.
> ---------------------------------------------------------------------------------------------
>
>                 Key: SPARK-41483
>                 URL: https://issues.apache.org/jira/browse/SPARK-41483
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.4.8
>            Reporter: Deng An
>            Priority: Major
>
> My issue is similar to: SPARK-31625( [https://github.com/apache/spark/pull/28435).]
> In the scenario where the shutdown hook does not run (e.g., timeouts, etc.), the application is not unregistered, resulting in YARN RM resubmitting the application even if it succeeded.
> {code:java}
> 22/12/08 09:28:06 INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
> 22/12/08 09:28:06 INFO SparkContext :Invoking stop() from shut down hook 
> 22/12/08 09:28:06 INFO SparkContext :SparkUI : Stopped Spark web UI at xxx
> 22/12/08 09:28:16 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException     at java.util.concurrent.FutureTask.get(FutureTask.java:205)     at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 22/12/08 09:28:26 WARN ShutdownHookManager: ShutdownHook 'ClientFinalizer' timeout, java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException     at java.util.concurrent.FutureTask.get(FutureTask.java:205)     at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 22/12/08 09:28:36 ERROR ShutdownHookManager: ShutdownHookManger shutdown forcefully. {code}
> From the log, it seems that the shutdown hook of SparkContext is hang after the UI is closed. Finally, the hadoop shutdown manager threw a timeout exception and shutdown forcefully.
> This eventually led to the Spark Application being marked as FAILED by Yarn, because the unregister in the ApplicationMaster was not executed.
>  
> From the code of SparkContext # stop(), after closing the web UI, it is metricsSystem # report. However, this method may be blocked for a long time for various reasons (such as network timeout), which is the root cause of the final shutdown hook timeout.
> In our scenario, the network is unstable during an abnormal period of time, which causes sinks to take a long time to throw a connection time out exception, which directly causes the SparkContext to fail to stop within 10s.
> {code:java}
> Utils.tryLogNonFatalError {
>   _ui.foreach(_.stop())
> }
> if (env != null) {
>   Utils.tryLogNonFatalError {
>     env.metricsSystem.report()
>   }
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org