You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Deng An (Jira)" <ji...@apache.org> on 2022/12/12 03:30:00 UTC

[jira] [Updated] (SPARK-41483) MetricsSystem report takes too much time, which may lead to spark application failed on yarn.

     [ https://issues.apache.org/jira/browse/SPARK-41483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Deng An updated SPARK-41483:
----------------------------
    Description: 
My issue is similar to: SPARK-31625( [https://github.com/apache/spark/pull/28435).]

In the scenario where the shutdown hook does not run (e.g., timeouts, etc.), the application is not unregistered, resulting in YARN RM resubmitting the application even if it succeeded.
{code:java}
22/12/08 09:28:06 INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
22/12/08 09:28:06 INFO SparkContext :Invoking stop() from shut down hook 
22/12/08 09:28:06 INFO SparkContext :SparklJI : Stopped Spark web UI at xxx
22/12/08 09:28:16 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException     at java.util.concurrent.FutureTask.get(FutureTask.java:205)     at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 22/12/08 09:28:26 WARN ShutdownHookManager: ShutdownHook 'ClientFinalizer' timeout, java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException     at java.util.concurrent.FutureTask.get(FutureTask.java:205)     at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 22/12/08 09:28:36 ERROR ShutdownHookManager: ShutdownHookManger shutdown forcefully. {code}
From the log, it seems that the shutdown hook of SparkContext is hang after the UI is closed. Finally, the hadoop shutdown manager threw a timeout exception and shutdown forcefully.

This eventually led to the Spark Application being marked as FAILED by Yarn, because the unregister in the ApplicationMaster was not executed.

 

From the code of SparkContext # stop(), after closing the web UI, it is metricsSystem # report. However, this method may be blocked for a long time for various reasons (such as network timeout), which is the root cause of the final shutdown hook timeout.

In our scenario, the network is unstable during an abnormal period of time, which causes sinks to take a long time to throw a connection time out exception, which directly causes the SparkContext to fail to stop within 10s.
{code:java}
Utils.tryLogNonFatalError {
  _ui.foreach(_.stop())
}
if (env != null) {
  Utils.tryLogNonFatalError {
    env.metricsSystem.report()
  }
} {code}

  was:
My issue is similar to: SPARK-31625( [https://github.com/apache/spark/pull/28435).]

In the scenario where the shutdown hook does not run (e.g., timeouts, etc.), the application is not unregistered, resulting in YARN RM resubmitting the application even if it succeeded.

```scala

22/12/08 09:28:06 INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
22/12/08 09:28:06 INFO SparkContext :Invoking stop() from shut down hook 
22/12/08 09:28:06 INFO SparkContext :SparklJI : Stopped Spark web UI at xxx
22/12/08 09:28:16 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException     at java.util.concurrent.FutureTask.get(FutureTask.java:205)     at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 22/12/08 09:28:26 WARN ShutdownHookManager: ShutdownHook 'ClientFinalizer' timeout, java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException     at java.util.concurrent.FutureTask.get(FutureTask.java:205)     at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 22/12/08 09:28:36 ERROR ShutdownHookManager: ShutdownHookManger shutdown forcefully.

```

From the log, it seems that the shutdown hook of SparkContext is hang after the UI is closed. Finally, the hadoop shutdown manager threw a timeout exception and shutdown forcefully.

This eventually led to the Spark Application being marked as FAILED by Yarn, because the unregister in the ApplicationMaster was not executed.

 


> MetricsSystem report takes too much time, which may lead to spark application failed on yarn.
> ---------------------------------------------------------------------------------------------
>
>                 Key: SPARK-41483
>                 URL: https://issues.apache.org/jira/browse/SPARK-41483
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.4.8
>            Reporter: Deng An
>            Priority: Major
>
> My issue is similar to: SPARK-31625( [https://github.com/apache/spark/pull/28435).]
> In the scenario where the shutdown hook does not run (e.g., timeouts, etc.), the application is not unregistered, resulting in YARN RM resubmitting the application even if it succeeded.
> {code:java}
> 22/12/08 09:28:06 INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
> 22/12/08 09:28:06 INFO SparkContext :Invoking stop() from shut down hook 
> 22/12/08 09:28:06 INFO SparkContext :SparklJI : Stopped Spark web UI at xxx
> 22/12/08 09:28:16 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException     at java.util.concurrent.FutureTask.get(FutureTask.java:205)     at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 22/12/08 09:28:26 WARN ShutdownHookManager: ShutdownHook 'ClientFinalizer' timeout, java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException     at java.util.concurrent.FutureTask.get(FutureTask.java:205)     at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67) 22/12/08 09:28:36 ERROR ShutdownHookManager: ShutdownHookManger shutdown forcefully. {code}
> From the log, it seems that the shutdown hook of SparkContext is hang after the UI is closed. Finally, the hadoop shutdown manager threw a timeout exception and shutdown forcefully.
> This eventually led to the Spark Application being marked as FAILED by Yarn, because the unregister in the ApplicationMaster was not executed.
>  
> From the code of SparkContext # stop(), after closing the web UI, it is metricsSystem # report. However, this method may be blocked for a long time for various reasons (such as network timeout), which is the root cause of the final shutdown hook timeout.
> In our scenario, the network is unstable during an abnormal period of time, which causes sinks to take a long time to throw a connection time out exception, which directly causes the SparkContext to fail to stop within 10s.
> {code:java}
> Utils.tryLogNonFatalError {
>   _ui.foreach(_.stop())
> }
> if (env != null) {
>   Utils.tryLogNonFatalError {
>     env.metricsSystem.report()
>   }
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org