You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:34:33 UTC
[jira] [Resolved] (SPARK-10568) Error thrown in stopping one component in SparkContext.stop() doesn't allow other components to be stopped

     [ https://issues.apache.org/jira/browse/SPARK-10568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-10568.
----------------------------------
    Resolution: Incomplete

> Error thrown in stopping one component in SparkContext.stop() doesn't allow other components to be stopped
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-10568
>                 URL: https://issues.apache.org/jira/browse/SPARK-10568
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.4.1
>            Reporter: Matt Cheah
>            Priority: Minor
>              Labels: bulk-closed
>
> When I shut down a Java process that is running a SparkContext, it invokes a shutdown hook that eventually calls SparkContext.stop(), and inside SparkContext.stop() each individual component (DiskBlockManager, Scheduler Backend) is stopped. If an exception is thrown in stopping one of these components, none of the other components will be stopped cleanly either. This caused problems when I stopped a Java process running a Spark context in yarn-client mode, because not properly stopping YarnSchedulerBackend leads to problems.
> The steps I ran are as follows:
> 1. Create one job which fills the cluster
> 2. Kick off another job which creates a Spark Context
> 3. Kill the Java process with the Spark Context in #2
> 4. The job remains in the YARN UI as ACCEPTED
> Looking in the logs we see the following:
> {code}
> 2015-09-07 10:32:43,446 ERROR [Thread-3] o.a.s.u.Utils - Uncaught exception in thread Thread-3
> java.lang.NullPointerException: null
>         at org.apache.spark.storage.DiskBlockManager.org$apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:162) ~[spark-core_2.10-1.4.1.jar:1.4.1]
>         at org.apache.spark.storage.DiskBlockManager$$anonfun$addShutdownHook$1.apply$mcV$sp(DiskBlockManager.scala:144) ~[spark-core_2.10-1.4.1.jar:1.4.1]
>         at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2308) ~[spark-core_2.10-1.4.1.jar:1.4.1]
> {code}
> I think what's going on is that when we kill the application in the queued state, it tries to run the SparkContext.stop() method on the driver and stop each component. It dies trying to stop the DiskBlockManager because it hasn't been initialized yet - the application is still waiting to be scheduled by the Yarn RM - but YarnClient.stop() is not invoked as a result, leaving the application sticking around in the accepted state.
> Because of what appears to be bugs in the YARN scheduler, entering this state makes it so that the YARN scheduler is unable to schedule any more jobs unless we manually remove this application via the YARN CLI. We can tackle the YARN stuck state separately, but ensuring that all components get at least some chance to stop when a SparkContext stops seems like a good idea. Of course we can still throw some exception and/or log exceptions for everything that goes wrong at the end of stopping the context.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org