You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jungtaek Lim (Jira)" <ji...@apache.org> on 2020/07/14 01:42:00 UTC

[jira] [Commented] (SPARK-32197) 'Spark driver' stays running even though 'spark application' has FAILED

    [ https://issues.apache.org/jira/browse/SPARK-32197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157070#comment-17157070 ] 

Jungtaek Lim commented on SPARK-32197:
--------------------------------------

Lowering the priority, as Critical+ requires committer's judgement.

> 'Spark driver' stays running even though 'spark application' has FAILED
> -----------------------------------------------------------------------
>
>                 Key: SPARK-32197
>                 URL: https://issues.apache.org/jira/browse/SPARK-32197
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler, Spark Core
>    Affects Versions: 2.4.6
>            Reporter: t oo
>            Priority: Major
>         Attachments: app_executors.png, applog.txt, driverlog.txt, failed1.png, failed_stages.png, failedapp.png, j1.out, stuckdriver.png
>
>
> App failed in 6 minutes, driver has been stuck for > 8 hours. I would expect driver to fail if app fails.
>  
> Thread dump from jstack (on the driver pid) attached (j1.out)
> Last part of stdout driver log attached (full log is 23MB, stderr log just has launch command)
> Last part of app logs attached
>  
> Can see that "org.apache.spark.util.ShutdownHookManager - Shutdown hook called"  line never appears in the driver log after "org.apache.spark.SparkContext - Successfully stopped SparkContext"
>  
> Using spark 2.4.6 with spark standalone mode. spark-submit to REST API (port 6066) in cluster mode was used. Other drivers/apps have worked fine with this setup, just this one getting stuck. My cluster has 1 EC2 dedicated as spark master and 1 Spot EC2 dedicated as spark worker. They can auto heal/spot terminate at any time. From checking aws logs: the worker was terminated at 01:53:38
>  
> I think you can replicate this by tearing down worker machine while app is running. You might have to try several times.
>  
> Similar to https://issues.apache.org/jira/browse/SPARK-24617 i raised before!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org