You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jungtaek Lim (Jira)" <ji...@apache.org> on 2020/07/14 01:42:00 UTC
[jira] [Commented] (SPARK-32197) 'Spark driver' stays running even
though 'spark application' has FAILED
[ https://issues.apache.org/jira/browse/SPARK-32197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157070#comment-17157070 ]
Jungtaek Lim commented on SPARK-32197:
--------------------------------------
Lowering the priority, as Critical+ requires committer's judgement.
> 'Spark driver' stays running even though 'spark application' has FAILED
> -----------------------------------------------------------------------
>
> Key: SPARK-32197
> URL: https://issues.apache.org/jira/browse/SPARK-32197
> Project: Spark
> Issue Type: Bug
> Components: Scheduler, Spark Core
> Affects Versions: 2.4.6
> Reporter: t oo
> Priority: Major
> Attachments: app_executors.png, applog.txt, driverlog.txt, failed1.png, failed_stages.png, failedapp.png, j1.out, stuckdriver.png
>
>
> App failed in 6 minutes, driver has been stuck for > 8 hours. I would expect driver to fail if app fails.
>
> Thread dump from jstack (on the driver pid) attached (j1.out)
> Last part of stdout driver log attached (full log is 23MB, stderr log just has launch command)
> Last part of app logs attached
>
> Can see that "org.apache.spark.util.ShutdownHookManager - Shutdown hook called" line never appears in the driver log after "org.apache.spark.SparkContext - Successfully stopped SparkContext"
>
> Using spark 2.4.6 with spark standalone mode. spark-submit to REST API (port 6066) in cluster mode was used. Other drivers/apps have worked fine with this setup, just this one getting stuck. My cluster has 1 EC2 dedicated as spark master and 1 Spot EC2 dedicated as spark worker. They can auto heal/spot terminate at any time. From checking aws logs: the worker was terminated at 01:53:38
>
> I think you can replicate this by tearing down worker machine while app is running. You might have to try several times.
>
> Similar to https://issues.apache.org/jira/browse/SPARK-24617 i raised before!
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org