You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "t oo (Jira)" <ji...@apache.org> on 2020/07/06 14:46:00 UTC

[jira] [Created] (SPARK-32197) 'Spark driver' stays running even though 'spark application' has FAILED

t oo created SPARK-32197:
----------------------------

             Summary: 'Spark driver' stays running even though 'spark application' has FAILED
                 Key: SPARK-32197
                 URL: https://issues.apache.org/jira/browse/SPARK-32197
             Project: Spark
          Issue Type: Bug
          Components: Scheduler, Spark Core
    Affects Versions: 2.4.6
            Reporter: t oo
         Attachments: applog.txt, driverlog.txt, j1.out

App failed in 6 minutes, driver has been stuck for > 8 hours. I would expect driver to fail if app fails.

 

Thread dump from jstack (on the driver pid) attached (j1.out)

Last part of stdout driver log attached (full log is 23MB, stderr log just has launch command)

Last part of app logs attached

 

 

 

Using spark 2.4.6 with spark standalone mode. spark-submit to REST API (port 6066) in cluster mode was used. Other drivers/apps have worked fine with this setup, just this one getting stuck. My cluster has 1 EC2 dedicated as spark master and 1 Spot EC2 dedicated as spark worker. They can auto heal/spot terminate at any time. From checking aws logs: the worker was terminated at 01:53:38

 

I think you can replicate this by tearing down worker machine while app is running. You might have to try several times.

 

Similar to https://issues.apache.org/jira/browse/SPARK-24617 i raised before!

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org