You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Sridhar Rana (JIRA)" <ji...@apache.org> on 2017/03/11 15:57:04 UTC

[jira] [Created] (SPARK-19920) How to capture reasons (log/trace/info/anything?) for "[ERROR]Driver 172.31.25.77:45151 disassociated! Shutting down."

Sridhar Rana created SPARK-19920:
------------------------------------

Summary: How to capture reasons (log/trace/info/anything?) for "[ERROR]Driver 172.31.25.77:45151 disassociated! Shutting down."
Key: SPARK-19920
URL: https://issues.apache.org/jira/browse/SPARK-19920
Project: Spark
Issue Type: Question
Components: Spark Submit, YARN
Affects Versions: 2.1.0
Reporter: Sridhar Rana
Priority: Critical

We have an AWS Cloudera Spark environment. It is a yarn cluster with 1 driver node and 3 executor node. We use Spark SQL heavily and log4J for logging. Ours is a 24X7 long running process in a iterative loop. The process runs fine but after several iterations (after few hours), it reports this error "[ERROR]Driver 172.31.25.77:45151 disassociated! Shutting down.". At the sames second, there is this warning "[WARN ]Error sending message [message = Heartbeat(1,[Lscala.Tuple2;@24452d3d,BlockManagerId(1, ip-172-31-21-121.ec2.internal, 40378))] in 1 attempts". The spark process is able to recover from this failure but takes more time to finish that iteration. Other than that there is not much info on this. How do we know what is the cause of this error condition or what's causing it so that appropriate measure can be taken? Can we capture those using log4j?

--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org