You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Shekhar Gupta (Jira)" <ji...@apache.org> on 2021/09/28 02:42:00 UTC
[jira] [Created] (SPARK-36872) Decommissioning executors get killed
before transferring their data because of the hardcoded timeout of 60 secs
Shekhar Gupta created SPARK-36872:
-------------------------------------
Summary: Decommissioning executors get killed before transferring their data because of the hardcoded timeout of 60 secs
Key: SPARK-36872
URL: https://issues.apache.org/jira/browse/SPARK-36872
Project: Spark
Issue Type: Improvement
Components: Kubernetes
Affects Versions: 3.1.2, 3.1.1, 3.2.0
Reporter: Shekhar Gupta
During the graceful decommissioning phase, executors need to transfer all of their shuffle and cache data to the peer executors. However, they get killed before could transfer all the data because of the hardcoded timeout value of 60 secs in the decommissioning script. As a result of executors dying prematurely, the spark tasks on other executors fail which causes application failures, and it is hard to debug those failures. To fix the issue, we ended up writing a custom script with a different timeout and rebuilt the spark image but we would prefer an easier solution that does not require rebuilding the image.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org