You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:13:27 UTC

[jira] [Resolved] (SPARK-18691) Spark can hang if a node goes down during a shuffle

     [ https://issues.apache.org/jira/browse/SPARK-18691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-18691.
----------------------------------
    Resolution: Incomplete

> Spark can hang if a node goes down during a shuffle
> ---------------------------------------------------
>
>                 Key: SPARK-18691
>                 URL: https://issues.apache.org/jira/browse/SPARK-18691
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle
>    Affects Versions: 1.6.1
>         Environment: Running on an AWS EMR cluster using yarn
>            Reporter: Nicholas Brown
>            Priority: Major
>              Labels: bulk-closed
>         Attachments: retries.txt
>
>
> We have a 200 node cluster that sometimes hangs if a node goes down. It looks like it detects the failure and spins up a new node just fine.  However, the other 199 nodes then appear to become unresponsive.  From looking at the logs and code, this is what appears to be happening.
> The other nodes have 36 threads each (as far as I can tell, they are processing the shuffle) that timeout while trying to fetch data from the now dead node.  With the default configuration, they are supposed to retry 3 times with 5 seconds in between retries, for a max delay of 15 seconds.  However, because spark.shuffle.io.numConnectionsPerPeer is set to the default value of 1, they can only go one at a time.  And with a two minute default network timeout, that means the delay becomes several hours, which is obviously unacceptable.  I'll attach a portion of our logs which shows the retries getting further and further behind.
> Now I can work around this partially by playing with the configuration options mentioned above, particularly by upping the numConnectionsPerPeer.  But it seems Spark should be able to handle this.  One thought is make the maxRetries apply between across threads that are accessing the same node.
> I've seen this on 1.6.1, but from looking at the code I suspect it exists in the latest version as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org