You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Romi Kuntsman (JIRA)" <ji...@apache.org> on 2015/10/13 18:56:05 UTC
[jira] [Commented] (SPARK-2563) Re-open sockets to handle connect
timeouts
[ https://issues.apache.org/jira/browse/SPARK-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955264#comment-14955264 ]
Romi Kuntsman commented on SPARK-2563:
--------------------------------------
i got a socket timeout in spark 1.4.0
is this still relevant for the last version, or is this bug abandoned?
> Re-open sockets to handle connect timeouts
> ------------------------------------------
>
> Key: SPARK-2563
> URL: https://issues.apache.org/jira/browse/SPARK-2563
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Reporter: Shivaram Venkataraman
> Priority: Minor
>
> In a large EC2 cluster, I often see the first shuffle stage in a job fail due to connection timeout exceptions.
> If the connection attempt times out, the socket gets closed and from [1] we get a ClosedChannelException. We should check if the Socket was closed due to a timeout and open a new socket and try to connect.
> FWIW, I was able to work around my problems by increasing the number of SYN retries in Linux. (I ran echo 8 > /proc/sys/net/ipv4/tcp_syn_retries)
> [1] http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/sun/nio/ch/SocketChannelImpl.java?av=h#573
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org