You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by shivaram <gi...@git.apache.org> on 2014/07/18 01:02:50 UTC

[GitHub] spark pull request: [SPARK-2563] Make connection retries configura...

GitHub user shivaram opened a pull request:

    https://github.com/apache/spark/pull/1471

    [SPARK-2563] Make connection retries configurable

    In a large EC2 cluster, I often see the first shuffle stage in a job fail due to connection timeout exceptions. This patch makes the number of connection retries configurable to handle these cases.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/shivaram/spark-1 connect-retries

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1471.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1471
    
----
commit a4522b1bd23f53819f09d318adf859d9a86be298
Author: Shivaram Venkataraman <sh...@cs.berkeley.edu>
Date:   2014-07-17T22:51:34Z

    Make number of connection retries configurable

commit 2b9452fd81403343c3e378754fd30e0cec7d247c
Author: Shivaram Venkataraman <sh...@cs.berkeley.edu>
Date:   2014-07-17T22:54:57Z

    Document new config variable

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2563] Make connection retries configura...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1471#issuecomment-50209854
  
    I see, got it. It sounds like we should open a JIRA for creating a new socket then. It's pretty strange that you can't reuse the same one in Java, but I guess that's how it works.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2563] Make connection retries configura...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1471#issuecomment-49384160
  
    QA results for PR 1471:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16793/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2563] Make connection retries configura...

Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on the pull request:

    https://github.com/apache/spark/pull/1471#issuecomment-49911839
  
    Actually hold off on merging this -- I found that this patch doesn't completely solve the problem. The issue I think is that `finishConnect` throws an IOException [1] if there is a connectTimeout and we don't retry if we hit any exceptions. (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/network/Connection.scala#L327). 
    
    I'll try to test another fix and update this PR.
    
    [1] http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/sun/nio/ch/SocketChannelImpl.java?av=h#573
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2563] Make connection retries configura...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1471#issuecomment-50226768
  
    Sure, you can modify the existing one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2563] Make connection retries configura...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1471#issuecomment-50191160
  
    QA results for PR 1471:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17192/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2563] Make connection retries configura...

Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on the pull request:

    https://github.com/apache/spark/pull/1471#issuecomment-50210528
  
    Yeah I will close this PR -- Should I just modify SPARK-2563 for the Socket re-opening issue or do you think a new JIRA is better ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2563] Make connection retries configura...

Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on the pull request:

    https://github.com/apache/spark/pull/1471#issuecomment-50208195
  
    @mateiz So I looked at this more closely today -- It turns out these retries don't help much with Connection timed out exceptions. If the connection attempt times out, the socket gets closed and from [1] we get a ClosedChannelException. So even if we sleep and call `finishConnect` again, we get back the same exception.
    
    The right thing to do if the Socket was closed due to a timeout, is to open a new socket and try to connect. The ConnectionManager doesn't have support for that right now and it seems like a much bigger change. Do you think that is something that might be useful to do ?
    
    FWIW, I was able to work around my problems by increasing the number of SYN retries in Linux. (I ran `echo 8 > /proc/sys/net/ipv4/tcp_syn_retries`)
    
    [1] http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/sun/nio/ch/SocketChannelImpl.java?av=h#573


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2563] Make connection retries configura...

Posted by shivaram <gi...@git.apache.org>.
Github user shivaram closed the pull request at:

    https://github.com/apache/spark/pull/1471


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2563] Make connection retries configura...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1471#discussion_r15302438
  
    --- Diff: docs/configuration.md ---
    @@ -755,6 +755,13 @@ Apart from these, the following properties are also available, and may be useful
       </td>
     </tr>
     <tr>
    +  <td><code>spark.core.connection.retries</code></td>
    --- End diff --
    
    Looks good, but call this spark.shuffle.connection.retries. I also wonder, should we be sleeping more after each retry? We seem to be sleeping only 1 ms after each one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2563] Make connection retries configura...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1471#issuecomment-50206285
  
    @shivaram what did you think about sleeping longer after each attempt? Does each attempt already take some time to time out? Otherwise we are only sleeping 1 ms.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2563] Make connection retries configura...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1471#issuecomment-50184891
  
    QA tests have started for PR 1471. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17192/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2563] Make connection retries configura...

Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on the pull request:

    https://github.com/apache/spark/pull/1471#issuecomment-50371592
  
    Updated the JIRA -- Closing this issue


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2563] Make connection retries configura...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1471#issuecomment-50209890
  
    If this PR doesn't help by the way, make sure to close it too so it doesn't stay in the list.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2563] Make connection retries configura...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1471#issuecomment-49378078
  
    QA tests have started for PR 1471. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16793/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---