You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@giraph.apache.org by "Avery Ching (JIRA)" <ji...@apache.org> on 2012/08/14 09:30:37 UTC

[jira] [Created] (GIRAPH-300) Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning

Avery Ching created GIRAPH-300:
----------------------------------

             Summary: Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning
                 Key: GIRAPH-300
                 URL: https://issues.apache.org/jira/browse/GIRAPH-300
             Project: Giraph
          Issue Type: Improvement
            Reporter: Avery Ching
            Assignee: Avery Ching


* Upgrade to the most recent stable version of Netty (3.5.3.Final)
* Try multiple connection attempts up to n failures
* Track requests throughout the system by keeping track of the request id and then matching the request id to the response (minor refactoring of WritableRequest to make requests simpler and support the request id)
* Improved handling of netty exceptions by dumping the exception stack to help debug failures
* Fixes bug in HashWorkerPartitioner by making partitionList thread-safe (this causes divide by zero exceptions in real life)

Currently, netty connection failures causes issues with more than 75 workers in my setup.  This allows us to reach over 200+ in a reasonably reliable network that doesn't kill connections.

This code passes the local Hadoop regressions and the single node Hadoop instance regressions.  It also succeeded on large runs (200+ workers) on a real Hadoop cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-300) Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436168#comment-13436168 ] 

Avery Ching commented on GIRAPH-300:
------------------------------------

Glad to hear it Eli.  I have another patch I worked on last night to make Netty very reliable by implementing retries and reconnections for failed requests.  Still cleaning it up and testing.
                
> Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-300
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-300
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-300.2.patch, GIRAPH-300.patch
>
>
> * Upgrade to the most recent stable version of Netty (3.5.3.Final)
> * Try multiple connection attempts up to n failures
> * Track requests throughout the system by keeping track of the request id and then matching the request id to the response (minor refactoring of WritableRequest to make requests simpler and support the request id)
> * Improved handling of netty exceptions by dumping the exception stack to help debug failures
> * Fixes bug in HashWorkerPartitioner by making partitionList thread-safe (this causes divide by zero exceptions in real life)
> Currently, netty connection failures causes issues with more than 75 workers in my setup.  This allows us to reach over 200+ in a reasonably reliable network that doesn't kill connections.
> This code passes the local Hadoop regressions and the single node Hadoop instance regressions.  It also succeeded on large runs (200+ workers) on a real Hadoop cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-300) Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436144#comment-13436144 ] 

Eli Reisman commented on GIRAPH-300:
------------------------------------

This patch is great, I have now completed several large runs on a cluster busy enough to upset Netty under the old code, wow! The use case where Giraph is sharing a cluster with other Hadoop (and especially Pig) jobs that come on and off the grid during a long Giraph run has been a pain point all summer, this is really helping. I think this is a use case that will be typical for a lot of users, especially those with an existing Hadoop test cluster people are debugging jobs on who want to give Giraph a try, handing this situation with some grace is really a big step. Great contribution, thanks again!

                
> Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-300
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-300
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-300.2.patch, GIRAPH-300.patch
>
>
> * Upgrade to the most recent stable version of Netty (3.5.3.Final)
> * Try multiple connection attempts up to n failures
> * Track requests throughout the system by keeping track of the request id and then matching the request id to the response (minor refactoring of WritableRequest to make requests simpler and support the request id)
> * Improved handling of netty exceptions by dumping the exception stack to help debug failures
> * Fixes bug in HashWorkerPartitioner by making partitionList thread-safe (this causes divide by zero exceptions in real life)
> Currently, netty connection failures causes issues with more than 75 workers in my setup.  This allows us to reach over 200+ in a reasonably reliable network that doesn't kill connections.
> This code passes the local Hadoop regressions and the single node Hadoop instance regressions.  It also succeeded on large runs (200+ workers) on a real Hadoop cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-300) Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435477#comment-13435477 ] 

Hudson commented on GIRAPH-300:
-------------------------------

Integrated in Giraph-trunk-Commit #173 (See [https://builds.apache.org/job/Giraph-trunk-Commit/173/])
    GIRAPH-300) Improve netty reliability with retrying failed
connections, tracking requests, thread-safe hash partitioning (aching
via apresta). (Revision 1373609)

     Result = SUCCESS
aching : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1373609
Files : 
* /giraph/trunk/CHANGELOG
* /giraph/trunk/pom.xml
* /giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java
* /giraph/trunk/src/main/java/org/apache/giraph/comm/NettyServer.java
* /giraph/trunk/src/main/java/org/apache/giraph/comm/NettyWorkerClient.java
* /giraph/trunk/src/main/java/org/apache/giraph/comm/RequestInfo.java
* /giraph/trunk/src/main/java/org/apache/giraph/comm/RequestServerHandler.java
* /giraph/trunk/src/main/java/org/apache/giraph/comm/ResponseClientHandler.java
* /giraph/trunk/src/main/java/org/apache/giraph/comm/SendPartitionMessagesRequest.java
* /giraph/trunk/src/main/java/org/apache/giraph/comm/SendPartitionMutationsRequest.java
* /giraph/trunk/src/main/java/org/apache/giraph/comm/SendVertexRequest.java
* /giraph/trunk/src/main/java/org/apache/giraph/comm/WritableRequest.java
* /giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java
* /giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java
* /giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashWorkerPartitioner.java
* /giraph/trunk/src/main/java/org/apache/giraph/utils/TimedLogger.java
* /giraph/trunk/src/test/java/org/apache/giraph/comm/ConnectionTest.java

                
> Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-300
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-300
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-300.2.patch, GIRAPH-300.patch
>
>
> * Upgrade to the most recent stable version of Netty (3.5.3.Final)
> * Try multiple connection attempts up to n failures
> * Track requests throughout the system by keeping track of the request id and then matching the request id to the response (minor refactoring of WritableRequest to make requests simpler and support the request id)
> * Improved handling of netty exceptions by dumping the exception stack to help debug failures
> * Fixes bug in HashWorkerPartitioner by making partitionList thread-safe (this causes divide by zero exceptions in real life)
> Currently, netty connection failures causes issues with more than 75 workers in my setup.  This allows us to reach over 200+ in a reasonably reliable network that doesn't kill connections.
> This code passes the local Hadoop regressions and the single node Hadoop instance regressions.  It also succeeded on large runs (200+ workers) on a real Hadoop cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-300) Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning

Posted by "Alessandro Presta (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435413#comment-13435413 ] 

Alessandro Presta commented on GIRAPH-300:
------------------------------------------

Looks good, +1
                
> Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-300
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-300
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-300.2.patch, GIRAPH-300.patch
>
>
> * Upgrade to the most recent stable version of Netty (3.5.3.Final)
> * Try multiple connection attempts up to n failures
> * Track requests throughout the system by keeping track of the request id and then matching the request id to the response (minor refactoring of WritableRequest to make requests simpler and support the request id)
> * Improved handling of netty exceptions by dumping the exception stack to help debug failures
> * Fixes bug in HashWorkerPartitioner by making partitionList thread-safe (this causes divide by zero exceptions in real life)
> Currently, netty connection failures causes issues with more than 75 workers in my setup.  This allows us to reach over 200+ in a reasonably reliable network that doesn't kill connections.
> This code passes the local Hadoop regressions and the single node Hadoop instance regressions.  It also succeeded on large runs (200+ workers) on a real Hadoop cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-300) Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13434507#comment-13434507 ] 

Eli Reisman commented on GIRAPH-300:
------------------------------------

We are using tons more workers, I rarely use less than 200. But our memory/resource profile is much lower per-worker and we are competing with existing jobs coming on and off the grid as we run pretty much all the time. So it takes a few runs of the same job to get a real baseline of how it will behave in any generalized way (threads and network resources in use during a given run vary widely here etc.)

Anyway, this will definitely be helpful for us right away, looking forward to running it.
                
> Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-300
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-300
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-300.patch
>
>
> * Upgrade to the most recent stable version of Netty (3.5.3.Final)
> * Try multiple connection attempts up to n failures
> * Track requests throughout the system by keeping track of the request id and then matching the request id to the response (minor refactoring of WritableRequest to make requests simpler and support the request id)
> * Improved handling of netty exceptions by dumping the exception stack to help debug failures
> * Fixes bug in HashWorkerPartitioner by making partitionList thread-safe (this causes divide by zero exceptions in real life)
> Currently, netty connection failures causes issues with more than 75 workers in my setup.  This allows us to reach over 200+ in a reasonably reliable network that doesn't kill connections.
> This code passes the local Hadoop regressions and the single node Hadoop instance regressions.  It also succeeded on large runs (200+ workers) on a real Hadoop cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-300) Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13434328#comment-13434328 ] 

Eli Reisman commented on GIRAPH-300:
------------------------------------

Can't wait to try this out, fantastic!

We have not found any hard limits on # of workers == Netty fail here, but when any one worker for any reason starts getting Netty buffers backed up, thats when it will happen. Sometimes its during temp partition shuffle on INPUT_SUPERSTEP, sometimes (more rare) its during computation in a message intensive algorithm.

I got one run in this morning to try out some code, hope to get more in over the next few days and this just made my short list. Review forthcoming, looking forward to it!

                
> Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-300
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-300
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-300.patch
>
>
> * Upgrade to the most recent stable version of Netty (3.5.3.Final)
> * Try multiple connection attempts up to n failures
> * Track requests throughout the system by keeping track of the request id and then matching the request id to the response (minor refactoring of WritableRequest to make requests simpler and support the request id)
> * Improved handling of netty exceptions by dumping the exception stack to help debug failures
> * Fixes bug in HashWorkerPartitioner by making partitionList thread-safe (this causes divide by zero exceptions in real life)
> Currently, netty connection failures causes issues with more than 75 workers in my setup.  This allows us to reach over 200+ in a reasonably reliable network that doesn't kill connections.
> This code passes the local Hadoop regressions and the single node Hadoop instance regressions.  It also succeeded on large runs (200+ workers) on a real Hadoop cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (GIRAPH-300) Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Avery Ching updated GIRAPH-300:
-------------------------------

    Attachment: GIRAPH-300.patch

Here is a patch to match https://reviews.apache.org/r/6600/
                
> Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-300
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-300
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-300.patch
>
>
> * Upgrade to the most recent stable version of Netty (3.5.3.Final)
> * Try multiple connection attempts up to n failures
> * Track requests throughout the system by keeping track of the request id and then matching the request id to the response (minor refactoring of WritableRequest to make requests simpler and support the request id)
> * Improved handling of netty exceptions by dumping the exception stack to help debug failures
> * Fixes bug in HashWorkerPartitioner by making partitionList thread-safe (this causes divide by zero exceptions in real life)
> Currently, netty connection failures causes issues with more than 75 workers in my setup.  This allows us to reach over 200+ in a reasonably reliable network that doesn't kill connections.
> This code passes the local Hadoop regressions and the single node Hadoop instance regressions.  It also succeeded on large runs (200+ workers) on a real Hadoop cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (GIRAPH-300) Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Avery Ching updated GIRAPH-300:
-------------------------------

    Attachment: GIRAPH-300.2.patch

Synced with reviewboard diff2
                
> Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-300
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-300
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-300.2.patch, GIRAPH-300.patch
>
>
> * Upgrade to the most recent stable version of Netty (3.5.3.Final)
> * Try multiple connection attempts up to n failures
> * Track requests throughout the system by keeping track of the request id and then matching the request id to the response (minor refactoring of WritableRequest to make requests simpler and support the request id)
> * Improved handling of netty exceptions by dumping the exception stack to help debug failures
> * Fixes bug in HashWorkerPartitioner by making partitionList thread-safe (this causes divide by zero exceptions in real life)
> Currently, netty connection failures causes issues with more than 75 workers in my setup.  This allows us to reach over 200+ in a reasonably reliable network that doesn't kill connections.
> This code passes the local Hadoop regressions and the single node Hadoop instance regressions.  It also succeeded on large runs (200+ workers) on a real Hadoop cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-300) Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13434341#comment-13434341 ] 

Avery Ching commented on GIRAPH-300:
------------------------------------

Thanks Eli.  Your network is probably more stable than ours or you're not using as many workers?

This patch should really help with tracking down network issues, which is impossible now =).
                
> Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-300
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-300
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-300.patch
>
>
> * Upgrade to the most recent stable version of Netty (3.5.3.Final)
> * Try multiple connection attempts up to n failures
> * Track requests throughout the system by keeping track of the request id and then matching the request id to the response (minor refactoring of WritableRequest to make requests simpler and support the request id)
> * Improved handling of netty exceptions by dumping the exception stack to help debug failures
> * Fixes bug in HashWorkerPartitioner by making partitionList thread-safe (this causes divide by zero exceptions in real life)
> Currently, netty connection failures causes issues with more than 75 workers in my setup.  This allows us to reach over 200+ in a reasonably reliable network that doesn't kill connections.
> This code passes the local Hadoop regressions and the single node Hadoop instance regressions.  It also succeeded on large runs (200+ workers) on a real Hadoop cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-300) Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435675#comment-13435675 ] 

Eli Reisman commented on GIRAPH-300:
------------------------------------

Whew! Turns out this is probably related to our cluster maintenance today. Seems to be OK as of now.  I thought this might relate to some of the errors Avery was correcting with the patch so I posted it here in case. Hopefully false alarm, unless anyone else is seeing this?

                
> Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-300
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-300
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-300.2.patch, GIRAPH-300.patch
>
>
> * Upgrade to the most recent stable version of Netty (3.5.3.Final)
> * Try multiple connection attempts up to n failures
> * Track requests throughout the system by keeping track of the request id and then matching the request id to the response (minor refactoring of WritableRequest to make requests simpler and support the request id)
> * Improved handling of netty exceptions by dumping the exception stack to help debug failures
> * Fixes bug in HashWorkerPartitioner by making partitionList thread-safe (this causes divide by zero exceptions in real life)
> Currently, netty connection failures causes issues with more than 75 workers in my setup.  This allows us to reach over 200+ in a reasonably reliable network that doesn't kill connections.
> This code passes the local Hadoop regressions and the single node Hadoop instance regressions.  It also succeeded on large runs (200+ workers) on a real Hadoop cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: [jira] [Commented] (GIRAPH-300) Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning

Posted by Avery Ching <ac...@apache.org>.
Yes, this will happen, but should be okay, since the connect retries 
will take care of it (hopefully). This already happened with the old 
code (as you mentioned).

I'm also working on a more robust implementation that will retry failed 
requests going forward (and establish broken connections).

Avery

On 8/15/12 3:04 PM, Eli Reisman (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/GIRAPH-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435564#comment-13435564 ]
>
> Eli Reisman commented on GIRAPH-300:
> ------------------------------------
>
> Getting errors like this during input superstep on about 20% of my workers, happens on small and large jobs. This happened before this patch got committed, but seems to be happening now too. Anyone seeing this on your runs?
>
>
> Aug 15, 2012 9:55:25 PM org.jboss.netty.channel.DefaultChannelPipeline
> WARNING: An exception was thrown by a user handler while handling an exception event ([id: 0x48433545] EXCEPTION: java.net.ConnectException: Connection timed out)
> java.lang.IllegalStateException: exceptionCaught: Channel failed with remote address null
> 	at org.apache.giraph.comm.ResponseClientHandler.exceptionCaught(ResponseClientHandler.java:107)
> 	at org.jboss.netty.handler.codec.frame.FrameDecoder.exceptionCaught(FrameDecoder.java:244)
> 	at org.apache.giraph.comm.ByteCounter.handleUpstream(ByteCounter.java:61)
> 	at org.jboss.netty.channel.Channels.fireExceptionCaught(Channels.java:426)
> 	at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:406)
> 	at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:362)
> 	at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:284)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> 	at java.lang.Thread.run(Thread.java:662)
> Caused by: java.net.ConnectException: Connection timed out
> 	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> 	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
> 	at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:400)
> 	... 5 more
>
>                  
>> Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning
>> ------------------------------------------------------------------------------------------------------------
>>
>>                  Key: GIRAPH-300
>>                  URL: https://issues.apache.org/jira/browse/GIRAPH-300
>>              Project: Giraph
>>           Issue Type: Improvement
>>             Reporter: Avery Ching
>>             Assignee: Avery Ching
>>          Attachments: GIRAPH-300.2.patch, GIRAPH-300.patch
>>
>>
>> * Upgrade to the most recent stable version of Netty (3.5.3.Final)
>> * Try multiple connection attempts up to n failures
>> * Track requests throughout the system by keeping track of the request id and then matching the request id to the response (minor refactoring of WritableRequest to make requests simpler and support the request id)
>> * Improved handling of netty exceptions by dumping the exception stack to help debug failures
>> * Fixes bug in HashWorkerPartitioner by making partitionList thread-safe (this causes divide by zero exceptions in real life)
>> Currently, netty connection failures causes issues with more than 75 workers in my setup.  This allows us to reach over 200+ in a reasonably reliable network that doesn't kill connections.
>> This code passes the local Hadoop regressions and the single node Hadoop instance regressions.  It also succeeded on large runs (200+ workers) on a real Hadoop cluster.
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>          


[jira] [Commented] (GIRAPH-300) Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435564#comment-13435564 ] 

Eli Reisman commented on GIRAPH-300:
------------------------------------

Getting errors like this during input superstep on about 20% of my workers, happens on small and large jobs. This happened before this patch got committed, but seems to be happening now too. Anyone seeing this on your runs?


Aug 15, 2012 9:55:25 PM org.jboss.netty.channel.DefaultChannelPipeline
WARNING: An exception was thrown by a user handler while handling an exception event ([id: 0x48433545] EXCEPTION: java.net.ConnectException: Connection timed out)
java.lang.IllegalStateException: exceptionCaught: Channel failed with remote address null
	at org.apache.giraph.comm.ResponseClientHandler.exceptionCaught(ResponseClientHandler.java:107)
	at org.jboss.netty.handler.codec.frame.FrameDecoder.exceptionCaught(FrameDecoder.java:244)
	at org.apache.giraph.comm.ByteCounter.handleUpstream(ByteCounter.java:61)
	at org.jboss.netty.channel.Channels.fireExceptionCaught(Channels.java:426)
	at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:406)
	at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:362)
	at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:284)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:662)
Caused by: java.net.ConnectException: Connection timed out
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
	at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:400)
	... 5 more

                
> Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-300
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-300
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-300.2.patch, GIRAPH-300.patch
>
>
> * Upgrade to the most recent stable version of Netty (3.5.3.Final)
> * Try multiple connection attempts up to n failures
> * Track requests throughout the system by keeping track of the request id and then matching the request id to the response (minor refactoring of WritableRequest to make requests simpler and support the request id)
> * Improved handling of netty exceptions by dumping the exception stack to help debug failures
> * Fixes bug in HashWorkerPartitioner by making partitionList thread-safe (this causes divide by zero exceptions in real life)
> Currently, netty connection failures causes issues with more than 75 workers in my setup.  This allows us to reach over 200+ in a reasonably reliable network that doesn't kill connections.
> This code passes the local Hadoop regressions and the single node Hadoop instance regressions.  It also succeeded on large runs (200+ workers) on a real Hadoop cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira