You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@giraph.apache.org by Avery Ching <av...@gmail.com> on 2012/08/14 09:32:06 UTC

Review Request: GIRAPH-300 : Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6600/
-----------------------------------------------------------

Review request for giraph.


Description
-------

* Upgrade to the most recent stable version of Netty (3.5.3.Final)
* Try multiple connection attempts up to n failures
* Track requests throughout the system by keeping track of the request id and then matching the request id to the response (minor refactoring of WritableRequest to make requests simpler and support the request id)
* Improved handling of netty exceptions by dumping the exception stack to help debug failures
* Fixes bug in HashWorkerPartitioner by making partitionList thread-safe (this causes divide by zero exceptions in real life)


This addresses bug GIRAPH-300.
    https://issues.apache.org/jira/browse/GIRAPH-300


Diffs
-----

  http://svn.apache.org/repos/asf/giraph/trunk/pom.xml 1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java 1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyServer.java 1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyWorkerClient.java 1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestInfo.java PRE-CREATION 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestServerHandler.java 1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ResponseClientHandler.java 1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendPartitionMessagesRequest.java 1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendPartitionMutationsRequest.java 1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendVertexRequest.java 1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WritableRequest.java 1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java 1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java 1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashWorkerPartitioner.java 1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/utils/TimedLogger.java 1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/ConnectionTest.java 1372575 

Diff: https://reviews.apache.org/r/6600/diff/


Testing
-------

Currently, netty connection failures causes issues with more than 75 workers in my setup. This allows us to reach over 200+ in a reasonably reliable network that doesn't kill connections.

This code passes the local Hadoop regressions and the single node Hadoop instance regressions. It also succeeded on large runs (200+ workers) on a real Hadoop cluster.


Thanks,

Avery Ching


Re: Review Request: GIRAPH-300 : Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning

Posted by Avery Ching <av...@gmail.com>.

> On Aug. 15, 2012, 2:03 p.m., Alessandro Presta wrote:
> > This is a big win, thanks!
> > So it was a reliability issue masked as a scalability one: more workers -> increased probability of network failure -> waiting forever on lost requests.
> > Now is there anything that can be done to minimize those failures in the first place, or does it just depend on the cluster setup?
> > 
> > I didn't know we had that partitioning bug. When is updatePartitionOwners() called concurrently with getPartitionOwner()? I guess we might be processing vertex requests while doing the partition exchange?

Thanks for the review!

Next patch can improve the situation by retrying. =)  The problem with updatePartitionOwners() is that it is called by the main thread in BspServiceWorker#setup(), but getPartitionOwners() is called by the netty server threads.


> On Aug. 15, 2012, 2:03 p.m., Alessandro Presta wrote:
> > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java, line 389
> > <https://reviews.apache.org/r/6600/diff/1/?file=139832#file139832line389>
> >
> >     This looks a bit weird (using a TimedLogger for the timing but doing the actual logging on the raw Logger) although I see where the problem is (need to call it multiple times without waiting for the next deadline).

yeah, we can probably augment TimedLogger to have a level in a future patch.


> On Aug. 15, 2012, 2:03 p.m., Alessandro Presta wrote:
> > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyServer.java, line 63
> > <https://reviews.apache.org/r/6600/diff/1/?file=139833#file139833line63>
> >
> >     Missing a javadoc here.

Thanks.  Fixed.


> On Aug. 15, 2012, 2:03 p.m., Alessandro Presta wrote:
> > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/utils/TimedLogger.java, line 51
> > <https://reviews.apache.org/r/6600/diff/1/?file=139845#file139845line51>
> >
> >     We could call isPrintable() here to avoid duplication.

Great suggestion.  Fixed.


- Avery


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6600/#review10330
-----------------------------------------------------------


On Aug. 15, 2012, 6:52 p.m., Avery Ching wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/6600/
> -----------------------------------------------------------
> 
> (Updated Aug. 15, 2012, 6:52 p.m.)
> 
> 
> Review request for giraph.
> 
> 
> Description
> -------
> 
> * Upgrade to the most recent stable version of Netty (3.5.3.Final)
> * Try multiple connection attempts up to n failures
> * Track requests throughout the system by keeping track of the request id and then matching the request id to the response (minor refactoring of WritableRequest to make requests simpler and support the request id)
> * Improved handling of netty exceptions by dumping the exception stack to help debug failures
> * Fixes bug in HashWorkerPartitioner by making partitionList thread-safe (this causes divide by zero exceptions in real life)
> 
> 
> This addresses bug GIRAPH-300.
>     https://issues.apache.org/jira/browse/GIRAPH-300
> 
> 
> Diffs
> -----
> 
>   http://svn.apache.org/repos/asf/giraph/trunk/pom.xml 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyServer.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyWorkerClient.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestInfo.java PRE-CREATION 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestServerHandler.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ResponseClientHandler.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendPartitionMessagesRequest.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendPartitionMutationsRequest.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendVertexRequest.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WritableRequest.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashWorkerPartitioner.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/utils/TimedLogger.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/ConnectionTest.java 1372575 
> 
> Diff: https://reviews.apache.org/r/6600/diff/
> 
> 
> Testing
> -------
> 
> Currently, netty connection failures causes issues with more than 75 workers in my setup. This allows us to reach over 200+ in a reasonably reliable network that doesn't kill connections.
> 
> This code passes the local Hadoop regressions and the single node Hadoop instance regressions. It also succeeded on large runs (200+ workers) on a real Hadoop cluster.
> 
> 
> Thanks,
> 
> Avery Ching
> 
>


Re: Review Request: GIRAPH-300 : Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning

Posted by Alessandro Presta <al...@fb.com>.

> On Aug. 15, 2012, 2:03 p.m., Alessandro Presta wrote:
> > http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java, line 389
> > <https://reviews.apache.org/r/6600/diff/1/?file=139832#file139832line389>
> >
> >     This looks a bit weird (using a TimedLogger for the timing but doing the actual logging on the raw Logger) although I see where the problem is (need to call it multiple times without waiting for the next deadline).
> 
> Avery Ching wrote:
>     yeah, we can probably augment TimedLogger to have a level in a future patch.

Yeah, also making it a full-fledged Logger (with its own isInfoEnabled() for example) would improve encapsulation. But this is very minor stuff.


- Alessandro


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6600/#review10330
-----------------------------------------------------------


On Aug. 15, 2012, 6:52 p.m., Avery Ching wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/6600/
> -----------------------------------------------------------
> 
> (Updated Aug. 15, 2012, 6:52 p.m.)
> 
> 
> Review request for giraph.
> 
> 
> Description
> -------
> 
> * Upgrade to the most recent stable version of Netty (3.5.3.Final)
> * Try multiple connection attempts up to n failures
> * Track requests throughout the system by keeping track of the request id and then matching the request id to the response (minor refactoring of WritableRequest to make requests simpler and support the request id)
> * Improved handling of netty exceptions by dumping the exception stack to help debug failures
> * Fixes bug in HashWorkerPartitioner by making partitionList thread-safe (this causes divide by zero exceptions in real life)
> 
> 
> This addresses bug GIRAPH-300.
>     https://issues.apache.org/jira/browse/GIRAPH-300
> 
> 
> Diffs
> -----
> 
>   http://svn.apache.org/repos/asf/giraph/trunk/pom.xml 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyServer.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyWorkerClient.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestInfo.java PRE-CREATION 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestServerHandler.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ResponseClientHandler.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendPartitionMessagesRequest.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendPartitionMutationsRequest.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendVertexRequest.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WritableRequest.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashWorkerPartitioner.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/utils/TimedLogger.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/ConnectionTest.java 1372575 
> 
> Diff: https://reviews.apache.org/r/6600/diff/
> 
> 
> Testing
> -------
> 
> Currently, netty connection failures causes issues with more than 75 workers in my setup. This allows us to reach over 200+ in a reasonably reliable network that doesn't kill connections.
> 
> This code passes the local Hadoop regressions and the single node Hadoop instance regressions. It also succeeded on large runs (200+ workers) on a real Hadoop cluster.
> 
> 
> Thanks,
> 
> Avery Ching
> 
>


Re: Review Request: GIRAPH-300 : Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning

Posted by Alessandro Presta <al...@fb.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6600/#review10330
-----------------------------------------------------------


This is a big win, thanks!
So it was a reliability issue masked as a scalability one: more workers -> increased probability of network failure -> waiting forever on lost requests.
Now is there anything that can be done to minimize those failures in the first place, or does it just depend on the cluster setup?

I didn't know we had that partitioning bug. When is updatePartitionOwners() called concurrently with getPartitionOwner()? I guess we might be processing vertex requests while doing the partition exchange?


http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java
<https://reviews.apache.org/r/6600/#comment21950>

    This looks a bit weird (using a TimedLogger for the timing but doing the actual logging on the raw Logger) although I see where the problem is (need to call it multiple times without waiting for the next deadline).



http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyServer.java
<https://reviews.apache.org/r/6600/#comment21947>

    Missing a javadoc here.



http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/utils/TimedLogger.java
<https://reviews.apache.org/r/6600/#comment21949>

    We could call isPrintable() here to avoid duplication.


- Alessandro Presta


On Aug. 14, 2012, 7:32 a.m., Avery Ching wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/6600/
> -----------------------------------------------------------
> 
> (Updated Aug. 14, 2012, 7:32 a.m.)
> 
> 
> Review request for giraph.
> 
> 
> Description
> -------
> 
> * Upgrade to the most recent stable version of Netty (3.5.3.Final)
> * Try multiple connection attempts up to n failures
> * Track requests throughout the system by keeping track of the request id and then matching the request id to the response (minor refactoring of WritableRequest to make requests simpler and support the request id)
> * Improved handling of netty exceptions by dumping the exception stack to help debug failures
> * Fixes bug in HashWorkerPartitioner by making partitionList thread-safe (this causes divide by zero exceptions in real life)
> 
> 
> This addresses bug GIRAPH-300.
>     https://issues.apache.org/jira/browse/GIRAPH-300
> 
> 
> Diffs
> -----
> 
>   http://svn.apache.org/repos/asf/giraph/trunk/pom.xml 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyServer.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyWorkerClient.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestInfo.java PRE-CREATION 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestServerHandler.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ResponseClientHandler.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendPartitionMessagesRequest.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendPartitionMutationsRequest.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendVertexRequest.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WritableRequest.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashWorkerPartitioner.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/utils/TimedLogger.java 1372575 
>   http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/ConnectionTest.java 1372575 
> 
> Diff: https://reviews.apache.org/r/6600/diff/
> 
> 
> Testing
> -------
> 
> Currently, netty connection failures causes issues with more than 75 workers in my setup. This allows us to reach over 200+ in a reasonably reliable network that doesn't kill connections.
> 
> This code passes the local Hadoop regressions and the single node Hadoop instance regressions. It also succeeded on large runs (200+ workers) on a real Hadoop cluster.
> 
> 
> Thanks,
> 
> Avery Ching
> 
>


Re: Review Request: GIRAPH-300 : Improve netty reliability with retrying failed connections, tracking requests, thread-safe hash partitioning

Posted by Avery Ching <av...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6600/
-----------------------------------------------------------

(Updated Aug. 15, 2012, 6:52 p.m.)


Review request for giraph.


Changes
-------

Addressed Alessandro's comments.


Description
-------

* Upgrade to the most recent stable version of Netty (3.5.3.Final)
* Try multiple connection attempts up to n failures
* Track requests throughout the system by keeping track of the request id and then matching the request id to the response (minor refactoring of WritableRequest to make requests simpler and support the request id)
* Improved handling of netty exceptions by dumping the exception stack to help debug failures
* Fixes bug in HashWorkerPartitioner by making partitionList thread-safe (this causes divide by zero exceptions in real life)


This addresses bug GIRAPH-300.
    https://issues.apache.org/jira/browse/GIRAPH-300


Diffs (updated)
-----

  http://svn.apache.org/repos/asf/giraph/trunk/pom.xml 1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java 1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyServer.java 1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyWorkerClient.java 1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestInfo.java PRE-CREATION 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestServerHandler.java 1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ResponseClientHandler.java 1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendPartitionMessagesRequest.java 1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendPartitionMutationsRequest.java 1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/SendVertexRequest.java 1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WritableRequest.java 1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java 1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java 1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/partition/HashWorkerPartitioner.java 1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/utils/TimedLogger.java 1372575 
  http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/ConnectionTest.java 1372575 

Diff: https://reviews.apache.org/r/6600/diff/


Testing
-------

Currently, netty connection failures causes issues with more than 75 workers in my setup. This allows us to reach over 200+ in a reasonably reliable network that doesn't kill connections.

This code passes the local Hadoop regressions and the single node Hadoop instance regressions. It also succeeded on large runs (200+ workers) on a real Hadoop cluster.


Thanks,

Avery Ching