You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@giraph.apache.org by "Jakob Homan (Updated) (JIRA)" <ji...@apache.org> on 2011/10/15 03:08:11 UTC

[jira] [Updated] (GIRAPH-37) Implement Netty-backed rpc solution

     [ https://issues.apache.org/jira/browse/GIRAPH-37?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jakob Homan updated GIRAPH-37:
------------------------------

    Attachment: GIRAPH-37-wip.patch

Here's a work in progress patch for review and because I have to take next week to work on something else, so wanted to get it out before it went stale.  It uses Finagle with Thrift.  This experience was at first challenging due to Finagle ramp-up costs, then nice, now a challenging again due to stability issues.  95% of the size of the patch is generated thrift code; I'm not usually a fan on including generated code, but as explained below, this is a reasonable approach for Finagle.

The good:
* With this patch I can scale up to about 1k workers, although not reliably (see bad points)
* This approach moves us away from Hadoop RPC, which is good for the upcoming Yarn work and because Hadoop RPC itself is not ideal.
* Looking at what Hyunsik was having to go through when he was looking at Netty+PB, Finagle definitely saves quite a lot of work.
* This exercise has identified several improvements to the overall that need to be done.  I've opened GIRAPH-57, GIRAPH-55 and GIRAPH-54 for these.

The bad:
* The Thrift-Finagle combination uses a forked version of the thrift compiler to generate the interface Finagle expects.  Once up and running this is fine, but it means that we'd be dependent on this oddity.  Also, we'd need to include the generated code since it's too much to ask regular developers (not interested in the rpc) to download a new thrift compiler from github, compile it, keep it around, etc.
* There are quite a lot of knobs necessary to get a reliable run with a large number of mappers.  This is partially a fact of life of a distributed rpc and we can probably determine some of them programmatically, but at the moment, I can only get successful runs about 2/3 of the time.  The rest I get very difficult to decipher stack traces such as:
{noformat}
WARNING: An exception was thrown by a user handler while handling an exception event ([id: 0x4b7f1841, /172.18.67.79:46082 :> esv4-hcl227.corp.linkedin.com/172.18.66.182:30047] EXCEPTION: com.twitter.util.Promise$ImmutableResult: Result set multiple times: Throw(java.lang.RuntimeException: Hit exception in proxied call))
java.lang.RuntimeException: Hit exception in proxied call
	at org.apache.giraph.comm.finaglerpc.ThriftRPCProxyClient$CDLListener.onFailure(ThriftRPCProxyClient.java:91)
	at com.twitter.util.Future$$anonfun$addEventListener$1.apply(Future.scala:277)
	at com.twitter.util.Future$$anonfun$addEventListener$1.apply(Future.scala:276)
	at com.twitter.util.Promise$$anonfun$respond$1.apply(Future.scala:471)
	at com.twitter.util.Promise$$anonfun$respond$1.apply(Future.scala:467)
	at com.twitter.concurrent.IVar.set(IVar.scala:50)
	at com.twitter.util.Promise.updateIfEmpty(Future.scala:462)
	at com.twitter.util.Promise.update(Future.scala:450)
	at com.twitter.util.Promise$$anon$2$$anonfun$8.apply(Future.scala:506)
	at com.twitter.util.Promise$$anon$2$$anonfun$8.apply(Future.scala:497)
	at com.twitter.util.Promise$$anonfun$respond$1.apply(Future.scala:471)
	at com.twitter.util.Promise$$anonfun$respond$1.apply(Future.scala:467)
	at com.twitter.concurrent.IVar.set(IVar.scala:50)
	at com.twitter.util.Promise.updateIfEmpty(Future.scala:462)
	at com.twitter.util.Promise.update(Future.scala:450)
	at com.twitter.finagle.service.RetryingFilter$$anonfun$1.apply(RetryingFilter.scala:73)
	at com.twitter.finagle.service.RetryingFilter$$anonfun$1.apply(RetryingFilter.scala:56)
	at com.twitter.util.Promise$$anonfun$respond$1.apply(Future.scala:471)
	at com.twitter.util.Promise$$anonfun$respond$1.apply(Future.scala:467)
	at com.twitter.concurrent.IVar.set(IVar.scala:50)
	at com.twitter.concurrent.IVar.set(IVar.scala:55)
	at com.twitter.util.Promise.updateIfEmpty(Future.scala:462)
	at com.twitter.util.Promise.update(Future.scala:450)
	at com.twitter.util.Promise$$anon$2$$anonfun$8$$anonfun$apply$7.apply(Future.scala:502)
	at com.twitter.util.Promise$$anon$2$$anonfun$8$$anonfun$apply$7.apply(Future.scala:502)
	at com.twitter.util.Promise$$anonfun$respond$1.apply(Future.scala:471)
	at com.twitter.util.Promise$$anonfun$respond$1.apply(Future.scala:467)
	at com.twitter.concurrent.IVar.set(IVar.scala:50)
	at com.twitter.concurrent.IVar.set(IVar.scala:55)
	at com.twitter.concurrent.IVar.set(IVar.scala:55)
	at com.twitter.concurrent.IVar.set(IVar.scala:55)
	at com.twitter.concurrent.IVar.set(IVar.scala:55)
	at com.twitter.util.Promise.updateIfEmpty(Future.scala:462)
	at com.twitter.util.Promise.update(Future.scala:450)
	at com.twitter.util.Promise$$anon$1$$anonfun$7.apply(Future.scala:491)
	at com.twitter.util.Promise$$anon$1$$anonfun$7.apply(Future.scala:490)
	at com.twitter.util.Promise$$anonfun$respond$1.apply(Future.scala:471)
	at com.twitter.util.Promise$$anonfun$respond$1.apply(Future.scala:467)
	at com.twitter.concurrent.IVar.set(IVar.scala:50)
	at com.twitter.concurrent.IVar.set(IVar.scala:55)
	at com.twitter.util.Promise.updateIfEmpty(Future.scala:462)
	at com.twitter.util.Promise.update(Future.scala:450)
	at com.twitter.finagle.channel.ChannelService.com$twitter$finagle$channel$ChannelService$$reply(ChannelService.scala:51)
	at com.twitter.finagle.channel.ChannelService$$anon$1.exceptionCaught(ChannelService.scala:74)
	at org.jboss.netty.handler.codec.oneone.OneToOneDecoder.handleUpstream(OneToOneDecoder.java:66)
	at org.jboss.netty.handler.codec.frame.FrameDecoder.exceptionCaught(FrameDecoder.java:238)
	at com.twitter.finagle.thrift.ThriftFrameCodec.handleUpstream(ThriftFrameCodec.scala:11)
	at org.jboss.netty.channel.Channels.fireExceptionCaught(Channels.java:432)
	at org.jboss.netty.channel.AbstractChannelSink.exceptionCaught(AbstractChannelSink.java:52)
	at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:302)
	at org.jboss.netty.handler.codec.oneone.OneToOneDecoder.handleUpstream(OneToOneDecoder.java:76)
	at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:302)
	at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:317)
	at org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:299)
	at org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:216)
	at com.twitter.finagle.thrift.ThriftFrameCodec.handleUpstream(ThriftFrameCodec.scala:11)
	at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:274)
	at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:261)
	at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:349)
	at org.jboss.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:280)
	at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:200)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:619)
{noformat}
another one that happens quite a lot is {{Caused by: com.twitter.finagle.UnknownChannelException: com.twitter.util.Promise$ImmutableResult: Result set multiple times: Throw(java.lang.RuntimeException: Hit exception in proxied call)}}.  I think I need some aid from someone more experienced with Finagle, but I'm a bit nervous about the underlying framework being difficult to debug and configure.

Currently the patch passes all unit tests (and needs more for the finagle section itself).  Overall, I think the patch is worth pursuing and could be committed with the Hadoop RPC as the default RPC and the config/stability issues resolved in follow-up patches.  Perhaps it's just an issue of lousy configuration on my part.  Another option would be to look in a different direction, such as MessagePack.

Thoughts?
                
> Implement Netty-backed rpc solution
> -----------------------------------
>
>                 Key: GIRAPH-37
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-37
>             Project: Giraph
>          Issue Type: New Feature
>            Reporter: Jakob Homan
>            Assignee: Jakob Homan
>         Attachments: GIRAPH-37-wip.patch
>
>
> GIRAPH-12 considered replacing the current Hadoop based rpc method with Netty, but didn't went in another direction. I think there is still value in this approach, and will also look at Finagle.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira