You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@giraph.apache.org by "Avery Ching (JIRA)" <ji...@apache.org> on 2012/08/18 09:52:37 UTC

[jira] [Created] (GIRAPH-306) Netty requests should be reliable and implement exactly once semantics

Avery Ching created GIRAPH-306:
----------------------------------

Summary: Netty requests should be reliable and implement exactly once semantics
Key: GIRAPH-306
URL: https://issues.apache.org/jira/browse/GIRAPH-306
Project: Giraph
Issue Type: Improvement
Reporter: Avery Ching
Priority: Critical

One of the biggest scalability challenges is getting Giraph to run reliably on a large number of tasks (i.e. > 200). Several problems exist:

1) If the connection fails after the initial connection was made, the job will die.
2) Requests must be completed exactly once. This is difficult to implement, but required since we cannot have multiple retried requests succeed (i.e. a vertex gets more messages than expected).
3) Sometimes there are unresolved addresses, causing failure.

This patch addresses these issues by re-establishing failed connections and keep tracking of every request sent to every worker. If the request fails or passes a timeout, it will be resent. The server will keep track of requests that succeeded to insure that the same request won't be processed more than once. The structure for keeping track of the succeeded requests on the server is efficient for handling increasing request ids (IncreasingBitSet). For handling unresolved addresses, I added retry logic to keep trying to resolve the problem.

This patch also adds several unit tests that use fault injection to simulate a lost response or a closed channel exception on the server. It also has unittests for IncreasingBitSet to insure it is working correctly and efficiently.

This passes all unittests (including the new ones). Additionally, I have some experience results as well.

Previously, I was unable to run reliably with more than 200 workers. With this change I can reliably run 500+ workers. I also ran with 600 workers successfully. This is a really big reliability win for us.

I can see the code working to do reconnections and re-issue requests when necessary. It's very cool.

I.e.

2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] [Updated] (GIRAPH-306) Netty requests should be reliable and implement exactly once semantics

Posted by Eli Reisman <in...@gmail.com>.

I've run this many times this weekend in its original form "version 1". Ran
1000+ workers on it here with no problem. Barring any check style issues I
skipped, ;) this thing is solid. In general Netty is less happy than before
with large numbers of connections to maintain as we scale out, but I
suspect that is transitional, and I didn't get very far into tweaking the
perfect configuration for its current incarnation yet either.


On Mon, Aug 20, 2012 at 11:49 AM, Avery Ching (JIRA) <ji...@apache.org>wrote:

>
>      [
> https://issues.apache.org/jira/browse/GIRAPH-306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Avery Ching updated GIRAPH-306:
> -------------------------------
>
>     Attachment: GIRAPH-306.2.patch
>
> - Added small change to not connect to one's self
> - Also changed the backlog to be the number of workers by default
>
> Also updated https://reviews.apache.org/r/6687/
>
> > Netty requests should be reliable and implement exactly once semantics
> > ----------------------------------------------------------------------
> >
> >                 Key: GIRAPH-306
> >                 URL: https://issues.apache.org/jira/browse/GIRAPH-306
> >             Project: Giraph
> >          Issue Type: Improvement
> >            Reporter: Avery Ching
> >            Assignee: Avery Ching
> >            Priority: Critical
> >         Attachments: GIRAPH-306.2.patch, GIRAPH-306.patch
> >
> >
> > One of the biggest scalability challenges is getting Giraph to run
> reliably on a large number of tasks (i.e. > 200).  Several problems exist:
> > 1) If the connection fails after the initial connection was made, the
> job will die.
> > 2) Requests must be completed exactly once.  This is difficult to
> implement, but required since we cannot have multiple retried requests
> succeed (i.e. a vertex gets more messages than expected).
> > 3) Sometimes there are unresolved addresses, causing failure.
> > This patch addresses these issues by re-establishing failed connections
> and keep tracking of every request sent to every worker.  If the request
> fails or passes a timeout, it will be resent.  The server will keep track
> of requests that succeeded to insure that the same request won't be
> processed more than once.  The structure for keeping track of the succeeded
> requests on the server is efficient for handling increasing request ids
> (IncreasingBitSet).  For handling unresolved addresses, I added retry logic
> to keep trying to resolve the problem.
> > This patch also adds several unit tests that use fault injection to
> simulate a lost response or a closed channel exception on the server.  It
> also has unittests for IncreasingBitSet to insure it is working correctly
> and efficiently.
> > This passes all unittests (including the new ones).  Additionally, I
> have some experience results as well.
> > Previously, I was unable to run reliably with more than 200 workers.
>  With this change I can reliably run 500+ workers.  I also ran with 600
> workers successfully.  This is a really big reliability win for us.
> > I can see the code working to do reconnections and re-issue requests
> when necessary.  It's very cool.
> > I.e.
> > 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient:
> checkAndFixChannel: Fixing disconnected channel to
> xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
> > 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient:
> checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
> > 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient:
> checkAndFixChannel: Fixing disconnected channel to
> xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
> > 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient:
> checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>

Re: [jira] [Commented] (GIRAPH-306) Netty requests should be reliable and implement exactly once semantics

Posted by Eli Reisman <in...@gmail.com>.

Damn I wish I'd have read this last night! Thanks for the tip. I will try
that, I am finding as I (optimistically!) tested at the low memory levels
the logs just don't make it to me at all. As I ramp it up a bit, I finally
start to get them again. This is why I didn't know if the 246-NEW-FIX-2
patch was working or not on Friday. I see now its netty connection errors
(timed out, host to connect to == null, etc.) and simple GC OutOFMemory
exceptions from netty pipeline handlers most of the time.

Still, Giraph is more resilient to working with existing MR jobs on the
cluster coming and going without causing us to fail, etc. this is real
progress, Netty problems will settle out as we find ways to configure the
new improvements to work for us here I'm sure. In general Giraph is running
great now. Keep it up!

I love the bit set idea too. I have heard the standard java implementation
is not so hot, is there an alternate library (or maybe we can build one
directly into the class) that would be lower profile? Anyway all of this
stuff seems like required pieces for Netty to be reliable, great work.


On Sun, Aug 19, 2012 at 2:48 PM, Eli Reisman (JIRA) <ji...@apache.org> wrote:

>
>     [
> https://issues.apache.org/jira/browse/GIRAPH-306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437585#comment-13437585]
>
> Eli Reisman commented on GIRAPH-306:
> ------------------------------------
>
> Yeah that was the impression I had too. Just to clarify, as of the recent
> Netty upgrades + this one, we are in no way attempting to handle worker
> restarts with any grace right? This is all purely connection reliability
> for healthy worker nodes?
>
> I am having a lot more trouble scaling out to more workers than I used to.
> I know you guys had mentioned this, but I have not been testing again until
> the last few days and its definitely gotten trickier, not the least of
> which because I'm having trouble getting logs to see what happened during a
> fail. I don't have dumps I saved from those jobs, but if I see more I will
> put them here.
>
> Mostly the logs I get are reconnection logs after reincarnation in which
> they all fail (of course) and no logs for the failed portion of the run
> that triggered the worker to reincarnate.
>
>
> > Netty requests should be reliable and implement exactly once semantics
> > ----------------------------------------------------------------------
> >
> >                 Key: GIRAPH-306
> >                 URL: https://issues.apache.org/jira/browse/GIRAPH-306
> >             Project: Giraph
> >          Issue Type: Improvement
> >            Reporter: Avery Ching
> >            Priority: Critical
> >         Attachments: GIRAPH-306.patch
> >
> >
> > One of the biggest scalability challenges is getting Giraph to run
> reliably on a large number of tasks (i.e. > 200).  Several problems exist:
> > 1) If the connection fails after the initial connection was made, the
> job will die.
> > 2) Requests must be completed exactly once.  This is difficult to
> implement, but required since we cannot have multiple retried requests
> succeed (i.e. a vertex gets more messages than expected).
> > 3) Sometimes there are unresolved addresses, causing failure.
> > This patch addresses these issues by re-establishing failed connections
> and keep tracking of every request sent to every worker.  If the request
> fails or passes a timeout, it will be resent.  The server will keep track
> of requests that succeeded to insure that the same request won't be
> processed more than once.  The structure for keeping track of the succeeded
> requests on the server is efficient for handling increasing request ids
> (IncreasingBitSet).  For handling unresolved addresses, I added retry logic
> to keep trying to resolve the problem.
> > This patch also adds several unit tests that use fault injection to
> simulate a lost response or a closed channel exception on the server.  It
> also has unittests for IncreasingBitSet to insure it is working correctly
> and efficiently.
> > This passes all unittests (including the new ones).  Additionally, I
> have some experience results as well.
> > Previously, I was unable to run reliably with more than 200 workers.
>  With this change I can reliably run 500+ workers.  I also ran with 600
> workers successfully.  This is a really big reliability win for us.
> > I can see the code working to do reconnections and re-issue requests
> when necessary.  It's very cool.
> > I.e.
> > 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient:
> checkAndFixChannel: Fixing disconnected channel to
> xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
> > 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient:
> checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
> > 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient:
> checkAndFixChannel: Fixing disconnected channel to
> xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
> > 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient:
> checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>

[jira] [Commented] (GIRAPH-306) Netty requests should be reliable and implement exactly once semantics

Posted by "Jan van der Lugt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439068#comment-13439068 ] 

Jan van der Lugt commented on GIRAPH-306:
-----------------------------------------

Ok, let me test it as well, then :-) Will let you know in a few hours!
                
> Netty requests should be reliable and implement exactly once semantics
> ----------------------------------------------------------------------
>
>                 Key: GIRAPH-306
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-306
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>            Priority: Critical
>         Attachments: GIRAPH-306.2.patch, GIRAPH-306.patch
>
>
> One of the biggest scalability challenges is getting Giraph to run reliably on a large number of tasks (i.e. > 200).  Several problems exist:
> 1) If the connection fails after the initial connection was made, the job will die.
> 2) Requests must be completed exactly once.  This is difficult to implement, but required since we cannot have multiple retried requests succeed (i.e. a vertex gets more messages than expected).
> 3) Sometimes there are unresolved addresses, causing failure.
> This patch addresses these issues by re-establishing failed connections and keep tracking of every request sent to every worker.  If the request fails or passes a timeout, it will be resent.  The server will keep track of requests that succeeded to insure that the same request won't be processed more than once.  The structure for keeping track of the succeeded requests on the server is efficient for handling increasing request ids (IncreasingBitSet).  For handling unresolved addresses, I added retry logic to keep trying to resolve the problem.
> This patch also adds several unit tests that use fault injection to simulate a lost response or a closed channel exception on the server.  It also has unittests for IncreasingBitSet to insure it is working correctly and efficiently.
> This passes all unittests (including the new ones).  Additionally, I have some experience results as well.
> Previously, I was unable to run reliably with more than 200 workers.  With this change I can reliably run 500+ workers.  I also ran with 600 workers successfully.  This is a really big reliability win for us.
> I can see the code working to do reconnections and re-issue requests when necessary.  It's very cool.
> I.e.
> 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
> 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
> 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
> 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-306) Netty requests should be reliable and implement exactly once semantics

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438989#comment-13438989 ] 

Avery Ching commented on GIRAPH-306:
------------------------------------

Here are some tests with the RandomMessageBenchmark.  Note that variance is high.  Let's ignore the setup and input superstep times as they vary based on the connection attempts.  The superstep times are pretty close.  I'm running this in a shared cluster.  I don't think the overhead is significant and reliability is important.

My arguments:

hadoop jar giraph-0.2-SNAPSHOT-for-hadoop-0.20.1-jar-with-dependencies.jar  org.apache.giraph.benchmark.RandomMessageBenchmark -Dmapred.child.java.opts="-Xms1500m -Xmx1500m -Xss160k" -Dgiraph.useNetty=true -Dmapred.map.max.attempts=1 -Dmapred.fairscheduler.pool=di.graphonly  -Dmapreduce.job.user.classpath.first=true -s 5 -e 1 -v -w 100 -n 40 -b 1024 -V 1000000

Before GIRAPH-306:

12/08/21 12:37:02 INFO mapred.JobClient:   Giraph Timers
12/08/21 12:37:02 INFO mapred.JobClient:     Total (milliseconds)=209108
12/08/21 12:37:02 INFO mapred.JobClient:     Superstep 3 (milliseconds)=16142
12/08/21 12:37:02 INFO mapred.JobClient:     Setup (milliseconds)=88873
12/08/21 12:37:02 INFO mapred.JobClient:     Vertex input superstep (milliseconds)=37255
12/08/21 12:37:02 INFO mapred.JobClient:     Shutdown (milliseconds)=1251
12/08/21 12:37:02 INFO mapred.JobClient:     Superstep 0 (milliseconds)=15088
12/08/21 12:37:02 INFO mapred.JobClient:     Superstep 4 (milliseconds)=16251
12/08/21 12:37:02 INFO mapred.JobClient:     Superstep 5 (milliseconds)=1529
12/08/21 12:37:02 INFO mapred.JobClient:     Superstep 2 (milliseconds)=16043
12/08/21 12:37:02 INFO mapred.JobClient:     Superstep 1 (milliseconds)=16671

12/08/21 12:27:05 INFO mapred.JobClient:   Giraph Timers
12/08/21 12:27:05 INFO mapred.JobClient:     Total (milliseconds)=269081
12/08/21 12:27:05 INFO mapred.JobClient:     Superstep 3 (milliseconds)=14929
12/08/21 12:27:05 INFO mapred.JobClient:     Setup (milliseconds)=46770
12/08/21 12:27:05 INFO mapred.JobClient:     Vertex input superstep (milliseconds)=123033
12/08/21 12:27:05 INFO mapred.JobClient:     Shutdown (milliseconds)=1359
12/08/21 12:27:05 INFO mapred.JobClient:     Superstep 0 (milliseconds)=17098
12/08/21 12:27:05 INFO mapred.JobClient:     Superstep 4 (milliseconds)=16759
12/08/21 12:27:05 INFO mapred.JobClient:     Superstep 5 (milliseconds)=11882
12/08/21 12:27:05 INFO mapred.JobClient:     Superstep 2 (milliseconds)=18835
12/08/21 12:27:05 INFO mapred.JobClient:     Superstep 1 (milliseconds)=18409

12/08/21 12:41:31 INFO mapred.JobClient:   Giraph Timers
12/08/21 12:41:31 INFO mapred.JobClient:     Total (milliseconds)=191158
12/08/21 12:41:31 INFO mapred.JobClient:     Superstep 3 (milliseconds)=19005
12/08/21 12:41:31 INFO mapred.JobClient:     Setup (milliseconds)=49267
12/08/21 12:41:31 INFO mapred.JobClient:     Vertex input superstep (milliseconds)=39635
12/08/21 12:41:31 INFO mapred.JobClient:     Shutdown (milliseconds)=2483
12/08/21 12:41:31 INFO mapred.JobClient:     Superstep 0 (milliseconds)=20668
12/08/21 12:41:31 INFO mapred.JobClient:     Superstep 4 (milliseconds)=17100
12/08/21 12:41:31 INFO mapred.JobClient:     Superstep 5 (milliseconds)=7636
12/08/21 12:41:31 INFO mapred.JobClient:     Superstep 2 (milliseconds)=17253
12/08/21 12:41:31 INFO mapred.JobClient:     Superstep 1 (milliseconds)=18106

After GIRAPH-306:

12/08/21 12:46:35 INFO mapred.JobClient:   Giraph Timers
12/08/21 12:46:35 INFO mapred.JobClient:     Total (milliseconds)=233213
12/08/21 12:46:35 INFO mapred.JobClient:     Superstep 3 (milliseconds)=13991
12/08/21 12:46:35 INFO mapred.JobClient:     Setup (milliseconds)=81516
12/08/21 12:46:35 INFO mapred.JobClient:     Vertex input superstep (milliseconds)=68620
12/08/21 12:46:35 INFO mapred.JobClient:     Shutdown (milliseconds)=794
12/08/21 12:46:35 INFO mapred.JobClient:     Superstep 0 (milliseconds)=16599
12/08/21 12:46:35 INFO mapred.JobClient:     Superstep 4 (milliseconds)=15746
12/08/21 12:46:35 INFO mapred.JobClient:     Superstep 5 (milliseconds)=1543
12/08/21 12:46:35 INFO mapred.JobClient:     Superstep 2 (milliseconds)=16284
12/08/21 12:46:35 INFO mapred.JobClient:     Superstep 1 (milliseconds)=18110

12/08/21 12:53:02 INFO mapred.JobClient:   Giraph Timers
12/08/21 12:53:02 INFO mapred.JobClient:     Total (milliseconds)=285832
12/08/21 12:53:02 INFO mapred.JobClient:     Superstep 3 (milliseconds)=15915
12/08/21 12:53:02 INFO mapred.JobClient:     Setup (milliseconds)=48762
12/08/21 12:53:02 INFO mapred.JobClient:     Vertex input superstep (milliseconds)=152074
12/08/21 12:53:02 INFO mapred.JobClient:     Shutdown (milliseconds)=2438
12/08/21 12:53:02 INFO mapred.JobClient:     Superstep 0 (milliseconds)=18609
12/08/21 12:53:02 INFO mapred.JobClient:     Superstep 4 (milliseconds)=14075
12/08/21 12:53:02 INFO mapred.JobClient:     Superstep 5 (milliseconds)=2248
12/08/21 12:53:02 INFO mapred.JobClient:     Superstep 2 (milliseconds)=15277
12/08/21 12:53:02 INFO mapred.JobClient:     Superstep 1 (milliseconds)=16422

                
> Netty requests should be reliable and implement exactly once semantics
> ----------------------------------------------------------------------
>
>                 Key: GIRAPH-306
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-306
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>            Priority: Critical
>         Attachments: GIRAPH-306.2.patch, GIRAPH-306.patch
>
>
> One of the biggest scalability challenges is getting Giraph to run reliably on a large number of tasks (i.e. > 200).  Several problems exist:
> 1) If the connection fails after the initial connection was made, the job will die.
> 2) Requests must be completed exactly once.  This is difficult to implement, but required since we cannot have multiple retried requests succeed (i.e. a vertex gets more messages than expected).
> 3) Sometimes there are unresolved addresses, causing failure.
> This patch addresses these issues by re-establishing failed connections and keep tracking of every request sent to every worker.  If the request fails or passes a timeout, it will be resent.  The server will keep track of requests that succeeded to insure that the same request won't be processed more than once.  The structure for keeping track of the succeeded requests on the server is efficient for handling increasing request ids (IncreasingBitSet).  For handling unresolved addresses, I added retry logic to keep trying to resolve the problem.
> This patch also adds several unit tests that use fault injection to simulate a lost response or a closed channel exception on the server.  It also has unittests for IncreasingBitSet to insure it is working correctly and efficiently.
> This passes all unittests (including the new ones).  Additionally, I have some experience results as well.
> Previously, I was unable to run reliably with more than 200 workers.  With this change I can reliably run 500+ workers.  I also ran with 600 workers successfully.  This is a really big reliability win for us.
> I can see the code working to do reconnections and re-issue requests when necessary.  It's very cool.
> I.e.
> 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
> 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
> 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
> 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-306) Netty requests should be reliable and implement exactly once semantics

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437462#comment-13437462 ] 

Eli Reisman commented on GIRAPH-306:
------------------------------------

FYI: Trunk + this patch makes it most of the way through restart when a worker dies, but tries to reconnect with itself as well as all the other workers,and cannot reconnect with itself even when all the other connections seem to succeed. Not sure what would happen next in regards to the InputSplit the reincarnated worker was reading at death either, but we didn't get that far. Seems like a minor detail, otherwise this is doing everything you said it would, will keep testing, nice work!

                
> Netty requests should be reliable and implement exactly once semantics
> ----------------------------------------------------------------------
>
>                 Key: GIRAPH-306
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-306
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Priority: Critical
>         Attachments: GIRAPH-306.patch
>
>
> One of the biggest scalability challenges is getting Giraph to run reliably on a large number of tasks (i.e. > 200).  Several problems exist:
> 1) If the connection fails after the initial connection was made, the job will die.
> 2) Requests must be completed exactly once.  This is difficult to implement, but required since we cannot have multiple retried requests succeed (i.e. a vertex gets more messages than expected).
> 3) Sometimes there are unresolved addresses, causing failure.
> This patch addresses these issues by re-establishing failed connections and keep tracking of every request sent to every worker.  If the request fails or passes a timeout, it will be resent.  The server will keep track of requests that succeeded to insure that the same request won't be processed more than once.  The structure for keeping track of the succeeded requests on the server is efficient for handling increasing request ids (IncreasingBitSet).  For handling unresolved addresses, I added retry logic to keep trying to resolve the problem.
> This patch also adds several unit tests that use fault injection to simulate a lost response or a closed channel exception on the server.  It also has unittests for IncreasingBitSet to insure it is working correctly and efficiently.
> This passes all unittests (including the new ones).  Additionally, I have some experience results as well.
> Previously, I was unable to run reliably with more than 200 workers.  With this change I can reliably run 500+ workers.  I also ran with 600 workers successfully.  This is a really big reliability win for us.
> I can see the code working to do reconnections and re-issue requests when necessary.  It's very cool.
> I.e.
> 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
> 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
> 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
> 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-306) Netty requests should be reliable and implement exactly once semantics

Posted by "Maja Kabiljo (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437763#comment-13437763 ] 

Maja Kabiljo commented on GIRAPH-306:
-------------------------------------

As for connecting to ourselves, even though we don't send any requests we still make a connection (we are not skipping our address in NettyWorkerClient.fixPartitionIdToSocketAddrMap() - we should add that). If a worker died maybe it opened server on a different port and tried to reconnect to the old one.
                
> Netty requests should be reliable and implement exactly once semantics
> ----------------------------------------------------------------------
>
>                 Key: GIRAPH-306
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-306
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>            Priority: Critical
>         Attachments: GIRAPH-306.patch
>
>
> One of the biggest scalability challenges is getting Giraph to run reliably on a large number of tasks (i.e. > 200).  Several problems exist:
> 1) If the connection fails after the initial connection was made, the job will die.
> 2) Requests must be completed exactly once.  This is difficult to implement, but required since we cannot have multiple retried requests succeed (i.e. a vertex gets more messages than expected).
> 3) Sometimes there are unresolved addresses, causing failure.
> This patch addresses these issues by re-establishing failed connections and keep tracking of every request sent to every worker.  If the request fails or passes a timeout, it will be resent.  The server will keep track of requests that succeeded to insure that the same request won't be processed more than once.  The structure for keeping track of the succeeded requests on the server is efficient for handling increasing request ids (IncreasingBitSet).  For handling unresolved addresses, I added retry logic to keep trying to resolve the problem.
> This patch also adds several unit tests that use fault injection to simulate a lost response or a closed channel exception on the server.  It also has unittests for IncreasingBitSet to insure it is working correctly and efficiently.
> This passes all unittests (including the new ones).  Additionally, I have some experience results as well.
> Previously, I was unable to run reliably with more than 200 workers.  With this change I can reliably run 500+ workers.  I also ran with 600 workers successfully.  This is a really big reliability win for us.
> I can see the code working to do reconnections and re-issue requests when necessary.  It's very cool.
> I.e.
> 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
> 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
> 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
> 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-306) Netty requests should be reliable and implement exactly once semantics

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437451#comment-13437451 ] 

Eli Reisman commented on GIRAPH-306:
------------------------------------

Nice I will try this out right away, after friday I have been able to get in some better instrumented runs to test a bunch of patches, and even on trunk I am running into these errors all the time. I was not able to see logs for a while on friday and could not determine what was happening, but its always either memory issues or Netty connection errors. If this solves it I will be a very happy guy, Giraph is performing very well up to a scale limit now and then hitting this wall. Will report back the results...

Thanks again!
                
> Netty requests should be reliable and implement exactly once semantics
> ----------------------------------------------------------------------
>
>                 Key: GIRAPH-306
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-306
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Priority: Critical
>         Attachments: GIRAPH-306.patch
>
>
> One of the biggest scalability challenges is getting Giraph to run reliably on a large number of tasks (i.e. > 200).  Several problems exist:
> 1) If the connection fails after the initial connection was made, the job will die.
> 2) Requests must be completed exactly once.  This is difficult to implement, but required since we cannot have multiple retried requests succeed (i.e. a vertex gets more messages than expected).
> 3) Sometimes there are unresolved addresses, causing failure.
> This patch addresses these issues by re-establishing failed connections and keep tracking of every request sent to every worker.  If the request fails or passes a timeout, it will be resent.  The server will keep track of requests that succeeded to insure that the same request won't be processed more than once.  The structure for keeping track of the succeeded requests on the server is efficient for handling increasing request ids (IncreasingBitSet).  For handling unresolved addresses, I added retry logic to keep trying to resolve the problem.
> This patch also adds several unit tests that use fault injection to simulate a lost response or a closed channel exception on the server.  It also has unittests for IncreasingBitSet to insure it is working correctly and efficiently.
> This passes all unittests (including the new ones).  Additionally, I have some experience results as well.
> Previously, I was unable to run reliably with more than 200 workers.  With this change I can reliably run 500+ workers.  I also ran with 600 workers successfully.  This is a really big reliability win for us.
> I can see the code working to do reconnections and re-issue requests when necessary.  It's very cool.
> I.e.
> 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
> 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
> 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
> 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-306) Netty requests should be reliable and implement exactly once semantics

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438857#comment-13438857 ] 

Eli Reisman commented on GIRAPH-306:
------------------------------------

As far as reconnecting on live workers and generally making Netty behave in a more resilient way, this worked great in testing. When A/Bing runs where memory is very tight, this does seem to make life a bit tricker, but this is an important upgrade (and works!) and there are still other large opportunities to save memory so I'm +1 on this.

                
> Netty requests should be reliable and implement exactly once semantics
> ----------------------------------------------------------------------
>
>                 Key: GIRAPH-306
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-306
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>            Priority: Critical
>         Attachments: GIRAPH-306.2.patch, GIRAPH-306.patch
>
>
> One of the biggest scalability challenges is getting Giraph to run reliably on a large number of tasks (i.e. > 200).  Several problems exist:
> 1) If the connection fails after the initial connection was made, the job will die.
> 2) Requests must be completed exactly once.  This is difficult to implement, but required since we cannot have multiple retried requests succeed (i.e. a vertex gets more messages than expected).
> 3) Sometimes there are unresolved addresses, causing failure.
> This patch addresses these issues by re-establishing failed connections and keep tracking of every request sent to every worker.  If the request fails or passes a timeout, it will be resent.  The server will keep track of requests that succeeded to insure that the same request won't be processed more than once.  The structure for keeping track of the succeeded requests on the server is efficient for handling increasing request ids (IncreasingBitSet).  For handling unresolved addresses, I added retry logic to keep trying to resolve the problem.
> This patch also adds several unit tests that use fault injection to simulate a lost response or a closed channel exception on the server.  It also has unittests for IncreasingBitSet to insure it is working correctly and efficiently.
> This passes all unittests (including the new ones).  Additionally, I have some experience results as well.
> Previously, I was unable to run reliably with more than 200 workers.  With this change I can reliably run 500+ workers.  I also ran with 600 workers successfully.  This is a really big reliability win for us.
> I can see the code working to do reconnections and re-issue requests when necessary.  It's very cool.
> I.e.
> 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
> 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
> 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
> 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-306) Netty requests should be reliable and implement exactly once semantics

Posted by "Jan van der Lugt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439060#comment-13439060 ] 

Jan van der Lugt commented on GIRAPH-306:
-----------------------------------------

I don't know if this is related, but I'm having problems with running Giraph/Netty on many workers. As long as I run jobs on 20-80 workers (spread over 20 machines), it's usually fine, but once I start running 100+ workers (I try to get up to 320), I get the error below. If I use Hadoop RPC, it all runs fine. It just takes 2+ minutes for the vertex input superstep (7 seconds using Netty). If there are no errors, Netty is about 20-30% faster. Don't know if it matters, but the machines are pretty beefy (2x 8 cores with 256 GB of RAM), so I run a maximum of 16 map tasks per machine. Any ideas? If it's not related, I'll just post this on the mailing list instead.

java.lang.IllegalStateException: connectAllAddresses: Unresolved address bunch10.ib.bunch:30107
	at org.apache.giraph.comm.NettyClient.connectAllAddresses(NettyClient.java:215)
	at org.apache.giraph.comm.NettyWorkerClient.fixPartitionIdToSocketAddrMap(NettyWorkerClient.java:140)
	at org.apache.giraph.comm.NettyWorkerClientServer.setup(NettyWorkerClientServer.java:120)
	at org.apache.giraph.graph.BspServiceWorker.setup(BspServiceWorker.java:560)
	at org.apache.giraph.graph.GraphMapper.setup(GraphMapper.java:368)
	at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:570)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Unknown Source)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
	at org.apache.hadoop.mapred.Child.main(Child.java:2
                
> Netty requests should be reliable and implement exactly once semantics
> ----------------------------------------------------------------------
>
>                 Key: GIRAPH-306
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-306
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>            Priority: Critical
>         Attachments: GIRAPH-306.2.patch, GIRAPH-306.patch
>
>
> One of the biggest scalability challenges is getting Giraph to run reliably on a large number of tasks (i.e. > 200).  Several problems exist:
> 1) If the connection fails after the initial connection was made, the job will die.
> 2) Requests must be completed exactly once.  This is difficult to implement, but required since we cannot have multiple retried requests succeed (i.e. a vertex gets more messages than expected).
> 3) Sometimes there are unresolved addresses, causing failure.
> This patch addresses these issues by re-establishing failed connections and keep tracking of every request sent to every worker.  If the request fails or passes a timeout, it will be resent.  The server will keep track of requests that succeeded to insure that the same request won't be processed more than once.  The structure for keeping track of the succeeded requests on the server is efficient for handling increasing request ids (IncreasingBitSet).  For handling unresolved addresses, I added retry logic to keep trying to resolve the problem.
> This patch also adds several unit tests that use fault injection to simulate a lost response or a closed channel exception on the server.  It also has unittests for IncreasingBitSet to insure it is working correctly and efficiently.
> This passes all unittests (including the new ones).  Additionally, I have some experience results as well.
> Previously, I was unable to run reliably with more than 200 workers.  With this change I can reliably run 500+ workers.  I also ran with 600 workers successfully.  This is a really big reliability win for us.
> I can see the code working to do reconnections and re-issue requests when necessary.  It's very cool.
> I.e.
> 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
> 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
> 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
> 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-306) Netty requests should be reliable and implement exactly once semantics

Posted by "Jan van der Lugt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439078#comment-13439078 ] 

Jan van der Lugt commented on GIRAPH-306:
-----------------------------------------

Works like a charm! It actually slows down when going from 100 to 320 workers (on the LiveJournal graph), but I guess that's because of a lot of unnecessary extra synchronization overhead, since the graph is reasonably small. Will try it later today on the twitter graph. Ship it! ;-)
                
> Netty requests should be reliable and implement exactly once semantics
> ----------------------------------------------------------------------
>
>                 Key: GIRAPH-306
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-306
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>            Priority: Critical
>         Attachments: GIRAPH-306.2.patch, GIRAPH-306.patch
>
>
> One of the biggest scalability challenges is getting Giraph to run reliably on a large number of tasks (i.e. > 200).  Several problems exist:
> 1) If the connection fails after the initial connection was made, the job will die.
> 2) Requests must be completed exactly once.  This is difficult to implement, but required since we cannot have multiple retried requests succeed (i.e. a vertex gets more messages than expected).
> 3) Sometimes there are unresolved addresses, causing failure.
> This patch addresses these issues by re-establishing failed connections and keep tracking of every request sent to every worker.  If the request fails or passes a timeout, it will be resent.  The server will keep track of requests that succeeded to insure that the same request won't be processed more than once.  The structure for keeping track of the succeeded requests on the server is efficient for handling increasing request ids (IncreasingBitSet).  For handling unresolved addresses, I added retry logic to keep trying to resolve the problem.
> This patch also adds several unit tests that use fault injection to simulate a lost response or a closed channel exception on the server.  It also has unittests for IncreasingBitSet to insure it is working correctly and efficiently.
> This passes all unittests (including the new ones).  Additionally, I have some experience results as well.
> Previously, I was unable to run reliably with more than 200 workers.  With this change I can reliably run 500+ workers.  I also ran with 600 workers successfully.  This is a really big reliability win for us.
> I can see the code working to do reconnections and re-issue requests when necessary.  It's very cool.
> I.e.
> 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
> 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
> 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
> 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (GIRAPH-306) Netty requests should be reliable and implement exactly once semantics

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/GIRAPH-306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Avery Ching updated GIRAPH-306:
-------------------------------

    Attachment: GIRAPH-306.2.patch

- Added small change to not connect to one's self
- Also changed the backlog to be the number of workers by default

Also updated https://reviews.apache.org/r/6687/
                
> Netty requests should be reliable and implement exactly once semantics
> ----------------------------------------------------------------------
>
>                 Key: GIRAPH-306
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-306
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>            Priority: Critical
>         Attachments: GIRAPH-306.2.patch, GIRAPH-306.patch
>
>
> One of the biggest scalability challenges is getting Giraph to run reliably on a large number of tasks (i.e. > 200).  Several problems exist:
> 1) If the connection fails after the initial connection was made, the job will die.
> 2) Requests must be completed exactly once.  This is difficult to implement, but required since we cannot have multiple retried requests succeed (i.e. a vertex gets more messages than expected).
> 3) Sometimes there are unresolved addresses, causing failure.
> This patch addresses these issues by re-establishing failed connections and keep tracking of every request sent to every worker.  If the request fails or passes a timeout, it will be resent.  The server will keep track of requests that succeeded to insure that the same request won't be processed more than once.  The structure for keeping track of the succeeded requests on the server is efficient for handling increasing request ids (IncreasingBitSet).  For handling unresolved addresses, I added retry logic to keep trying to resolve the problem.
> This patch also adds several unit tests that use fault injection to simulate a lost response or a closed channel exception on the server.  It also has unittests for IncreasingBitSet to insure it is working correctly and efficiently.
> This passes all unittests (including the new ones).  Additionally, I have some experience results as well.
> Previously, I was unable to run reliably with more than 200 workers.  With this change I can reliably run 500+ workers.  I also ran with 600 workers successfully.  This is a really big reliability win for us.
> I can see the code working to do reconnections and re-issue requests when necessary.  It's very cool.
> I.e.
> 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
> 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
> 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
> 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-306) Netty requests should be reliable and implement exactly once semantics

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437566#comment-13437566 ] 

Avery Ching commented on GIRAPH-306:
------------------------------------

Thanks for taking a look Eli.  It's kind of strange that it tries to reconnect with itself and fails...do you have a log you could post?  

I think in general, we shouldn't be connecting to ourselves (as local requests are supposed to be bypassed), although the code for it existed prior to this patch.

                
> Netty requests should be reliable and implement exactly once semantics
> ----------------------------------------------------------------------
>
>                 Key: GIRAPH-306
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-306
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Priority: Critical
>         Attachments: GIRAPH-306.patch
>
>
> One of the biggest scalability challenges is getting Giraph to run reliably on a large number of tasks (i.e. > 200).  Several problems exist:
> 1) If the connection fails after the initial connection was made, the job will die.
> 2) Requests must be completed exactly once.  This is difficult to implement, but required since we cannot have multiple retried requests succeed (i.e. a vertex gets more messages than expected).
> 3) Sometimes there are unresolved addresses, causing failure.
> This patch addresses these issues by re-establishing failed connections and keep tracking of every request sent to every worker.  If the request fails or passes a timeout, it will be resent.  The server will keep track of requests that succeeded to insure that the same request won't be processed more than once.  The structure for keeping track of the succeeded requests on the server is efficient for handling increasing request ids (IncreasingBitSet).  For handling unresolved addresses, I added retry logic to keep trying to resolve the problem.
> This patch also adds several unit tests that use fault injection to simulate a lost response or a closed channel exception on the server.  It also has unittests for IncreasingBitSet to insure it is working correctly and efficiently.
> This passes all unittests (including the new ones).  Additionally, I have some experience results as well.
> Previously, I was unable to run reliably with more than 200 workers.  With this change I can reliably run 500+ workers.  I also ran with 600 workers successfully.  This is a really big reliability win for us.
> I can see the code working to do reconnections and re-issue requests when necessary.  It's very cool.
> I.e.
> 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
> 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
> 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
> 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (GIRAPH-306) Netty requests should be reliable and implement exactly once semantics

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/GIRAPH-306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Avery Ching updated GIRAPH-306:
-------------------------------

    Attachment: GIRAPH-306.patch

Added a patch identical to https://reviews.apache.org/r/6687/

                
> Netty requests should be reliable and implement exactly once semantics
> ----------------------------------------------------------------------
>
>                 Key: GIRAPH-306
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-306
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Priority: Critical
>         Attachments: GIRAPH-306.patch
>
>
> One of the biggest scalability challenges is getting Giraph to run reliably on a large number of tasks (i.e. > 200).  Several problems exist:
> 1) If the connection fails after the initial connection was made, the job will die.
> 2) Requests must be completed exactly once.  This is difficult to implement, but required since we cannot have multiple retried requests succeed (i.e. a vertex gets more messages than expected).
> 3) Sometimes there are unresolved addresses, causing failure.
> This patch addresses these issues by re-establishing failed connections and keep tracking of every request sent to every worker.  If the request fails or passes a timeout, it will be resent.  The server will keep track of requests that succeeded to insure that the same request won't be processed more than once.  The structure for keeping track of the succeeded requests on the server is efficient for handling increasing request ids (IncreasingBitSet).  For handling unresolved addresses, I added retry logic to keep trying to resolve the problem.
> This patch also adds several unit tests that use fault injection to simulate a lost response or a closed channel exception on the server.  It also has unittests for IncreasingBitSet to insure it is working correctly and efficiently.
> This passes all unittests (including the new ones).  Additionally, I have some experience results as well.
> Previously, I was unable to run reliably with more than 200 workers.  With this change I can reliably run 500+ workers.  I also ran with 600 workers successfully.  This is a really big reliability win for us.
> I can see the code working to do reconnections and re-issue requests when necessary.  It's very cool.
> I.e.
> 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
> 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
> 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
> 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-306) Netty requests should be reliable and implement exactly once semantics

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437607#comment-13437607 ] 

Avery Ching commented on GIRAPH-306:
------------------------------------

>Yeah that was the impression I had too. Just to clarify, as of the recent Netty upgrades + this one, we are in no way >attempting to handle worker restarts with any grace right? This is all purely connection reliability for healthy worker nodes?

Yeah, this is purely for reliability of connections and requests, nothing else.

>I am having a lot more trouble scaling out to more workers than I used to. I know you guys had mentioned this, but I have >not been testing again until the last few days and its definitely gotten trickier, not the least of which because I'm >having trouble getting logs to see what happened during a fail. I don't have dumps I saved from those jobs, but if I see >more I will put them here.

Here's a trick you can try.  Add -Dmapred.map.max.attempts=1 to ensure that any failure will fail the job.  Then you can see the logs for the failed task and try to figure out what the problem is.

>Mostly the logs I get are reconnection logs after reincarnation in which they all fail (of course) and no logs for the >failed portion of the run that triggered the worker to reincarnate.

The above should help us narrow down your problem.  =)
                
> Netty requests should be reliable and implement exactly once semantics
> ----------------------------------------------------------------------
>
>                 Key: GIRAPH-306
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-306
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Priority: Critical
>         Attachments: GIRAPH-306.patch
>
>
> One of the biggest scalability challenges is getting Giraph to run reliably on a large number of tasks (i.e. > 200).  Several problems exist:
> 1) If the connection fails after the initial connection was made, the job will die.
> 2) Requests must be completed exactly once.  This is difficult to implement, but required since we cannot have multiple retried requests succeed (i.e. a vertex gets more messages than expected).
> 3) Sometimes there are unresolved addresses, causing failure.
> This patch addresses these issues by re-establishing failed connections and keep tracking of every request sent to every worker.  If the request fails or passes a timeout, it will be resent.  The server will keep track of requests that succeeded to insure that the same request won't be processed more than once.  The structure for keeping track of the succeeded requests on the server is efficient for handling increasing request ids (IncreasingBitSet).  For handling unresolved addresses, I added retry logic to keep trying to resolve the problem.
> This patch also adds several unit tests that use fault injection to simulate a lost response or a closed channel exception on the server.  It also has unittests for IncreasingBitSet to insure it is working correctly and efficiently.
> This passes all unittests (including the new ones).  Additionally, I have some experience results as well.
> Previously, I was unable to run reliably with more than 200 workers.  With this change I can reliably run 500+ workers.  I also ran with 600 workers successfully.  This is a really big reliability win for us.
> I can see the code working to do reconnections and re-issue requests when necessary.  It's very cool.
> I.e.
> 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
> 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
> 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
> 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (GIRAPH-306) Netty requests should be reliable and implement exactly once semantics

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/GIRAPH-306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Avery Ching reassigned GIRAPH-306:
----------------------------------

    Assignee: Avery Ching
    
> Netty requests should be reliable and implement exactly once semantics
> ----------------------------------------------------------------------
>
>                 Key: GIRAPH-306
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-306
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>            Priority: Critical
>         Attachments: GIRAPH-306.patch
>
>
> One of the biggest scalability challenges is getting Giraph to run reliably on a large number of tasks (i.e. > 200).  Several problems exist:
> 1) If the connection fails after the initial connection was made, the job will die.
> 2) Requests must be completed exactly once.  This is difficult to implement, but required since we cannot have multiple retried requests succeed (i.e. a vertex gets more messages than expected).
> 3) Sometimes there are unresolved addresses, causing failure.
> This patch addresses these issues by re-establishing failed connections and keep tracking of every request sent to every worker.  If the request fails or passes a timeout, it will be resent.  The server will keep track of requests that succeeded to insure that the same request won't be processed more than once.  The structure for keeping track of the succeeded requests on the server is efficient for handling increasing request ids (IncreasingBitSet).  For handling unresolved addresses, I added retry logic to keep trying to resolve the problem.
> This patch also adds several unit tests that use fault injection to simulate a lost response or a closed channel exception on the server.  It also has unittests for IncreasingBitSet to insure it is working correctly and efficiently.
> This passes all unittests (including the new ones).  Additionally, I have some experience results as well.
> Previously, I was unable to run reliably with more than 200 workers.  With this change I can reliably run 500+ workers.  I also ran with 600 workers successfully.  This is a really big reliability win for us.
> I can see the code working to do reconnections and re-issue requests when necessary.  It's very cool.
> I.e.
> 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
> 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
> 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
> 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (GIRAPH-306) Netty requests should be reliable and implement exactly once semantics

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439106#comment-13439106 ] 

Avery Ching edited comment on GIRAPH-306 at 8/22/12 9:23 AM:
-------------------------------------------------------------

Thanks for the review Alessandro and everyone else for the comments.  Committed.
                
      was (Author: aching):
    Thanks for the review Alessandro and everyone else for the comments.  Committing...
                  
> Netty requests should be reliable and implement exactly once semantics
> ----------------------------------------------------------------------
>
>                 Key: GIRAPH-306
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-306
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>            Priority: Critical
>         Attachments: GIRAPH-306.2.patch, GIRAPH-306.patch
>
>
> One of the biggest scalability challenges is getting Giraph to run reliably on a large number of tasks (i.e. > 200).  Several problems exist:
> 1) If the connection fails after the initial connection was made, the job will die.
> 2) Requests must be completed exactly once.  This is difficult to implement, but required since we cannot have multiple retried requests succeed (i.e. a vertex gets more messages than expected).
> 3) Sometimes there are unresolved addresses, causing failure.
> This patch addresses these issues by re-establishing failed connections and keep tracking of every request sent to every worker.  If the request fails or passes a timeout, it will be resent.  The server will keep track of requests that succeeded to insure that the same request won't be processed more than once.  The structure for keeping track of the succeeded requests on the server is efficient for handling increasing request ids (IncreasingBitSet).  For handling unresolved addresses, I added retry logic to keep trying to resolve the problem.
> This patch also adds several unit tests that use fault injection to simulate a lost response or a closed channel exception on the server.  It also has unittests for IncreasingBitSet to insure it is working correctly and efficiently.
> This passes all unittests (including the new ones).  Additionally, I have some experience results as well.
> Previously, I was unable to run reliably with more than 200 workers.  With this change I can reliably run 500+ workers.  I also ran with 600 workers successfully.  This is a really big reliability win for us.
> I can see the code working to do reconnections and re-issue requests when necessary.  It's very cool.
> I.e.
> 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
> 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
> 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
> 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-306) Netty requests should be reliable and implement exactly once semantics

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437585#comment-13437585 ] 

Eli Reisman commented on GIRAPH-306:
------------------------------------

Yeah that was the impression I had too. Just to clarify, as of the recent Netty upgrades + this one, we are in no way attempting to handle worker restarts with any grace right? This is all purely connection reliability for healthy worker nodes?

I am having a lot more trouble scaling out to more workers than I used to. I know you guys had mentioned this, but I have not been testing again until the last few days and its definitely gotten trickier, not the least of which because I'm having trouble getting logs to see what happened during a fail. I don't have dumps I saved from those jobs, but if I see more I will put them here.

Mostly the logs I get are reconnection logs after reincarnation in which they all fail (of course) and no logs for the failed portion of the run that triggered the worker to reincarnate.

                
> Netty requests should be reliable and implement exactly once semantics
> ----------------------------------------------------------------------
>
>                 Key: GIRAPH-306
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-306
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Priority: Critical
>         Attachments: GIRAPH-306.patch
>
>
> One of the biggest scalability challenges is getting Giraph to run reliably on a large number of tasks (i.e. > 200).  Several problems exist:
> 1) If the connection fails after the initial connection was made, the job will die.
> 2) Requests must be completed exactly once.  This is difficult to implement, but required since we cannot have multiple retried requests succeed (i.e. a vertex gets more messages than expected).
> 3) Sometimes there are unresolved addresses, causing failure.
> This patch addresses these issues by re-establishing failed connections and keep tracking of every request sent to every worker.  If the request fails or passes a timeout, it will be resent.  The server will keep track of requests that succeeded to insure that the same request won't be processed more than once.  The structure for keeping track of the succeeded requests on the server is efficient for handling increasing request ids (IncreasingBitSet).  For handling unresolved addresses, I added retry logic to keep trying to resolve the problem.
> This patch also adds several unit tests that use fault injection to simulate a lost response or a closed channel exception on the server.  It also has unittests for IncreasingBitSet to insure it is working correctly and efficiently.
> This passes all unittests (including the new ones).  Additionally, I have some experience results as well.
> Previously, I was unable to run reliably with more than 200 workers.  With this change I can reliably run 500+ workers.  I also ran with 600 workers successfully.  This is a really big reliability win for us.
> I can see the code working to do reconnections and re-issue requests when necessary.  It's very cool.
> I.e.
> 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
> 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
> 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
> 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-306) Netty requests should be reliable and implement exactly once semantics

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439104#comment-13439104 ] 

Hudson commented on GIRAPH-306:
-------------------------------

Integrated in Giraph-trunk-Commit #183 (See [https://builds.apache.org/job/Giraph-trunk-Commit/183/])
    GIRAPH-306: Netty requests should be reliable and implement exactly
once semantics. (aching) (Revision 1375824)

     Result = SUCCESS
aching : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1375824
Files : 
* /giraph/trunk/CHANGELOG
* /giraph/trunk/src/main/java/org/apache/giraph/comm/AddressRequestIdGenerator.java
* /giraph/trunk/src/main/java/org/apache/giraph/comm/ChannelRotater.java
* /giraph/trunk/src/main/java/org/apache/giraph/comm/ClientRequestId.java
* /giraph/trunk/src/main/java/org/apache/giraph/comm/IncreasingBitSet.java
* /giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java
* /giraph/trunk/src/main/java/org/apache/giraph/comm/NettyServer.java
* /giraph/trunk/src/main/java/org/apache/giraph/comm/NettyWorkerClient.java
* /giraph/trunk/src/main/java/org/apache/giraph/comm/RequestDecoder.java
* /giraph/trunk/src/main/java/org/apache/giraph/comm/RequestInfo.java
* /giraph/trunk/src/main/java/org/apache/giraph/comm/RequestServerHandler.java
* /giraph/trunk/src/main/java/org/apache/giraph/comm/ResponseClientHandler.java
* /giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerRequestReservedMap.java
* /giraph/trunk/src/main/java/org/apache/giraph/comm/WritableRequest.java
* /giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java
* /giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java
* /giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java
* /giraph/trunk/src/main/java/org/apache/giraph/graph/WorkerInfo.java
* /giraph/trunk/src/test/java/org/apache/giraph/comm/IncreasingBitSetTest.java
* /giraph/trunk/src/test/java/org/apache/giraph/comm/RequestFailureTest.java
* /giraph/trunk/src/test/java/org/apache/giraph/comm/RequestTest.java

                
> Netty requests should be reliable and implement exactly once semantics
> ----------------------------------------------------------------------
>
>                 Key: GIRAPH-306
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-306
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>            Priority: Critical
>         Attachments: GIRAPH-306.2.patch, GIRAPH-306.patch
>
>
> One of the biggest scalability challenges is getting Giraph to run reliably on a large number of tasks (i.e. > 200).  Several problems exist:
> 1) If the connection fails after the initial connection was made, the job will die.
> 2) Requests must be completed exactly once.  This is difficult to implement, but required since we cannot have multiple retried requests succeed (i.e. a vertex gets more messages than expected).
> 3) Sometimes there are unresolved addresses, causing failure.
> This patch addresses these issues by re-establishing failed connections and keep tracking of every request sent to every worker.  If the request fails or passes a timeout, it will be resent.  The server will keep track of requests that succeeded to insure that the same request won't be processed more than once.  The structure for keeping track of the succeeded requests on the server is efficient for handling increasing request ids (IncreasingBitSet).  For handling unresolved addresses, I added retry logic to keep trying to resolve the problem.
> This patch also adds several unit tests that use fault injection to simulate a lost response or a closed channel exception on the server.  It also has unittests for IncreasingBitSet to insure it is working correctly and efficiently.
> This passes all unittests (including the new ones).  Additionally, I have some experience results as well.
> Previously, I was unable to run reliably with more than 200 workers.  With this change I can reliably run 500+ workers.  I also ran with 600 workers successfully.  This is a really big reliability win for us.
> I can see the code working to do reconnections and re-issue requests when necessary.  It's very cool.
> I.e.
> 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
> 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
> 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
> 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-306) Netty requests should be reliable and implement exactly once semantics

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GIRAPH-306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439066#comment-13439066 ] 

Avery Ching commented on GIRAPH-306:
------------------------------------

This patch should address that issue Jan.  It tries multiple times to resolve the hostname (configurable with giraph.maxResolveAddressAttempt).
                
> Netty requests should be reliable and implement exactly once semantics
> ----------------------------------------------------------------------
>
>                 Key: GIRAPH-306
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-306
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>            Priority: Critical
>         Attachments: GIRAPH-306.2.patch, GIRAPH-306.patch
>
>
> One of the biggest scalability challenges is getting Giraph to run reliably on a large number of tasks (i.e. > 200).  Several problems exist:
> 1) If the connection fails after the initial connection was made, the job will die.
> 2) Requests must be completed exactly once.  This is difficult to implement, but required since we cannot have multiple retried requests succeed (i.e. a vertex gets more messages than expected).
> 3) Sometimes there are unresolved addresses, causing failure.
> This patch addresses these issues by re-establishing failed connections and keep tracking of every request sent to every worker.  If the request fails or passes a timeout, it will be resent.  The server will keep track of requests that succeeded to insure that the same request won't be processed more than once.  The structure for keeping track of the succeeded requests on the server is efficient for handling increasing request ids (IncreasingBitSet).  For handling unresolved addresses, I added retry logic to keep trying to resolve the problem.
> This patch also adds several unit tests that use fault injection to simulate a lost response or a closed channel exception on the server.  It also has unittests for IncreasingBitSet to insure it is working correctly and efficiently.
> This passes all unittests (including the new ones).  Additionally, I have some experience results as well.
> Previously, I was unable to run reliably with more than 200 workers.  With this change I can reliably run 500+ workers.  I also ran with 600 workers successfully.  This is a really big reliability win for us.
> I can see the code working to do reconnections and re-issue requests when necessary.  It's very cool.
> I.e.
> 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
> 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
> 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Fixing disconnected channel to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
> 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira