You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by danielschonfeld <gi...@git.apache.org> on 2015/10/15 23:17:39 UTC

[GitHub] storm pull request: [STORM-1115] Stale leader-lock key effectively...

GitHub user danielschonfeld opened a pull request:

    https://github.com/apache/storm/pull/802

    [STORM-1115] Stale leader-lock key effectively bans all nodes from becoming leaders

    From the issue:
    
    ```
    I believe this curator bug is what's in play causing the above described situation.
    https://issues.apache.org/jira/browse/CURATOR-202
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/schonfeld/storm update-curator

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/storm/pull/802.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #802
    
----
commit 580de48c91ab32a1777f0e375d065d3a01b8e3e9
Author: Michael Schonfeld <mi...@schonfeld.org>
Date:   2015-10-06T17:55:17Z

    more logging to nimbusinfo

commit ea2c3a81855dded8c48d946a7017c5ee1ded5e5b
Author: Michael Schonfeld <mi...@schonfeld.org>
Date:   2015-10-09T19:21:52Z

    debug logging

commit 25ad78e3c6c47e3da3c6cab86d155274ce5ad561
Author: Michael Schonfeld <mi...@schonfeld.org>
Date:   2015-10-09T19:57:45Z

    wrong encasing

commit b7db4434f10b78c6259b82f6b6edaac676900bef
Author: Michael Schonfeld <mi...@schonfeld.org>
Date:   2015-10-09T20:36:53Z

    wrap in do()

commit 9ca4d5d4821959b7b9df002fbdc589f6232c37da
Author: Michael Schonfeld <mi...@schonfeld.org>
Date:   2015-10-09T20:39:49Z

    missing )

commit 24c4ce0056d40f82f17c15603e5ab83a3b9596ab
Author: Michael Schonfeld <mi...@schonfeld.org>
Date:   2015-10-09T21:15:21Z

    more debug

commit 8253f83ece8a63c076a9a5e833e2871728419c83
Author: Michael Schonfeld <mi...@schonfeld.org>
Date:   2015-10-09T21:33:23Z

    no need for ()

commit 278206baf7567f270b5bcca41dbeec75c765aaf4
Author: Michael Schonfeld <mi...@schonfeld.org>
Date:   2015-10-09T21:39:05Z

    blha

commit 9f3e8d55b9083d59423ac9153f0aff8c912242bb
Author: Michael Schonfeld <mi...@schonfeld.org>
Date:   2015-10-09T23:11:28Z

    dont use local var

commit faf9bda82e1c4e1875c87a4920fb2c1c2a33ba56
Author: Daniel Schonfeld <da...@schonfeld.org>
Date:   2015-10-15T21:11:45Z

    update curator to version 2.9.0 to combat stale leader issue as described in https://issues.apache.org/jira/browse/CURATOR-202

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-1115] Stale leader-lock key effectively...

Posted by Parth-Brahmbhatt <gi...@git.apache.org>.
Github user Parth-Brahmbhatt commented on the pull request:

    https://github.com/apache/storm/pull/802#issuecomment-148763402
  
    @revans2 The log concerns were from the origin PR that @danielschonfeld which he has fixed but I guess he force pushed the branch. I am +1 on this change too. 
    
    @danielschonfeld On a side note, can you provide any steps to reproduce this locally?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-1115] Stale leader-lock key effectively...

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the pull request:

    https://github.com/apache/storm/pull/802#issuecomment-148734244
  
    @danielschonfeld It looks fine to me.  Personally am +1 on it, but I want to hear back from @Parth-Brahmbhatt.  He has a concern about unnecessary log statements.  I assume that these are coming directly out of curator itself, so I am not really sure how he wants to handle this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-1115] Stale leader-lock key effectively...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/storm/pull/802


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-1115] Stale leader-lock key effectively...

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the pull request:

    https://github.com/apache/storm/pull/802#issuecomment-148728541
  
    @danielschonfeld the test failure is because we have too many tests that don't use ephemeral ports.  In all likelihood the JDK8 test was running on the same box at the same time and got the port before this test could.  JDK8 tends to run through tests faster then JDK7 does.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-1115] Stale leader-lock key effectively...

Posted by danielschonfeld <gi...@git.apache.org>.
Github user danielschonfeld commented on the pull request:

    https://github.com/apache/storm/pull/802#issuecomment-148731922
  
    @revans2 what can I do to make this PR approval ready?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-1115] Stale leader-lock key effectively...

Posted by danielschonfeld <gi...@git.apache.org>.
Github user danielschonfeld commented on the pull request:

    https://github.com/apache/storm/pull/802#issuecomment-148807939
  
    @Parth-Brahmbhatt that's a tricky one.  I haven't found a way to reproduce but leaving nimbus work for a day or so with number of nimbuses > 1 and a good load on the system we see the number of ZK nodes/keys go up to (X*nimbuses)+1 under /leader-lock.  When that happens, we have problems trying to do anything as no nimbus thinks it's the leader which is exactly what's described in CURATOR-202.
    
    If you can think of a way to disconnect the ZK connection but reconnect using the same session programmatically you'll have a reproduction of this bug as this always starts showing up after something like the following log lines:
    
    ```
    2015-10-16 18:16:13 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO] State change: RECONNECTED
    2015-10-16 18:16:14 o.a.s.s.o.a.z.ClientCnxn [INFO] Client session timed out, have not heard from server in 6668ms for sessionid 0x1506caf14ab005f, closing socket connection and attempting reconnect
    2015-10-16 18:16:14 o.a.s.s.o.a.z.ClientCnxn [INFO] Client session timed out, have not heard from server in 6672ms for sessionid 0x1506caf14ab0060, closing socket connection and attempting reconnect
    2015-10-16 18:16:15 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket connection to server 10.101.1.2/10.101.1.2:2181. Will not attempt to authenticate using SASL (unknown error)
    2015-10-16 18:16:15 o.a.s.s.o.a.z.ClientCnxn [INFO] Socket connection established to 10.101.1.2/10.101.1.2:2181, initiating session
    2015-10-16 18:16:15 o.a.s.s.o.a.z.ClientCnxn [INFO] Session establishment complete on server 10.101.1.2/10.101.1.2:2181, sessionid = 0x1506caf14ab005f, negotiated timeout = 20000
    2015-10-16 18:16:15 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO] State change: RECONNECTED
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-1115] Stale leader-lock key effectively...

Posted by Parth-Brahmbhatt <gi...@git.apache.org>.
Github user Parth-Brahmbhatt commented on the pull request:

    https://github.com/apache/storm/pull/802#issuecomment-148524703
  
    lot of unnecessary log statements, can you remove them?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-1115] Stale leader-lock key effectively...

Posted by danielschonfeld <gi...@git.apache.org>.
Github user danielschonfeld commented on the pull request:

    https://github.com/apache/storm/pull/802#issuecomment-148527966
  
    any idea why the JDK7 test fails? I'm not entirely clear on why DRPC server is having problems starting up


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---