You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@storm.apache.org by "Rui Li (Jira)" <ji...@apache.org> on 2020/11/09 14:50:00 UTC

[jira] [Updated] (STORM-3713) Possible race condition between zookeeper sync-up and killing topology

     [ https://issues.apache.org/jira/browse/STORM-3713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rui Li updated STORM-3713:
--------------------------
    Description: 
When nimbus re-gains leadership, the leaderCallback will sync-up with zookeeper:

[https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/nimbus/LeaderListenerCallback.java#L106] [https://github.com/apache/storm/blob/master/storm-client/src/jvm/org/apache/storm/cluster/StormClusterStateImpl.java#L212]  

When killing topology, both zookeeper and in-memory assignments map get cleaned up.

[https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L313]  

However, in the syncRemoteAssignments call, it will get the information from zookeeper into stormIds. The after some processing (including deserialization), it will then put it into local in-memory assignments backend. If the zookeeper deletion happens between these two steps, then there will be mismatch between remote zookeeper and local backends.  

We found this issue since we observed a NPE when making assignments. 2020-11-04 19:56:17.703 o.a.s.d.n.Nimbus timer [ERROR] Error while processing event java.lang.RuntimeException: java.lang.NullPointerException at

{code}

org.apache.storm.daemon.nimbus.Nimbus.lambda$launchServer$17(Nimbus.java:1419) ~[storm-server-2.3.0.y.jar:2.3.0.y] at org.apache.storm.StormTimer$1.run(StormTimer.java:110) ~[storm-client-2.3.0.y.jar:2.3.0.y] at org.apache.storm.StormTimer$StormTimerTask.run(StormTimer.java:226) [storm-client-2.3.0.y.jar:2.3.0.y] Caused by: java.lang.NullPointerException at org.apache.storm.daemon.nimbus.HeartbeatCache.getAliveExecutors(HeartbeatCache.java:199) ~[storm-server-2.3.0.y.jar:2.3.0.y] at org.apache.storm.daemon.nimbus.Nimbus.aliveExecutors(Nimbus.java:2029) ~[storm-server-2.3.0.y.jar:2.3.0.y] at org.apache.storm.daemon.nimbus.Nimbus.computeTopologyToAliveExecutors(Nimbus.java:2109) ~[storm-server-2.3.0.y.jar:2.3.0.y] at org.apache.storm.daemon.nimbus.Nimbus.computeNewSchedulerAssignments(Nimbus.java:2272) ~[storm-server-2.3.0.y.jar:2.3.0.y] at org.apache.storm.daemon.nimbus.Nimbus.lockingMkAssignments(Nimbus.java:2467) ~[storm-server-2.3.0.y.jar:2.3.0.y] at org.apache.storm.daemon.nimbus.Nimbus.mkAssignments(Nimbus.java:2453) ~[storm-server-2.3.0.y.jar:2.3.0.y] at org.apache.storm.daemon.nimbus.Nimbus.mkAssignments(Nimbus.java:2397) ~[storm-server-2.3.0.y.jar:2.3.0.y] at org.apache.storm.daemon.nimbus.Nimbus.lambda$launchServer$17(Nimbus.java:1415) ~[storm-server-2.3.0.y.jar:2.3.0.y] ... 2 more 2020-11-04 19:56:17.703 o.a.s.u.Utils timer [ERROR] Halting process: Error while processing event  

{code}

[https://github.com/apache/storm/blob/fe2f7102e244336e288d26f2dde8089198ee4c33/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2108]  

The existingAssignment comes from in-memory backend while the topologyToExecutors comes from zookeeper which did not include a deleted topolgy id. [https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2108] [https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2111|https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2108] [https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/HeartbeatCache.java#L199]

So NPE happens.      

  was:When nimbus re-gains leadership, the leaderCallback will sync-up with zookeeper: [https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/nimbus/LeaderListenerCallback.java#L106] [https://github.com/apache/storm/blob/master/storm-client/src/jvm/org/apache/storm/cluster/StormClusterStateImpl.java#L212]   When killing topology, both zookeeper and in-memory assignments map get cleaned up. [https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L313]   However, in the syncRemoteAssignments call, it will get the information from zookeeper into stormIds. The after some processing (including deserialization), it will then put it into local in-memory assignments backend. If the zookeeper deletion happens between these two steps, then there will be mismatch between remote zookeeper and local backends.   We found this issue since we observed a NPE when making assignments. 2020-11-04 19:56:17.703 o.a.s.d.n.Nimbus timer [ERROR] Error while processing event java.lang.RuntimeException: java.lang.NullPointerException at org.apache.storm.daemon.nimbus.Nimbus.lambda$launchServer$17(Nimbus.java:1419) ~[storm-server-2.3.0.y.jar:2.3.0.y] at org.apache.storm.StormTimer$1.run(StormTimer.java:110) ~[storm-client-2.3.0.y.jar:2.3.0.y] at org.apache.storm.StormTimer$StormTimerTask.run(StormTimer.java:226) [storm-client-2.3.0.y.jar:2.3.0.y] Caused by: java.lang.NullPointerException at org.apache.storm.daemon.nimbus.HeartbeatCache.getAliveExecutors(HeartbeatCache.java:199) ~[storm-server-2.3.0.y.jar:2.3.0.y] at org.apache.storm.daemon.nimbus.Nimbus.aliveExecutors(Nimbus.java:2029) ~[storm-server-2.3.0.y.jar:2.3.0.y] at org.apache.storm.daemon.nimbus.Nimbus.computeTopologyToAliveExecutors(Nimbus.java:2109) ~[storm-server-2.3.0.y.jar:2.3.0.y] at org.apache.storm.daemon.nimbus.Nimbus.computeNewSchedulerAssignments(Nimbus.java:2272) ~[storm-server-2.3.0.y.jar:2.3.0.y] at org.apache.storm.daemon.nimbus.Nimbus.lockingMkAssignments(Nimbus.java:2467) ~[storm-server-2.3.0.y.jar:2.3.0.y] at org.apache.storm.daemon.nimbus.Nimbus.mkAssignments(Nimbus.java:2453) ~[storm-server-2.3.0.y.jar:2.3.0.y] at org.apache.storm.daemon.nimbus.Nimbus.mkAssignments(Nimbus.java:2397) ~[storm-server-2.3.0.y.jar:2.3.0.y] at org.apache.storm.daemon.nimbus.Nimbus.lambda$launchServer$17(Nimbus.java:1415) ~[storm-server-2.3.0.y.jar:2.3.0.y] ... 2 more 2020-11-04 19:56:17.703 o.a.s.u.Utils timer [ERROR] Halting process: Error while processing event   [https://github.com/apache/storm/blob/fe2f7102e244336e288d26f2dde8089198ee4c33/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2108]   The existingAssignment comes from in-memory backend while the topologyToExecutors comes from zookeeper which did not include a deleted topolgy id. [https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2108] [https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2111|https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2108] [https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/HeartbeatCache.java#L199] So NPE happens.      


> Possible race condition between zookeeper sync-up and killing topology
> ----------------------------------------------------------------------
>
>                 Key: STORM-3713
>                 URL: https://issues.apache.org/jira/browse/STORM-3713
>             Project: Apache Storm
>          Issue Type: Bug
>            Reporter: Rui Li
>            Assignee: Rui Li
>            Priority: Minor
>
> When nimbus re-gains leadership, the leaderCallback will sync-up with zookeeper:
> [https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/nimbus/LeaderListenerCallback.java#L106] [https://github.com/apache/storm/blob/master/storm-client/src/jvm/org/apache/storm/cluster/StormClusterStateImpl.java#L212]  
> When killing topology, both zookeeper and in-memory assignments map get cleaned up.
> [https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L313]  
> However, in the syncRemoteAssignments call, it will get the information from zookeeper into stormIds. The after some processing (including deserialization), it will then put it into local in-memory assignments backend. If the zookeeper deletion happens between these two steps, then there will be mismatch between remote zookeeper and local backends.  
> We found this issue since we observed a NPE when making assignments. 2020-11-04 19:56:17.703 o.a.s.d.n.Nimbus timer [ERROR] Error while processing event java.lang.RuntimeException: java.lang.NullPointerException at
> {code}
> org.apache.storm.daemon.nimbus.Nimbus.lambda$launchServer$17(Nimbus.java:1419) ~[storm-server-2.3.0.y.jar:2.3.0.y] at org.apache.storm.StormTimer$1.run(StormTimer.java:110) ~[storm-client-2.3.0.y.jar:2.3.0.y] at org.apache.storm.StormTimer$StormTimerTask.run(StormTimer.java:226) [storm-client-2.3.0.y.jar:2.3.0.y] Caused by: java.lang.NullPointerException at org.apache.storm.daemon.nimbus.HeartbeatCache.getAliveExecutors(HeartbeatCache.java:199) ~[storm-server-2.3.0.y.jar:2.3.0.y] at org.apache.storm.daemon.nimbus.Nimbus.aliveExecutors(Nimbus.java:2029) ~[storm-server-2.3.0.y.jar:2.3.0.y] at org.apache.storm.daemon.nimbus.Nimbus.computeTopologyToAliveExecutors(Nimbus.java:2109) ~[storm-server-2.3.0.y.jar:2.3.0.y] at org.apache.storm.daemon.nimbus.Nimbus.computeNewSchedulerAssignments(Nimbus.java:2272) ~[storm-server-2.3.0.y.jar:2.3.0.y] at org.apache.storm.daemon.nimbus.Nimbus.lockingMkAssignments(Nimbus.java:2467) ~[storm-server-2.3.0.y.jar:2.3.0.y] at org.apache.storm.daemon.nimbus.Nimbus.mkAssignments(Nimbus.java:2453) ~[storm-server-2.3.0.y.jar:2.3.0.y] at org.apache.storm.daemon.nimbus.Nimbus.mkAssignments(Nimbus.java:2397) ~[storm-server-2.3.0.y.jar:2.3.0.y] at org.apache.storm.daemon.nimbus.Nimbus.lambda$launchServer$17(Nimbus.java:1415) ~[storm-server-2.3.0.y.jar:2.3.0.y] ... 2 more 2020-11-04 19:56:17.703 o.a.s.u.Utils timer [ERROR] Halting process: Error while processing event  
> {code}
> [https://github.com/apache/storm/blob/fe2f7102e244336e288d26f2dde8089198ee4c33/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2108]  
> The existingAssignment comes from in-memory backend while the topologyToExecutors comes from zookeeper which did not include a deleted topolgy id. [https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2108] [https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2111|https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L2108] [https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/HeartbeatCache.java#L199]
> So NPE happens.      



--
This message was sent by Atlassian Jira
(v8.3.4#803005)