You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@storm.apache.org by "Stig Rohde Døssing (JIRA)" <ji...@apache.org> on 2018/03/20 19:41:00 UTC

[jira] [Resolved] (STORM-2987) PaceMakerStateStorage should deal with InterruptedException correctly

     [ https://issues.apache.org/jira/browse/STORM-2987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stig Rohde Døssing resolved STORM-2987.
---------------------------------------
       Resolution: Fixed
    Fix Version/s: 2.0.0

> PaceMakerStateStorage should deal with InterruptedException correctly
> ---------------------------------------------------------------------
>
>                 Key: STORM-2987
>                 URL: https://issues.apache.org/jira/browse/STORM-2987
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Ethan Li
>            Assignee: Ethan Li
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.0.0
>
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> We found an issue that when nimbus restarted, it can only get leadership after a few seconds (15~20s). 
>  
> {code:java}
> 2018-02-27 08:18:43.420 main o.a.s.z.LeaderElectorImp [INFO] Queued up for leader lock.
> 2018-02-27 08:18:43.481 main o.a.s.d.m.MetricsUtils [INFO] Using statistics reporter plugin:org.apache.storm.daemon.metrics.reporters.JmxPreparableReporter
> 2018-02-27 08:18:43.483 main o.a.s.d.m.r.JmxPreparableReporter [INFO] Preparing...
> 2018-02-27 08:18:43.499 main o.a.s.m.StormMetricsRegistry [INFO] Started statistics report plugin...
> 2018-02-27 08:18:43.543 main o.a.s.m.n.Login [INFO] successfully logged in.
> 2018-02-27 08:18:43.551 main o.a.s.z.ClientZookeeper [INFO] Staring ZK Curator
> 2018-02-27 08:18:43.551 main o.a.c.f.i.CuratorFrameworkImpl [INFO] Starting
> 2018-02-27 08:18:43.552 Refresh-TGT o.a.s.m.n.Login [INFO] TGT refresh thread started.
> 2018-02-27 08:18:43.553 Refresh-TGT o.a.s.m.n.Login [INFO] TGT valid starting at:        Tue Feb 27 08:18:43 UTC 2018
> 2018-02-27 08:18:43.553 Refresh-TGT o.a.s.m.n.Login [INFO] TGT expires:                  Wed Feb 28 08:18:43 UTC 2018
> 2018-02-27 08:18:43.553 Refresh-TGT o.a.s.m.n.Login [INFO] TGT refresh sleeping until: Wed Feb 28 04:35:55 UTC 2018
> 2018-02-27 08:18:43.553 main o.a.z.ZooKeeper [INFO] Initiating client connection, connectString=openqe74blue-gw.blue.ygrid.yahoo.com:2181 sessionTimeout=60000 watcher=org.apache.
> curator.ConnectionState@2e185cd7
> 2018-02-27 08:18:43.559 main o.a.c.f.i.CuratorFrameworkImpl [INFO] Default schema
> 2018-02-27 08:18:43.560 main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.c.ZooKeeperSaslClient [INFO] Client will use GSSAPI as SASL mechanism.
> 2018-02-27 08:18:43.561 main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.ClientCnxn [INFO] Opening socket connection to server openqe74blue-gw.blue.ygrid.yahoo.co
> m/10.215.68.156:2181. Will attempt to SASL-authenticate using Login Context section 'Client'
> 2018-02-27 08:18:43.562 main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.ClientCnxn [INFO] Socket connection established to openqe74blue-gw.blue.ygrid.yahoo.com/1
> 0.215.68.156:2181, initiating session
> 2018-02-27 08:18:43.565 main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.ClientCnxn [INFO] Session establishment complete on server openqe74blue-gw.blue.ygrid.yah
> oo.com/10.215.68.156:2181, sessionid = 0x161d5f1ae970099, negotiated timeout = 40000
> 2018-02-27 08:18:43.565 main-EventThread o.a.c.f.s.ConnectionStateManager [INFO] State change: CONNECTED
> 2018-02-27 08:18:43.605 Curator-Framework-0 o.a.c.f.i.CuratorFrameworkImpl [INFO] backgroundOperationsLoop exiting
> 2018-02-27 08:18:43.625 main o.a.z.ZooKeeper [INFO] Session: 0x161d5f1ae970099 closed
> 2018-02-27 08:18:43.625 main-EventThread o.a.z.ClientCnxn [INFO] EventThread shut down
> 2018-02-27 08:18:43.626 main o.a.s.z.ClientZookeeper [INFO] Staring ZK Curator
> 2018-02-27 08:18:43.626 main o.a.c.f.i.CuratorFrameworkImpl [INFO] Starting
> 2018-02-27 08:18:43.635 main o.a.z.ZooKeeper [INFO] Initiating client connection, connectString=openqe74blue-gw.blue.ygrid.yahoo.com:2181/storm_ystormQE_CI sessionTimeout=60000 w
> atcher=org.apache.curator.ConnectionState@46cc127b
> 2018-02-27 08:18:43.654 main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.c.ZooKeeperSaslClient [INFO] Client will use GSSAPI as SASL mechanism.
> 2018-02-27 08:18:43.660 main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.ClientCnxn [INFO] Opening socket connection to server openqe74blue-gw.blue.ygrid.yahoo.co
> m/10.215.68.156:2181. Will attempt to SASL-authenticate using Login Context section 'Client'
> 2018-02-27 08:18:43.663 main o.a.c.f.i.CuratorFrameworkImpl [INFO] Default schema
> 2018-02-27 08:18:43.663 main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.ClientCnxn [INFO] Socket connection established to openqe74blue-gw.blue.ygrid.yahoo.com/1
> 0.215.68.156:2181, initiating session
> 2018-02-27 08:18:43.666 main-SendThread(openqe74blue-gw.blue.ygrid.yahoo.com:2181) o.a.z.ClientCnxn [INFO] Session establishment complete on server openqe74blue-gw.blue.ygrid.yah
> oo.com/10.215.68.156:2181, sessionid = 0x161d5f1ae97009a, negotiated timeout = 40000
> 2018-02-27 08:18:43.669 main-EventThread o.a.c.f.s.ConnectionStateManager [INFO] State change: CONNECTED
> 2018-02-27 08:18:43.790 main o.a.s.d.n.Nimbus [INFO] Starting nimbus server for storm version '2.0.0.y'
> 2018-02-27 08:18:44.274 timer o.a.s.d.n.Nimbus [INFO] not a leader, skipping assignments
> 2018-02-27 08:18:44.274 timer o.a.s.d.n.Nimbus [INFO] not a leader, skipping cleanup
> 2018-02-27 08:18:44.300 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download the blob with key: blob-5-1518767144-stormcode.ser
> 2018-02-27 08:18:44.301 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download the blob with key: TestZkErrorNodesHaveCorrectAcls-3-1519540302-stormcode.ser
> 2018-02-27 08:18:44.302 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download the blob with key: logviewer-ui-groups-test-1-1518940914-stormcode.ser
> 2018-02-27 08:18:44.303 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download the blob with key: blob-5-1518800831-stormcode.ser
> 2018-02-27 08:18:44.304 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download the blob with key: blob-5-1518767144-stormconf.ser
> 2018-02-27 08:18:44.306 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download the blob with key: logviewer-ui-groups-test-1-1518940914-stormconf.ser
> 2018-02-27 08:18:44.307 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download the blob with key: TestZkErrorNodesHaveCorrectAcls-3-1519540302-stormconf.ser
> 2018-02-27 08:18:44.308 timer o.a.s.b.BlobStoreUtils [ERROR] Could not download the blob with key: blob-5-1518800831-stormconf.ser
> 2018-02-27 08:18:44.367 timer o.a.s.d.n.Nimbus [INFO] not a leader, skipping credential renewal.
> 2018-02-27 08:18:54.274 timer o.a.s.d.n.Nimbus [INFO] not a leader, skipping assignments
> 2018-02-27 08:18:54.274 timer o.a.s.d.n.Nimbus [INFO] not a leader, skipping cleanup
> 2018-02-27 08:18:59.367 timer o.a.s.d.n.Nimbus [INFO] not a leader, skipping credential renewal.
> 2018-02-27 08:19:02.059 main-EventThread o.a.s.z.Zookeeper [INFO] active-topology-blobs [] local-topology-blobs [] diff-topology-blobs []
> 2018-02-27 08:19:02.059 main-EventThread o.a.s.z.Zookeeper [INFO] active-topology-dependencies [] local-blobs [] diff-topology-dependencies []
> 2018-02-27 08:19:02.059 main-EventThread o.a.s.z.Zookeeper [INFO] Accepting leadership, all active topologies and corresponding dependencies found locally.
> 2018-02-27 08:19:04.754 timer o.a.s.d.n.Nimbus [INFO] Scheduling took 442 ms for 0 topologies
> {code}
>  
> This can be re-produced by the following steps:
> {code:java}
> 1. restart pacemaker;
> 2. before pacemaker is up, restart nimbus{code}
> When we restart nimbus process,  it runs ShutDownHooks and stuck on timer.close(). 
> [https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L4227]
> The timer is not able to close because it's waiting for  doCleanup() to stop. However, the interruptedException is caught and ate in : [https://github.com/apache/storm/blob/master/storm-client/src/jvm/org/apache/storm/pacemaker/PacemakerClient.java#L180-L192]
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)