You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "Jungtaek Lim (JIRA)" <ji...@apache.org> on 2015/04/22 09:47:58 UTC
[jira] [Commented] (STORM-794) Trident Topology doesn't handle deactivate during graceful shutdown

    [ https://issues.apache.org/jira/browse/STORM-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14506599#comment-14506599 ] 

Jungtaek Lim commented on STORM-794:
------------------------------------

Out of curiosity, waiting 1 sec to prepare shutdown seems to be too short.

It seems that worker couldn't execute whole shutdown function before worker has been killed by supervisor.
(Please correct me if I'm wrong! :) )

{code}
        shutdown* (fn []
                    (log-message "Shutting down worker " storm-id " " assignment-id " " port)
                    (doseq [[_ socket] @(:cached-node+port->socket worker)]
                      ;; this will do best effort flushing since the linger period
                      ;; was set on creation
                      (.close socket))
                    (log-message "Shutting down receive thread")
                    (receive-thread-shutdown)
                    (log-message "Shut down receive thread")
                    (log-message "Terminating messaging context")
                    (log-message "Shutting down executors")
                    (doseq [executor @executors] (.shutdown executor))
                    (log-message "Shut down executors")
                                        
                    ;;this is fine because the only time this is shared is when it's a local context,
                    ;;in which case it's a noop
                    (.term ^IContext (:mq-context worker))
                    (log-message "Shutting down transfer thread")
                    (disruptor/halt-with-interrupt! (:transfer-queue worker))

                    (.interrupt transfer-thread)
                    (.join transfer-thread)
                    (log-message "Shut down transfer thread")
                    (cancel-timer (:heartbeat-timer worker))
                    (cancel-timer (:refresh-connections-timer worker))
                    (cancel-timer (:refresh-active-timer worker))
                    (cancel-timer (:executor-heartbeat-timer worker))
                    (cancel-timer (:user-timer worker))
                    
                    (close-resources worker)
                    
                    ;; TODO: here need to invoke the "shutdown" method of WorkerHook
                    
                    (.remove-worker-heartbeat! (:storm-cluster-state worker) storm-id assignment-id port)
                    (log-message "Disconnecting from storm cluster state context")
                    (.disconnect (:storm-cluster-state worker))
                    (.close (:cluster-state worker))
                    (log-message "Shut down worker " storm-id " " assignment-id " " port))
{code}

> Trident Topology doesn't handle deactivate during graceful shutdown
> -------------------------------------------------------------------
>
>                 Key: STORM-794
>                 URL: https://issues.apache.org/jira/browse/STORM-794
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 0.9.3
>            Reporter: Jungtaek Lim
>
> I met an issue from Trident Topology in production env.
> Normally, when we kill a topology via UI, Nimbus changes Topology status to "killed", and when Spout determines new status, it becomes deactivated so bolts can handle remain tuples within wait-time.
> AFAIK that's how Storm guarantees graceful shutdown.
> But, Trident Topology seems not handle "deactivate" while we try shutdown topology gracefully.
> MasterBatchCoordinator never stops making next transaction, so Trident Spout never stops emitting, bolts (function) always take care of tuples.
> Topology setting
> - 1 worker, 1 acker
> - max spout pending: 1
> - TOPOLOGY_TRIDENT_BATCH_EMIT_INTERVAL_MILLIS : 5
> -- It may be weird but MasterBatchCoordinator's default value is 1
> * Nimbus log
> {code}
> 2015-04-20 09:59:07.954 INFO  [pool-5-thread-41][nimbus] Delaying event :remove for 120 secs for BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015
> ...
> 2015-04-20 09:59:07.955 INFO  [pool-5-thread-41][nimbus] Updated BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015 with status {:type :killed, :kill-time-secs 120}
> ...
> 2015-04-20 10:01:07.956 INFO  [timer][nimbus] Killing topology: BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015
> ...
> 2015-04-20 10:01:14.448 INFO  [timer][nimbus] Cleaning up BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015
> {code}
> * Supervisor log
> {code}
> 2015-04-20 10:01:07.960 INFO  [Thread-1][supervisor] Removing code for storm id BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015
> 2015-04-20 10:01:07.962 INFO  [Thread-2][supervisor] Shutting down and clearing state for id 9719259e-528c-4336-abf9-592c1bb9a00b. Current supervisor time: 1429491667. State: :disallowed, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1429491667, :storm-id "BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015", :executors #{[2 2] [3 3] [4 4] [5 5] [6 6] [7 7] [8 8] [9 9] [10 10] [11 11] [12 12] [13 13] [14 14] [-1 -1] [1 1]}, :port 6706}
> 2015-04-20 10:01:07.962 INFO  [Thread-2][supervisor] Shutting down 5bc084a2-b668-4610-86f6-9b93304d40a8:9719259e-528c-4336-abf9-592c1bb9a00b
> 2015-04-20 10:01:08.974 INFO  [Thread-2][supervisor] Shut down 5bc084a2-b668-4610-86f6-9b93304d40a8:9719259e-528c-4336-abf9-592c1bb9a00b
> {code}
> * Worker log
> {code}
> 2015-04-20 10:01:07.985 INFO  [Thread-33][worker] Shutting down worker BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015 5bc084a2-b668-4610-86f6-9b93304d40a8 6706
> 2015-04-20 10:01:07.985 INFO  [Thread-33][worker] Shutting down receive thread
> 2015-04-20 10:01:07.988 WARN  [Thread-33][ExponentialBackoffRetry] maxRetries too large (300). Pinning to 29
> 2015-04-20 10:01:07.988 INFO  [Thread-33][StormBoundedExponentialBackoffRetry] The baseSleepTimeMs [100] the maxSleepTimeMs [1000] the maxRetries [300]
> 2015-04-20 10:01:07.988 INFO  [Thread-33][Client] New Netty Client, connect to localhost, 6706, config: , buffer_size: 5242880
> 2015-04-20 10:01:07.991 INFO  [client-schedule-service-1][Client] Reconnect started for Netty-Client-localhost/127.0.0.1:6706... [0]
> 2015-04-20 10:01:07.996 INFO  [Thread-33][loader] Shutting down receiving-thread: [BFDC-topology-DynamicCollect-68c9d7b4-72-1429491015, 6706]
> ...
> 2015-04-20 10:01:08.044 INFO  [Thread-33][Client] Closing Netty Client Netty-Client-localhost/127.0.0.1:6706
> 2015-04-20 10:01:08.044 INFO  [Thread-33][Client] Waiting for pending batchs to be sent with Netty-Client-localhost/127.0.0.1:6706..., timeout: 600000ms, pendings: 1
> {code}
> I found activating log, but cannot find deactivating log.
> {code}
> 2015-04-20 09:50:24.556 INFO  [Thread-30-$mastercoord-bg0][executor] Activating spout $mastercoord-bg0:(1)
> {code}
> Please note that it doesn't work when I just push button to "deactivate" topology via UI.
> We're changing our Topology to normal Spout-Bolt, but personally I'd like to see it resolved. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)