You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/08/18 15:22:20 UTC

[jira] [Commented] (STORM-1707) Improve supervisor latency by removing 2-min wait

    [ https://issues.apache.org/jira/browse/STORM-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426632#comment-15426632 ] 

ASF GitHub Bot commented on STORM-1707:
---------------------------------------

Github user abellina commented on a diff in the pull request:

    https://github.com/apache/storm/pull/1370#discussion_r75329245
  
    --- Diff: storm-core/src/jvm/org/apache/storm/daemon/supervisor/SyncProcessEvent.java ---
    @@ -110,34 +117,47 @@ public void run() {
                         keeperWorkerIds.add(entry.getKey());
                         keepPorts.add(stateHeartbeat.getHeartbeat().get_port());
                     }
    +                if (stateHeartbeat.getState() == State.NOT_STARTED) {
    +                    keeperWorkerIds.add(entry.getKey());
    +                    keepPorts.add(supervisorData.getWorkerIdsToPorts().get(entry.getKey()));
    +                }
                 }
                 Map<Integer, LocalAssignment> reassignExecutors = getReassignExecutors(assignedExecutors, keepPorts);
                 Map<Integer, String> newWorkerIds = new HashMap<>();
                 for (Integer port : reassignExecutors.keySet()) {
                     newWorkerIds.put(port, Utils.uuid());
                 }
    +
                 LOG.debug("Assigned executors: {}", assignedExecutors);
                 LOG.debug("Allocated: {}", localWorkerStats);
    +            LOG.debug("Keeper worker ids: {}", keeperWorkerIds);
    +            LOG.debug("Keep ports: {}", keepPorts);
    +            LOG.debug("LaunchTimes: {}", supervisorData.getWorkerIdsToLaunchTimes());
    +            LOG.debug("Ids Ports: {}", supervisorData.getWorkerIdsToPorts());
     
                 for (Map.Entry<String, StateHeartbeat> entry : localWorkerStats.entrySet()) {
                     StateHeartbeat stateHeartbeat = entry.getValue();
    -                if (stateHeartbeat.getState() != State.VALID) {
    +                if ((stateHeartbeat.getState() != State.VALID && stateHeartbeat.getState() != State.NOT_STARTED) ||
    +                    (stateHeartbeat.getState() == State.NOT_STARTED && isWorkerStartTimeoutExpired(entry.getKey()))) { 
                         LOG.info("Shutting down and clearing state for id {}, Current supervisor time: {}, State: {}, Heartbeat: {}", entry.getKey(), now,
    --- End diff --
    
    The clojure equivalent for this logs also when the worker could not be started. So, in addition to the "Shutting down" message, there is a "Worker X failed to start" message in the case where the timeout expired (e.g. state is NOT_STARTED). That's a nice thing to have I think (should probably be a LOG.error).


> Improve supervisor latency by removing 2-min wait
> -------------------------------------------------
>
>                 Key: STORM-1707
>                 URL: https://issues.apache.org/jira/browse/STORM-1707
>             Project: Apache Storm
>          Issue Type: Improvement
>            Reporter: Paul Poulosky
>            Assignee: Paul Poulosky
>
> After launching workers, the supervisor waits up to 2 minutes synchronously for the workers to be "launched".
> We should remove this, and instead keep track of launch time, making the "killer" function smart enough to determine the difference between a worker that's still launching, one that's timed out, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)