You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "Derek Dagit (JIRA)" <ji...@apache.org> on 2015/09/28 18:33:04 UTC

[jira] [Created] (STORM-1072) Nimbus gives incomplete cluster data to scheduler (hides dead worker slots)

Derek Dagit created STORM-1072:
----------------------------------

             Summary: Nimbus gives incomplete cluster data to scheduler (hides dead worker slots)
                 Key: STORM-1072
                 URL: https://issues.apache.org/jira/browse/STORM-1072
             Project: Apache Storm
          Issue Type: Bug
    Affects Versions: 0.11.0
            Reporter: Derek Dagit


1. Describe observed behavior.

Certain slots that have been assigned but have workers that have not yet sent a heartbeat are treated as "dead" slots, and these are not included in the cluster summary data that is passed to the scheduler.

[link|https://github.com/apache/storm/blob/8dd9e6e213210009968f39483cb69f271b2e8415/storm-core/src/clj/backtype/storm/daemon/nimbus.clj#L527] to nimbus code

For topologies whose payload is very large, this can result in scheduler results that never quite converge due to some of the slots not appearing on each call to schedule()


2. What is the expected behavior?

Nimbus may be too smart here: it seems better to give the full cluster information to the scheduler and let the scheduler make the appropriate decision about how to handle workers that are not yet up.

3. Outline the steps to reproduce the problem.

Either launch a topology with a very large jar file that takes minutes to download, or simulate by adding a sleep to the supervisor code just after the jar is downloaded.  This will cause a significant delay before the worker is up and heartbeating in.  On each scheduling run, such slots will not even be present for the scheduler logic.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)