You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "Howard Lee (JIRA)" <ji...@apache.org> on 2016/09/06 08:28:20 UTC

[jira] [Commented] (STORM-2083) Blacklist Scheduler

    [ https://issues.apache.org/jira/browse/STORM-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15466809#comment-15466809 ] 

Howard Lee commented on STORM-2083:
-----------------------------------

We find that there is already a Blacklist in storm scheduling, which is used in Isolation scheduler. We decide to reuse this Blacklist. The only thing we will do is to add the unstable nodes to the blacklist, and leave the real scheduling to the underlying scheduler (Default Scheduler for now). 
Some configs:
1.	*blacklist.scheduler.tolerance.time.secs*: The number of seconds that the blacklist scheduler will concern of bad slots or supervisors. Default: 5 min.
2.	*blacklist.scheduler.tolerance.count*: The number of hit count that will trigger blacklist in tolerance time. Default: 3.
3.	*blacklist.scheduler.resume.time.secs*: The number of seconds that the blacklisted slots or supervisor will be resumed. Default: 30 min.
4.	*blacklist.scheduler.reporter*: The class that the blacklist scheduler will report the blacklist. We do not want storm to add blacklist silently, the blacklist add action may be reported via email or so on. Default: org.apache.storm.scheduler.blacklist.reporters.LogReporter
5.	blacklist.scheduler.strategy: The class that specifies the eviction strategy to use in blacklist scheduler. Default: org.apache.storm.scheduler.blacklist.strategies.DefaultBlacklistStrategy.

The blacklist scheduler maintains a cached supervisors map, comparing all the incoming supervisors to the cache, add new to the cache and remove the ones which is never exist in tolerance time (We can assume that they have already been removed from cluster, if not, they will be added back to cache as soon as they appear again).
The blacklist scheduler also maintains a circular buffer with a fix length of _torerance.time / monitor.freq_ as a slide window. On every time of scheduling ,the bad slots or supervisors will be added to the slide window. (We implement circular buffer ourselves instead of the disruptor RingBuffer inside storm, which I think is not used for slide window. This is to be discussed.)
The blacklist map in blacklist scheduler is map with a key of node info and value of _resume.time / monitor.freq_ while initializing which will be decreased by 1 on each schedule time and finally removed when it hits 0. The nodes that appear more than tolerance.count times in slide window will be add to the blacklist map discussed above.


> Blacklist Scheduler
> -------------------
>
>                 Key: STORM-2083
>                 URL: https://issues.apache.org/jira/browse/STORM-2083
>             Project: Apache Storm
>          Issue Type: New Feature
>          Components: storm-core
>            Reporter: Howard Lee
>              Labels: blacklist, scheduling
>             Fix For: 1.0.1, 1.0.2, 1.1.0, 1.0.3
>
>
> My company has gone through a fault in production, in which a critical switch causes unstable network for a set of machines with package loss rate of 30%-50%. In such fault, the supervisors and workers on the machines are not definitely dead, which is easy to handle. Instead they are still alive but very unstable. They lost heartbeat to the nimbus occasionally. The nimbus, in such circumstance, will still assign jobs to these machines, but will soon find them invalid again, result in a very slow convergence to stable status.
> To deal with such unstable cases, we intend to implement a blacklist scheduler, which will add the unstable nodes (supervisors, slots) to the blacklist temporarily, and resume them later. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)