You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "Robert Joseph Evans (JIRA)" <ji...@apache.org> on 2016/08/03 17:53:21 UTC

[jira] [Commented] (STORM-2018) Simplify Threading Model of the Supervisor

    [ https://issues.apache.org/jira/browse/STORM-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15406303#comment-15406303 ] 

Robert Joseph Evans commented on STORM-2018:
--------------------------------------------

I would propose that we move to a model where we have 3 main types of threads, for the supervisor itself. Threads in the localizer are different.

1) A Single HB thread (very much like it is now)
2) A Single Scheduling Sync Thread that would
{code}
while (!done) {
  read scheduling from ZK && sanity check with retry like today;
  for (int port: set.union(scheduling.ports, slots.keys)) {
    Slot s = slots.get(port);
    if (s == null) {
        s = new Slot();
        slots.put(port, s);
    }
    s.setNewAssignment(scheduling.get(port));
  }
  sleep(...);
}
{code}
3) A Slot thread per slot.  This thread would more or less do the following
{code}
while(!done) {
  Assignment newAssignment = this.newAssignment;
  StateMachine.transitionIfNeeded(newAssignment,...);
}
{code}
The state machine itself is described in [Slot.dot|https://issues.apache.org/jira/secure/attachment/12821873/Slot.dot] and you can see a visualization in Slot.svg
!Slot.svg!

Slot would have just a few methods to set things asynchronously
{code}
public void setNewAssignment(Assignment...);
public void informWorkerDied(String workerId...);
{code}

Every time that current assignment is written to it would also be written out to disk so if we crash we can recover.

> Simplify Threading Model of the Supervisor
> ------------------------------------------
>
>                 Key: STORM-2018
>                 URL: https://issues.apache.org/jira/browse/STORM-2018
>             Project: Apache Storm
>          Issue Type: New Feature
>          Components: storm-core
>    Affects Versions: 1.0.0, 2.0.0
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>         Attachments: Slot.dot, Slot.svg
>
>
> We have been trying to roll out CGROUP enforcement and right now are running into a number of race conditions in the supervisor.  When using CGROUPS the timing of some operations are different and are exposing issues that we would not see without this.
> In order to make progress with testing/deploying CGROUP and RAS we are going to try and refactor the supervisor to have a simpler threading model, but likely with more threads.  We will base the code off of the java code currently in master, and may replace that in the 2.0 release, but plan on having it be a part of 1.x too, if it truly is more stable.
> I will try to keep this JIRA up to date with what we are doing and the architecture to keep the community informed.  We need to move quickly to meet some of our company goals but will not just shove this in.  We welcome any feedback on the design and code before it goes into the community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)