You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@storm.apache.org by "Erik Weathers (JIRA)" <ji...@apache.org> on 2017/08/22 02:05:00 UTC

[jira] [Commented] (STORM-2126) Fix NPE due to race condition in nimbus.clj when attempting to get resources from SupervisorDetail

    [ https://issues.apache.org/jira/browse/STORM-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16136150#comment-16136150 ] 

Erik Weathers commented on STORM-2126:
--------------------------------------

[~revans2]: hi again Bobby!  This is another change that broke storm-on-mesos.  It's *much* worse for us though than the breakage caused by the previous change I brought up before ([STORM-2018|https://issues.apache.org/jira/browse/STORM-2018?focusedCommentId=16108307&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16108307]).   We've already worked around that issue.   However, this change has switched the nimbus scheduling from being Slot-centric to being Supervisor-centric.  Specifically this changed the behavior of the code path that takes the slots returned from {{INimbus.allSlotsAvailableForScheduling()}} and eventually propagated into {{INimbus.assignSlots()}} -- if a Supervisor doesn't already exist then the slot information is no longer respected.   And for storm-on-mesos the supervisors do not exist a priori -- they are launched onto hosts when a topology needs to be launched.  So the storm-on-mesos integration was relying on the nimbus to take the slots and call assignSlots even without a supervisor yet existing.

It is unfortunate that we inherited this messy interface between Storm and Mesos which allowed for this kind of seemingly purely improved code  (it *is* much more readable!) to actually break storm-on-mesos.  I have some ideas for hacking around it which may eventually be done, but I'm curious if you have any ideas for how we can fake there being Supervisors even when there aren't actually any supervisors yet?

> Fix NPE due to race condition in nimbus.clj when attempting to get resources from SupervisorDetail
> --------------------------------------------------------------------------------------------------
>
>                 Key: STORM-2126
>                 URL: https://issues.apache.org/jira/browse/STORM-2126
>             Project: Apache Storm
>          Issue Type: Bug
>            Reporter: Alessandro Bellina
>            Assignee: Alessandro Bellina
>            Priority: Minor
>             Fix For: 2.0.0, 1.1.0
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> Work done by [~dagit] at Yahoo.
> In nimbus.clj there is some code to work around a race condition in compute-new-scheduler-assignments and read-all-supervisor-details. However, the race can be avoiding, hence removing the need for the workaround. 
> This was exposed because RAS code is expecting the SchedulerDetail object to contain a resources map, but when the race happens these objects have that map set to null.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)