You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Benjamin Hindman <be...@gmail.com> on 2014/06/28 05:22:08 UTC

framework unregistration bug for Java and Python frameworks

If you have written or maintain a Mesos framework please read on.

*What:* Today a long standing bug was found with the MesosSchedulerDriver
for Java and Python that causes a framework to get unregistered with Mesos
without the framework doing so explicitly.

*How: *In the normal lifecycle of a framework the scheduler calls 'stop()'
on it's instance of MesosSchedulerDriver when it's done using the driver.
IMPORTANT: If the framework plans to failover it must pass 'true' to
'stop()', otherwise 'false' (the default).

Some very old code (from before the introduction of the 'failover' boolean
argument) that gets invoked when a Java or Python MesosSchedulerDriver gets
garbaged collected was calling 'stop()' which was using the default
semantics of 'false' indicating that the framework would not be failing
over and reconnecting to Mesos.

*Why:* In particular, why wasn't this bug found before? This behavior only
occurs when the MesosSchedulerDriver instance explicitly gets garbaged
collected _AND_ 'stop()' has not already been called. Moreover, in most
applications that don't call stop the MesosSchedulerDriver does not get
garbaged collected either because a reference is maintained for the
lifetime of the application _OR_ the application is terminated before the
garbage collector kicks in! Our best guess of why this was uncovered today
is because, for whatever reason, the garbage collector kicked in and
'stop()' got invoked.

*Short-term Mitigation:*

(1) Never destroy your reference to MesosSchedulerDriver (so the garbage
collector never cleans it up).
(2) Always call 'stop(true)' after you're done with the
MesosSchedulerDriver if you plan on failing over!

In addition, we'll be releasing a *0.19.1* bug fix release which fixes this
issue.

Apologies for any inconveniences this may cause folks. Big thanks to
Whitney Sorensen for reporting the bug and Vinod Kone for tracking it down.

Ben.

Re: framework unregistration bug for Java and Python frameworks

Posted by Benjamin Hindman <be...@gmail.com>.
Forgot to include the JIRA link for folks to follow along:
https://issues.apache.org/jira/browse/MESOS-1550


On Fri, Jun 27, 2014 at 8:22 PM, Benjamin Hindman <
benjamin.hindman@gmail.com> wrote:

> If you have written or maintain a Mesos framework please read on.
>
> *What:* Today a long standing bug was found with the MesosSchedulerDriver
> for Java and Python that causes a framework to get unregistered with Mesos
> without the framework doing so explicitly.
>
> *How: *In the normal lifecycle of a framework the scheduler calls
> 'stop()' on it's instance of MesosSchedulerDriver when it's done using the
> driver. IMPORTANT: If the framework plans to failover it must pass 'true'
> to 'stop()', otherwise 'false' (the default).
>
> Some very old code (from before the introduction of the 'failover' boolean
> argument) that gets invoked when a Java or Python MesosSchedulerDriver gets
> garbaged collected was calling 'stop()' which was using the default
> semantics of 'false' indicating that the framework would not be failing
> over and reconnecting to Mesos.
>
> *Why:* In particular, why wasn't this bug found before? This behavior
> only occurs when the MesosSchedulerDriver instance explicitly gets garbaged
> collected _AND_ 'stop()' has not already been called. Moreover, in most
> applications that don't call stop the MesosSchedulerDriver does not get
> garbaged collected either because a reference is maintained for the
> lifetime of the application _OR_ the application is terminated before the
> garbage collector kicks in! Our best guess of why this was uncovered today
> is because, for whatever reason, the garbage collector kicked in and
> 'stop()' got invoked.
>
> *Short-term Mitigation:*
>
> (1) Never destroy your reference to MesosSchedulerDriver (so the garbage
> collector never cleans it up).
> (2) Always call 'stop(true)' after you're done with the
> MesosSchedulerDriver if you plan on failing over!
>
> In addition, we'll be releasing a *0.19.1* bug fix release which fixes
> this issue.
>
> Apologies for any inconveniences this may cause folks. Big thanks to
> Whitney Sorensen for reporting the bug and Vinod Kone for tracking it down.
>
> Ben.
>

Re: framework unregistration bug for Java and Python frameworks

Posted by Benjamin Hindman <be...@gmail.com>.
Forgot to include the JIRA link for folks to follow along:
https://issues.apache.org/jira/browse/MESOS-1550


On Fri, Jun 27, 2014 at 8:22 PM, Benjamin Hindman <
benjamin.hindman@gmail.com> wrote:

> If you have written or maintain a Mesos framework please read on.
>
> *What:* Today a long standing bug was found with the MesosSchedulerDriver
> for Java and Python that causes a framework to get unregistered with Mesos
> without the framework doing so explicitly.
>
> *How: *In the normal lifecycle of a framework the scheduler calls
> 'stop()' on it's instance of MesosSchedulerDriver when it's done using the
> driver. IMPORTANT: If the framework plans to failover it must pass 'true'
> to 'stop()', otherwise 'false' (the default).
>
> Some very old code (from before the introduction of the 'failover' boolean
> argument) that gets invoked when a Java or Python MesosSchedulerDriver gets
> garbaged collected was calling 'stop()' which was using the default
> semantics of 'false' indicating that the framework would not be failing
> over and reconnecting to Mesos.
>
> *Why:* In particular, why wasn't this bug found before? This behavior
> only occurs when the MesosSchedulerDriver instance explicitly gets garbaged
> collected _AND_ 'stop()' has not already been called. Moreover, in most
> applications that don't call stop the MesosSchedulerDriver does not get
> garbaged collected either because a reference is maintained for the
> lifetime of the application _OR_ the application is terminated before the
> garbage collector kicks in! Our best guess of why this was uncovered today
> is because, for whatever reason, the garbage collector kicked in and
> 'stop()' got invoked.
>
> *Short-term Mitigation:*
>
> (1) Never destroy your reference to MesosSchedulerDriver (so the garbage
> collector never cleans it up).
> (2) Always call 'stop(true)' after you're done with the
> MesosSchedulerDriver if you plan on failing over!
>
> In addition, we'll be releasing a *0.19.1* bug fix release which fixes
> this issue.
>
> Apologies for any inconveniences this may cause folks. Big thanks to
> Whitney Sorensen for reporting the bug and Vinod Kone for tracking it down.
>
> Ben.
>