You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Andrei Sekretenko (JIRA)" <ji...@apache.org> on 2019/06/13 12:20:00 UTC
[jira] [Commented] (MESOS-9763) Race between two re-subscriptions
against an empty master.
[ https://issues.apache.org/jira/browse/MESOS-9763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863008#comment-16863008 ]
Andrei Sekretenko commented on MESOS-9763:
------------------------------------------
In [https://reviews.apache.org/r/70668] the validation of the new FrameworkInfo against the current one was moved into the `_subscribe()` continuation (which also performs applying the update). This fixes the race.
No deterministic test against this race has been implemened yet, though.
> Race between two re-subscriptions against an empty master.
> ----------------------------------------------------------
>
> Key: MESOS-9763
> URL: https://issues.apache.org/jira/browse/MESOS-9763
> Project: Mesos
> Issue Type: Bug
> Components: master, scheduler api
> Reporter: Andrei Sekretenko
> Priority: Major
> Labels: foundations
>
> Currently, subscription (and re-subscription) is not atomic.
> It consists of three steps performed by two actors:
> - Validating the supplied FrameworkInfo against the master state (which possibly includes an existing FrameworkInfo)
> - Authorizing the (re-)subscribing framework
> - Applying the update
> A partitioned or buggy (or both) framework can trigger a race by sending two SUBSCRIBE calls with differing FrameworkInfo's on master failover.
> One of the possible sequences of events:
> 1. FrameworkInfo A is validated by master (which has no data about this framework)
> 2. conflicting FrameworkInfo B is validated by master (which stores no data about this framework as SchedulerA is not even authorized yet)
> 3. Scheduler A is authorized
> 4. Scheduler B is authorized
> 5. FrameworkInfo A is applied
> 6. Master attempts to apply FrameworkInfoB which is no longer valid after the previous step.
> One simple example is an attempt to re-subscribe with two different principals: currently the scheduler B's principal will be silently ignored at step 6 (instead of a validation error sent to B).
> At the moment of writing I'm not sure if there are other problems caused by this race.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)