You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Chun-Hung Hsiao <ch...@apache.org> on 2018/04/16 21:04:24 UTC

Convention for Backward Compatibility for New Operations in Mesos 1.6

Hi all,

As some might have already known, we are currently working on patches to
implement the new GROW_VOLUME and SHRINK_VOLUME operations [1].

One problem surfaces is that, since the new operations are not supported in
Mesos 1.5, they will lead to an agent crash during the operation application
cycle if a Mesos 1.6 master send these operations to a Mesos 1.5 agent [2].

We are now consider two possibilities to address this compatibility problem:

1) The Mesos 1.6 master should check the agent's Mesos version in
`Master::accept` [3]. Moving forward, if we add new operations in future
Mesos
releases, we would have code like the following:

```
Version slaveVersion = ...; // Get the Mesos version of the slave of the
offer.
switch (operation.type()) {
  ...
  case SOME_NEW_OPERATION: {
    if (slaveVersion < minVersionForSomeNewOperation) {
      ... // Drop the operation.
    }
    break;
  }
  ...
}
```

Pros and cons:
+ The new operation won't go into the operation application cycle since it
is
  rejected in the very beginning. This means no resource metadata is
touched.
- Explicit slave version checks at master side make the code look not very
clean,
  and we will need to update this list every time we add a new operation.

2) Treat this issue as an agent crash bug. The Mesos master would forward
the operation to the agent, regardless of the agent's Mesos version. In the
agent,
we deploy and backport the following logic in `Slave::applyOperation` [4]:

```
if (message.operation_info().type() == OPERATION_UNKNOWN) {
  ... // Drop the operation and trigger a re-registration or send an
      // `UpdateSlaveMessage` to force the master to update the total
resource of
      // the slave.
}
```

Pros and cons:
+ Easier to add new operations since no new logic needs to be added for
backward
  Compability.
- Since the old agent won't know whether the new operations are speculative
or not,
  a re-registration or an `UpdateSlaveMessage` is required.
- Mesos 1.5.0 agents will still have the bug and crash when a new master
sends a
  new operation to them.

Since both options are viable and there seems to be no clear winner, we'd
like to
check with the community to see which convention is preferable. Please let
us know
what you think. Thanks!

Best,
Chun-Hung


[1] https://issues.apache.org/jira/browse/MESOS-4965
[2]
https://github.com/apache/mesos/blob/1.5.x/src/common/protobuf_utils.cpp#L851
[3] https://github.com/apache/mesos/blob/master/src/master/master.cpp#L3899
[4] https://github.com/apache/mesos/blob/1.5.x/src/slave/slave.cpp#L4359

Re: Convention for Backward Compatibility for New Operations in Mesos 1.6

Posted by Chun-Hung Hsiao <ch...@mesosphere.io>.
If we do option 1, then there will be no agent crash since the master won't
send any unknown operation to an old agent,
so option 2 is not a necessity.

On Mon, Apr 16, 2018 at 2:12 PM, Silas Snider <sw...@apple.com> wrote:

> I think we should definitely do option 2 regardless of whether we do
> option 1 as well, since although in this case it will still crash 1.5.0, at
> least in the future we won't have to have this worry again.
>
> On 4/16/18, 2:04 PM, "Chun-Hung Hsiao" <ch...@apache.org> wrote:
>
>     Hi all,
>
>     As some might have already known, we are currently working on patches
> to
>     implement the new GROW_VOLUME and SHRINK_VOLUME operations [1].
>
>     One problem surfaces is that, since the new operations are not
> supported in
>     Mesos 1.5, they will lead to an agent crash during the operation
> application
>     cycle if a Mesos 1.6 master send these operations to a Mesos 1.5 agent
> [2].
>
>     We are now consider two possibilities to address this compatibility
> problem:
>
>     1) The Mesos 1.6 master should check the agent's Mesos version in
>     `Master::accept` [3]. Moving forward, if we add new operations in
> future
>     Mesos
>     releases, we would have code like the following:
>
>     ```
>     Version slaveVersion = ...; // Get the Mesos version of the slave of
> the
>     offer.
>     switch (operation.type()) {
>       ...
>       case SOME_NEW_OPERATION: {
>         if (slaveVersion < minVersionForSomeNewOperation) {
>           ... // Drop the operation.
>         }
>         break;
>       }
>       ...
>     }
>     ```
>
>     Pros and cons:
>     + The new operation won't go into the operation application cycle
> since it
>     is
>       rejected in the very beginning. This means no resource metadata is
>     touched.
>     - Explicit slave version checks at master side make the code look not
> very
>     clean,
>       and we will need to update this list every time we add a new
> operation.
>
>     2) Treat this issue as an agent crash bug. The Mesos master would
> forward
>     the operation to the agent, regardless of the agent's Mesos version.
> In the
>     agent,
>     we deploy and backport the following logic in `Slave::applyOperation`
> [4]:
>
>     ```
>     if (message.operation_info().type() == OPERATION_UNKNOWN) {
>       ... // Drop the operation and trigger a re-registration or send an
>           // `UpdateSlaveMessage` to force the master to update the total
>     resource of
>           // the slave.
>     }
>     ```
>
>     Pros and cons:
>     + Easier to add new operations since no new logic needs to be added for
>     backward
>       Compability.
>     - Since the old agent won't know whether the new operations are
> speculative
>     or not,
>       a re-registration or an `UpdateSlaveMessage` is required.
>     - Mesos 1.5.0 agents will still have the bug and crash when a new
> master
>     sends a
>       new operation to them.
>
>     Since both options are viable and there seems to be no clear winner,
> we'd
>     like to
>     check with the community to see which convention is preferable. Please
> let
>     us know
>     what you think. Thanks!
>
>     Best,
>     Chun-Hung
>
>
>     [1] https://issues.apache.org/jira/browse/MESOS-4965
>     [2]
>     https://github.com/apache/mesos/blob/1.5.x/src/common/
> protobuf_utils.cpp#L851
>     [3] https://github.com/apache/mesos/blob/master/src/master/
> master.cpp#L3899
>     [4] https://github.com/apache/mesos/blob/1.5.x/src/slave/
> slave.cpp#L4359
>
>
>
>

Re: Convention for Backward Compatibility for New Operations in Mesos 1.6

Posted by Silas Snider <sw...@apple.com>.
I think we should definitely do option 2 regardless of whether we do option 1 as well, since although in this case it will still crash 1.5.0, at least in the future we won't have to have this worry again.

On 4/16/18, 2:04 PM, "Chun-Hung Hsiao" <ch...@apache.org> wrote:

    Hi all,
    
    As some might have already known, we are currently working on patches to
    implement the new GROW_VOLUME and SHRINK_VOLUME operations [1].
    
    One problem surfaces is that, since the new operations are not supported in
    Mesos 1.5, they will lead to an agent crash during the operation application
    cycle if a Mesos 1.6 master send these operations to a Mesos 1.5 agent [2].
    
    We are now consider two possibilities to address this compatibility problem:
    
    1) The Mesos 1.6 master should check the agent's Mesos version in
    `Master::accept` [3]. Moving forward, if we add new operations in future
    Mesos
    releases, we would have code like the following:
    
    ```
    Version slaveVersion = ...; // Get the Mesos version of the slave of the
    offer.
    switch (operation.type()) {
      ...
      case SOME_NEW_OPERATION: {
        if (slaveVersion < minVersionForSomeNewOperation) {
          ... // Drop the operation.
        }
        break;
      }
      ...
    }
    ```
    
    Pros and cons:
    + The new operation won't go into the operation application cycle since it
    is
      rejected in the very beginning. This means no resource metadata is
    touched.
    - Explicit slave version checks at master side make the code look not very
    clean,
      and we will need to update this list every time we add a new operation.
    
    2) Treat this issue as an agent crash bug. The Mesos master would forward
    the operation to the agent, regardless of the agent's Mesos version. In the
    agent,
    we deploy and backport the following logic in `Slave::applyOperation` [4]:
    
    ```
    if (message.operation_info().type() == OPERATION_UNKNOWN) {
      ... // Drop the operation and trigger a re-registration or send an
          // `UpdateSlaveMessage` to force the master to update the total
    resource of
          // the slave.
    }
    ```
    
    Pros and cons:
    + Easier to add new operations since no new logic needs to be added for
    backward
      Compability.
    - Since the old agent won't know whether the new operations are speculative
    or not,
      a re-registration or an `UpdateSlaveMessage` is required.
    - Mesos 1.5.0 agents will still have the bug and crash when a new master
    sends a
      new operation to them.
    
    Since both options are viable and there seems to be no clear winner, we'd
    like to
    check with the community to see which convention is preferable. Please let
    us know
    what you think. Thanks!
    
    Best,
    Chun-Hung
    
    
    [1] https://issues.apache.org/jira/browse/MESOS-4965
    [2]
    https://github.com/apache/mesos/blob/1.5.x/src/common/protobuf_utils.cpp#L851
    [3] https://github.com/apache/mesos/blob/master/src/master/master.cpp#L3899
    [4] https://github.com/apache/mesos/blob/1.5.x/src/slave/slave.cpp#L4359
    



Re: Convention for Backward Compatibility for New Operations in Mesos 1.6

Posted by James Peach <jo...@gmail.com>.
> On Apr 16, 2018, at 2:04 PM, Chun-Hung Hsiao <ch...@apache.org> wrote:
> 
> Hi all,
> 
> As some might have already known, we are currently working on patches to
> implement the new GROW_VOLUME and SHRINK_VOLUME operations [1].
> 
> One problem surfaces is that, since the new operations are not supported in
> Mesos 1.5, they will lead to an agent crash during the operation application
> cycle if a Mesos 1.6 master send these operations to a Mesos 1.5 agent [2].
> 
> We are now consider two possibilities to address this compatibility problem:
> 
> 1) The Mesos 1.6 master should check the agent's Mesos version in
> `Master::accept` [3]. Moving forward, if we add new operations in future
> Mesos
> releases, we would have code like the following:

Using a capability follows the existing practice. I'm also sympathetic to the argument that this is an experimental feature and will cause 1.5 agents will crash.

> 2) Treat this issue as an agent crash bug. The Mesos master would forward
> the operation to the agent, regardless of the agent's Mesos version. In the
> agent,
> we deploy and backport the following logic in `Slave::applyOperation` [4]:
> 
> ```
> if (message.operation_info().type() == OPERATION_UNKNOWN) {
>  ... // Drop the operation and trigger a re-registration or send an
>      // `UpdateSlaveMessage` to force the master to update the total
> resource of
>      // the slave.
> }
> ```

You should never drop operations. This should respond with some sort of "UNKNOWN/UNSUPPORTED" status.

J

Re: Convention for Backward Compatibility for New Operations in Mesos 1.6

Posted by Chun-Hung Hsiao <ch...@mesosphere.io>.
Are you suggesting that for every new operation we'll introduce a new
capability?

On Mon, Apr 16, 2018 at 2:14 PM, Vinod Kone <vi...@apache.org> wrote:

> Crashing the agent is definitely not a viable option IMO.
>
> Why can't we use agent capabilities instead of agent version and reject
> such operations at master? This is one of the main reasons we introduced
> the concept of framework, master, agent capabilities.
>
> On Mon, Apr 16, 2018 at 2:04 PM, Chun-Hung Hsiao <ch...@apache.org>
> wrote:
>
> > Hi all,
> >
> > As some might have already known, we are currently working on patches to
> > implement the new GROW_VOLUME and SHRINK_VOLUME operations [1].
> >
> > One problem surfaces is that, since the new operations are not supported
> in
> > Mesos 1.5, they will lead to an agent crash during the operation
> > application
> > cycle if a Mesos 1.6 master send these operations to a Mesos 1.5 agent
> [2].
> >
> > We are now consider two possibilities to address this compatibility
> > problem:
> >
> > 1) The Mesos 1.6 master should check the agent's Mesos version in
> > `Master::accept` [3]. Moving forward, if we add new operations in future
> > Mesos
> > releases, we would have code like the following:
> >
> > ```
> > Version slaveVersion = ...; // Get the Mesos version of the slave of the
> > offer.
> > switch (operation.type()) {
> >   ...
> >   case SOME_NEW_OPERATION: {
> >     if (slaveVersion < minVersionForSomeNewOperation) {
> >       ... // Drop the operation.
> >     }
> >     break;
> >   }
> >   ...
> > }
> > ```
> >
> > Pros and cons:
> > + The new operation won't go into the operation application cycle since
> it
> > is
> >   rejected in the very beginning. This means no resource metadata is
> > touched.
> > - Explicit slave version checks at master side make the code look not
> very
> > clean,
> >   and we will need to update this list every time we add a new operation.
> >
> > 2) Treat this issue as an agent crash bug. The Mesos master would forward
> > the operation to the agent, regardless of the agent's Mesos version. In
> the
> > agent,
> > we deploy and backport the following logic in `Slave::applyOperation`
> [4]:
> >
> > ```
> > if (message.operation_info().type() == OPERATION_UNKNOWN) {
> >   ... // Drop the operation and trigger a re-registration or send an
> >       // `UpdateSlaveMessage` to force the master to update the total
> > resource of
> >       // the slave.
> > }
> > ```
> >
> > Pros and cons:
> > + Easier to add new operations since no new logic needs to be added for
> > backward
> >   Compability.
> > - Since the old agent won't know whether the new operations are
> speculative
> > or not,
> >   a re-registration or an `UpdateSlaveMessage` is required.
> > - Mesos 1.5.0 agents will still have the bug and crash when a new master
> > sends a
> >   new operation to them.
> >
> > Since both options are viable and there seems to be no clear winner, we'd
> > like to
> > check with the community to see which convention is preferable. Please
> let
> > us know
> > what you think. Thanks!
> >
> > Best,
> > Chun-Hung
> >
> >
> > [1] https://issues.apache.org/jira/browse/MESOS-4965
> > [2]
> > https://github.com/apache/mesos/blob/1.5.x/src/common/protob
> > uf_utils.cpp#L851
> > [3] https://github.com/apache/mesos/blob/master/src/master/maste
> > r.cpp#L3899
> > [4] https://github.com/apache/mesos/blob/1.5.x/src/slave/slave.cpp#L4359
> >
>

Re: Convention for Backward Compatibility for New Operations in Mesos 1.6

Posted by Greg Mann <gr...@mesosphere.io>.
> Crashing the agent is definitely not a viable option IMO.
>
> Why can't we use agent capabilities instead of agent version and reject
> such operations at master? This is one of the main reasons we introduced
> the concept of framework, master, agent capabilities.
>
>
One thing worth mentioning is that this crash would only manifest when:
1) The operator has set the experimental RESOURCE_PROVIDER capability on
the 1.5 agents in the cluster, and
2) Frameworks/operators begin using the new GROW_VOLUME/SHRINK_VOLUME
operations before upgrade is complete, while some agents are still on 1.5

We can definitely use a capability to address this scenario. For some
reason I find myself hesitant to add capabilities for small features or
edge cases like this, but perhaps there's no reason for such hesitation?

Cheers,
Greg

Re: Convention for Backward Compatibility for New Operations in Mesos 1.6

Posted by Vinod Kone <vi...@apache.org>.
Crashing the agent is definitely not a viable option IMO.

Why can't we use agent capabilities instead of agent version and reject
such operations at master? This is one of the main reasons we introduced
the concept of framework, master, agent capabilities.

On Mon, Apr 16, 2018 at 2:04 PM, Chun-Hung Hsiao <ch...@apache.org> wrote:

> Hi all,
>
> As some might have already known, we are currently working on patches to
> implement the new GROW_VOLUME and SHRINK_VOLUME operations [1].
>
> One problem surfaces is that, since the new operations are not supported in
> Mesos 1.5, they will lead to an agent crash during the operation
> application
> cycle if a Mesos 1.6 master send these operations to a Mesos 1.5 agent [2].
>
> We are now consider two possibilities to address this compatibility
> problem:
>
> 1) The Mesos 1.6 master should check the agent's Mesos version in
> `Master::accept` [3]. Moving forward, if we add new operations in future
> Mesos
> releases, we would have code like the following:
>
> ```
> Version slaveVersion = ...; // Get the Mesos version of the slave of the
> offer.
> switch (operation.type()) {
>   ...
>   case SOME_NEW_OPERATION: {
>     if (slaveVersion < minVersionForSomeNewOperation) {
>       ... // Drop the operation.
>     }
>     break;
>   }
>   ...
> }
> ```
>
> Pros and cons:
> + The new operation won't go into the operation application cycle since it
> is
>   rejected in the very beginning. This means no resource metadata is
> touched.
> - Explicit slave version checks at master side make the code look not very
> clean,
>   and we will need to update this list every time we add a new operation.
>
> 2) Treat this issue as an agent crash bug. The Mesos master would forward
> the operation to the agent, regardless of the agent's Mesos version. In the
> agent,
> we deploy and backport the following logic in `Slave::applyOperation` [4]:
>
> ```
> if (message.operation_info().type() == OPERATION_UNKNOWN) {
>   ... // Drop the operation and trigger a re-registration or send an
>       // `UpdateSlaveMessage` to force the master to update the total
> resource of
>       // the slave.
> }
> ```
>
> Pros and cons:
> + Easier to add new operations since no new logic needs to be added for
> backward
>   Compability.
> - Since the old agent won't know whether the new operations are speculative
> or not,
>   a re-registration or an `UpdateSlaveMessage` is required.
> - Mesos 1.5.0 agents will still have the bug and crash when a new master
> sends a
>   new operation to them.
>
> Since both options are viable and there seems to be no clear winner, we'd
> like to
> check with the community to see which convention is preferable. Please let
> us know
> what you think. Thanks!
>
> Best,
> Chun-Hung
>
>
> [1] https://issues.apache.org/jira/browse/MESOS-4965
> [2]
> https://github.com/apache/mesos/blob/1.5.x/src/common/protob
> uf_utils.cpp#L851
> [3] https://github.com/apache/mesos/blob/master/src/master/maste
> r.cpp#L3899
> [4] https://github.com/apache/mesos/blob/1.5.x/src/slave/slave.cpp#L4359
>