You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mesos.apache.org by haosdent <ha...@gmail.com> on 2016/09/05 18:14:01 UTC

Question about the deprecated policy after 1.0

Hi, folks. As I mentioned in the previous email
http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1.
We have added `type` in the `HealthCheck` protobuf definition in 1.1.0 and
health checks without `type` specified will be deprecated since 1.1.0.

For backwards compatibility, we still support the command health check if
the
type is not specified for now. But we plan to make `type` become a required
field
and return `TASK_ERROR` if the type is not specified after 6 months. The
question
is if this meets the deprecated policy since 1.0 ? If 6 months is too short
and
we have to deprecate it after 2.0 ?

Looking forward the answers. Any concerns and questions are appreciated,
thanks a lot!

-- 
Best Regards,
Haosdent Huang

Re: Question about the deprecated policy after 1.0

Posted by Yan Xu <xu...@apple.com>.

Thanks @haosdent. Sorry there's been offline chats and I was waiting for that to circle back to the list.

Comments inlined. Besides this particular issue I think we all agree that we can improve the versioning doc to avoid confusion in the future. I'll send a RR for that but let's focus on this particular issue first.

> On Sep 14, 2016, at 7:42 PM, haosdent <ha...@gmail.com> wrote:
> 
> Thx @yan's update!
> 
> >The remaining issue we'd like to discuss with the community is, of the things in we plan to support in v1 API and hope to deprecate in Mesos 2.0, should we slate them for deprecation before we have defined their replacement?
> 
> IMO, if we sure a feature would be deprecated in Mesos 2.0, we should deprecate it immediately although could not give a clear replacement at that time.
> Then users would think that feature is not recommended to use and avoid to use it.

There has been misunderstandings about whether the v1 HTTP health check API is supported but I thought that the eventual conclusion was that we are supporting it in 1.x.
See http://mesos.slackarchive.io/health-check/-/1472748463.000051/1473696580.000102/1473235632000088/ <http://mesos.slackarchive.io/health-check/-/1472748463.000051/1473696580.000102/1473235632000088/>

But this is not even about it.

If the API is totally not supposed to be used, no need to wait until 2.0, remove it today.
If the API is supported in v1 and we are not sunsetting this feature, we need to provide an "upgrade path" whenever we plan to replace it with something else (in addition to conforming to the process here <http://mesos.apache.org/documentation/latest/versioning/>)

In either case, "deprecate this API in 2.0 without an alternative" doesn't make sense.

This is not a hypothetical issue - there are people who depend on this field.

> 
> Otherwise, we would encounter the same problem again:
> After few releases, we finish the better replacement of that old feature. Unfortunately, we find it has been used for a lot of users because we didn' mark it deprecated before. Then we have to deprecate it in Mesos 3.0.
> A worse case is we forget and ignore that TODO item as time flows, and left this tech debt to other guys. This is which we should avoid now if we could.

Think about the users who depend on this who find out that alternative API doesn't make into the v2 but the old API is removed from it in v2. What should they do?

I don't think we should ever do this.

Like I said, it's perfectly fine if we want to deprecate this in 2.0. To do that, let's make sure the alternative API makes into 2.0 first. If we fail to achieve that, why shouldn't we postpone the deprecation further?

> 
> Bases on the above two points, I think we should slate the deprecated feature although we don't have defined their replacement.
> 
> On Thu, Sep 15, 2016 at 4:31 AM, Yan Xu <xujyan@apple.com <ma...@apple.com>> wrote:
> To follow up on this, after discussions we (contributors and reviewers of MESOS-6110) have agreed to support the existing HTTP health check API in Mesos 1.x. (See https://reviews.apache.org/r/51803 <https://reviews.apache.org/r/51803>)
> 
> The remaining issue we'd like to discuss with the community is, of the things in we plan to support in v1 API and hope to deprecate in Mesos 2.0, should we slate them for deprecation before we have defined their replacement?
> 
> Two contrasting examples In /r/51803:
> `HTTPCheckInfo::type`:
> - v1 API: setting `type` is not required.
> - Proposal: Deprecate the above in 2.0, i.e., requiring it to be set.
> - This is clearly OK because it's clear to the users how to migrate to the new API -> just set the `type`.
> 
> `HTTPCheckInfo::statuses`:
> - v1 API: This field can be set, even though not acted on by Mesos default executors, they can be used by custom executors.
> - Proposal: Deprecate the above in 2.0, i.e., removing it.
> - IMO we cannot slate this API for deprecation in 2.0 right now because there's no replacement defined yet. I think ultimately we don't want to be in the situation where we remove `statuses` without replacing it with something else. If that's the case, I don't think we need to rush to decide that it has to be deprecated in 2.0. What if the replacement doesn't make it into 2.0?
> 
> Feel free to provide your feedback here or on https://reviews.apache.org/r/51803 <https://reviews.apache.org/r/51803> where more context may help clarify things.
> 
> 
> > On Sep 6, 2016, at 12:05 PM, Yan Xu <xujyan@apple.com <ma...@apple.com>> wrote:
> >
> > Should we think of this not as "whether this change should be subject to the current policy" but rather "whether this change presents an exception case that shouldn't be subject to the current policy"?
> >
> > We shouldn't make exceptions liberally but given that the community as a whole is new to the 1.0 world, we should allow the policy to evolve into something that works for both the developers and the frameworks writers/users. If we discover exceptions we should amend the policy document.
> >
> > The current policy on this is very clear on this (as Silas has cited): "The deprecation clock for vN-1 API will start as soon as we release “N.0.0” version of Mesos. We will strive to give enough time (e.g., 6 months) for frameworks/operators to upgrade to vN API before we stop supporting vN-1 API."
> >
> > In this case it translates to "Start the deprecation for this change with Mesos 2.0 and thrive to not break it until 2.0.0 release date + 6 months".
> >
> > Can we list the technical challenges that would make adhering to this policy not worthwhile? If this is the case, how should we change policy document accordingly?
> >
> > Separately from the details of this change but motivated by it, I think we need to improve the policy document on:
> > Clearly state what in the versioned protobufs that don't constitute as the official API. IMO we should give the developers the flexibility to add to the API before the feature is fully developed (and they are not subject to the policy) but in the policy doc we should point out the words to look for for such exceptions, e.g., "Unimplemented. DO NOT USE." or "Experimental. Unstable API."
> > Clearly state what to do with deprecations by developers and users. People will likely appreciate early heads-ups so perhaps the developer should be able to start warning people as soon as the change is made. e.g., print a warning message that references a future release: "This API/flag will be deprecated in Mesos 2.0... "
> > To help folks discover deprecations we can have a live document that lists the deprecated features by version. Currently the CHANGELOG file only lists deprecation in the next release so there's not a place to put the deprecations for 2.0 when we are only at 1.1.0 (WIP).
> > Thoughts?
> >
> > Jiang Yan Xu 
> >
> > On Tue, Sep 6, 2016 at 6:40 AM, Silas Snider <swsnider@apple.com <ma...@apple.com> <mailto:swsnider@apple.com <ma...@apple.com>>> wrote:
> > Responses inline
> >
> > > On Sep 6, 2016, at 1:33 AM, haosdent <haosdent@gmail.com <ma...@gmail.com> <mailto:haosdent@gmail.com <ma...@gmail.com>>> wrote:
> > >
> > > Hi, Silas. Thanks a lot to help test the health check changes recently.
> > >
> > > According to my understanding about your email, you mentioned two problems:
> > >
> > > 1. The bug that broken exists HTTP/command health check caused by r50812 <https://reviews.apache.org/r/50812 <https://reviews.apache.org/r/50812><https://reviews.apache.org/r/50812 <https://reviews.apache.org/r/50812>>> and r50996 <https://reviews.apache.org/r/50996 <https://reviews.apache.org/r/50996> <https://reviews.apache.org/r/50996 <https://reviews.apache.org/r/50996>>>
> > >
> > > >It is now true that even with the proposed change (51560), we will still get tasks rejected with TASK_ERROR in 1.1.0, despite the same exact code working in 1.0.0.
> > > >Even in the case of the command health checks, which are once again supported in 51560, we now get deprecation warnings, suggesting that mesos will again break us in 1.4.
> > >
> > > As you mentioned, this is a bug and we definitely should fix before release 1.1.0.
> > > I have updated r51560 <https://reviews.apache.org/r/51560 <https://reviews.apache.org/r/51560> <https://reviews.apache.org/r/51560 <https://reviews.apache.org/r/51560>>> yesterday and verify it fix the problem via r51635. As you see in
> > > the r51560 <https://reviews.apache.org/r/51560 <https://reviews.apache.org/r/51560> <https://reviews.apache.org/r/51560 <https://reviews.apache.org/r/51560>>>, we make sure the protobuf compatible again and didn't lose any
> > > fields. Would you help to double check if it fixes your problem when you free?
> > > It would be highly appreciated that if you could help to verify it.
> > >
> > > After this bug fix, we could ensure all tasks with HTTP/command health check are not when upgrading to 1.1.0.
> > >
> >
> > I see those changes now (I’m very very bad at the review board UI, so I’m sorry if it was always there and I missed it somehow).
> >
> > > 2. Should we make the `HealthCheck::type` required after v2 ?
> > >
> > > To be honest, I think 6 months should be enough and it also should be changed in
> > > v1 because it is a minor change and we didn't make it `required` in protobuf
> > > message level. We still keeping it `option` in protobuf message definition and
> > > add a check about it in Mesos code.
> > > But your concerns make sense as well, so let's see what other users/developers say to
> > > see if we could make an agreement on this.
> > >
> >
> > This is an important point. It doesn’t make sense to me that the compatibility policy is talking about only whether a protobuf field is optional or required — it seems to me that any change that takes a protocol exchange that did not result in a TASK_ERROR before, and changes it to cause a TASK_ERROR now, *is* making that protobuf field semantically required, whether or not the protobuf def says so.
> >
> > I’ll also point out that there is no definition of ‘minor’ change in the compatibility document, and therefore, whether or not a change appears to be ‘minor’ under some rubric (and I agree that this change could seem minor), if it’s part of the v1 mesos.proto, it affects downstream users (such as me, a writer of schedulers).
> >
> > > On Tue, Sep 6, 2016 at 1:18 PM, Silas Snider <swsnider@apple.com <ma...@apple.com> <mailto:swsnider@apple.com <ma...@apple.com>> <mailto:swsnider@apple.com <ma...@apple.com><mailto:swsnider@apple.com <ma...@apple.com>>>> wrote:
> > > There’s a little history to this:
> > >
> > > In https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/> <https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/>> <https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/> <https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/>>> <https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/> <https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/>> <https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/><https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/>>>>, on the 8th of August, the HTTP health check message was changed to be entirely incompatible with the previous HTTP health check message. Not only was its name changed (breaking compatibility with anyone using the feature with libmesos), but the field tags were rearranged, making it truly wire-format incompatible. This change also introduced a ‘type’ field to the HealthCheck message as an optional enum.
> > >
> > > Next, in https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/> <https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/>> <https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/><https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/>>> <https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/> <https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/>> <https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/> <https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/>>>>, on the 13th of August, the health checking code was changed to make the new ‘type’ field mandatory — if the protobuf field is not present, the mesos master rejects your task with TASK_ERROR.
> > >
> > > A colleague of mine was testing our internal scheduler against HEAD of mesos, and discovered that any task they submitted was being rejected as TASK_ERROR, since we were setting health checks, but not sending type. I filed MESOS-6110, on the 30th of August, and haosdent huang has kindly created https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/> <https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/>> <https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/><https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/>>> <https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/> <https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/>> <https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/> <https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/>>>> to try to fix this.
> > >
> > > In the course of reviewing that fix, I noticed that it only addresses the case of a command health check, and does not continue to support HTTP health checks in the way they were in 1.0.0. This is a problem for our scheduler, as we have ~always (before mesos actually added support) passed our HTTP health checks in the message, depending on our custom executor to actually perform the check. It is now true that even with the proposed change (51560), we will still get tasks rejected with TASK_ERROR in 1.1.0, despite the same exact code working in 1.0.0.
> > >
> > > Even in the case of the command health checks, which are once again supported in 51560, we now get deprecation warnings, suggesting that mesos will again break us in 1.4.
> > >
> > > It is my team’s belief that the mesos compatibility guarantee, as documented on this page: http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/> <http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/>> <http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/><http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/>>><http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/> <http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/>> <http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/> <http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/>>>> would prohibit this sort of change from occurring. Specifically, the ‘API Versioning’ section says "The API version is only bumped if we need to make a backwards incompatible API change. We will strive to support a given API version for at least a year.” and under the ‘API compatibility’ the change is considered to be breaking if it would involve "Adding new required fields to existing requests to “/scheduler”.”
> > >
> > > The proposed change does indeed add a new required field — ‘type’ to the v1 api, in the case of command health checks in 6 months, in the case of http health checks, immediately. Therefore, it seems clear that this constitutes a new ‘v2’ api, and it’s very clear that 6 months is too short, especially as another part of the 'API Versioning’ section says "The deprecation clock for vN-1 API will start as soon as we release “N.0.0” version of Mesos. […]”
> > >
> > > Please believe me, I understand the need to be able to change broken api and implementation quickly, without spending years maintaining technical debt. This is why I believe the mesos project decided to move to a model where the internal protobufs are separate from the v1/v2/etc. protobufs, and evolvers/devolvers are proposed. It seems clear that the right way of doing this is to modify the internal protobuf to look the way you’d like (better message name, clearer field order, etc.) and write an evolver from the v1 api to the internal api.
> > >
> > > Also, I think it’s important to note that the compatibility guarantees I’m citing are exactly the things that make it possible at all to write a scheduler against mesos and actually use it in production. Deciding that this case is too insignificant to really bother with the compatibility guarantees means that you’ve just pushed the tech debt issue one level higher to the scheduler writers.
> > >
> > > I’m sorry this email ended up so long, but thank you for taking some time to read it — I believe that this issue is critical to the ongoing health of the mesos project.
> > >
> > >
> > > > On Sep 5, 2016, at 11:14 AM, haosdent <haosdent@gmail.com <ma...@gmail.com> <mailto:haosdent@gmail.com <ma...@gmail.com>> <mailto:haosdent@gmail.com <ma...@gmail.com><mailto:haosdent@gmail.com <ma...@gmail.com>>>> wrote:
> > > >
> > > > Hi, folks. As I mentioned in the previous email http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1> <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1>> <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1> <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1>>> <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1> <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1>> <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1><http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1>>>>.
> > > > We have added `type` in the `HealthCheck` protobuf definition in 1.1.0 and
> > > > health checks without `type` specified will be deprecated since 1.1.0.
> > > >
> > > > For backwards compatibility, we still support the command health check if the
> > > > type is not specified for now. But we plan to make `type` become a required field
> > > > and return `TASK_ERROR` if the type is not specified after 6 months. The question
> > > > is if this meets the deprecated policy since 1.0 ? If 6 months is too short and
> > > > we have to deprecate it after 2.0 ?
> > > >
> > > > Looking forward the answers. Any concerns and questions are appreciated, thanks a lot!
> > > >
> > > > --
> > > > Best Regards,
> > > > Haosdent Huang
> > >
> > >
> > >
> > >
> > > --
> > > Best Regards,
> > > Haosdent Huang
> >
> >
> 
> 
> 
> 
> -- 
> Best Regards,
> Haosdent Huang

Re: Question about the deprecated policy after 1.0

Posted by Yan Xu <xu...@apple.com>.

Thanks @haosdent. Sorry there's been offline chats and I was waiting for that to circle back to the list.

Comments inlined. Besides this particular issue I think we all agree that we can improve the versioning doc to avoid confusion in the future. I'll send a RR for that but let's focus on this particular issue first.

> On Sep 14, 2016, at 7:42 PM, haosdent <ha...@gmail.com> wrote:
> 
> Thx @yan's update!
> 
> >The remaining issue we'd like to discuss with the community is, of the things in we plan to support in v1 API and hope to deprecate in Mesos 2.0, should we slate them for deprecation before we have defined their replacement?
> 
> IMO, if we sure a feature would be deprecated in Mesos 2.0, we should deprecate it immediately although could not give a clear replacement at that time.
> Then users would think that feature is not recommended to use and avoid to use it.

There has been misunderstandings about whether the v1 HTTP health check API is supported but I thought that the eventual conclusion was that we are supporting it in 1.x.
See http://mesos.slackarchive.io/health-check/-/1472748463.000051/1473696580.000102/1473235632000088/ <http://mesos.slackarchive.io/health-check/-/1472748463.000051/1473696580.000102/1473235632000088/>

But this is not even about it.

If the API is totally not supposed to be used, no need to wait until 2.0, remove it today.
If the API is supported in v1 and we are not sunsetting this feature, we need to provide an "upgrade path" whenever we plan to replace it with something else (in addition to conforming to the process here <http://mesos.apache.org/documentation/latest/versioning/>)

In either case, "deprecate this API in 2.0 without an alternative" doesn't make sense.

This is not a hypothetical issue - there are people who depend on this field.

> 
> Otherwise, we would encounter the same problem again:
> After few releases, we finish the better replacement of that old feature. Unfortunately, we find it has been used for a lot of users because we didn' mark it deprecated before. Then we have to deprecate it in Mesos 3.0.
> A worse case is we forget and ignore that TODO item as time flows, and left this tech debt to other guys. This is which we should avoid now if we could.

Think about the users who depend on this who find out that alternative API doesn't make into the v2 but the old API is removed from it in v2. What should they do?

I don't think we should ever do this.

Like I said, it's perfectly fine if we want to deprecate this in 2.0. To do that, let's make sure the alternative API makes into 2.0 first. If we fail to achieve that, why shouldn't we postpone the deprecation further?

> 
> Bases on the above two points, I think we should slate the deprecated feature although we don't have defined their replacement.
> 
> On Thu, Sep 15, 2016 at 4:31 AM, Yan Xu <xujyan@apple.com <ma...@apple.com>> wrote:
> To follow up on this, after discussions we (contributors and reviewers of MESOS-6110) have agreed to support the existing HTTP health check API in Mesos 1.x. (See https://reviews.apache.org/r/51803 <https://reviews.apache.org/r/51803>)
> 
> The remaining issue we'd like to discuss with the community is, of the things in we plan to support in v1 API and hope to deprecate in Mesos 2.0, should we slate them for deprecation before we have defined their replacement?
> 
> Two contrasting examples In /r/51803:
> `HTTPCheckInfo::type`:
> - v1 API: setting `type` is not required.
> - Proposal: Deprecate the above in 2.0, i.e., requiring it to be set.
> - This is clearly OK because it's clear to the users how to migrate to the new API -> just set the `type`.
> 
> `HTTPCheckInfo::statuses`:
> - v1 API: This field can be set, even though not acted on by Mesos default executors, they can be used by custom executors.
> - Proposal: Deprecate the above in 2.0, i.e., removing it.
> - IMO we cannot slate this API for deprecation in 2.0 right now because there's no replacement defined yet. I think ultimately we don't want to be in the situation where we remove `statuses` without replacing it with something else. If that's the case, I don't think we need to rush to decide that it has to be deprecated in 2.0. What if the replacement doesn't make it into 2.0?
> 
> Feel free to provide your feedback here or on https://reviews.apache.org/r/51803 <https://reviews.apache.org/r/51803> where more context may help clarify things.
> 
> 
> > On Sep 6, 2016, at 12:05 PM, Yan Xu <xujyan@apple.com <ma...@apple.com>> wrote:
> >
> > Should we think of this not as "whether this change should be subject to the current policy" but rather "whether this change presents an exception case that shouldn't be subject to the current policy"?
> >
> > We shouldn't make exceptions liberally but given that the community as a whole is new to the 1.0 world, we should allow the policy to evolve into something that works for both the developers and the frameworks writers/users. If we discover exceptions we should amend the policy document.
> >
> > The current policy on this is very clear on this (as Silas has cited): "The deprecation clock for vN-1 API will start as soon as we release “N.0.0” version of Mesos. We will strive to give enough time (e.g., 6 months) for frameworks/operators to upgrade to vN API before we stop supporting vN-1 API."
> >
> > In this case it translates to "Start the deprecation for this change with Mesos 2.0 and thrive to not break it until 2.0.0 release date + 6 months".
> >
> > Can we list the technical challenges that would make adhering to this policy not worthwhile? If this is the case, how should we change policy document accordingly?
> >
> > Separately from the details of this change but motivated by it, I think we need to improve the policy document on:
> > Clearly state what in the versioned protobufs that don't constitute as the official API. IMO we should give the developers the flexibility to add to the API before the feature is fully developed (and they are not subject to the policy) but in the policy doc we should point out the words to look for for such exceptions, e.g., "Unimplemented. DO NOT USE." or "Experimental. Unstable API."
> > Clearly state what to do with deprecations by developers and users. People will likely appreciate early heads-ups so perhaps the developer should be able to start warning people as soon as the change is made. e.g., print a warning message that references a future release: "This API/flag will be deprecated in Mesos 2.0... "
> > To help folks discover deprecations we can have a live document that lists the deprecated features by version. Currently the CHANGELOG file only lists deprecation in the next release so there's not a place to put the deprecations for 2.0 when we are only at 1.1.0 (WIP).
> > Thoughts?
> >
> > Jiang Yan Xu 
> >
> > On Tue, Sep 6, 2016 at 6:40 AM, Silas Snider <swsnider@apple.com <ma...@apple.com> <mailto:swsnider@apple.com <ma...@apple.com>>> wrote:
> > Responses inline
> >
> > > On Sep 6, 2016, at 1:33 AM, haosdent <haosdent@gmail.com <ma...@gmail.com> <mailto:haosdent@gmail.com <ma...@gmail.com>>> wrote:
> > >
> > > Hi, Silas. Thanks a lot to help test the health check changes recently.
> > >
> > > According to my understanding about your email, you mentioned two problems:
> > >
> > > 1. The bug that broken exists HTTP/command health check caused by r50812 <https://reviews.apache.org/r/50812 <https://reviews.apache.org/r/50812><https://reviews.apache.org/r/50812 <https://reviews.apache.org/r/50812>>> and r50996 <https://reviews.apache.org/r/50996 <https://reviews.apache.org/r/50996> <https://reviews.apache.org/r/50996 <https://reviews.apache.org/r/50996>>>
> > >
> > > >It is now true that even with the proposed change (51560), we will still get tasks rejected with TASK_ERROR in 1.1.0, despite the same exact code working in 1.0.0.
> > > >Even in the case of the command health checks, which are once again supported in 51560, we now get deprecation warnings, suggesting that mesos will again break us in 1.4.
> > >
> > > As you mentioned, this is a bug and we definitely should fix before release 1.1.0.
> > > I have updated r51560 <https://reviews.apache.org/r/51560 <https://reviews.apache.org/r/51560> <https://reviews.apache.org/r/51560 <https://reviews.apache.org/r/51560>>> yesterday and verify it fix the problem via r51635. As you see in
> > > the r51560 <https://reviews.apache.org/r/51560 <https://reviews.apache.org/r/51560> <https://reviews.apache.org/r/51560 <https://reviews.apache.org/r/51560>>>, we make sure the protobuf compatible again and didn't lose any
> > > fields. Would you help to double check if it fixes your problem when you free?
> > > It would be highly appreciated that if you could help to verify it.
> > >
> > > After this bug fix, we could ensure all tasks with HTTP/command health check are not when upgrading to 1.1.0.
> > >
> >
> > I see those changes now (I’m very very bad at the review board UI, so I’m sorry if it was always there and I missed it somehow).
> >
> > > 2. Should we make the `HealthCheck::type` required after v2 ?
> > >
> > > To be honest, I think 6 months should be enough and it also should be changed in
> > > v1 because it is a minor change and we didn't make it `required` in protobuf
> > > message level. We still keeping it `option` in protobuf message definition and
> > > add a check about it in Mesos code.
> > > But your concerns make sense as well, so let's see what other users/developers say to
> > > see if we could make an agreement on this.
> > >
> >
> > This is an important point. It doesn’t make sense to me that the compatibility policy is talking about only whether a protobuf field is optional or required — it seems to me that any change that takes a protocol exchange that did not result in a TASK_ERROR before, and changes it to cause a TASK_ERROR now, *is* making that protobuf field semantically required, whether or not the protobuf def says so.
> >
> > I’ll also point out that there is no definition of ‘minor’ change in the compatibility document, and therefore, whether or not a change appears to be ‘minor’ under some rubric (and I agree that this change could seem minor), if it’s part of the v1 mesos.proto, it affects downstream users (such as me, a writer of schedulers).
> >
> > > On Tue, Sep 6, 2016 at 1:18 PM, Silas Snider <swsnider@apple.com <ma...@apple.com> <mailto:swsnider@apple.com <ma...@apple.com>> <mailto:swsnider@apple.com <ma...@apple.com><mailto:swsnider@apple.com <ma...@apple.com>>>> wrote:
> > > There’s a little history to this:
> > >
> > > In https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/> <https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/>> <https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/> <https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/>>> <https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/> <https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/>> <https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/><https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/>>>>, on the 8th of August, the HTTP health check message was changed to be entirely incompatible with the previous HTTP health check message. Not only was its name changed (breaking compatibility with anyone using the feature with libmesos), but the field tags were rearranged, making it truly wire-format incompatible. This change also introduced a ‘type’ field to the HealthCheck message as an optional enum.
> > >
> > > Next, in https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/> <https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/>> <https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/><https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/>>> <https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/> <https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/>> <https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/> <https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/>>>>, on the 13th of August, the health checking code was changed to make the new ‘type’ field mandatory — if the protobuf field is not present, the mesos master rejects your task with TASK_ERROR.
> > >
> > > A colleague of mine was testing our internal scheduler against HEAD of mesos, and discovered that any task they submitted was being rejected as TASK_ERROR, since we were setting health checks, but not sending type. I filed MESOS-6110, on the 30th of August, and haosdent huang has kindly created https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/> <https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/>> <https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/><https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/>>> <https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/> <https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/>> <https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/> <https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/>>>> to try to fix this.
> > >
> > > In the course of reviewing that fix, I noticed that it only addresses the case of a command health check, and does not continue to support HTTP health checks in the way they were in 1.0.0. This is a problem for our scheduler, as we have ~always (before mesos actually added support) passed our HTTP health checks in the message, depending on our custom executor to actually perform the check. It is now true that even with the proposed change (51560), we will still get tasks rejected with TASK_ERROR in 1.1.0, despite the same exact code working in 1.0.0.
> > >
> > > Even in the case of the command health checks, which are once again supported in 51560, we now get deprecation warnings, suggesting that mesos will again break us in 1.4.
> > >
> > > It is my team’s belief that the mesos compatibility guarantee, as documented on this page: http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/> <http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/>> <http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/><http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/>>><http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/> <http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/>> <http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/> <http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/>>>> would prohibit this sort of change from occurring. Specifically, the ‘API Versioning’ section says "The API version is only bumped if we need to make a backwards incompatible API change. We will strive to support a given API version for at least a year.” and under the ‘API compatibility’ the change is considered to be breaking if it would involve "Adding new required fields to existing requests to “/scheduler”.”
> > >
> > > The proposed change does indeed add a new required field — ‘type’ to the v1 api, in the case of command health checks in 6 months, in the case of http health checks, immediately. Therefore, it seems clear that this constitutes a new ‘v2’ api, and it’s very clear that 6 months is too short, especially as another part of the 'API Versioning’ section says "The deprecation clock for vN-1 API will start as soon as we release “N.0.0” version of Mesos. […]”
> > >
> > > Please believe me, I understand the need to be able to change broken api and implementation quickly, without spending years maintaining technical debt. This is why I believe the mesos project decided to move to a model where the internal protobufs are separate from the v1/v2/etc. protobufs, and evolvers/devolvers are proposed. It seems clear that the right way of doing this is to modify the internal protobuf to look the way you’d like (better message name, clearer field order, etc.) and write an evolver from the v1 api to the internal api.
> > >
> > > Also, I think it’s important to note that the compatibility guarantees I’m citing are exactly the things that make it possible at all to write a scheduler against mesos and actually use it in production. Deciding that this case is too insignificant to really bother with the compatibility guarantees means that you’ve just pushed the tech debt issue one level higher to the scheduler writers.
> > >
> > > I’m sorry this email ended up so long, but thank you for taking some time to read it — I believe that this issue is critical to the ongoing health of the mesos project.
> > >
> > >
> > > > On Sep 5, 2016, at 11:14 AM, haosdent <haosdent@gmail.com <ma...@gmail.com> <mailto:haosdent@gmail.com <ma...@gmail.com>> <mailto:haosdent@gmail.com <ma...@gmail.com><mailto:haosdent@gmail.com <ma...@gmail.com>>>> wrote:
> > > >
> > > > Hi, folks. As I mentioned in the previous email http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1> <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1>> <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1> <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1>>> <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1> <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1>> <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1><http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1>>>>.
> > > > We have added `type` in the `HealthCheck` protobuf definition in 1.1.0 and
> > > > health checks without `type` specified will be deprecated since 1.1.0.
> > > >
> > > > For backwards compatibility, we still support the command health check if the
> > > > type is not specified for now. But we plan to make `type` become a required field
> > > > and return `TASK_ERROR` if the type is not specified after 6 months. The question
> > > > is if this meets the deprecated policy since 1.0 ? If 6 months is too short and
> > > > we have to deprecate it after 2.0 ?
> > > >
> > > > Looking forward the answers. Any concerns and questions are appreciated, thanks a lot!
> > > >
> > > > --
> > > > Best Regards,
> > > > Haosdent Huang
> > >
> > >
> > >
> > >
> > > --
> > > Best Regards,
> > > Haosdent Huang
> >
> >
> 
> 
> 
> 
> -- 
> Best Regards,
> Haosdent Huang

Re: Question about the deprecated policy after 1.0

Posted by haosdent <ha...@gmail.com>.

Thx @yan's update!

>The remaining issue we'd like to discuss with the community is, of the
things in we plan to support in v1 API and hope to deprecate in Mesos 2.0,
should we slate them for deprecation before we have defined their
replacement?

IMO, if we sure a feature would be deprecated in Mesos 2.0, we should
deprecate it immediately although could not give a clear replacement at
that time.
Then users would think that feature is not recommended to use and avoid to
use it.

Otherwise, we would encounter the same problem again:
After few releases, we finish the better replacement of that old
feature. Unfortunately,
we find it has been used for a lot of users because we didn' mark it
deprecated before. Then we have to deprecate it in Mesos 3.0.
A worse case is we forget and ignore that TODO item as time flows, and left
this tech debt to other guys. This is which we should avoid now if we could.

Bases on the above two points, I think we should slate the deprecated
feature although we don't have defined their replacement.

On Thu, Sep 15, 2016 at 4:31 AM, Yan Xu <xu...@apple.com> wrote:

> To follow up on this, after discussions we (contributors and reviewers of
> MESOS-6110) have agreed to support the existing HTTP health check API in
> Mesos 1.x. (See https://reviews.apache.org/r/51803)
>
> The remaining issue we'd like to discuss with the community is, of the
> things in we plan to support in v1 API and hope to deprecate in Mesos 2.0,
> should we slate them for deprecation before we have defined their
> replacement?
>
> Two contrasting examples In /r/51803:
> `HTTPCheckInfo::type`:
> - v1 API: setting `type` is not required.
> - Proposal: Deprecate the above in 2.0, i.e., requiring it to be set.
> - This is clearly OK because it's clear to the users how to migrate to the
> new API -> just set the `type`.
>
> `HTTPCheckInfo::statuses`:
> - v1 API: This field can be set, even though not acted on by Mesos default
> executors, they can be used by custom executors.
> - Proposal: Deprecate the above in 2.0, i.e., removing it.
> - IMO we cannot slate this API for deprecation in 2.0 right now because
> there's no replacement defined yet. I think ultimately we don't want to be
> in the situation where we remove `statuses` without replacing it with
> something else. If that's the case, I don't think we need to rush to decide
> that it has to be deprecated in 2.0. What if the replacement doesn't make
> it into 2.0?
>
> Feel free to provide your feedback here or on
> https://reviews.apache.org/r/51803 where more context may help clarify
> things.
>
>
> > On Sep 6, 2016, at 12:05 PM, Yan Xu <xu...@apple.com> wrote:
> >
> > Should we think of this not as "whether this change should be subject to
> the current policy" but rather "whether this change presents an exception
> case that shouldn't be subject to the current policy"?
> >
> > We shouldn't make exceptions liberally but given that the community as a
> whole is new to the 1.0 world, we should allow the policy to evolve into
> something that works for both the developers and the frameworks
> writers/users. If we discover exceptions we should amend the policy
> document.
> >
> > The current policy on this is very clear on this (as Silas has cited):
> "The deprecation clock for vN-1 API will start as soon as we release
> “N.0.0” version of Mesos. We will strive to give enough time (e.g., 6
> months) for frameworks/operators to upgrade to vN API before we stop
> supporting vN-1 API."
> >
> > In this case it translates to "Start the deprecation for this change
> with Mesos 2.0 and thrive to not break it until 2.0.0 release date + 6
> months".
> >
> > Can we list the technical challenges that would make adhering to this
> policy not worthwhile? If this is the case, how should we change policy
> document accordingly?
> >
> > Separately from the details of this change but motivated by it, I think
> we need to improve the policy document on:
> > Clearly state what in the versioned protobufs that don't constitute as
> the official API. IMO we should give the developers the flexibility to add
> to the API before the feature is fully developed (and they are not subject
> to the policy) but in the policy doc we should point out the words to look
> for for such exceptions, e.g., "Unimplemented. DO NOT USE." or
> "Experimental. Unstable API."
> > Clearly state what to do with deprecations by developers and users.
> People will likely appreciate early heads-ups so perhaps the developer
> should be able to start warning people as soon as the change is made. e.g.,
> print a warning message that references a future release: "This API/flag
> will be deprecated in Mesos 2.0... "
> > To help folks discover deprecations we can have a live document that
> lists the deprecated features by version. Currently the CHANGELOG file only
> lists deprecation in the next release so there's not a place to put the
> deprecations for 2.0 when we are only at 1.1.0 (WIP).
> > Thoughts?
> >
> > Jiang Yan Xu 
> >
> > On Tue, Sep 6, 2016 at 6:40 AM, Silas Snider <swsnider@apple.com
> <ma...@apple.com>> wrote:
> > Responses inline
> >
> > > On Sep 6, 2016, at 1:33 AM, haosdent <haosdent@gmail.com <mailto:
> haosdent@gmail.com>> wrote:
> > >
> > > Hi, Silas. Thanks a lot to help test the health check changes recently.
> > >
> > > According to my understanding about your email, you mentioned two
> problems:
> > >
> > > 1. The bug that broken exists HTTP/command health check caused by
> r50812 <https://reviews.apache.org/r/50812 <https://reviews.apache.org/r/
> 50812>> and r50996 <https://reviews.apache.org/r/50996 <
> https://reviews.apache.org/r/50996>>
> > >
> > > >It is now true that even with the proposed change (51560), we will
> still get tasks rejected with TASK_ERROR in 1.1.0, despite the same exact
> code working in 1.0.0.
> > > >Even in the case of the command health checks, which are once again
> supported in 51560, we now get deprecation warnings, suggesting that mesos
> will again break us in 1.4.
> > >
> > > As you mentioned, this is a bug and we definitely should fix before
> release 1.1.0.
> > > I have updated r51560 <https://reviews.apache.org/r/51560 <
> https://reviews.apache.org/r/51560>> yesterday and verify it fix the
> problem via r51635. As you see in
> > > the r51560 <https://reviews.apache.org/r/51560 <
> https://reviews.apache.org/r/51560>>, we make sure the protobuf
> compatible again and didn't lose any
> > > fields. Would you help to double check if it fixes your problem when
> you free?
> > > It would be highly appreciated that if you could help to verify it.
> > >
> > > After this bug fix, we could ensure all tasks with HTTP/command health
> check are not when upgrading to 1.1.0.
> > >
> >
> > I see those changes now (I’m very very bad at the review board UI, so
> I’m sorry if it was always there and I missed it somehow).
> >
> > > 2. Should we make the `HealthCheck::type` required after v2 ?
> > >
> > > To be honest, I think 6 months should be enough and it also should be
> changed in
> > > v1 because it is a minor change and we didn't make it `required` in
> protobuf
> > > message level. We still keeping it `option` in protobuf message
> definition and
> > > add a check about it in Mesos code.
> > > But your concerns make sense as well, so let's see what other
> users/developers say to
> > > see if we could make an agreement on this.
> > >
> >
> > This is an important point. It doesn’t make sense to me that the
> compatibility policy is talking about only whether a protobuf field is
> optional or required — it seems to me that any change that takes a protocol
> exchange that did not result in a TASK_ERROR before, and changes it to
> cause a TASK_ERROR now, *is* making that protobuf field semantically
> required, whether or not the protobuf def says so.
> >
> > I’ll also point out that there is no definition of ‘minor’ change in the
> compatibility document, and therefore, whether or not a change appears to
> be ‘minor’ under some rubric (and I agree that this change could seem
> minor), if it’s part of the v1 mesos.proto, it affects downstream users
> (such as me, a writer of schedulers).
> >
> > > On Tue, Sep 6, 2016 at 1:18 PM, Silas Snider <swsnider@apple.com
> <ma...@apple.com> <mailto:swsnider@apple.com <mailto:
> swsnider@apple.com>>> wrote:
> > > There’s a little history to this:
> > >
> > > In https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/
> 50812/> <https://reviews.apache.org/r/50812/ <
> https://reviews.apache.org/r/50812/>> <https://reviews.apache.org/r/50812/
> <https://reviews.apache.org/r/50812/> <https://reviews.apache.org/r/50812/
> <https://reviews.apache.org/r/50812/>>>, on the 8th of August, the HTTP
> health check message was changed to be entirely incompatible with the
> previous HTTP health check message. Not only was its name changed (breaking
> compatibility with anyone using the feature with libmesos), but the field
> tags were rearranged, making it truly wire-format incompatible. This change
> also introduced a ‘type’ field to the HealthCheck message as an optional
> enum.
> > >
> > > Next, in https://reviews.apache.org/r/50996/ <
> https://reviews.apache.org/r/50996/> <https://reviews.apache.org/r/50996/
> <https://reviews.apache.org/r/50996/>> <https://reviews.apache.org/r/
> 50996/ <https://reviews.apache.org/r/50996/> <
> https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/>>>,
> on the 13th of August, the health checking code was changed to make the new
> ‘type’ field mandatory — if the protobuf field is not present, the mesos
> master rejects your task with TASK_ERROR.
> > >
> > > A colleague of mine was testing our internal scheduler against HEAD of
> mesos, and discovered that any task they submitted was being rejected as
> TASK_ERROR, since we were setting health checks, but not sending type. I
> filed MESOS-6110, on the 30th of August, and haosdent huang has kindly
> created https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/
> 51560/> <https://reviews.apache.org/r/51560/ <
> https://reviews.apache.org/r/51560/>> <https://reviews.apache.org/r/51560/
> <https://reviews.apache.org/r/51560/> <https://reviews.apache.org/r/51560/
> <https://reviews.apache.org/r/51560/>>> to try to fix this.
> > >
> > > In the course of reviewing that fix, I noticed that it only addresses
> the case of a command health check, and does not continue to support HTTP
> health checks in the way they were in 1.0.0. This is a problem for our
> scheduler, as we have ~always (before mesos actually added support) passed
> our HTTP health checks in the message, depending on our custom executor to
> actually perform the check. It is now true that even with the proposed
> change (51560), we will still get tasks rejected with TASK_ERROR in 1.1.0,
> despite the same exact code working in 1.0.0.
> > >
> > > Even in the case of the command health checks, which are once again
> supported in 51560, we now get deprecation warnings, suggesting that mesos
> will again break us in 1.4.
> > >
> > > It is my team’s belief that the mesos compatibility guarantee, as
> documented on this page: http://mesos.apache.org/documentation/latest/
> versioning/ <http://mesos.apache.org/documentation/latest/versioning/> <
> http://mesos.apache.org/documentation/latest/versioning/ <
> http://mesos.apache.org/documentation/latest/versioning/>><http://mesos.
> apache.org/documentation/latest/versioning/ <http://mesos.apache.org/
> documentation/latest/versioning/> <http://mesos.apache.org/
> documentation/latest/versioning/ <http://mesos.apache.org/
> documentation/latest/versioning/>>> would prohibit this sort of change
> from occurring. Specifically, the ‘API Versioning’ section says "The API
> version is only bumped if we need to make a backwards incompatible API
> change. We will strive to support a given API version for at least a year.”
> and under the ‘API compatibility’ the change is considered to be breaking
> if it would involve "Adding new required fields to existing requests to
> “/scheduler”.”
> > >
> > > The proposed change does indeed add a new required field — ‘type’ to
> the v1 api, in the case of command health checks in 6 months, in the case
> of http health checks, immediately. Therefore, it seems clear that this
> constitutes a new ‘v2’ api, and it’s very clear that 6 months is too short,
> especially as another part of the 'API Versioning’ section says "The
> deprecation clock for vN-1 API will start as soon as we release “N.0.0”
> version of Mesos. […]”
> > >
> > > Please believe me, I understand the need to be able to change broken
> api and implementation quickly, without spending years maintaining
> technical debt. This is why I believe the mesos project decided to move to
> a model where the internal protobufs are separate from the v1/v2/etc.
> protobufs, and evolvers/devolvers are proposed. It seems clear that the
> right way of doing this is to modify the internal protobuf to look the way
> you’d like (better message name, clearer field order, etc.) and write an
> evolver from the v1 api to the internal api.
> > >
> > > Also, I think it’s important to note that the compatibility guarantees
> I’m citing are exactly the things that make it possible at all to write a
> scheduler against mesos and actually use it in production. Deciding that
> this case is too insignificant to really bother with the compatibility
> guarantees means that you’ve just pushed the tech debt issue one level
> higher to the scheduler writers.
> > >
> > > I’m sorry this email ended up so long, but thank you for taking some
> time to read it — I believe that this issue is critical to the ongoing
> health of the mesos project.
> > >
> > >
> > > > On Sep 5, 2016, at 11:14 AM, haosdent <haosdent@gmail.com <mailto:
> haosdent@gmail.com> <mailto:haosdent@gmail.com <ma...@gmail.com>>>
> wrote:
> > > >
> > > > Hi, folks. As I mentioned in the previous email
> http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/
> 0Vlr6Ma9DWqzG3M1> <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <
> http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1>> <http://search-hadoop.com/m/
> 0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1> <
> http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/
> 0Vlr6Ma9DWqzG3M1>>>.
> > > > We have added `type` in the `HealthCheck` protobuf definition in
> 1.1.0 and
> > > > health checks without `type` specified will be deprecated since
> 1.1.0.
> > > >
> > > > For backwards compatibility, we still support the command health
> check if the
> > > > type is not specified for now. But we plan to make `type` become a
> required field
> > > > and return `TASK_ERROR` if the type is not specified after 6 months.
> The question
> > > > is if this meets the deprecated policy since 1.0 ? If 6 months is
> too short and
> > > > we have to deprecate it after 2.0 ?
> > > >
> > > > Looking forward the answers. Any concerns and questions are
> appreciated, thanks a lot!
> > > >
> > > > --
> > > > Best Regards,
> > > > Haosdent Huang
> > >
> > >
> > >
> > >
> > > --
> > > Best Regards,
> > > Haosdent Huang
> >
> >
>
>


-- 
Best Regards,
Haosdent Huang

Re: Question about the deprecated policy after 1.0

Posted by haosdent <ha...@gmail.com>.

Thx @yan's update!

>The remaining issue we'd like to discuss with the community is, of the
things in we plan to support in v1 API and hope to deprecate in Mesos 2.0,
should we slate them for deprecation before we have defined their
replacement?

IMO, if we sure a feature would be deprecated in Mesos 2.0, we should
deprecate it immediately although could not give a clear replacement at
that time.
Then users would think that feature is not recommended to use and avoid to
use it.

Otherwise, we would encounter the same problem again:
After few releases, we finish the better replacement of that old
feature. Unfortunately,
we find it has been used for a lot of users because we didn' mark it
deprecated before. Then we have to deprecate it in Mesos 3.0.
A worse case is we forget and ignore that TODO item as time flows, and left
this tech debt to other guys. This is which we should avoid now if we could.

Bases on the above two points, I think we should slate the deprecated
feature although we don't have defined their replacement.

On Thu, Sep 15, 2016 at 4:31 AM, Yan Xu <xu...@apple.com> wrote:

> To follow up on this, after discussions we (contributors and reviewers of
> MESOS-6110) have agreed to support the existing HTTP health check API in
> Mesos 1.x. (See https://reviews.apache.org/r/51803)
>
> The remaining issue we'd like to discuss with the community is, of the
> things in we plan to support in v1 API and hope to deprecate in Mesos 2.0,
> should we slate them for deprecation before we have defined their
> replacement?
>
> Two contrasting examples In /r/51803:
> `HTTPCheckInfo::type`:
> - v1 API: setting `type` is not required.
> - Proposal: Deprecate the above in 2.0, i.e., requiring it to be set.
> - This is clearly OK because it's clear to the users how to migrate to the
> new API -> just set the `type`.
>
> `HTTPCheckInfo::statuses`:
> - v1 API: This field can be set, even though not acted on by Mesos default
> executors, they can be used by custom executors.
> - Proposal: Deprecate the above in 2.0, i.e., removing it.
> - IMO we cannot slate this API for deprecation in 2.0 right now because
> there's no replacement defined yet. I think ultimately we don't want to be
> in the situation where we remove `statuses` without replacing it with
> something else. If that's the case, I don't think we need to rush to decide
> that it has to be deprecated in 2.0. What if the replacement doesn't make
> it into 2.0?
>
> Feel free to provide your feedback here or on
> https://reviews.apache.org/r/51803 where more context may help clarify
> things.
>
>
> > On Sep 6, 2016, at 12:05 PM, Yan Xu <xu...@apple.com> wrote:
> >
> > Should we think of this not as "whether this change should be subject to
> the current policy" but rather "whether this change presents an exception
> case that shouldn't be subject to the current policy"?
> >
> > We shouldn't make exceptions liberally but given that the community as a
> whole is new to the 1.0 world, we should allow the policy to evolve into
> something that works for both the developers and the frameworks
> writers/users. If we discover exceptions we should amend the policy
> document.
> >
> > The current policy on this is very clear on this (as Silas has cited):
> "The deprecation clock for vN-1 API will start as soon as we release
> “N.0.0” version of Mesos. We will strive to give enough time (e.g., 6
> months) for frameworks/operators to upgrade to vN API before we stop
> supporting vN-1 API."
> >
> > In this case it translates to "Start the deprecation for this change
> with Mesos 2.0 and thrive to not break it until 2.0.0 release date + 6
> months".
> >
> > Can we list the technical challenges that would make adhering to this
> policy not worthwhile? If this is the case, how should we change policy
> document accordingly?
> >
> > Separately from the details of this change but motivated by it, I think
> we need to improve the policy document on:
> > Clearly state what in the versioned protobufs that don't constitute as
> the official API. IMO we should give the developers the flexibility to add
> to the API before the feature is fully developed (and they are not subject
> to the policy) but in the policy doc we should point out the words to look
> for for such exceptions, e.g., "Unimplemented. DO NOT USE." or
> "Experimental. Unstable API."
> > Clearly state what to do with deprecations by developers and users.
> People will likely appreciate early heads-ups so perhaps the developer
> should be able to start warning people as soon as the change is made. e.g.,
> print a warning message that references a future release: "This API/flag
> will be deprecated in Mesos 2.0... "
> > To help folks discover deprecations we can have a live document that
> lists the deprecated features by version. Currently the CHANGELOG file only
> lists deprecation in the next release so there's not a place to put the
> deprecations for 2.0 when we are only at 1.1.0 (WIP).
> > Thoughts?
> >
> > Jiang Yan Xu 
> >
> > On Tue, Sep 6, 2016 at 6:40 AM, Silas Snider <swsnider@apple.com
> <ma...@apple.com>> wrote:
> > Responses inline
> >
> > > On Sep 6, 2016, at 1:33 AM, haosdent <haosdent@gmail.com <mailto:
> haosdent@gmail.com>> wrote:
> > >
> > > Hi, Silas. Thanks a lot to help test the health check changes recently.
> > >
> > > According to my understanding about your email, you mentioned two
> problems:
> > >
> > > 1. The bug that broken exists HTTP/command health check caused by
> r50812 <https://reviews.apache.org/r/50812 <https://reviews.apache.org/r/
> 50812>> and r50996 <https://reviews.apache.org/r/50996 <
> https://reviews.apache.org/r/50996>>
> > >
> > > >It is now true that even with the proposed change (51560), we will
> still get tasks rejected with TASK_ERROR in 1.1.0, despite the same exact
> code working in 1.0.0.
> > > >Even in the case of the command health checks, which are once again
> supported in 51560, we now get deprecation warnings, suggesting that mesos
> will again break us in 1.4.
> > >
> > > As you mentioned, this is a bug and we definitely should fix before
> release 1.1.0.
> > > I have updated r51560 <https://reviews.apache.org/r/51560 <
> https://reviews.apache.org/r/51560>> yesterday and verify it fix the
> problem via r51635. As you see in
> > > the r51560 <https://reviews.apache.org/r/51560 <
> https://reviews.apache.org/r/51560>>, we make sure the protobuf
> compatible again and didn't lose any
> > > fields. Would you help to double check if it fixes your problem when
> you free?
> > > It would be highly appreciated that if you could help to verify it.
> > >
> > > After this bug fix, we could ensure all tasks with HTTP/command health
> check are not when upgrading to 1.1.0.
> > >
> >
> > I see those changes now (I’m very very bad at the review board UI, so
> I’m sorry if it was always there and I missed it somehow).
> >
> > > 2. Should we make the `HealthCheck::type` required after v2 ?
> > >
> > > To be honest, I think 6 months should be enough and it also should be
> changed in
> > > v1 because it is a minor change and we didn't make it `required` in
> protobuf
> > > message level. We still keeping it `option` in protobuf message
> definition and
> > > add a check about it in Mesos code.
> > > But your concerns make sense as well, so let's see what other
> users/developers say to
> > > see if we could make an agreement on this.
> > >
> >
> > This is an important point. It doesn’t make sense to me that the
> compatibility policy is talking about only whether a protobuf field is
> optional or required — it seems to me that any change that takes a protocol
> exchange that did not result in a TASK_ERROR before, and changes it to
> cause a TASK_ERROR now, *is* making that protobuf field semantically
> required, whether or not the protobuf def says so.
> >
> > I’ll also point out that there is no definition of ‘minor’ change in the
> compatibility document, and therefore, whether or not a change appears to
> be ‘minor’ under some rubric (and I agree that this change could seem
> minor), if it’s part of the v1 mesos.proto, it affects downstream users
> (such as me, a writer of schedulers).
> >
> > > On Tue, Sep 6, 2016 at 1:18 PM, Silas Snider <swsnider@apple.com
> <ma...@apple.com> <mailto:swsnider@apple.com <mailto:
> swsnider@apple.com>>> wrote:
> > > There’s a little history to this:
> > >
> > > In https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/
> 50812/> <https://reviews.apache.org/r/50812/ <
> https://reviews.apache.org/r/50812/>> <https://reviews.apache.org/r/50812/
> <https://reviews.apache.org/r/50812/> <https://reviews.apache.org/r/50812/
> <https://reviews.apache.org/r/50812/>>>, on the 8th of August, the HTTP
> health check message was changed to be entirely incompatible with the
> previous HTTP health check message. Not only was its name changed (breaking
> compatibility with anyone using the feature with libmesos), but the field
> tags were rearranged, making it truly wire-format incompatible. This change
> also introduced a ‘type’ field to the HealthCheck message as an optional
> enum.
> > >
> > > Next, in https://reviews.apache.org/r/50996/ <
> https://reviews.apache.org/r/50996/> <https://reviews.apache.org/r/50996/
> <https://reviews.apache.org/r/50996/>> <https://reviews.apache.org/r/
> 50996/ <https://reviews.apache.org/r/50996/> <
> https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/>>>,
> on the 13th of August, the health checking code was changed to make the new
> ‘type’ field mandatory — if the protobuf field is not present, the mesos
> master rejects your task with TASK_ERROR.
> > >
> > > A colleague of mine was testing our internal scheduler against HEAD of
> mesos, and discovered that any task they submitted was being rejected as
> TASK_ERROR, since we were setting health checks, but not sending type. I
> filed MESOS-6110, on the 30th of August, and haosdent huang has kindly
> created https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/
> 51560/> <https://reviews.apache.org/r/51560/ <
> https://reviews.apache.org/r/51560/>> <https://reviews.apache.org/r/51560/
> <https://reviews.apache.org/r/51560/> <https://reviews.apache.org/r/51560/
> <https://reviews.apache.org/r/51560/>>> to try to fix this.
> > >
> > > In the course of reviewing that fix, I noticed that it only addresses
> the case of a command health check, and does not continue to support HTTP
> health checks in the way they were in 1.0.0. This is a problem for our
> scheduler, as we have ~always (before mesos actually added support) passed
> our HTTP health checks in the message, depending on our custom executor to
> actually perform the check. It is now true that even with the proposed
> change (51560), we will still get tasks rejected with TASK_ERROR in 1.1.0,
> despite the same exact code working in 1.0.0.
> > >
> > > Even in the case of the command health checks, which are once again
> supported in 51560, we now get deprecation warnings, suggesting that mesos
> will again break us in 1.4.
> > >
> > > It is my team’s belief that the mesos compatibility guarantee, as
> documented on this page: http://mesos.apache.org/documentation/latest/
> versioning/ <http://mesos.apache.org/documentation/latest/versioning/> <
> http://mesos.apache.org/documentation/latest/versioning/ <
> http://mesos.apache.org/documentation/latest/versioning/>><http://mesos.
> apache.org/documentation/latest/versioning/ <http://mesos.apache.org/
> documentation/latest/versioning/> <http://mesos.apache.org/
> documentation/latest/versioning/ <http://mesos.apache.org/
> documentation/latest/versioning/>>> would prohibit this sort of change
> from occurring. Specifically, the ‘API Versioning’ section says "The API
> version is only bumped if we need to make a backwards incompatible API
> change. We will strive to support a given API version for at least a year.”
> and under the ‘API compatibility’ the change is considered to be breaking
> if it would involve "Adding new required fields to existing requests to
> “/scheduler”.”
> > >
> > > The proposed change does indeed add a new required field — ‘type’ to
> the v1 api, in the case of command health checks in 6 months, in the case
> of http health checks, immediately. Therefore, it seems clear that this
> constitutes a new ‘v2’ api, and it’s very clear that 6 months is too short,
> especially as another part of the 'API Versioning’ section says "The
> deprecation clock for vN-1 API will start as soon as we release “N.0.0”
> version of Mesos. […]”
> > >
> > > Please believe me, I understand the need to be able to change broken
> api and implementation quickly, without spending years maintaining
> technical debt. This is why I believe the mesos project decided to move to
> a model where the internal protobufs are separate from the v1/v2/etc.
> protobufs, and evolvers/devolvers are proposed. It seems clear that the
> right way of doing this is to modify the internal protobuf to look the way
> you’d like (better message name, clearer field order, etc.) and write an
> evolver from the v1 api to the internal api.
> > >
> > > Also, I think it’s important to note that the compatibility guarantees
> I’m citing are exactly the things that make it possible at all to write a
> scheduler against mesos and actually use it in production. Deciding that
> this case is too insignificant to really bother with the compatibility
> guarantees means that you’ve just pushed the tech debt issue one level
> higher to the scheduler writers.
> > >
> > > I’m sorry this email ended up so long, but thank you for taking some
> time to read it — I believe that this issue is critical to the ongoing
> health of the mesos project.
> > >
> > >
> > > > On Sep 5, 2016, at 11:14 AM, haosdent <haosdent@gmail.com <mailto:
> haosdent@gmail.com> <mailto:haosdent@gmail.com <ma...@gmail.com>>>
> wrote:
> > > >
> > > > Hi, folks. As I mentioned in the previous email
> http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/
> 0Vlr6Ma9DWqzG3M1> <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <
> http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1>> <http://search-hadoop.com/m/
> 0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1> <
> http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/
> 0Vlr6Ma9DWqzG3M1>>>.
> > > > We have added `type` in the `HealthCheck` protobuf definition in
> 1.1.0 and
> > > > health checks without `type` specified will be deprecated since
> 1.1.0.
> > > >
> > > > For backwards compatibility, we still support the command health
> check if the
> > > > type is not specified for now. But we plan to make `type` become a
> required field
> > > > and return `TASK_ERROR` if the type is not specified after 6 months.
> The question
> > > > is if this meets the deprecated policy since 1.0 ? If 6 months is
> too short and
> > > > we have to deprecate it after 2.0 ?
> > > >
> > > > Looking forward the answers. Any concerns and questions are
> appreciated, thanks a lot!
> > > >
> > > > --
> > > > Best Regards,
> > > > Haosdent Huang
> > >
> > >
> > >
> > >
> > > --
> > > Best Regards,
> > > Haosdent Huang
> >
> >
>
>


-- 
Best Regards,
Haosdent Huang

Re: Question about the deprecated policy after 1.0

Posted by Yan Xu <xu...@apple.com>.

To follow up on this, after discussions we (contributors and reviewers of MESOS-6110) have agreed to support the existing HTTP health check API in Mesos 1.x. (See https://reviews.apache.org/r/51803)

The remaining issue we'd like to discuss with the community is, of the things in we plan to support in v1 API and hope to deprecate in Mesos 2.0, should we slate them for deprecation before we have defined their replacement?

Two contrasting examples In /r/51803: 
`HTTPCheckInfo::type`:
- v1 API: setting `type` is not required.
- Proposal: Deprecate the above in 2.0, i.e., requiring it to be set.
- This is clearly OK because it's clear to the users how to migrate to the new API -> just set the `type`.

`HTTPCheckInfo::statuses`:
- v1 API: This field can be set, even though not acted on by Mesos default executors, they can be used by custom executors.
- Proposal: Deprecate the above in 2.0, i.e., removing it.
- IMO we cannot slate this API for deprecation in 2.0 right now because there's no replacement defined yet. I think ultimately we don't want to be in the situation where we remove `statuses` without replacing it with something else. If that's the case, I don't think we need to rush to decide that it has to be deprecated in 2.0. What if the replacement doesn't make it into 2.0?

Feel free to provide your feedback here or on https://reviews.apache.org/r/51803 where more context may help clarify things. 


> On Sep 6, 2016, at 12:05 PM, Yan Xu <xu...@apple.com> wrote:
> 
> Should we think of this not as "whether this change should be subject to the current policy" but rather "whether this change presents an exception case that shouldn't be subject to the current policy"? 
> 
> We shouldn't make exceptions liberally but given that the community as a whole is new to the 1.0 world, we should allow the policy to evolve into something that works for both the developers and the frameworks writers/users. If we discover exceptions we should amend the policy document.
> 
> The current policy on this is very clear on this (as Silas has cited): "The deprecation clock for vN-1 API will start as soon as we release “N.0.0” version of Mesos. We will strive to give enough time (e.g., 6 months) for frameworks/operators to upgrade to vN API before we stop supporting vN-1 API." 
> 
> In this case it translates to "Start the deprecation for this change with Mesos 2.0 and thrive to not break it until 2.0.0 release date + 6 months".
> 
> Can we list the technical challenges that would make adhering to this policy not worthwhile? If this is the case, how should we change policy document accordingly?
> 
> Separately from the details of this change but motivated by it, I think we need to improve the policy document on:
> Clearly state what in the versioned protobufs that don't constitute as the official API. IMO we should give the developers the flexibility to add to the API before the feature is fully developed (and they are not subject to the policy) but in the policy doc we should point out the words to look for for such exceptions, e.g., "Unimplemented. DO NOT USE." or "Experimental. Unstable API."
> Clearly state what to do with deprecations by developers and users. People will likely appreciate early heads-ups so perhaps the developer should be able to start warning people as soon as the change is made. e.g., print a warning message that references a future release: "This API/flag will be deprecated in Mesos 2.0... "
> To help folks discover deprecations we can have a live document that lists the deprecated features by version. Currently the CHANGELOG file only lists deprecation in the next release so there's not a place to put the deprecations for 2.0 when we are only at 1.1.0 (WIP).
> Thoughts?
> 
> Jiang Yan Xu 
> 
> On Tue, Sep 6, 2016 at 6:40 AM, Silas Snider <swsnider@apple.com <ma...@apple.com>> wrote:
> Responses inline
> 
> > On Sep 6, 2016, at 1:33 AM, haosdent <haosdent@gmail.com <ma...@gmail.com>> wrote:
> >
> > Hi, Silas. Thanks a lot to help test the health check changes recently.
> >
> > According to my understanding about your email, you mentioned two problems:
> >
> > 1. The bug that broken exists HTTP/command health check caused by r50812 <https://reviews.apache.org/r/50812 <https://reviews.apache.org/r/50812>> and r50996 <https://reviews.apache.org/r/50996 <https://reviews.apache.org/r/50996>>
> >
> > >It is now true that even with the proposed change (51560), we will still get tasks rejected with TASK_ERROR in 1.1.0, despite the same exact code working in 1.0.0.
> > >Even in the case of the command health checks, which are once again supported in 51560, we now get deprecation warnings, suggesting that mesos will again break us in 1.4.
> >
> > As you mentioned, this is a bug and we definitely should fix before release 1.1.0.
> > I have updated r51560 <https://reviews.apache.org/r/51560 <https://reviews.apache.org/r/51560>> yesterday and verify it fix the problem via r51635. As you see in
> > the r51560 <https://reviews.apache.org/r/51560 <https://reviews.apache.org/r/51560>>, we make sure the protobuf compatible again and didn't lose any
> > fields. Would you help to double check if it fixes your problem when you free?
> > It would be highly appreciated that if you could help to verify it.
> >
> > After this bug fix, we could ensure all tasks with HTTP/command health check are not when upgrading to 1.1.0.
> >
> 
> I see those changes now (I’m very very bad at the review board UI, so I’m sorry if it was always there and I missed it somehow).
> 
> > 2. Should we make the `HealthCheck::type` required after v2 ?
> >
> > To be honest, I think 6 months should be enough and it also should be changed in
> > v1 because it is a minor change and we didn't make it `required` in protobuf
> > message level. We still keeping it `option` in protobuf message definition and
> > add a check about it in Mesos code.
> > But your concerns make sense as well, so let's see what other users/developers say to
> > see if we could make an agreement on this.
> >
> 
> This is an important point. It doesn’t make sense to me that the compatibility policy is talking about only whether a protobuf field is optional or required — it seems to me that any change that takes a protocol exchange that did not result in a TASK_ERROR before, and changes it to cause a TASK_ERROR now, *is* making that protobuf field semantically required, whether or not the protobuf def says so.
> 
> I’ll also point out that there is no definition of ‘minor’ change in the compatibility document, and therefore, whether or not a change appears to be ‘minor’ under some rubric (and I agree that this change could seem minor), if it’s part of the v1 mesos.proto, it affects downstream users (such as me, a writer of schedulers).
> 
> > On Tue, Sep 6, 2016 at 1:18 PM, Silas Snider <swsnider@apple.com <ma...@apple.com> <mailto:swsnider@apple.com <ma...@apple.com>>> wrote:
> > There’s a little history to this:
> >
> > In https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/> <https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/>> <https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/> <https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/>>>, on the 8th of August, the HTTP health check message was changed to be entirely incompatible with the previous HTTP health check message. Not only was its name changed (breaking compatibility with anyone using the feature with libmesos), but the field tags were rearranged, making it truly wire-format incompatible. This change also introduced a ‘type’ field to the HealthCheck message as an optional enum.
> >
> > Next, in https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/> <https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/>> <https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/> <https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/>>>, on the 13th of August, the health checking code was changed to make the new ‘type’ field mandatory — if the protobuf field is not present, the mesos master rejects your task with TASK_ERROR.
> >
> > A colleague of mine was testing our internal scheduler against HEAD of mesos, and discovered that any task they submitted was being rejected as TASK_ERROR, since we were setting health checks, but not sending type. I filed MESOS-6110, on the 30th of August, and haosdent huang has kindly created https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/> <https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/>> <https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/> <https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/>>> to try to fix this.
> >
> > In the course of reviewing that fix, I noticed that it only addresses the case of a command health check, and does not continue to support HTTP health checks in the way they were in 1.0.0. This is a problem for our scheduler, as we have ~always (before mesos actually added support) passed our HTTP health checks in the message, depending on our custom executor to actually perform the check. It is now true that even with the proposed change (51560), we will still get tasks rejected with TASK_ERROR in 1.1.0, despite the same exact code working in 1.0.0.
> >
> > Even in the case of the command health checks, which are once again supported in 51560, we now get deprecation warnings, suggesting that mesos will again break us in 1.4.
> >
> > It is my team’s belief that the mesos compatibility guarantee, as documented on this page: http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/> <http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/>><http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/> <http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/>>> would prohibit this sort of change from occurring. Specifically, the ‘API Versioning’ section says "The API version is only bumped if we need to make a backwards incompatible API change. We will strive to support a given API version for at least a year.” and under the ‘API compatibility’ the change is considered to be breaking if it would involve "Adding new required fields to existing requests to “/scheduler”.”
> >
> > The proposed change does indeed add a new required field — ‘type’ to the v1 api, in the case of command health checks in 6 months, in the case of http health checks, immediately. Therefore, it seems clear that this constitutes a new ‘v2’ api, and it’s very clear that 6 months is too short, especially as another part of the 'API Versioning’ section says "The deprecation clock for vN-1 API will start as soon as we release “N.0.0” version of Mesos. […]”
> >
> > Please believe me, I understand the need to be able to change broken api and implementation quickly, without spending years maintaining technical debt. This is why I believe the mesos project decided to move to a model where the internal protobufs are separate from the v1/v2/etc. protobufs, and evolvers/devolvers are proposed. It seems clear that the right way of doing this is to modify the internal protobuf to look the way you’d like (better message name, clearer field order, etc.) and write an evolver from the v1 api to the internal api.
> >
> > Also, I think it’s important to note that the compatibility guarantees I’m citing are exactly the things that make it possible at all to write a scheduler against mesos and actually use it in production. Deciding that this case is too insignificant to really bother with the compatibility guarantees means that you’ve just pushed the tech debt issue one level higher to the scheduler writers.
> >
> > I’m sorry this email ended up so long, but thank you for taking some time to read it — I believe that this issue is critical to the ongoing health of the mesos project.
> >
> >
> > > On Sep 5, 2016, at 11:14 AM, haosdent <haosdent@gmail.com <ma...@gmail.com> <mailto:haosdent@gmail.com <ma...@gmail.com>>> wrote:
> > >
> > > Hi, folks. As I mentioned in the previous email http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1> <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1>> <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1> <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1>>>.
> > > We have added `type` in the `HealthCheck` protobuf definition in 1.1.0 and
> > > health checks without `type` specified will be deprecated since 1.1.0.
> > >
> > > For backwards compatibility, we still support the command health check if the
> > > type is not specified for now. But we plan to make `type` become a required field
> > > and return `TASK_ERROR` if the type is not specified after 6 months. The question
> > > is if this meets the deprecated policy since 1.0 ? If 6 months is too short and
> > > we have to deprecate it after 2.0 ?
> > >
> > > Looking forward the answers. Any concerns and questions are appreciated, thanks a lot!
> > >
> > > --
> > > Best Regards,
> > > Haosdent Huang
> >
> >
> >
> >
> > --
> > Best Regards,
> > Haosdent Huang
> 
>

Re: Question about the deprecated policy after 1.0

Posted by Yan Xu <xu...@apple.com>.

To follow up on this, after discussions we (contributors and reviewers of MESOS-6110) have agreed to support the existing HTTP health check API in Mesos 1.x. (See https://reviews.apache.org/r/51803)

The remaining issue we'd like to discuss with the community is, of the things in we plan to support in v1 API and hope to deprecate in Mesos 2.0, should we slate them for deprecation before we have defined their replacement?

Two contrasting examples In /r/51803: 
`HTTPCheckInfo::type`:
- v1 API: setting `type` is not required.
- Proposal: Deprecate the above in 2.0, i.e., requiring it to be set.
- This is clearly OK because it's clear to the users how to migrate to the new API -> just set the `type`.

`HTTPCheckInfo::statuses`:
- v1 API: This field can be set, even though not acted on by Mesos default executors, they can be used by custom executors.
- Proposal: Deprecate the above in 2.0, i.e., removing it.
- IMO we cannot slate this API for deprecation in 2.0 right now because there's no replacement defined yet. I think ultimately we don't want to be in the situation where we remove `statuses` without replacing it with something else. If that's the case, I don't think we need to rush to decide that it has to be deprecated in 2.0. What if the replacement doesn't make it into 2.0?

Feel free to provide your feedback here or on https://reviews.apache.org/r/51803 where more context may help clarify things. 


> On Sep 6, 2016, at 12:05 PM, Yan Xu <xu...@apple.com> wrote:
> 
> Should we think of this not as "whether this change should be subject to the current policy" but rather "whether this change presents an exception case that shouldn't be subject to the current policy"? 
> 
> We shouldn't make exceptions liberally but given that the community as a whole is new to the 1.0 world, we should allow the policy to evolve into something that works for both the developers and the frameworks writers/users. If we discover exceptions we should amend the policy document.
> 
> The current policy on this is very clear on this (as Silas has cited): "The deprecation clock for vN-1 API will start as soon as we release “N.0.0” version of Mesos. We will strive to give enough time (e.g., 6 months) for frameworks/operators to upgrade to vN API before we stop supporting vN-1 API." 
> 
> In this case it translates to "Start the deprecation for this change with Mesos 2.0 and thrive to not break it until 2.0.0 release date + 6 months".
> 
> Can we list the technical challenges that would make adhering to this policy not worthwhile? If this is the case, how should we change policy document accordingly?
> 
> Separately from the details of this change but motivated by it, I think we need to improve the policy document on:
> Clearly state what in the versioned protobufs that don't constitute as the official API. IMO we should give the developers the flexibility to add to the API before the feature is fully developed (and they are not subject to the policy) but in the policy doc we should point out the words to look for for such exceptions, e.g., "Unimplemented. DO NOT USE." or "Experimental. Unstable API."
> Clearly state what to do with deprecations by developers and users. People will likely appreciate early heads-ups so perhaps the developer should be able to start warning people as soon as the change is made. e.g., print a warning message that references a future release: "This API/flag will be deprecated in Mesos 2.0... "
> To help folks discover deprecations we can have a live document that lists the deprecated features by version. Currently the CHANGELOG file only lists deprecation in the next release so there's not a place to put the deprecations for 2.0 when we are only at 1.1.0 (WIP).
> Thoughts?
> 
> Jiang Yan Xu 
> 
> On Tue, Sep 6, 2016 at 6:40 AM, Silas Snider <swsnider@apple.com <ma...@apple.com>> wrote:
> Responses inline
> 
> > On Sep 6, 2016, at 1:33 AM, haosdent <haosdent@gmail.com <ma...@gmail.com>> wrote:
> >
> > Hi, Silas. Thanks a lot to help test the health check changes recently.
> >
> > According to my understanding about your email, you mentioned two problems:
> >
> > 1. The bug that broken exists HTTP/command health check caused by r50812 <https://reviews.apache.org/r/50812 <https://reviews.apache.org/r/50812>> and r50996 <https://reviews.apache.org/r/50996 <https://reviews.apache.org/r/50996>>
> >
> > >It is now true that even with the proposed change (51560), we will still get tasks rejected with TASK_ERROR in 1.1.0, despite the same exact code working in 1.0.0.
> > >Even in the case of the command health checks, which are once again supported in 51560, we now get deprecation warnings, suggesting that mesos will again break us in 1.4.
> >
> > As you mentioned, this is a bug and we definitely should fix before release 1.1.0.
> > I have updated r51560 <https://reviews.apache.org/r/51560 <https://reviews.apache.org/r/51560>> yesterday and verify it fix the problem via r51635. As you see in
> > the r51560 <https://reviews.apache.org/r/51560 <https://reviews.apache.org/r/51560>>, we make sure the protobuf compatible again and didn't lose any
> > fields. Would you help to double check if it fixes your problem when you free?
> > It would be highly appreciated that if you could help to verify it.
> >
> > After this bug fix, we could ensure all tasks with HTTP/command health check are not when upgrading to 1.1.0.
> >
> 
> I see those changes now (I’m very very bad at the review board UI, so I’m sorry if it was always there and I missed it somehow).
> 
> > 2. Should we make the `HealthCheck::type` required after v2 ?
> >
> > To be honest, I think 6 months should be enough and it also should be changed in
> > v1 because it is a minor change and we didn't make it `required` in protobuf
> > message level. We still keeping it `option` in protobuf message definition and
> > add a check about it in Mesos code.
> > But your concerns make sense as well, so let's see what other users/developers say to
> > see if we could make an agreement on this.
> >
> 
> This is an important point. It doesn’t make sense to me that the compatibility policy is talking about only whether a protobuf field is optional or required — it seems to me that any change that takes a protocol exchange that did not result in a TASK_ERROR before, and changes it to cause a TASK_ERROR now, *is* making that protobuf field semantically required, whether or not the protobuf def says so.
> 
> I’ll also point out that there is no definition of ‘minor’ change in the compatibility document, and therefore, whether or not a change appears to be ‘minor’ under some rubric (and I agree that this change could seem minor), if it’s part of the v1 mesos.proto, it affects downstream users (such as me, a writer of schedulers).
> 
> > On Tue, Sep 6, 2016 at 1:18 PM, Silas Snider <swsnider@apple.com <ma...@apple.com> <mailto:swsnider@apple.com <ma...@apple.com>>> wrote:
> > There’s a little history to this:
> >
> > In https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/> <https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/>> <https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/> <https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/>>>, on the 8th of August, the HTTP health check message was changed to be entirely incompatible with the previous HTTP health check message. Not only was its name changed (breaking compatibility with anyone using the feature with libmesos), but the field tags were rearranged, making it truly wire-format incompatible. This change also introduced a ‘type’ field to the HealthCheck message as an optional enum.
> >
> > Next, in https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/> <https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/>> <https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/> <https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/>>>, on the 13th of August, the health checking code was changed to make the new ‘type’ field mandatory — if the protobuf field is not present, the mesos master rejects your task with TASK_ERROR.
> >
> > A colleague of mine was testing our internal scheduler against HEAD of mesos, and discovered that any task they submitted was being rejected as TASK_ERROR, since we were setting health checks, but not sending type. I filed MESOS-6110, on the 30th of August, and haosdent huang has kindly created https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/> <https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/>> <https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/> <https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/>>> to try to fix this.
> >
> > In the course of reviewing that fix, I noticed that it only addresses the case of a command health check, and does not continue to support HTTP health checks in the way they were in 1.0.0. This is a problem for our scheduler, as we have ~always (before mesos actually added support) passed our HTTP health checks in the message, depending on our custom executor to actually perform the check. It is now true that even with the proposed change (51560), we will still get tasks rejected with TASK_ERROR in 1.1.0, despite the same exact code working in 1.0.0.
> >
> > Even in the case of the command health checks, which are once again supported in 51560, we now get deprecation warnings, suggesting that mesos will again break us in 1.4.
> >
> > It is my team’s belief that the mesos compatibility guarantee, as documented on this page: http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/> <http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/>><http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/> <http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/>>> would prohibit this sort of change from occurring. Specifically, the ‘API Versioning’ section says "The API version is only bumped if we need to make a backwards incompatible API change. We will strive to support a given API version for at least a year.” and under the ‘API compatibility’ the change is considered to be breaking if it would involve "Adding new required fields to existing requests to “/scheduler”.”
> >
> > The proposed change does indeed add a new required field — ‘type’ to the v1 api, in the case of command health checks in 6 months, in the case of http health checks, immediately. Therefore, it seems clear that this constitutes a new ‘v2’ api, and it’s very clear that 6 months is too short, especially as another part of the 'API Versioning’ section says "The deprecation clock for vN-1 API will start as soon as we release “N.0.0” version of Mesos. […]”
> >
> > Please believe me, I understand the need to be able to change broken api and implementation quickly, without spending years maintaining technical debt. This is why I believe the mesos project decided to move to a model where the internal protobufs are separate from the v1/v2/etc. protobufs, and evolvers/devolvers are proposed. It seems clear that the right way of doing this is to modify the internal protobuf to look the way you’d like (better message name, clearer field order, etc.) and write an evolver from the v1 api to the internal api.
> >
> > Also, I think it’s important to note that the compatibility guarantees I’m citing are exactly the things that make it possible at all to write a scheduler against mesos and actually use it in production. Deciding that this case is too insignificant to really bother with the compatibility guarantees means that you’ve just pushed the tech debt issue one level higher to the scheduler writers.
> >
> > I’m sorry this email ended up so long, but thank you for taking some time to read it — I believe that this issue is critical to the ongoing health of the mesos project.
> >
> >
> > > On Sep 5, 2016, at 11:14 AM, haosdent <haosdent@gmail.com <ma...@gmail.com> <mailto:haosdent@gmail.com <ma...@gmail.com>>> wrote:
> > >
> > > Hi, folks. As I mentioned in the previous email http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1> <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1>> <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1> <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1>>>.
> > > We have added `type` in the `HealthCheck` protobuf definition in 1.1.0 and
> > > health checks without `type` specified will be deprecated since 1.1.0.
> > >
> > > For backwards compatibility, we still support the command health check if the
> > > type is not specified for now. But we plan to make `type` become a required field
> > > and return `TASK_ERROR` if the type is not specified after 6 months. The question
> > > is if this meets the deprecated policy since 1.0 ? If 6 months is too short and
> > > we have to deprecate it after 2.0 ?
> > >
> > > Looking forward the answers. Any concerns and questions are appreciated, thanks a lot!
> > >
> > > --
> > > Best Regards,
> > > Haosdent Huang
> >
> >
> >
> >
> > --
> > Best Regards,
> > Haosdent Huang
> 
>

Re: Question about the deprecated policy after 1.0

Posted by Yan Xu <xu...@apple.com>.

Should we think of this *not* as "whether this change should be subject to
the current policy" but rather "whether this change presents an *exception
case* that *shouldn't* be subject to the current policy"?

We shouldn't make exceptions liberally but given that the community as a
whole is new to the 1.0 world, we should allow the policy to evolve into
something that works for both the developers and the frameworks
writers/users. If we discover exceptions we should amend the policy
document.

The current policy on this is very clear on this (as Silas has cited): "The
deprecation clock for vN-1 API will start as soon as we release “N.0.0”
version of Mesos. We will strive to give enough time (e.g., 6 months) for
frameworks/operators to upgrade to vN API before we stop supporting vN-1
API."

In this case it translates to "Start the deprecation for this change with
Mesos 2.0 and thrive to not break it until 2.0.0 release date + 6 months".

Can we list the technical challenges that would make adhering to this
policy not worthwhile? If this is the case, how should we change policy
document accordingly?

Separately from the details of this change but motivated by it, I think we
need to improve the policy document on:

   - Clearly state what in the versioned protobufs that don't constitute as
   the official API. IMO we should give the developers the flexibility to add
   to the API before the feature is fully developed (and they are not subject
   to the policy) but in the policy doc we should point out the words to look
   for for such exceptions, e.g., "Unimplemented. DO NOT USE."
   or "Experimental. Unstable API."
   - Clearly state what to do with deprecations by developers and users.
   People will likely appreciate early heads-ups so perhaps the developer
   should be able to start warning people as soon as the change is made. e.g.,
   print a warning message that references a *future *release: "This
   API/flag will be deprecated in Mesos 2.0... "
   - To help folks discover deprecations we can have a live document that
   lists the deprecated features *by version*. Currently the CHANGELOG file
   only lists deprecation in the next release so there's not a place to put
   the deprecations for 2.0 when we are only at 1.1.0 (WIP).

Thoughts?

Jiang Yan Xu 

On Tue, Sep 6, 2016 at 6:40 AM, Silas Snider <sw...@apple.com> wrote:

> Responses inline
>
> > On Sep 6, 2016, at 1:33 AM, haosdent <ha...@gmail.com> wrote:
> >
> > Hi, Silas. Thanks a lot to help test the health check changes recently.
> >
> > According to my understanding about your email, you mentioned two
> problems:
> >
> > 1. The bug that broken exists HTTP/command health check caused by r50812
> <https://reviews.apache.org/r/50812> and r50996 <
> https://reviews.apache.org/r/50996>
> >
> > >It is now true that even with the proposed change (51560), we will
> still get tasks rejected with TASK_ERROR in 1.1.0, despite the same exact
> code working in 1.0.0.
> > >Even in the case of the command health checks, which are once again
> supported in 51560, we now get deprecation warnings, suggesting that mesos
> will again break us in 1.4.
> >
> > As you mentioned, this is a bug and we definitely should fix before
> release 1.1.0.
> > I have updated r51560 <https://reviews.apache.org/r/51560> yesterday
> and verify it fix the problem via r51635. As you see in
> > the r51560 <https://reviews.apache.org/r/51560>, we make sure the
> protobuf compatible again and didn't lose any
> > fields. Would you help to double check if it fixes your problem when you
> free?
> > It would be highly appreciated that if you could help to verify it.
> >
> > After this bug fix, we could ensure all tasks with HTTP/command health
> check are not when upgrading to 1.1.0.
> >
>
> I see those changes now (I’m very very bad at the review board UI, so I’m
> sorry if it was always there and I missed it somehow).
>
> > 2. Should we make the `HealthCheck::type` required after v2 ?
> >
> > To be honest, I think 6 months should be enough and it also should be
> changed in
> > v1 because it is a minor change and we didn't make it `required` in
> protobuf
> > message level. We still keeping it `option` in protobuf message
> definition and
> > add a check about it in Mesos code.
> > But your concerns make sense as well, so let's see what other
> users/developers say to
> > see if we could make an agreement on this.
> >
>
> This is an important point. It doesn’t make sense to me that the
> compatibility policy is talking about only whether a protobuf field is
> optional or required — it seems to me that any change that takes a protocol
> exchange that did not result in a TASK_ERROR before, and changes it to
> cause a TASK_ERROR now, *is* making that protobuf field semantically
> required, whether or not the protobuf def says so.
>
> I’ll also point out that there is no definition of ‘minor’ change in the
> compatibility document, and therefore, whether or not a change appears to
> be ‘minor’ under some rubric (and I agree that this change could seem
> minor), if it’s part of the v1 mesos.proto, it affects downstream users
> (such as me, a writer of schedulers).
>
> > On Tue, Sep 6, 2016 at 1:18 PM, Silas Snider <swsnider@apple.com
> <ma...@apple.com>> wrote:
> > There’s a little history to this:
> >
> > In https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/
> 50812/> <https://reviews.apache.org/r/50812/ <
> https://reviews.apache.org/r/50812/>>, on the 8th of August, the HTTP
> health check message was changed to be entirely incompatible with the
> previous HTTP health check message. Not only was its name changed (breaking
> compatibility with anyone using the feature with libmesos), but the field
> tags were rearranged, making it truly wire-format incompatible. This change
> also introduced a ‘type’ field to the HealthCheck message as an optional
> enum.
> >
> > Next, in https://reviews.apache.org/r/50996/ <
> https://reviews.apache.org/r/50996/> <https://reviews.apache.org/r/50996/
> <https://reviews.apache.org/r/50996/>>, on the 13th of August, the health
> checking code was changed to make the new ‘type’ field mandatory — if the
> protobuf field is not present, the mesos master rejects your task with
> TASK_ERROR.
> >
> > A colleague of mine was testing our internal scheduler against HEAD of
> mesos, and discovered that any task they submitted was being rejected as
> TASK_ERROR, since we were setting health checks, but not sending type. I
> filed MESOS-6110, on the 30th of August, and haosdent huang has kindly
> created https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/
> 51560/> <https://reviews.apache.org/r/51560/ <
> https://reviews.apache.org/r/51560/>> to try to fix this.
> >
> > In the course of reviewing that fix, I noticed that it only addresses
> the case of a command health check, and does not continue to support HTTP
> health checks in the way they were in 1.0.0. This is a problem for our
> scheduler, as we have ~always (before mesos actually added support) passed
> our HTTP health checks in the message, depending on our custom executor to
> actually perform the check. It is now true that even with the proposed
> change (51560), we will still get tasks rejected with TASK_ERROR in 1.1.0,
> despite the same exact code working in 1.0.0.
> >
> > Even in the case of the command health checks, which are once again
> supported in 51560, we now get deprecation warnings, suggesting that mesos
> will again break us in 1.4.
> >
> > It is my team’s belief that the mesos compatibility guarantee, as
> documented on this page: http://mesos.apache.org/documentation/latest/
> versioning/ <http://mesos.apache.org/documentation/latest/versioning/><
> http://mesos.apache.org/documentation/latest/versioning/ <
> http://mesos.apache.org/documentation/latest/versioning/>> would prohibit
> this sort of change from occurring. Specifically, the ‘API Versioning’
> section says "The API version is only bumped if we need to make a backwards
> incompatible API change. We will strive to support a given API version for
> at least a year.” and under the ‘API compatibility’ the change is
> considered to be breaking if it would involve "Adding new required fields
> to existing requests to “/scheduler”.”
> >
> > The proposed change does indeed add a new required field — ‘type’ to the
> v1 api, in the case of command health checks in 6 months, in the case of
> http health checks, immediately. Therefore, it seems clear that this
> constitutes a new ‘v2’ api, and it’s very clear that 6 months is too short,
> especially as another part of the 'API Versioning’ section says "The
> deprecation clock for vN-1 API will start as soon as we release “N.0.0”
> version of Mesos. […]”
> >
> > Please believe me, I understand the need to be able to change broken api
> and implementation quickly, without spending years maintaining technical
> debt. This is why I believe the mesos project decided to move to a model
> where the internal protobufs are separate from the v1/v2/etc. protobufs,
> and evolvers/devolvers are proposed. It seems clear that the right way of
> doing this is to modify the internal protobuf to look the way you’d like
> (better message name, clearer field order, etc.) and write an evolver from
> the v1 api to the internal api.
> >
> > Also, I think it’s important to note that the compatibility guarantees
> I’m citing are exactly the things that make it possible at all to write a
> scheduler against mesos and actually use it in production. Deciding that
> this case is too insignificant to really bother with the compatibility
> guarantees means that you’ve just pushed the tech debt issue one level
> higher to the scheduler writers.
> >
> > I’m sorry this email ended up so long, but thank you for taking some
> time to read it — I believe that this issue is critical to the ongoing
> health of the mesos project.
> >
> >
> > > On Sep 5, 2016, at 11:14 AM, haosdent <haosdent@gmail.com <mailto:
> haosdent@gmail.com>> wrote:
> > >
> > > Hi, folks. As I mentioned in the previous email
> http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/
> 0Vlr6Ma9DWqzG3M1> <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <
> http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1>>.
> > > We have added `type` in the `HealthCheck` protobuf definition in 1.1.0
> and
> > > health checks without `type` specified will be deprecated since 1.1.0.
> > >
> > > For backwards compatibility, we still support the command health check
> if the
> > > type is not specified for now. But we plan to make `type` become a
> required field
> > > and return `TASK_ERROR` if the type is not specified after 6 months.
> The question
> > > is if this meets the deprecated policy since 1.0 ? If 6 months is too
> short and
> > > we have to deprecate it after 2.0 ?
> > >
> > > Looking forward the answers. Any concerns and questions are
> appreciated, thanks a lot!
> > >
> > > --
> > > Best Regards,
> > > Haosdent Huang
> >
> >
> >
> >
> > --
> > Best Regards,
> > Haosdent Huang
>
>

Re: Question about the deprecated policy after 1.0

Posted by Yan Xu <xu...@apple.com>.

Should we think of this *not* as "whether this change should be subject to
the current policy" but rather "whether this change presents an *exception
case* that *shouldn't* be subject to the current policy"?

We shouldn't make exceptions liberally but given that the community as a
whole is new to the 1.0 world, we should allow the policy to evolve into
something that works for both the developers and the frameworks
writers/users. If we discover exceptions we should amend the policy
document.

The current policy on this is very clear on this (as Silas has cited): "The
deprecation clock for vN-1 API will start as soon as we release “N.0.0”
version of Mesos. We will strive to give enough time (e.g., 6 months) for
frameworks/operators to upgrade to vN API before we stop supporting vN-1
API."

In this case it translates to "Start the deprecation for this change with
Mesos 2.0 and thrive to not break it until 2.0.0 release date + 6 months".

Can we list the technical challenges that would make adhering to this
policy not worthwhile? If this is the case, how should we change policy
document accordingly?

Separately from the details of this change but motivated by it, I think we
need to improve the policy document on:

   - Clearly state what in the versioned protobufs that don't constitute as
   the official API. IMO we should give the developers the flexibility to add
   to the API before the feature is fully developed (and they are not subject
   to the policy) but in the policy doc we should point out the words to look
   for for such exceptions, e.g., "Unimplemented. DO NOT USE."
   or "Experimental. Unstable API."
   - Clearly state what to do with deprecations by developers and users.
   People will likely appreciate early heads-ups so perhaps the developer
   should be able to start warning people as soon as the change is made. e.g.,
   print a warning message that references a *future *release: "This
   API/flag will be deprecated in Mesos 2.0... "
   - To help folks discover deprecations we can have a live document that
   lists the deprecated features *by version*. Currently the CHANGELOG file
   only lists deprecation in the next release so there's not a place to put
   the deprecations for 2.0 when we are only at 1.1.0 (WIP).

Thoughts?

Jiang Yan Xu 

On Tue, Sep 6, 2016 at 6:40 AM, Silas Snider <sw...@apple.com> wrote:

> Responses inline
>
> > On Sep 6, 2016, at 1:33 AM, haosdent <ha...@gmail.com> wrote:
> >
> > Hi, Silas. Thanks a lot to help test the health check changes recently.
> >
> > According to my understanding about your email, you mentioned two
> problems:
> >
> > 1. The bug that broken exists HTTP/command health check caused by r50812
> <https://reviews.apache.org/r/50812> and r50996 <
> https://reviews.apache.org/r/50996>
> >
> > >It is now true that even with the proposed change (51560), we will
> still get tasks rejected with TASK_ERROR in 1.1.0, despite the same exact
> code working in 1.0.0.
> > >Even in the case of the command health checks, which are once again
> supported in 51560, we now get deprecation warnings, suggesting that mesos
> will again break us in 1.4.
> >
> > As you mentioned, this is a bug and we definitely should fix before
> release 1.1.0.
> > I have updated r51560 <https://reviews.apache.org/r/51560> yesterday
> and verify it fix the problem via r51635. As you see in
> > the r51560 <https://reviews.apache.org/r/51560>, we make sure the
> protobuf compatible again and didn't lose any
> > fields. Would you help to double check if it fixes your problem when you
> free?
> > It would be highly appreciated that if you could help to verify it.
> >
> > After this bug fix, we could ensure all tasks with HTTP/command health
> check are not when upgrading to 1.1.0.
> >
>
> I see those changes now (I’m very very bad at the review board UI, so I’m
> sorry if it was always there and I missed it somehow).
>
> > 2. Should we make the `HealthCheck::type` required after v2 ?
> >
> > To be honest, I think 6 months should be enough and it also should be
> changed in
> > v1 because it is a minor change and we didn't make it `required` in
> protobuf
> > message level. We still keeping it `option` in protobuf message
> definition and
> > add a check about it in Mesos code.
> > But your concerns make sense as well, so let's see what other
> users/developers say to
> > see if we could make an agreement on this.
> >
>
> This is an important point. It doesn’t make sense to me that the
> compatibility policy is talking about only whether a protobuf field is
> optional or required — it seems to me that any change that takes a protocol
> exchange that did not result in a TASK_ERROR before, and changes it to
> cause a TASK_ERROR now, *is* making that protobuf field semantically
> required, whether or not the protobuf def says so.
>
> I’ll also point out that there is no definition of ‘minor’ change in the
> compatibility document, and therefore, whether or not a change appears to
> be ‘minor’ under some rubric (and I agree that this change could seem
> minor), if it’s part of the v1 mesos.proto, it affects downstream users
> (such as me, a writer of schedulers).
>
> > On Tue, Sep 6, 2016 at 1:18 PM, Silas Snider <swsnider@apple.com
> <ma...@apple.com>> wrote:
> > There’s a little history to this:
> >
> > In https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/
> 50812/> <https://reviews.apache.org/r/50812/ <
> https://reviews.apache.org/r/50812/>>, on the 8th of August, the HTTP
> health check message was changed to be entirely incompatible with the
> previous HTTP health check message. Not only was its name changed (breaking
> compatibility with anyone using the feature with libmesos), but the field
> tags were rearranged, making it truly wire-format incompatible. This change
> also introduced a ‘type’ field to the HealthCheck message as an optional
> enum.
> >
> > Next, in https://reviews.apache.org/r/50996/ <
> https://reviews.apache.org/r/50996/> <https://reviews.apache.org/r/50996/
> <https://reviews.apache.org/r/50996/>>, on the 13th of August, the health
> checking code was changed to make the new ‘type’ field mandatory — if the
> protobuf field is not present, the mesos master rejects your task with
> TASK_ERROR.
> >
> > A colleague of mine was testing our internal scheduler against HEAD of
> mesos, and discovered that any task they submitted was being rejected as
> TASK_ERROR, since we were setting health checks, but not sending type. I
> filed MESOS-6110, on the 30th of August, and haosdent huang has kindly
> created https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/
> 51560/> <https://reviews.apache.org/r/51560/ <
> https://reviews.apache.org/r/51560/>> to try to fix this.
> >
> > In the course of reviewing that fix, I noticed that it only addresses
> the case of a command health check, and does not continue to support HTTP
> health checks in the way they were in 1.0.0. This is a problem for our
> scheduler, as we have ~always (before mesos actually added support) passed
> our HTTP health checks in the message, depending on our custom executor to
> actually perform the check. It is now true that even with the proposed
> change (51560), we will still get tasks rejected with TASK_ERROR in 1.1.0,
> despite the same exact code working in 1.0.0.
> >
> > Even in the case of the command health checks, which are once again
> supported in 51560, we now get deprecation warnings, suggesting that mesos
> will again break us in 1.4.
> >
> > It is my team’s belief that the mesos compatibility guarantee, as
> documented on this page: http://mesos.apache.org/documentation/latest/
> versioning/ <http://mesos.apache.org/documentation/latest/versioning/><
> http://mesos.apache.org/documentation/latest/versioning/ <
> http://mesos.apache.org/documentation/latest/versioning/>> would prohibit
> this sort of change from occurring. Specifically, the ‘API Versioning’
> section says "The API version is only bumped if we need to make a backwards
> incompatible API change. We will strive to support a given API version for
> at least a year.” and under the ‘API compatibility’ the change is
> considered to be breaking if it would involve "Adding new required fields
> to existing requests to “/scheduler”.”
> >
> > The proposed change does indeed add a new required field — ‘type’ to the
> v1 api, in the case of command health checks in 6 months, in the case of
> http health checks, immediately. Therefore, it seems clear that this
> constitutes a new ‘v2’ api, and it’s very clear that 6 months is too short,
> especially as another part of the 'API Versioning’ section says "The
> deprecation clock for vN-1 API will start as soon as we release “N.0.0”
> version of Mesos. […]”
> >
> > Please believe me, I understand the need to be able to change broken api
> and implementation quickly, without spending years maintaining technical
> debt. This is why I believe the mesos project decided to move to a model
> where the internal protobufs are separate from the v1/v2/etc. protobufs,
> and evolvers/devolvers are proposed. It seems clear that the right way of
> doing this is to modify the internal protobuf to look the way you’d like
> (better message name, clearer field order, etc.) and write an evolver from
> the v1 api to the internal api.
> >
> > Also, I think it’s important to note that the compatibility guarantees
> I’m citing are exactly the things that make it possible at all to write a
> scheduler against mesos and actually use it in production. Deciding that
> this case is too insignificant to really bother with the compatibility
> guarantees means that you’ve just pushed the tech debt issue one level
> higher to the scheduler writers.
> >
> > I’m sorry this email ended up so long, but thank you for taking some
> time to read it — I believe that this issue is critical to the ongoing
> health of the mesos project.
> >
> >
> > > On Sep 5, 2016, at 11:14 AM, haosdent <haosdent@gmail.com <mailto:
> haosdent@gmail.com>> wrote:
> > >
> > > Hi, folks. As I mentioned in the previous email
> http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/
> 0Vlr6Ma9DWqzG3M1> <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <
> http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1>>.
> > > We have added `type` in the `HealthCheck` protobuf definition in 1.1.0
> and
> > > health checks without `type` specified will be deprecated since 1.1.0.
> > >
> > > For backwards compatibility, we still support the command health check
> if the
> > > type is not specified for now. But we plan to make `type` become a
> required field
> > > and return `TASK_ERROR` if the type is not specified after 6 months.
> The question
> > > is if this meets the deprecated policy since 1.0 ? If 6 months is too
> short and
> > > we have to deprecate it after 2.0 ?
> > >
> > > Looking forward the answers. Any concerns and questions are
> appreciated, thanks a lot!
> > >
> > > --
> > > Best Regards,
> > > Haosdent Huang
> >
> >
> >
> >
> > --
> > Best Regards,
> > Haosdent Huang
>
>

Re: Question about the deprecated policy after 1.0

Posted by Silas Snider <sw...@apple.com>.

Responses inline

> On Sep 6, 2016, at 1:33 AM, haosdent <ha...@gmail.com> wrote:
> 
> Hi, Silas. Thanks a lot to help test the health check changes recently.
> 
> According to my understanding about your email, you mentioned two problems:
> 
> 1. The bug that broken exists HTTP/command health check caused by r50812 <https://reviews.apache.org/r/50812> and r50996 <https://reviews.apache.org/r/50996>
> 
> >It is now true that even with the proposed change (51560), we will still get tasks rejected with TASK_ERROR in 1.1.0, despite the same exact code working in 1.0.0.
> >Even in the case of the command health checks, which are once again supported in 51560, we now get deprecation warnings, suggesting that mesos will again break us in 1.4.
> 
> As you mentioned, this is a bug and we definitely should fix before release 1.1.0.
> I have updated r51560 <https://reviews.apache.org/r/51560> yesterday and verify it fix the problem via r51635. As you see in
> the r51560 <https://reviews.apache.org/r/51560>, we make sure the protobuf compatible again and didn't lose any
> fields. Would you help to double check if it fixes your problem when you free?
> It would be highly appreciated that if you could help to verify it.
> 
> After this bug fix, we could ensure all tasks with HTTP/command health check are not when upgrading to 1.1.0.
> 

I see those changes now (I’m very very bad at the review board UI, so I’m sorry if it was always there and I missed it somehow).

> 2. Should we make the `HealthCheck::type` required after v2 ?
> 
> To be honest, I think 6 months should be enough and it also should be changed in
> v1 because it is a minor change and we didn't make it `required` in protobuf 
> message level. We still keeping it `option` in protobuf message definition and
> add a check about it in Mesos code.
> But your concerns make sense as well, so let's see what other users/developers say to
> see if we could make an agreement on this. 
> 

This is an important point. It doesn’t make sense to me that the compatibility policy is talking about only whether a protobuf field is optional or required — it seems to me that any change that takes a protocol exchange that did not result in a TASK_ERROR before, and changes it to cause a TASK_ERROR now, *is* making that protobuf field semantically required, whether or not the protobuf def says so.

I’ll also point out that there is no definition of ‘minor’ change in the compatibility document, and therefore, whether or not a change appears to be ‘minor’ under some rubric (and I agree that this change could seem minor), if it’s part of the v1 mesos.proto, it affects downstream users (such as me, a writer of schedulers).

> On Tue, Sep 6, 2016 at 1:18 PM, Silas Snider <swsnider@apple.com <ma...@apple.com>> wrote:
> There’s a little history to this:
> 
> In https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/> <https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/>>, on the 8th of August, the HTTP health check message was changed to be entirely incompatible with the previous HTTP health check message. Not only was its name changed (breaking compatibility with anyone using the feature with libmesos), but the field tags were rearranged, making it truly wire-format incompatible. This change also introduced a ‘type’ field to the HealthCheck message as an optional enum.
> 
> Next, in https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/> <https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/>>, on the 13th of August, the health checking code was changed to make the new ‘type’ field mandatory — if the protobuf field is not present, the mesos master rejects your task with TASK_ERROR.
> 
> A colleague of mine was testing our internal scheduler against HEAD of mesos, and discovered that any task they submitted was being rejected as TASK_ERROR, since we were setting health checks, but not sending type. I filed MESOS-6110, on the 30th of August, and haosdent huang has kindly created https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/> <https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/>> to try to fix this.
> 
> In the course of reviewing that fix, I noticed that it only addresses the case of a command health check, and does not continue to support HTTP health checks in the way they were in 1.0.0. This is a problem for our scheduler, as we have ~always (before mesos actually added support) passed our HTTP health checks in the message, depending on our custom executor to actually perform the check. It is now true that even with the proposed change (51560), we will still get tasks rejected with TASK_ERROR in 1.1.0, despite the same exact code working in 1.0.0.
> 
> Even in the case of the command health checks, which are once again supported in 51560, we now get deprecation warnings, suggesting that mesos will again break us in 1.4.
> 
> It is my team’s belief that the mesos compatibility guarantee, as documented on this page: http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/><http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/>> would prohibit this sort of change from occurring. Specifically, the ‘API Versioning’ section says "The API version is only bumped if we need to make a backwards incompatible API change. We will strive to support a given API version for at least a year.” and under the ‘API compatibility’ the change is considered to be breaking if it would involve "Adding new required fields to existing requests to “/scheduler”.”
> 
> The proposed change does indeed add a new required field — ‘type’ to the v1 api, in the case of command health checks in 6 months, in the case of http health checks, immediately. Therefore, it seems clear that this constitutes a new ‘v2’ api, and it’s very clear that 6 months is too short, especially as another part of the 'API Versioning’ section says "The deprecation clock for vN-1 API will start as soon as we release “N.0.0” version of Mesos. […]”
> 
> Please believe me, I understand the need to be able to change broken api and implementation quickly, without spending years maintaining technical debt. This is why I believe the mesos project decided to move to a model where the internal protobufs are separate from the v1/v2/etc. protobufs, and evolvers/devolvers are proposed. It seems clear that the right way of doing this is to modify the internal protobuf to look the way you’d like (better message name, clearer field order, etc.) and write an evolver from the v1 api to the internal api.
> 
> Also, I think it’s important to note that the compatibility guarantees I’m citing are exactly the things that make it possible at all to write a scheduler against mesos and actually use it in production. Deciding that this case is too insignificant to really bother with the compatibility guarantees means that you’ve just pushed the tech debt issue one level higher to the scheduler writers.
> 
> I’m sorry this email ended up so long, but thank you for taking some time to read it — I believe that this issue is critical to the ongoing health of the mesos project.
> 
> 
> > On Sep 5, 2016, at 11:14 AM, haosdent <haosdent@gmail.com <ma...@gmail.com>> wrote:
> >
> > Hi, folks. As I mentioned in the previous email http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1> <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1>>.
> > We have added `type` in the `HealthCheck` protobuf definition in 1.1.0 and
> > health checks without `type` specified will be deprecated since 1.1.0.
> >
> > For backwards compatibility, we still support the command health check if the
> > type is not specified for now. But we plan to make `type` become a required field
> > and return `TASK_ERROR` if the type is not specified after 6 months. The question
> > is if this meets the deprecated policy since 1.0 ? If 6 months is too short and
> > we have to deprecate it after 2.0 ?
> >
> > Looking forward the answers. Any concerns and questions are appreciated, thanks a lot!
> >
> > --
> > Best Regards,
> > Haosdent Huang
> 
> 
> 
> 
> -- 
> Best Regards,
> Haosdent Huang

Re: Question about the deprecated policy after 1.0

Posted by Silas Snider <sw...@apple.com>.

Responses inline

> On Sep 6, 2016, at 1:33 AM, haosdent <ha...@gmail.com> wrote:
> 
> Hi, Silas. Thanks a lot to help test the health check changes recently.
> 
> According to my understanding about your email, you mentioned two problems:
> 
> 1. The bug that broken exists HTTP/command health check caused by r50812 <https://reviews.apache.org/r/50812> and r50996 <https://reviews.apache.org/r/50996>
> 
> >It is now true that even with the proposed change (51560), we will still get tasks rejected with TASK_ERROR in 1.1.0, despite the same exact code working in 1.0.0.
> >Even in the case of the command health checks, which are once again supported in 51560, we now get deprecation warnings, suggesting that mesos will again break us in 1.4.
> 
> As you mentioned, this is a bug and we definitely should fix before release 1.1.0.
> I have updated r51560 <https://reviews.apache.org/r/51560> yesterday and verify it fix the problem via r51635. As you see in
> the r51560 <https://reviews.apache.org/r/51560>, we make sure the protobuf compatible again and didn't lose any
> fields. Would you help to double check if it fixes your problem when you free?
> It would be highly appreciated that if you could help to verify it.
> 
> After this bug fix, we could ensure all tasks with HTTP/command health check are not when upgrading to 1.1.0.
> 

I see those changes now (I’m very very bad at the review board UI, so I’m sorry if it was always there and I missed it somehow).

> 2. Should we make the `HealthCheck::type` required after v2 ?
> 
> To be honest, I think 6 months should be enough and it also should be changed in
> v1 because it is a minor change and we didn't make it `required` in protobuf 
> message level. We still keeping it `option` in protobuf message definition and
> add a check about it in Mesos code.
> But your concerns make sense as well, so let's see what other users/developers say to
> see if we could make an agreement on this. 
> 

This is an important point. It doesn’t make sense to me that the compatibility policy is talking about only whether a protobuf field is optional or required — it seems to me that any change that takes a protocol exchange that did not result in a TASK_ERROR before, and changes it to cause a TASK_ERROR now, *is* making that protobuf field semantically required, whether or not the protobuf def says so.

I’ll also point out that there is no definition of ‘minor’ change in the compatibility document, and therefore, whether or not a change appears to be ‘minor’ under some rubric (and I agree that this change could seem minor), if it’s part of the v1 mesos.proto, it affects downstream users (such as me, a writer of schedulers).

> On Tue, Sep 6, 2016 at 1:18 PM, Silas Snider <swsnider@apple.com <ma...@apple.com>> wrote:
> There’s a little history to this:
> 
> In https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/> <https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/>>, on the 8th of August, the HTTP health check message was changed to be entirely incompatible with the previous HTTP health check message. Not only was its name changed (breaking compatibility with anyone using the feature with libmesos), but the field tags were rearranged, making it truly wire-format incompatible. This change also introduced a ‘type’ field to the HealthCheck message as an optional enum.
> 
> Next, in https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/> <https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/>>, on the 13th of August, the health checking code was changed to make the new ‘type’ field mandatory — if the protobuf field is not present, the mesos master rejects your task with TASK_ERROR.
> 
> A colleague of mine was testing our internal scheduler against HEAD of mesos, and discovered that any task they submitted was being rejected as TASK_ERROR, since we were setting health checks, but not sending type. I filed MESOS-6110, on the 30th of August, and haosdent huang has kindly created https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/> <https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/>> to try to fix this.
> 
> In the course of reviewing that fix, I noticed that it only addresses the case of a command health check, and does not continue to support HTTP health checks in the way they were in 1.0.0. This is a problem for our scheduler, as we have ~always (before mesos actually added support) passed our HTTP health checks in the message, depending on our custom executor to actually perform the check. It is now true that even with the proposed change (51560), we will still get tasks rejected with TASK_ERROR in 1.1.0, despite the same exact code working in 1.0.0.
> 
> Even in the case of the command health checks, which are once again supported in 51560, we now get deprecation warnings, suggesting that mesos will again break us in 1.4.
> 
> It is my team’s belief that the mesos compatibility guarantee, as documented on this page: http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/><http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/>> would prohibit this sort of change from occurring. Specifically, the ‘API Versioning’ section says "The API version is only bumped if we need to make a backwards incompatible API change. We will strive to support a given API version for at least a year.” and under the ‘API compatibility’ the change is considered to be breaking if it would involve "Adding new required fields to existing requests to “/scheduler”.”
> 
> The proposed change does indeed add a new required field — ‘type’ to the v1 api, in the case of command health checks in 6 months, in the case of http health checks, immediately. Therefore, it seems clear that this constitutes a new ‘v2’ api, and it’s very clear that 6 months is too short, especially as another part of the 'API Versioning’ section says "The deprecation clock for vN-1 API will start as soon as we release “N.0.0” version of Mesos. […]”
> 
> Please believe me, I understand the need to be able to change broken api and implementation quickly, without spending years maintaining technical debt. This is why I believe the mesos project decided to move to a model where the internal protobufs are separate from the v1/v2/etc. protobufs, and evolvers/devolvers are proposed. It seems clear that the right way of doing this is to modify the internal protobuf to look the way you’d like (better message name, clearer field order, etc.) and write an evolver from the v1 api to the internal api.
> 
> Also, I think it’s important to note that the compatibility guarantees I’m citing are exactly the things that make it possible at all to write a scheduler against mesos and actually use it in production. Deciding that this case is too insignificant to really bother with the compatibility guarantees means that you’ve just pushed the tech debt issue one level higher to the scheduler writers.
> 
> I’m sorry this email ended up so long, but thank you for taking some time to read it — I believe that this issue is critical to the ongoing health of the mesos project.
> 
> 
> > On Sep 5, 2016, at 11:14 AM, haosdent <haosdent@gmail.com <ma...@gmail.com>> wrote:
> >
> > Hi, folks. As I mentioned in the previous email http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1> <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1>>.
> > We have added `type` in the `HealthCheck` protobuf definition in 1.1.0 and
> > health checks without `type` specified will be deprecated since 1.1.0.
> >
> > For backwards compatibility, we still support the command health check if the
> > type is not specified for now. But we plan to make `type` become a required field
> > and return `TASK_ERROR` if the type is not specified after 6 months. The question
> > is if this meets the deprecated policy since 1.0 ? If 6 months is too short and
> > we have to deprecate it after 2.0 ?
> >
> > Looking forward the answers. Any concerns and questions are appreciated, thanks a lot!
> >
> > --
> > Best Regards,
> > Haosdent Huang
> 
> 
> 
> 
> -- 
> Best Regards,
> Haosdent Huang

Re: Question about the deprecated policy after 1.0

Posted by haosdent <ha...@gmail.com>.

Hi, Silas. Thanks a lot to help test the health check changes recently.

According to my understanding about your email, you mentioned two problems:

1. The bug that broken exists HTTP/command health check caused by r50812
<https://reviews.apache.org/r/50812> and r50996
<https://reviews.apache.org/r/50996>

>It is now true that even with the proposed change (51560), we will still
get tasks rejected with TASK_ERROR in 1.1.0, despite the same exact code
working in 1.0.0.
>Even in the case of the command health checks, which are once again
supported in 51560, we now get deprecation warnings, suggesting that mesos
will again break us in 1.4.

As you mentioned, this is a bug and we definitely should fix before release
1.1.0.
I have updated r51560 <https://reviews.apache.org/r/51560> yesterday and
verify it fix the problem via r51635. As you see in
the r51560 <https://reviews.apache.org/r/51560>, we make sure the protobuf
compatible again and didn't lose any
fields. Would you help to double check if it fixes your problem when you
free?
It would be highly appreciated that if you could help to verify it.

After this bug fix, we could ensure all tasks with HTTP/command health
check are not when upgrading to 1.1.0.

2. Should we make the `HealthCheck::type` required after v2 ?

To be honest, I think 6 months should be enough and it also should be
changed in
v1 because it is a minor change and we didn't make it `required` in
protobuf
message level. We still keeping it `option` in protobuf message definition
and
add a check about it in Mesos code.
But your concerns make sense as well, so let's see what other
users/developers say to
see if we could make an agreement on this.

On Tue, Sep 6, 2016 at 1:18 PM, Silas Snider <sw...@apple.com> wrote:

> There’s a little history to this:
>
> In https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/
> 50812/>, on the 8th of August, the HTTP health check message was changed
> to be entirely incompatible with the previous HTTP health check message.
> Not only was its name changed (breaking compatibility with anyone using the
> feature with libmesos), but the field tags were rearranged, making it truly
> wire-format incompatible. This change also introduced a ‘type’ field to the
> HealthCheck message as an optional enum.
>
> Next, in https://reviews.apache.org/r/50996/ <
> https://reviews.apache.org/r/50996/>, on the 13th of August, the health
> checking code was changed to make the new ‘type’ field mandatory — if the
> protobuf field is not present, the mesos master rejects your task with
> TASK_ERROR.
>
> A colleague of mine was testing our internal scheduler against HEAD of
> mesos, and discovered that any task they submitted was being rejected as
> TASK_ERROR, since we were setting health checks, but not sending type. I
> filed MESOS-6110, on the 30th of August, and haosdent huang has kindly
> created https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/
> 51560/> to try to fix this.
>
> In the course of reviewing that fix, I noticed that it only addresses the
> case of a command health check, and does not continue to support HTTP
> health checks in the way they were in 1.0.0. This is a problem for our
> scheduler, as we have ~always (before mesos actually added support) passed
> our HTTP health checks in the message, depending on our custom executor to
> actually perform the check. It is now true that even with the proposed
> change (51560), we will still get tasks rejected with TASK_ERROR in 1.1.0,
> despite the same exact code working in 1.0.0.
>
> Even in the case of the command health checks, which are once again
> supported in 51560, we now get deprecation warnings, suggesting that mesos
> will again break us in 1.4.
>
> It is my team’s belief that the mesos compatibility guarantee, as
> documented on this page: http://mesos.apache.org/documentation/latest/
> versioning/ <http://mesos.apache.org/documentation/latest/versioning/>
> would prohibit this sort of change from occurring. Specifically, the ‘API
> Versioning’ section says "The API version is only bumped if we need to make
> a backwards incompatible API change. We will strive to support a given API
> version for at least a year.” and under the ‘API compatibility’ the change
> is considered to be breaking if it would involve "Adding new required
> fields to existing requests to “/scheduler”.”
>
> The proposed change does indeed add a new required field — ‘type’ to the
> v1 api, in the case of command health checks in 6 months, in the case of
> http health checks, immediately. Therefore, it seems clear that this
> constitutes a new ‘v2’ api, and it’s very clear that 6 months is too short,
> especially as another part of the 'API Versioning’ section says "The
> deprecation clock for vN-1 API will start as soon as we release “N.0.0”
> version of Mesos. […]”
>
> Please believe me, I understand the need to be able to change broken api
> and implementation quickly, without spending years maintaining technical
> debt. This is why I believe the mesos project decided to move to a model
> where the internal protobufs are separate from the v1/v2/etc. protobufs,
> and evolvers/devolvers are proposed. It seems clear that the right way of
> doing this is to modify the internal protobuf to look the way you’d like
> (better message name, clearer field order, etc.) and write an evolver from
> the v1 api to the internal api.
>
> Also, I think it’s important to note that the compatibility guarantees I’m
> citing are exactly the things that make it possible at all to write a
> scheduler against mesos and actually use it in production. Deciding that
> this case is too insignificant to really bother with the compatibility
> guarantees means that you’ve just pushed the tech debt issue one level
> higher to the scheduler writers.
>
> I’m sorry this email ended up so long, but thank you for taking some time
> to read it — I believe that this issue is critical to the ongoing health of
> the mesos project.
>
>
> > On Sep 5, 2016, at 11:14 AM, haosdent <ha...@gmail.com> wrote:
> >
> > Hi, folks. As I mentioned in the previous email
> http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/
> 0Vlr6Ma9DWqzG3M1>.
> > We have added `type` in the `HealthCheck` protobuf definition in 1.1.0
> and
> > health checks without `type` specified will be deprecated since 1.1.0.
> >
> > For backwards compatibility, we still support the command health check
> if the
> > type is not specified for now. But we plan to make `type` become a
> required field
> > and return `TASK_ERROR` if the type is not specified after 6 months. The
> question
> > is if this meets the deprecated policy since 1.0 ? If 6 months is too
> short and
> > we have to deprecate it after 2.0 ?
> >
> > Looking forward the answers. Any concerns and questions are appreciated,
> thanks a lot!
> >
> > --
> > Best Regards,
> > Haosdent Huang
>
>


-- 
Best Regards,
Haosdent Huang

Re: Question about the deprecated policy after 1.0

Posted by haosdent <ha...@gmail.com>.

Hi, Silas. Thanks a lot to help test the health check changes recently.

According to my understanding about your email, you mentioned two problems:

1. The bug that broken exists HTTP/command health check caused by r50812
<https://reviews.apache.org/r/50812> and r50996
<https://reviews.apache.org/r/50996>

>It is now true that even with the proposed change (51560), we will still
get tasks rejected with TASK_ERROR in 1.1.0, despite the same exact code
working in 1.0.0.
>Even in the case of the command health checks, which are once again
supported in 51560, we now get deprecation warnings, suggesting that mesos
will again break us in 1.4.

As you mentioned, this is a bug and we definitely should fix before release
1.1.0.
I have updated r51560 <https://reviews.apache.org/r/51560> yesterday and
verify it fix the problem via r51635. As you see in
the r51560 <https://reviews.apache.org/r/51560>, we make sure the protobuf
compatible again and didn't lose any
fields. Would you help to double check if it fixes your problem when you
free?
It would be highly appreciated that if you could help to verify it.

After this bug fix, we could ensure all tasks with HTTP/command health
check are not when upgrading to 1.1.0.

2. Should we make the `HealthCheck::type` required after v2 ?

To be honest, I think 6 months should be enough and it also should be
changed in
v1 because it is a minor change and we didn't make it `required` in
protobuf
message level. We still keeping it `option` in protobuf message definition
and
add a check about it in Mesos code.
But your concerns make sense as well, so let's see what other
users/developers say to
see if we could make an agreement on this.

On Tue, Sep 6, 2016 at 1:18 PM, Silas Snider <sw...@apple.com> wrote:

> There’s a little history to this:
>
> In https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/
> 50812/>, on the 8th of August, the HTTP health check message was changed
> to be entirely incompatible with the previous HTTP health check message.
> Not only was its name changed (breaking compatibility with anyone using the
> feature with libmesos), but the field tags were rearranged, making it truly
> wire-format incompatible. This change also introduced a ‘type’ field to the
> HealthCheck message as an optional enum.
>
> Next, in https://reviews.apache.org/r/50996/ <
> https://reviews.apache.org/r/50996/>, on the 13th of August, the health
> checking code was changed to make the new ‘type’ field mandatory — if the
> protobuf field is not present, the mesos master rejects your task with
> TASK_ERROR.
>
> A colleague of mine was testing our internal scheduler against HEAD of
> mesos, and discovered that any task they submitted was being rejected as
> TASK_ERROR, since we were setting health checks, but not sending type. I
> filed MESOS-6110, on the 30th of August, and haosdent huang has kindly
> created https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/
> 51560/> to try to fix this.
>
> In the course of reviewing that fix, I noticed that it only addresses the
> case of a command health check, and does not continue to support HTTP
> health checks in the way they were in 1.0.0. This is a problem for our
> scheduler, as we have ~always (before mesos actually added support) passed
> our HTTP health checks in the message, depending on our custom executor to
> actually perform the check. It is now true that even with the proposed
> change (51560), we will still get tasks rejected with TASK_ERROR in 1.1.0,
> despite the same exact code working in 1.0.0.
>
> Even in the case of the command health checks, which are once again
> supported in 51560, we now get deprecation warnings, suggesting that mesos
> will again break us in 1.4.
>
> It is my team’s belief that the mesos compatibility guarantee, as
> documented on this page: http://mesos.apache.org/documentation/latest/
> versioning/ <http://mesos.apache.org/documentation/latest/versioning/>
> would prohibit this sort of change from occurring. Specifically, the ‘API
> Versioning’ section says "The API version is only bumped if we need to make
> a backwards incompatible API change. We will strive to support a given API
> version for at least a year.” and under the ‘API compatibility’ the change
> is considered to be breaking if it would involve "Adding new required
> fields to existing requests to “/scheduler”.”
>
> The proposed change does indeed add a new required field — ‘type’ to the
> v1 api, in the case of command health checks in 6 months, in the case of
> http health checks, immediately. Therefore, it seems clear that this
> constitutes a new ‘v2’ api, and it’s very clear that 6 months is too short,
> especially as another part of the 'API Versioning’ section says "The
> deprecation clock for vN-1 API will start as soon as we release “N.0.0”
> version of Mesos. […]”
>
> Please believe me, I understand the need to be able to change broken api
> and implementation quickly, without spending years maintaining technical
> debt. This is why I believe the mesos project decided to move to a model
> where the internal protobufs are separate from the v1/v2/etc. protobufs,
> and evolvers/devolvers are proposed. It seems clear that the right way of
> doing this is to modify the internal protobuf to look the way you’d like
> (better message name, clearer field order, etc.) and write an evolver from
> the v1 api to the internal api.
>
> Also, I think it’s important to note that the compatibility guarantees I’m
> citing are exactly the things that make it possible at all to write a
> scheduler against mesos and actually use it in production. Deciding that
> this case is too insignificant to really bother with the compatibility
> guarantees means that you’ve just pushed the tech debt issue one level
> higher to the scheduler writers.
>
> I’m sorry this email ended up so long, but thank you for taking some time
> to read it — I believe that this issue is critical to the ongoing health of
> the mesos project.
>
>
> > On Sep 5, 2016, at 11:14 AM, haosdent <ha...@gmail.com> wrote:
> >
> > Hi, folks. As I mentioned in the previous email
> http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/
> 0Vlr6Ma9DWqzG3M1>.
> > We have added `type` in the `HealthCheck` protobuf definition in 1.1.0
> and
> > health checks without `type` specified will be deprecated since 1.1.0.
> >
> > For backwards compatibility, we still support the command health check
> if the
> > type is not specified for now. But we plan to make `type` become a
> required field
> > and return `TASK_ERROR` if the type is not specified after 6 months. The
> question
> > is if this meets the deprecated policy since 1.0 ? If 6 months is too
> short and
> > we have to deprecate it after 2.0 ?
> >
> > Looking forward the answers. Any concerns and questions are appreciated,
> thanks a lot!
> >
> > --
> > Best Regards,
> > Haosdent Huang
>
>


-- 
Best Regards,
Haosdent Huang

Re: Question about the deprecated policy after 1.0

Posted by Silas Snider <sw...@apple.com>.

There’s a little history to this:

In https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/>, on the 8th of August, the HTTP health check message was changed to be entirely incompatible with the previous HTTP health check message. Not only was its name changed (breaking compatibility with anyone using the feature with libmesos), but the field tags were rearranged, making it truly wire-format incompatible. This change also introduced a ‘type’ field to the HealthCheck message as an optional enum.

Next, in https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/>, on the 13th of August, the health checking code was changed to make the new ‘type’ field mandatory — if the protobuf field is not present, the mesos master rejects your task with TASK_ERROR.

A colleague of mine was testing our internal scheduler against HEAD of mesos, and discovered that any task they submitted was being rejected as TASK_ERROR, since we were setting health checks, but not sending type. I filed MESOS-6110, on the 30th of August, and haosdent huang has kindly created https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/> to try to fix this.

In the course of reviewing that fix, I noticed that it only addresses the case of a command health check, and does not continue to support HTTP health checks in the way they were in 1.0.0. This is a problem for our scheduler, as we have ~always (before mesos actually added support) passed our HTTP health checks in the message, depending on our custom executor to actually perform the check. It is now true that even with the proposed change (51560), we will still get tasks rejected with TASK_ERROR in 1.1.0, despite the same exact code working in 1.0.0.

Even in the case of the command health checks, which are once again supported in 51560, we now get deprecation warnings, suggesting that mesos will again break us in 1.4.

It is my team’s belief that the mesos compatibility guarantee, as documented on this page: http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/> would prohibit this sort of change from occurring. Specifically, the ‘API Versioning’ section says "The API version is only bumped if we need to make a backwards incompatible API change. We will strive to support a given API version for at least a year.” and under the ‘API compatibility’ the change is considered to be breaking if it would involve "Adding new required fields to existing requests to “/scheduler”.”

The proposed change does indeed add a new required field — ‘type’ to the v1 api, in the case of command health checks in 6 months, in the case of http health checks, immediately. Therefore, it seems clear that this constitutes a new ‘v2’ api, and it’s very clear that 6 months is too short, especially as another part of the 'API Versioning’ section says "The deprecation clock for vN-1 API will start as soon as we release “N.0.0” version of Mesos. […]”

Please believe me, I understand the need to be able to change broken api and implementation quickly, without spending years maintaining technical debt. This is why I believe the mesos project decided to move to a model where the internal protobufs are separate from the v1/v2/etc. protobufs, and evolvers/devolvers are proposed. It seems clear that the right way of doing this is to modify the internal protobuf to look the way you’d like (better message name, clearer field order, etc.) and write an evolver from the v1 api to the internal api.

Also, I think it’s important to note that the compatibility guarantees I’m citing are exactly the things that make it possible at all to write a scheduler against mesos and actually use it in production. Deciding that this case is too insignificant to really bother with the compatibility guarantees means that you’ve just pushed the tech debt issue one level higher to the scheduler writers.

I’m sorry this email ended up so long, but thank you for taking some time to read it — I believe that this issue is critical to the ongoing health of the mesos project.

> On Sep 5, 2016, at 11:14 AM, haosdent <ha...@gmail.com> wrote:
> 
> Hi, folks. As I mentioned in the previous email http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1>.
> We have added `type` in the `HealthCheck` protobuf definition in 1.1.0 and
> health checks without `type` specified will be deprecated since 1.1.0.
> 
> For backwards compatibility, we still support the command health check if the
> type is not specified for now. But we plan to make `type` become a required field
> and return `TASK_ERROR` if the type is not specified after 6 months. The question
> is if this meets the deprecated policy since 1.0 ? If 6 months is too short and
> we have to deprecate it after 2.0 ?
> 
> Looking forward the answers. Any concerns and questions are appreciated, thanks a lot!
> 
> -- 
> Best Regards,
> Haosdent Huang

Re: Question about the deprecated policy after 1.0

Posted by Silas Snider <sw...@apple.com>.

There’s a little history to this:

In https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/>, on the 8th of August, the HTTP health check message was changed to be entirely incompatible with the previous HTTP health check message. Not only was its name changed (breaking compatibility with anyone using the feature with libmesos), but the field tags were rearranged, making it truly wire-format incompatible. This change also introduced a ‘type’ field to the HealthCheck message as an optional enum.

Next, in https://reviews.apache.org/r/50996/ <https://reviews.apache.org/r/50996/>, on the 13th of August, the health checking code was changed to make the new ‘type’ field mandatory — if the protobuf field is not present, the mesos master rejects your task with TASK_ERROR.

A colleague of mine was testing our internal scheduler against HEAD of mesos, and discovered that any task they submitted was being rejected as TASK_ERROR, since we were setting health checks, but not sending type. I filed MESOS-6110, on the 30th of August, and haosdent huang has kindly created https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/> to try to fix this.

In the course of reviewing that fix, I noticed that it only addresses the case of a command health check, and does not continue to support HTTP health checks in the way they were in 1.0.0. This is a problem for our scheduler, as we have ~always (before mesos actually added support) passed our HTTP health checks in the message, depending on our custom executor to actually perform the check. It is now true that even with the proposed change (51560), we will still get tasks rejected with TASK_ERROR in 1.1.0, despite the same exact code working in 1.0.0.

Even in the case of the command health checks, which are once again supported in 51560, we now get deprecation warnings, suggesting that mesos will again break us in 1.4.

It is my team’s belief that the mesos compatibility guarantee, as documented on this page: http://mesos.apache.org/documentation/latest/versioning/ <http://mesos.apache.org/documentation/latest/versioning/> would prohibit this sort of change from occurring. Specifically, the ‘API Versioning’ section says "The API version is only bumped if we need to make a backwards incompatible API change. We will strive to support a given API version for at least a year.” and under the ‘API compatibility’ the change is considered to be breaking if it would involve "Adding new required fields to existing requests to “/scheduler”.”

The proposed change does indeed add a new required field — ‘type’ to the v1 api, in the case of command health checks in 6 months, in the case of http health checks, immediately. Therefore, it seems clear that this constitutes a new ‘v2’ api, and it’s very clear that 6 months is too short, especially as another part of the 'API Versioning’ section says "The deprecation clock for vN-1 API will start as soon as we release “N.0.0” version of Mesos. […]”

Please believe me, I understand the need to be able to change broken api and implementation quickly, without spending years maintaining technical debt. This is why I believe the mesos project decided to move to a model where the internal protobufs are separate from the v1/v2/etc. protobufs, and evolvers/devolvers are proposed. It seems clear that the right way of doing this is to modify the internal protobuf to look the way you’d like (better message name, clearer field order, etc.) and write an evolver from the v1 api to the internal api.

Also, I think it’s important to note that the compatibility guarantees I’m citing are exactly the things that make it possible at all to write a scheduler against mesos and actually use it in production. Deciding that this case is too insignificant to really bother with the compatibility guarantees means that you’ve just pushed the tech debt issue one level higher to the scheduler writers.

I’m sorry this email ended up so long, but thank you for taking some time to read it — I believe that this issue is critical to the ongoing health of the mesos project.

> On Sep 5, 2016, at 11:14 AM, haosdent <ha...@gmail.com> wrote:
> 
> Hi, folks. As I mentioned in the previous email http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1>.
> We have added `type` in the `HealthCheck` protobuf definition in 1.1.0 and
> health checks without `type` specified will be deprecated since 1.1.0.
> 
> For backwards compatibility, we still support the command health check if the
> type is not specified for now. But we plan to make `type` become a required field
> and return `TASK_ERROR` if the type is not specified after 6 months. The question
> is if this meets the deprecated policy since 1.0 ? If 6 months is too short and
> we have to deprecate it after 2.0 ?
> 
> Looking forward the answers. Any concerns and questions are appreciated, thanks a lot!
> 
> -- 
> Best Regards,
> Haosdent Huang