You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@reef.apache.org by Markus Weimer <ma...@weimo.de> on 2017/04/11 16:26:15 UTC

[Discuss] Timely releases with known issues vs. rare issue-free releases

Hi,

the current saga of ever not fully completing integration tests
reminded me that we never actually had a discussion about what our bar
for a release is.


Our informal agreement right now seems to be that we want all the
integration tests to finish all the time on our CI servers. I admire
our dedication to high-quality releases, and don't want to distract
from it. Over time, we must fix all of those issues and strive to
provide the most stable software we are capable off.

At the same time, I haven't had a test failure on any of my own
machines when reviewing pull requests in a very long time. Which makes
me wonder whether our CI servers just set us up for failure. I believe
the failures on the CI servers are real, and point to interesting edge
cases in REEF we haven't fully solved. Hence, we absolutely must
investigate and fix them.

However, I am not sure whether this needs to happen in a way that
blocks the next release. Because there is a competing interest in
timely releases. Our last actual release, 0.15, was in May of 2016*.
At that time, we did not have IMRU, we did not have a working group
communications and a bunch of bug fixes were not in yet. Our current
`master` is all around better than that release. Hence, we'd do our
users a service by making a release with a `known issues` section in
the release notes. This also would help us get feedback on the current
code from actual users, as many won't (and shouldn't) use a developer
version.

In summary, we are faced with two opposing goals: (1) Fixing all the
known issues before a release to make that release the best it can be
and (2) release frequently to get our latest fixes and features out.

What do you think? Which approach would you like us to follow?

Thanks!

Markus


*: Let's just not talk about the disaster that is 0.15.1

Re: [Discuss] Timely releases with known issues vs. rare issue-free releases

Posted by Byung-Gon Chun <bg...@gmail.com>.

We have two great student GSoC proposals. One of them is about formal
verification of REEF protocols. We may be in a better position to deal with
corner cases when we're done with formal verification.


On Wed, Apr 12, 2017 at 2:07 PM, Byung-Gon Chun <bg...@gmail.com> wrote:

> Mariia, thanks for the insightful note.
>
> It's hard to decide ( :( ) since both waiting to fix failures and
> releasing not too late make sense.
>
>
>
>
> On Wed, Apr 12, 2017 at 12:39 PM, Mariia Mykhailova <
> mamykhai@microsoft.com.invalid> wrote:
>
>> First, we should note that 0.16 is not the first release preceded by a
>> rather long bug hunt in our code; we spent at least a month fixing bugs for
>> each 0.15 and 0.14. We tend to prioritize work on actual features between
>> releases, and we either start investigating known test failures only before
>> the release, or find bugs only when we start testing release candidate on
>> various platforms/machines/environments. During the previous release
>> discussions we used to agree that we shouldn’t release code with known
>> reproducible test failures.
>>
>> Second, the fact that the test failure only happens in CI (as opposed to
>> our dev boxes) doesn't imply that this isn't an actual bug in our code,
>> that it can't happen in production or that it won't be bad if it happens.
>> We have a history of actual nasty bugs in code which we discovered only
>> using CI servers, and they were very much not obvious (which is why it
>> takes so long to track them down). Two of the bugs which are the freshest
>> in my memory were related to REEF job never terminating; I don't think
>> doing a release with a known issue "REEF job sometimes never terminates"
>> (that will stay a known issue for at least half a year) is doing a service
>> to our users.
>>
>> On the other hand, the nature of our CI servers is such that occasional
>> test failures are inevitable. I've observed a lot of one-time transient
>> test failures which I didn't consider JIRA-worthy because they were
>> obviously caused by some resource issue on the CI server. Besides, our .NET
>> tests, especially IMRU ones, deal with a lot of concurrency and may
>> sometimes not account for certain valid sequences of events.
>>
>> Right now we have a whole list of problems:
>> 1. Test failures don't necessarily indicate a bug in the system, but they
>> might.
>> 2. We don't have sufficient understanding of when the failures indicate a
>> serious bug and when they don't.
>> 3. The tests we run in CI don't reflect the actual usage patterns of REEF
>> in production (we only run tests on local runtime, not on Yarn)
>> 4. We don't have sufficient motivation to debug until we approach a
>> release.
>> 5. When we approach a release, we have motivation to mark any failures as
>> known issues and proceed :-)
>>
>> I guess we should balance our belief that the test failures are
>> benign/not frequent enough to cause problems if used in production vs our
>> need to have some motivation for investigating them (if we always dismiss
>> them as transient CI failures, we'll be missing actual bugs).
>>
>> -Mariia
>>
>> -----Original Message-----
>> From: Taegeon Um [mailto:taegeonum@gmail.com]
>> Sent: Tuesday, April 11, 2017 6:37 PM
>> To: dev@reef.apache.org
>> Subject: Re: [Discuss] Timely releases with known issues vs. rare
>> issue-free releases
>>
>> Hi,
>>
>> I am totally agree with Markus on timely releases and making a release
>> with a `known issues` section.
>> If the test failures are rare, it would be good to note down the known
>> issues and move forward to the release.
>>
>> My main concern is that "until when do we have to keep them as known
>> issues?". We cannot keep them as known issues forever.
>> I think at least we need a cut-off date. For example, if we find
>> transient failures in 0.16 snapshot, we should resolve them until the next
>> release (0.17)?
>>
>> Thanks,
>> Taegeon
>>
>>
>> 2017. 4. 12. 오전 6:09에 "Byung-Gon Chun" <bg...@gmail.com>님이 작성:
>>
>> I understand this concern. It's been about a month since we started to
>> talk about release 0.16. Also, our last release occurred long time ago.
>>
>> Mariia, Taegeon, Sergiy, and Julia have been working to fix test failure
>> issues for release 0.16. Since we heard from Julia, I hope to hear from
>> Mariia, Taegeon, and Sergiy as well. What're your thoughts?
>>
>>
>>
>> On Wed, Apr 12, 2017 at 3:15 AM, Julia Wang (QIUHE) <
>> Qiuhe.Wang@microsoft.com.invalid> wrote:
>>
>> > I totally agree with Markus's comments.
>> >
>> > If we have test failures that show some bugs in the system and impact
>> > the quality of the code, we should resolve before a release.
>> >
>> > Current few transit test failures only happen on AppVayer. It could be
>> > test issue that hit some edge scenarios, most possibly is related to
>> > the timing of the events. I have fixed some of them couple of weeks
>> > ago, like we didn't dispose active context in one of the test code (it
>> > was a bug in test), the validation condition in a test was too strict
>> > as the event may be received in different sequence, we depended on log
>> > messages in test handlers which may receive events later than driver
>> receives, etc.
>> >
>> > The failing tests may be just test issues, may imply some system
>> > issues, we don't know yet. But as long as there is no obvious defect
>> > in the
>> current
>> > code base, I would think it should not block the release.
>> >
>> > Thanks,
>> > Julia
>> >
>> > -----Original Message-----
>> > From: Markus Weimer [mailto:markus@weimo.de]
>> > Sent: Tuesday, April 11, 2017 9:26 AM
>> > To: REEF Developers Mailinglist <de...@reef.apache.org>
>> > Subject: [Discuss] Timely releases with known issues vs. rare
>> > issue-free releases
>> >
>> > Hi,
>> >
>> > the current saga of ever not fully completing integration tests
>> > reminded me that we never actually had a discussion about what our bar
>> > for a
>> release
>> > is.
>> >
>> >
>> > Our informal agreement right now seems to be that we want all the
>> > integration tests to finish all the time on our CI servers. I admire
>> > our dedication to high-quality releases, and don't want to distract
>> from it.
>> > Over time, we must fix all of those issues and strive to provide the
>> > most stable software we are capable off.
>> >
>> > At the same time, I haven't had a test failure on any of my own
>> > machines when reviewing pull requests in a very long time. Which makes
>> > me wonder whether our CI servers just set us up for failure. I believe
>> > the failures on the CI servers are real, and point to interesting edge
>> > cases in REEF we haven't fully solved. Hence, we absolutely must
>> investigate and fix them.
>> >
>> > However, I am not sure whether this needs to happen in a way that
>> > blocks the next release. Because there is a competing interest in
>> > timely
>> releases.
>> > Our last actual release, 0.15, was in May of 2016*.
>> > At that time, we did not have IMRU, we did not have a working group
>> > communications and a bunch of bug fixes were not in yet. Our current
>> > `master` is all around better than that release. Hence, we'd do our
>> > users
>> a
>> > service by making a release with a `known issues` section in the
>> > release notes. This also would help us get feedback on the current
>> > code from
>> actual
>> > users, as many won't (and shouldn't) use a developer version.
>> >
>> > In summary, we are faced with two opposing goals: (1) Fixing all the
>> > known issues before a release to make that release the best it can be
>> > and (2) release frequently to get our latest fixes and features out.
>> >
>> > What do you think? Which approach would you like us to follow?
>> >
>> > Thanks!
>> >
>> > Markus
>> >
>> >
>> > *: Let's just not talk about the disaster that is 0.15.1
>> >
>>
>>
>>
>> --
>> Byung-Gon Chun
>>
>
>
>
> --
> Byung-Gon Chun
>



-- 
Byung-Gon Chun

Re: [Discuss] Timely releases with known issues vs. rare issue-free releases

Posted by Byung-Gon Chun <bg...@gmail.com>.

Good point! :)


On Sat, Apr 22, 2017 at 4:27 AM, Mariia Mykhailova <
mamykhai@microsoft.com.invalid> wrote:

> Just got email about promoting releases at the upcoming ApacheCon:
>
> If your project is planning a release in the next few weeks, please
> consider doing this during the week of ApacheCon, and letting us know
> (email press@apache.org and rbowen@apache.org, please) so that we can
> make a splash around your release.
>
> Just something to consider :-)
>
> -Mariia
>
> -----Original Message-----
> From: Tae-Geon Um [mailto:taegeonum@gmail.com]
> Sent: Thursday, April 13, 2017 7:36 PM
> To: dev@reef.apache.org
> Subject: Re: [Discuss] Timely releases with known issues vs. rare
> issue-free releases
>
> We may need a week to make sure whether the transient failures are
> reoccured or not.
> So, how about working on resolving the failures till the end of the next
> week, and waiting for one week after that?
>
> Anyway, we should decide the policy between timely releases vs rare
> issue-free releases.
> After having a second thought, I prefer *rare issue-free releases*. I
> agreed with Mariia on that writing down “known issues” and saying “REEF
> tests could fail sometimes” doesn’t look good.
>
> However, I’m not saying that we should give up timely releases. I think we
> can achieve the *both* if we set a high priority on the transient
> failures.  As Mariia pointed out, we don’t have sufficient motivation to
> debug transient failures until we approach a release. But, if we try to
> resolve the transient failures earlier before a release, maybe we can make
> rare issue-free timely releases.
>
> Still, it’s really hard to decide. I would like to hear other thoughts.
>
> Thanks,
> Taegeon
>
>
> > On Apr 13, 2017, at 7:06 AM, Byung-Gon Chun <bg...@gmail.com> wrote:
> >
> > How about setting a deadline to work on test failure cases for release
> > 0.16.0 to the end of next week? If we would like to spend more time,
> > we can set it to the last day of April.
> >
> > -Gon
> >
> > On Wed, Apr 12, 2017 at 2:07 PM, Byung-Gon Chun <bg...@gmail.com>
> wrote:
> >
> >> Mariia, thanks for the insightful note.
> >>
> >> It's hard to decide ( :( ) since both waiting to fix failures and
> >> releasing not too late make sense.
> >>
> >>
> >>
> >>
> >> On Wed, Apr 12, 2017 at 12:39 PM, Mariia Mykhailova <
> >> mamykhai@microsoft.com.invalid> wrote:
> >>
> >>> First, we should note that 0.16 is not the first release preceded by
> >>> a rather long bug hunt in our code; we spent at least a month fixing
> >>> bugs for each 0.15 and 0.14. We tend to prioritize work on actual
> >>> features between releases, and we either start investigating known
> >>> test failures only before the release, or find bugs only when we
> >>> start testing release candidate on various
> >>> platforms/machines/environments. During the previous release
> >>> discussions we used to agree that we shouldn’t release code with known
> reproducible test failures.
> >>>
> >>> Second, the fact that the test failure only happens in CI (as
> >>> opposed to our dev boxes) doesn't imply that this isn't an actual
> >>> bug in our code, that it can't happen in production or that it won't
> be bad if it happens.
> >>> We have a history of actual nasty bugs in code which we discovered
> >>> only using CI servers, and they were very much not obvious (which is
> >>> why it takes so long to track them down). Two of the bugs which are
> >>> the freshest in my memory were related to REEF job never
> >>> terminating; I don't think doing a release with a known issue "REEF
> job sometimes never terminates"
> >>> (that will stay a known issue for at least half a year) is doing a
> >>> service to our users.
> >>>
> >>> On the other hand, the nature of our CI servers is such that
> >>> occasional test failures are inevitable. I've observed a lot of
> >>> one-time transient test failures which I didn't consider JIRA-worthy
> >>> because they were obviously caused by some resource issue on the CI
> >>> server. Besides, our .NET tests, especially IMRU ones, deal with a
> >>> lot of concurrency and may sometimes not account for certain valid
> sequences of events.
> >>>
> >>> Right now we have a whole list of problems:
> >>> 1. Test failures don't necessarily indicate a bug in the system, but
> >>> they might.
> >>> 2. We don't have sufficient understanding of when the failures
> >>> indicate a serious bug and when they don't.
> >>> 3. The tests we run in CI don't reflect the actual usage patterns of
> >>> REEF in production (we only run tests on local runtime, not on Yarn)
> >>> 4. We don't have sufficient motivation to debug until we approach a
> >>> release.
> >>> 5. When we approach a release, we have motivation to mark any
> >>> failures as known issues and proceed :-)
> >>>
> >>> I guess we should balance our belief that the test failures are
> >>> benign/not frequent enough to cause problems if used in production
> >>> vs our need to have some motivation for investigating them (if we
> >>> always dismiss them as transient CI failures, we'll be missing actual
> bugs).
> >>>
> >>> -Mariia
> >>>
> >>> -----Original Message-----
> >>> From: Taegeon Um [mailto:taegeonum@gmail.com]
> >>> Sent: Tuesday, April 11, 2017 6:37 PM
> >>> To: dev@reef.apache.org
> >>> Subject: Re: [Discuss] Timely releases with known issues vs. rare
> >>> issue-free releases
> >>>
> >>> Hi,
> >>>
> >>> I am totally agree with Markus on timely releases and making a
> >>> release with a `known issues` section.
> >>> If the test failures are rare, it would be good to note down the
> >>> known issues and move forward to the release.
> >>>
> >>> My main concern is that "until when do we have to keep them as known
> >>> issues?". We cannot keep them as known issues forever.
> >>> I think at least we need a cut-off date. For example, if we find
> >>> transient failures in 0.16 snapshot, we should resolve them until
> >>> the next release (0.17)?
> >>>
> >>> Thanks,
> >>> Taegeon
> >>>
> >>>
> >>> 2017. 4. 12. 오전 6:09에 "Byung-Gon Chun" <bg...@gmail.com>님이 작성:
> >>>
> >>> I understand this concern. It's been about a month since we started
> >>> to talk about release 0.16. Also, our last release occurred long time
> ago.
> >>>
> >>> Mariia, Taegeon, Sergiy, and Julia have been working to fix test
> >>> failure issues for release 0.16. Since we heard from Julia, I hope
> >>> to hear from Mariia, Taegeon, and Sergiy as well. What're your
> thoughts?
> >>>
> >>>
> >>>
> >>> On Wed, Apr 12, 2017 at 3:15 AM, Julia Wang (QIUHE) <
> >>> Qiuhe.Wang@microsoft.com.invalid> wrote:
> >>>
> >>>> I totally agree with Markus's comments.
> >>>>
> >>>> If we have test failures that show some bugs in the system and
> >>>> impact the quality of the code, we should resolve before a release.
> >>>>
> >>>> Current few transit test failures only happen on AppVayer. It could
> >>>> be test issue that hit some edge scenarios, most possibly is
> >>>> related to the timing of the events. I have fixed some of them
> >>>> couple of weeks ago, like we didn't dispose active context in one
> >>>> of the test code (it was a bug in test), the validation condition
> >>>> in a test was too strict as the event may be received in different
> >>>> sequence, we depended on log messages in test handlers which may
> >>>> receive events later than driver
> >>> receives, etc.
> >>>>
> >>>> The failing tests may be just test issues, may imply some system
> >>>> issues, we don't know yet. But as long as there is no obvious
> >>>> defect in the
> >>> current
> >>>> code base, I would think it should not block the release.
> >>>>
> >>>> Thanks,
> >>>> Julia
> >>>>
> >>>> -----Original Message-----
> >>>> From: Markus Weimer [mailto:markus@weimo.de]
> >>>> Sent: Tuesday, April 11, 2017 9:26 AM
> >>>> To: REEF Developers Mailinglist <de...@reef.apache.org>
> >>>> Subject: [Discuss] Timely releases with known issues vs. rare
> >>>> issue-free releases
> >>>>
> >>>> Hi,
> >>>>
> >>>> the current saga of ever not fully completing integration tests
> >>>> reminded me that we never actually had a discussion about what our
> >>>> bar for a
> >>> release
> >>>> is.
> >>>>
> >>>>
> >>>> Our informal agreement right now seems to be that we want all the
> >>>> integration tests to finish all the time on our CI servers. I
> >>>> admire our dedication to high-quality releases, and don't want to
> >>>> distract
> >>> from it.
> >>>> Over time, we must fix all of those issues and strive to provide
> >>>> the most stable software we are capable off.
> >>>>
> >>>> At the same time, I haven't had a test failure on any of my own
> >>>> machines when reviewing pull requests in a very long time. Which
> >>>> makes me wonder whether our CI servers just set us up for failure.
> >>>> I believe the failures on the CI servers are real, and point to
> >>>> interesting edge cases in REEF we haven't fully solved. Hence, we
> >>>> absolutely must
> >>> investigate and fix them.
> >>>>
> >>>> However, I am not sure whether this needs to happen in a way that
> >>>> blocks the next release. Because there is a competing interest in
> >>>> timely
> >>> releases.
> >>>> Our last actual release, 0.15, was in May of 2016*.
> >>>> At that time, we did not have IMRU, we did not have a working group
> >>>> communications and a bunch of bug fixes were not in yet. Our
> >>>> current `master` is all around better than that release. Hence,
> >>>> we'd do our users
> >>> a
> >>>> service by making a release with a `known issues` section in the
> >>>> release notes. This also would help us get feedback on the current
> >>>> code from
> >>> actual
> >>>> users, as many won't (and shouldn't) use a developer version.
> >>>>
> >>>> In summary, we are faced with two opposing goals: (1) Fixing all
> >>>> the known issues before a release to make that release the best it
> >>>> can be and (2) release frequently to get our latest fixes and
> features out.
> >>>>
> >>>> What do you think? Which approach would you like us to follow?
> >>>>
> >>>> Thanks!
> >>>>
> >>>> Markus
> >>>>
> >>>>
> >>>> *: Let's just not talk about the disaster that is 0.15.1
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Byung-Gon Chun
> >>>
> >>
> >>
> >>
> >> --
> >> Byung-Gon Chun
> >>
> >
> >
> >
> > --
> > Byung-Gon Chun
>
>


-- 
Byung-Gon Chun

RE: [Discuss] Timely releases with known issues vs. rare issue-free releases

Posted by Mariia Mykhailova <ma...@microsoft.com.INVALID>.

Just got email about promoting releases at the upcoming ApacheCon:

If your project is planning a release in the next few weeks, please
consider doing this during the week of ApacheCon, and letting us know
(email press@apache.org and rbowen@apache.org, please) so that we can
make a splash around your release.

Just something to consider :-)

-Mariia

-----Original Message-----
From: Tae-Geon Um [mailto:taegeonum@gmail.com] 
Sent: Thursday, April 13, 2017 7:36 PM
To: dev@reef.apache.org
Subject: Re: [Discuss] Timely releases with known issues vs. rare issue-free releases

We may need a week to make sure whether the transient failures are reoccured or not. 
So, how about working on resolving the failures till the end of the next week, and waiting for one week after that? 

Anyway, we should decide the policy between timely releases vs rare issue-free releases. 
After having a second thought, I prefer *rare issue-free releases*. I agreed with Mariia on that writing down “known issues” and saying “REEF tests could fail sometimes” doesn’t look good. 

However, I’m not saying that we should give up timely releases. I think we can achieve the *both* if we set a high priority on the transient failures.  As Mariia pointed out, we don’t have sufficient motivation to debug transient failures until we approach a release. But, if we try to resolve the transient failures earlier before a release, maybe we can make rare issue-free timely releases. 

Still, it’s really hard to decide. I would like to hear other thoughts. 

Thanks,
Taegeon


> On Apr 13, 2017, at 7:06 AM, Byung-Gon Chun <bg...@gmail.com> wrote:
> 
> How about setting a deadline to work on test failure cases for release
> 0.16.0 to the end of next week? If we would like to spend more time, 
> we can set it to the last day of April.
> 
> -Gon
> 
> On Wed, Apr 12, 2017 at 2:07 PM, Byung-Gon Chun <bg...@gmail.com> wrote:
> 
>> Mariia, thanks for the insightful note.
>> 
>> It's hard to decide ( :( ) since both waiting to fix failures and 
>> releasing not too late make sense.
>> 
>> 
>> 
>> 
>> On Wed, Apr 12, 2017 at 12:39 PM, Mariia Mykhailova < 
>> mamykhai@microsoft.com.invalid> wrote:
>> 
>>> First, we should note that 0.16 is not the first release preceded by 
>>> a rather long bug hunt in our code; we spent at least a month fixing 
>>> bugs for each 0.15 and 0.14. We tend to prioritize work on actual 
>>> features between releases, and we either start investigating known 
>>> test failures only before the release, or find bugs only when we 
>>> start testing release candidate on various 
>>> platforms/machines/environments. During the previous release 
>>> discussions we used to agree that we shouldn’t release code with known reproducible test failures.
>>> 
>>> Second, the fact that the test failure only happens in CI (as 
>>> opposed to our dev boxes) doesn't imply that this isn't an actual 
>>> bug in our code, that it can't happen in production or that it won't be bad if it happens.
>>> We have a history of actual nasty bugs in code which we discovered 
>>> only using CI servers, and they were very much not obvious (which is 
>>> why it takes so long to track them down). Two of the bugs which are 
>>> the freshest in my memory were related to REEF job never 
>>> terminating; I don't think doing a release with a known issue "REEF job sometimes never terminates"
>>> (that will stay a known issue for at least half a year) is doing a 
>>> service to our users.
>>> 
>>> On the other hand, the nature of our CI servers is such that 
>>> occasional test failures are inevitable. I've observed a lot of 
>>> one-time transient test failures which I didn't consider JIRA-worthy 
>>> because they were obviously caused by some resource issue on the CI 
>>> server. Besides, our .NET tests, especially IMRU ones, deal with a 
>>> lot of concurrency and may sometimes not account for certain valid sequences of events.
>>> 
>>> Right now we have a whole list of problems:
>>> 1. Test failures don't necessarily indicate a bug in the system, but 
>>> they might.
>>> 2. We don't have sufficient understanding of when the failures 
>>> indicate a serious bug and when they don't.
>>> 3. The tests we run in CI don't reflect the actual usage patterns of 
>>> REEF in production (we only run tests on local runtime, not on Yarn) 
>>> 4. We don't have sufficient motivation to debug until we approach a 
>>> release.
>>> 5. When we approach a release, we have motivation to mark any 
>>> failures as known issues and proceed :-)
>>> 
>>> I guess we should balance our belief that the test failures are 
>>> benign/not frequent enough to cause problems if used in production 
>>> vs our need to have some motivation for investigating them (if we 
>>> always dismiss them as transient CI failures, we'll be missing actual bugs).
>>> 
>>> -Mariia
>>> 
>>> -----Original Message-----
>>> From: Taegeon Um [mailto:taegeonum@gmail.com]
>>> Sent: Tuesday, April 11, 2017 6:37 PM
>>> To: dev@reef.apache.org
>>> Subject: Re: [Discuss] Timely releases with known issues vs. rare 
>>> issue-free releases
>>> 
>>> Hi,
>>> 
>>> I am totally agree with Markus on timely releases and making a 
>>> release with a `known issues` section.
>>> If the test failures are rare, it would be good to note down the 
>>> known issues and move forward to the release.
>>> 
>>> My main concern is that "until when do we have to keep them as known 
>>> issues?". We cannot keep them as known issues forever.
>>> I think at least we need a cut-off date. For example, if we find 
>>> transient failures in 0.16 snapshot, we should resolve them until 
>>> the next release (0.17)?
>>> 
>>> Thanks,
>>> Taegeon
>>> 
>>> 
>>> 2017. 4. 12. 오전 6:09에 "Byung-Gon Chun" <bg...@gmail.com>님이 작성:
>>> 
>>> I understand this concern. It's been about a month since we started 
>>> to talk about release 0.16. Also, our last release occurred long time ago.
>>> 
>>> Mariia, Taegeon, Sergiy, and Julia have been working to fix test 
>>> failure issues for release 0.16. Since we heard from Julia, I hope 
>>> to hear from Mariia, Taegeon, and Sergiy as well. What're your thoughts?
>>> 
>>> 
>>> 
>>> On Wed, Apr 12, 2017 at 3:15 AM, Julia Wang (QIUHE) < 
>>> Qiuhe.Wang@microsoft.com.invalid> wrote:
>>> 
>>>> I totally agree with Markus's comments.
>>>> 
>>>> If we have test failures that show some bugs in the system and 
>>>> impact the quality of the code, we should resolve before a release.
>>>> 
>>>> Current few transit test failures only happen on AppVayer. It could 
>>>> be test issue that hit some edge scenarios, most possibly is 
>>>> related to the timing of the events. I have fixed some of them 
>>>> couple of weeks ago, like we didn't dispose active context in one 
>>>> of the test code (it was a bug in test), the validation condition 
>>>> in a test was too strict as the event may be received in different 
>>>> sequence, we depended on log messages in test handlers which may 
>>>> receive events later than driver
>>> receives, etc.
>>>> 
>>>> The failing tests may be just test issues, may imply some system 
>>>> issues, we don't know yet. But as long as there is no obvious 
>>>> defect in the
>>> current
>>>> code base, I would think it should not block the release.
>>>> 
>>>> Thanks,
>>>> Julia
>>>> 
>>>> -----Original Message-----
>>>> From: Markus Weimer [mailto:markus@weimo.de]
>>>> Sent: Tuesday, April 11, 2017 9:26 AM
>>>> To: REEF Developers Mailinglist <de...@reef.apache.org>
>>>> Subject: [Discuss] Timely releases with known issues vs. rare 
>>>> issue-free releases
>>>> 
>>>> Hi,
>>>> 
>>>> the current saga of ever not fully completing integration tests 
>>>> reminded me that we never actually had a discussion about what our 
>>>> bar for a
>>> release
>>>> is.
>>>> 
>>>> 
>>>> Our informal agreement right now seems to be that we want all the 
>>>> integration tests to finish all the time on our CI servers. I 
>>>> admire our dedication to high-quality releases, and don't want to 
>>>> distract
>>> from it.
>>>> Over time, we must fix all of those issues and strive to provide 
>>>> the most stable software we are capable off.
>>>> 
>>>> At the same time, I haven't had a test failure on any of my own 
>>>> machines when reviewing pull requests in a very long time. Which 
>>>> makes me wonder whether our CI servers just set us up for failure. 
>>>> I believe the failures on the CI servers are real, and point to 
>>>> interesting edge cases in REEF we haven't fully solved. Hence, we 
>>>> absolutely must
>>> investigate and fix them.
>>>> 
>>>> However, I am not sure whether this needs to happen in a way that 
>>>> blocks the next release. Because there is a competing interest in 
>>>> timely
>>> releases.
>>>> Our last actual release, 0.15, was in May of 2016*.
>>>> At that time, we did not have IMRU, we did not have a working group 
>>>> communications and a bunch of bug fixes were not in yet. Our 
>>>> current `master` is all around better than that release. Hence, 
>>>> we'd do our users
>>> a
>>>> service by making a release with a `known issues` section in the 
>>>> release notes. This also would help us get feedback on the current 
>>>> code from
>>> actual
>>>> users, as many won't (and shouldn't) use a developer version.
>>>> 
>>>> In summary, we are faced with two opposing goals: (1) Fixing all 
>>>> the known issues before a release to make that release the best it 
>>>> can be and (2) release frequently to get our latest fixes and features out.
>>>> 
>>>> What do you think? Which approach would you like us to follow?
>>>> 
>>>> Thanks!
>>>> 
>>>> Markus
>>>> 
>>>> 
>>>> *: Let's just not talk about the disaster that is 0.15.1
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Byung-Gon Chun
>>> 
>> 
>> 
>> 
>> --
>> Byung-Gon Chun
>> 
> 
> 
> 
> --
> Byung-Gon Chun

Re: [Discuss] Timely releases with known issues vs. rare issue-free releases

Posted by Tae-Geon Um <ta...@gmail.com>.

We may need a week to make sure whether the transient failures are reoccured or not. 
So, how about working on resolving the failures till the end of the next week, and waiting for one week after that? 

Anyway, we should decide the policy between timely releases vs rare issue-free releases. 
After having a second thought, I prefer *rare issue-free releases*. I agreed with Mariia on that writing down “known issues” and saying “REEF tests could fail sometimes” doesn’t look good. 

However, I’m not saying that we should give up timely releases. I think we can achieve the *both* if we set a high priority on the transient failures.  As Mariia pointed out, we don’t have sufficient motivation to debug transient failures until we approach a release. But, if we try to resolve the transient failures earlier before a release, maybe we can make rare issue-free timely releases. 

Still, it’s really hard to decide. I would like to hear other thoughts. 

Thanks,
Taegeon


> On Apr 13, 2017, at 7:06 AM, Byung-Gon Chun <bg...@gmail.com> wrote:
> 
> How about setting a deadline to work on test failure cases for release
> 0.16.0 to the end of next week? If we would like to spend more time, we can
> set it to the last day of April.
> 
> -Gon
> 
> On Wed, Apr 12, 2017 at 2:07 PM, Byung-Gon Chun <bg...@gmail.com> wrote:
> 
>> Mariia, thanks for the insightful note.
>> 
>> It's hard to decide ( :( ) since both waiting to fix failures and
>> releasing not too late make sense.
>> 
>> 
>> 
>> 
>> On Wed, Apr 12, 2017 at 12:39 PM, Mariia Mykhailova <
>> mamykhai@microsoft.com.invalid> wrote:
>> 
>>> First, we should note that 0.16 is not the first release preceded by a
>>> rather long bug hunt in our code; we spent at least a month fixing bugs for
>>> each 0.15 and 0.14. We tend to prioritize work on actual features between
>>> releases, and we either start investigating known test failures only before
>>> the release, or find bugs only when we start testing release candidate on
>>> various platforms/machines/environments. During the previous release
>>> discussions we used to agree that we shouldn’t release code with known
>>> reproducible test failures.
>>> 
>>> Second, the fact that the test failure only happens in CI (as opposed to
>>> our dev boxes) doesn't imply that this isn't an actual bug in our code,
>>> that it can't happen in production or that it won't be bad if it happens.
>>> We have a history of actual nasty bugs in code which we discovered only
>>> using CI servers, and they were very much not obvious (which is why it
>>> takes so long to track them down). Two of the bugs which are the freshest
>>> in my memory were related to REEF job never terminating; I don't think
>>> doing a release with a known issue "REEF job sometimes never terminates"
>>> (that will stay a known issue for at least half a year) is doing a service
>>> to our users.
>>> 
>>> On the other hand, the nature of our CI servers is such that occasional
>>> test failures are inevitable. I've observed a lot of one-time transient
>>> test failures which I didn't consider JIRA-worthy because they were
>>> obviously caused by some resource issue on the CI server. Besides, our .NET
>>> tests, especially IMRU ones, deal with a lot of concurrency and may
>>> sometimes not account for certain valid sequences of events.
>>> 
>>> Right now we have a whole list of problems:
>>> 1. Test failures don't necessarily indicate a bug in the system, but they
>>> might.
>>> 2. We don't have sufficient understanding of when the failures indicate a
>>> serious bug and when they don't.
>>> 3. The tests we run in CI don't reflect the actual usage patterns of REEF
>>> in production (we only run tests on local runtime, not on Yarn)
>>> 4. We don't have sufficient motivation to debug until we approach a
>>> release.
>>> 5. When we approach a release, we have motivation to mark any failures as
>>> known issues and proceed :-)
>>> 
>>> I guess we should balance our belief that the test failures are
>>> benign/not frequent enough to cause problems if used in production vs our
>>> need to have some motivation for investigating them (if we always dismiss
>>> them as transient CI failures, we'll be missing actual bugs).
>>> 
>>> -Mariia
>>> 
>>> -----Original Message-----
>>> From: Taegeon Um [mailto:taegeonum@gmail.com]
>>> Sent: Tuesday, April 11, 2017 6:37 PM
>>> To: dev@reef.apache.org
>>> Subject: Re: [Discuss] Timely releases with known issues vs. rare
>>> issue-free releases
>>> 
>>> Hi,
>>> 
>>> I am totally agree with Markus on timely releases and making a release
>>> with a `known issues` section.
>>> If the test failures are rare, it would be good to note down the known
>>> issues and move forward to the release.
>>> 
>>> My main concern is that "until when do we have to keep them as known
>>> issues?". We cannot keep them as known issues forever.
>>> I think at least we need a cut-off date. For example, if we find
>>> transient failures in 0.16 snapshot, we should resolve them until the next
>>> release (0.17)?
>>> 
>>> Thanks,
>>> Taegeon
>>> 
>>> 
>>> 2017. 4. 12. 오전 6:09에 "Byung-Gon Chun" <bg...@gmail.com>님이 작성:
>>> 
>>> I understand this concern. It's been about a month since we started to
>>> talk about release 0.16. Also, our last release occurred long time ago.
>>> 
>>> Mariia, Taegeon, Sergiy, and Julia have been working to fix test failure
>>> issues for release 0.16. Since we heard from Julia, I hope to hear from
>>> Mariia, Taegeon, and Sergiy as well. What're your thoughts?
>>> 
>>> 
>>> 
>>> On Wed, Apr 12, 2017 at 3:15 AM, Julia Wang (QIUHE) <
>>> Qiuhe.Wang@microsoft.com.invalid> wrote:
>>> 
>>>> I totally agree with Markus's comments.
>>>> 
>>>> If we have test failures that show some bugs in the system and impact
>>>> the quality of the code, we should resolve before a release.
>>>> 
>>>> Current few transit test failures only happen on AppVayer. It could be
>>>> test issue that hit some edge scenarios, most possibly is related to
>>>> the timing of the events. I have fixed some of them couple of weeks
>>>> ago, like we didn't dispose active context in one of the test code (it
>>>> was a bug in test), the validation condition in a test was too strict
>>>> as the event may be received in different sequence, we depended on log
>>>> messages in test handlers which may receive events later than driver
>>> receives, etc.
>>>> 
>>>> The failing tests may be just test issues, may imply some system
>>>> issues, we don't know yet. But as long as there is no obvious defect
>>>> in the
>>> current
>>>> code base, I would think it should not block the release.
>>>> 
>>>> Thanks,
>>>> Julia
>>>> 
>>>> -----Original Message-----
>>>> From: Markus Weimer [mailto:markus@weimo.de]
>>>> Sent: Tuesday, April 11, 2017 9:26 AM
>>>> To: REEF Developers Mailinglist <de...@reef.apache.org>
>>>> Subject: [Discuss] Timely releases with known issues vs. rare
>>>> issue-free releases
>>>> 
>>>> Hi,
>>>> 
>>>> the current saga of ever not fully completing integration tests
>>>> reminded me that we never actually had a discussion about what our bar
>>>> for a
>>> release
>>>> is.
>>>> 
>>>> 
>>>> Our informal agreement right now seems to be that we want all the
>>>> integration tests to finish all the time on our CI servers. I admire
>>>> our dedication to high-quality releases, and don't want to distract
>>> from it.
>>>> Over time, we must fix all of those issues and strive to provide the
>>>> most stable software we are capable off.
>>>> 
>>>> At the same time, I haven't had a test failure on any of my own
>>>> machines when reviewing pull requests in a very long time. Which makes
>>>> me wonder whether our CI servers just set us up for failure. I believe
>>>> the failures on the CI servers are real, and point to interesting edge
>>>> cases in REEF we haven't fully solved. Hence, we absolutely must
>>> investigate and fix them.
>>>> 
>>>> However, I am not sure whether this needs to happen in a way that
>>>> blocks the next release. Because there is a competing interest in
>>>> timely
>>> releases.
>>>> Our last actual release, 0.15, was in May of 2016*.
>>>> At that time, we did not have IMRU, we did not have a working group
>>>> communications and a bunch of bug fixes were not in yet. Our current
>>>> `master` is all around better than that release. Hence, we'd do our
>>>> users
>>> a
>>>> service by making a release with a `known issues` section in the
>>>> release notes. This also would help us get feedback on the current
>>>> code from
>>> actual
>>>> users, as many won't (and shouldn't) use a developer version.
>>>> 
>>>> In summary, we are faced with two opposing goals: (1) Fixing all the
>>>> known issues before a release to make that release the best it can be
>>>> and (2) release frequently to get our latest fixes and features out.
>>>> 
>>>> What do you think? Which approach would you like us to follow?
>>>> 
>>>> Thanks!
>>>> 
>>>> Markus
>>>> 
>>>> 
>>>> *: Let's just not talk about the disaster that is 0.15.1
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Byung-Gon Chun
>>> 
>> 
>> 
>> 
>> --
>> Byung-Gon Chun
>> 
> 
> 
> 
> -- 
> Byung-Gon Chun

Re: [Discuss] Timely releases with known issues vs. rare issue-free releases

Posted by Byung-Gon Chun <bg...@gmail.com>.

How about setting a deadline to work on test failure cases for release
0.16.0 to the end of next week? If we would like to spend more time, we can
set it to the last day of April.

-Gon

On Wed, Apr 12, 2017 at 2:07 PM, Byung-Gon Chun <bg...@gmail.com> wrote:

> Mariia, thanks for the insightful note.
>
> It's hard to decide ( :( ) since both waiting to fix failures and
> releasing not too late make sense.
>
>
>
>
> On Wed, Apr 12, 2017 at 12:39 PM, Mariia Mykhailova <
> mamykhai@microsoft.com.invalid> wrote:
>
>> First, we should note that 0.16 is not the first release preceded by a
>> rather long bug hunt in our code; we spent at least a month fixing bugs for
>> each 0.15 and 0.14. We tend to prioritize work on actual features between
>> releases, and we either start investigating known test failures only before
>> the release, or find bugs only when we start testing release candidate on
>> various platforms/machines/environments. During the previous release
>> discussions we used to agree that we shouldn’t release code with known
>> reproducible test failures.
>>
>> Second, the fact that the test failure only happens in CI (as opposed to
>> our dev boxes) doesn't imply that this isn't an actual bug in our code,
>> that it can't happen in production or that it won't be bad if it happens.
>> We have a history of actual nasty bugs in code which we discovered only
>> using CI servers, and they were very much not obvious (which is why it
>> takes so long to track them down). Two of the bugs which are the freshest
>> in my memory were related to REEF job never terminating; I don't think
>> doing a release with a known issue "REEF job sometimes never terminates"
>> (that will stay a known issue for at least half a year) is doing a service
>> to our users.
>>
>> On the other hand, the nature of our CI servers is such that occasional
>> test failures are inevitable. I've observed a lot of one-time transient
>> test failures which I didn't consider JIRA-worthy because they were
>> obviously caused by some resource issue on the CI server. Besides, our .NET
>> tests, especially IMRU ones, deal with a lot of concurrency and may
>> sometimes not account for certain valid sequences of events.
>>
>> Right now we have a whole list of problems:
>> 1. Test failures don't necessarily indicate a bug in the system, but they
>> might.
>> 2. We don't have sufficient understanding of when the failures indicate a
>> serious bug and when they don't.
>> 3. The tests we run in CI don't reflect the actual usage patterns of REEF
>> in production (we only run tests on local runtime, not on Yarn)
>> 4. We don't have sufficient motivation to debug until we approach a
>> release.
>> 5. When we approach a release, we have motivation to mark any failures as
>> known issues and proceed :-)
>>
>> I guess we should balance our belief that the test failures are
>> benign/not frequent enough to cause problems if used in production vs our
>> need to have some motivation for investigating them (if we always dismiss
>> them as transient CI failures, we'll be missing actual bugs).
>>
>> -Mariia
>>
>> -----Original Message-----
>> From: Taegeon Um [mailto:taegeonum@gmail.com]
>> Sent: Tuesday, April 11, 2017 6:37 PM
>> To: dev@reef.apache.org
>> Subject: Re: [Discuss] Timely releases with known issues vs. rare
>> issue-free releases
>>
>> Hi,
>>
>> I am totally agree with Markus on timely releases and making a release
>> with a `known issues` section.
>> If the test failures are rare, it would be good to note down the known
>> issues and move forward to the release.
>>
>> My main concern is that "until when do we have to keep them as known
>> issues?". We cannot keep them as known issues forever.
>> I think at least we need a cut-off date. For example, if we find
>> transient failures in 0.16 snapshot, we should resolve them until the next
>> release (0.17)?
>>
>> Thanks,
>> Taegeon
>>
>>
>> 2017. 4. 12. 오전 6:09에 "Byung-Gon Chun" <bg...@gmail.com>님이 작성:
>>
>> I understand this concern. It's been about a month since we started to
>> talk about release 0.16. Also, our last release occurred long time ago.
>>
>> Mariia, Taegeon, Sergiy, and Julia have been working to fix test failure
>> issues for release 0.16. Since we heard from Julia, I hope to hear from
>> Mariia, Taegeon, and Sergiy as well. What're your thoughts?
>>
>>
>>
>> On Wed, Apr 12, 2017 at 3:15 AM, Julia Wang (QIUHE) <
>> Qiuhe.Wang@microsoft.com.invalid> wrote:
>>
>> > I totally agree with Markus's comments.
>> >
>> > If we have test failures that show some bugs in the system and impact
>> > the quality of the code, we should resolve before a release.
>> >
>> > Current few transit test failures only happen on AppVayer. It could be
>> > test issue that hit some edge scenarios, most possibly is related to
>> > the timing of the events. I have fixed some of them couple of weeks
>> > ago, like we didn't dispose active context in one of the test code (it
>> > was a bug in test), the validation condition in a test was too strict
>> > as the event may be received in different sequence, we depended on log
>> > messages in test handlers which may receive events later than driver
>> receives, etc.
>> >
>> > The failing tests may be just test issues, may imply some system
>> > issues, we don't know yet. But as long as there is no obvious defect
>> > in the
>> current
>> > code base, I would think it should not block the release.
>> >
>> > Thanks,
>> > Julia
>> >
>> > -----Original Message-----
>> > From: Markus Weimer [mailto:markus@weimo.de]
>> > Sent: Tuesday, April 11, 2017 9:26 AM
>> > To: REEF Developers Mailinglist <de...@reef.apache.org>
>> > Subject: [Discuss] Timely releases with known issues vs. rare
>> > issue-free releases
>> >
>> > Hi,
>> >
>> > the current saga of ever not fully completing integration tests
>> > reminded me that we never actually had a discussion about what our bar
>> > for a
>> release
>> > is.
>> >
>> >
>> > Our informal agreement right now seems to be that we want all the
>> > integration tests to finish all the time on our CI servers. I admire
>> > our dedication to high-quality releases, and don't want to distract
>> from it.
>> > Over time, we must fix all of those issues and strive to provide the
>> > most stable software we are capable off.
>> >
>> > At the same time, I haven't had a test failure on any of my own
>> > machines when reviewing pull requests in a very long time. Which makes
>> > me wonder whether our CI servers just set us up for failure. I believe
>> > the failures on the CI servers are real, and point to interesting edge
>> > cases in REEF we haven't fully solved. Hence, we absolutely must
>> investigate and fix them.
>> >
>> > However, I am not sure whether this needs to happen in a way that
>> > blocks the next release. Because there is a competing interest in
>> > timely
>> releases.
>> > Our last actual release, 0.15, was in May of 2016*.
>> > At that time, we did not have IMRU, we did not have a working group
>> > communications and a bunch of bug fixes were not in yet. Our current
>> > `master` is all around better than that release. Hence, we'd do our
>> > users
>> a
>> > service by making a release with a `known issues` section in the
>> > release notes. This also would help us get feedback on the current
>> > code from
>> actual
>> > users, as many won't (and shouldn't) use a developer version.
>> >
>> > In summary, we are faced with two opposing goals: (1) Fixing all the
>> > known issues before a release to make that release the best it can be
>> > and (2) release frequently to get our latest fixes and features out.
>> >
>> > What do you think? Which approach would you like us to follow?
>> >
>> > Thanks!
>> >
>> > Markus
>> >
>> >
>> > *: Let's just not talk about the disaster that is 0.15.1
>> >
>>
>>
>>
>> --
>> Byung-Gon Chun
>>
>
>
>
> --
> Byung-Gon Chun
>



-- 
Byung-Gon Chun

Re: [Discuss] Timely releases with known issues vs. rare issue-free releases

Posted by Byung-Gon Chun <bg...@gmail.com>.

Mariia, thanks for the insightful note.

It's hard to decide ( :( ) since both waiting to fix failures and releasing
not too late make sense.




On Wed, Apr 12, 2017 at 12:39 PM, Mariia Mykhailova <
mamykhai@microsoft.com.invalid> wrote:

> First, we should note that 0.16 is not the first release preceded by a
> rather long bug hunt in our code; we spent at least a month fixing bugs for
> each 0.15 and 0.14. We tend to prioritize work on actual features between
> releases, and we either start investigating known test failures only before
> the release, or find bugs only when we start testing release candidate on
> various platforms/machines/environments. During the previous release
> discussions we used to agree that we shouldn’t release code with known
> reproducible test failures.
>
> Second, the fact that the test failure only happens in CI (as opposed to
> our dev boxes) doesn't imply that this isn't an actual bug in our code,
> that it can't happen in production or that it won't be bad if it happens.
> We have a history of actual nasty bugs in code which we discovered only
> using CI servers, and they were very much not obvious (which is why it
> takes so long to track them down). Two of the bugs which are the freshest
> in my memory were related to REEF job never terminating; I don't think
> doing a release with a known issue "REEF job sometimes never terminates"
> (that will stay a known issue for at least half a year) is doing a service
> to our users.
>
> On the other hand, the nature of our CI servers is such that occasional
> test failures are inevitable. I've observed a lot of one-time transient
> test failures which I didn't consider JIRA-worthy because they were
> obviously caused by some resource issue on the CI server. Besides, our .NET
> tests, especially IMRU ones, deal with a lot of concurrency and may
> sometimes not account for certain valid sequences of events.
>
> Right now we have a whole list of problems:
> 1. Test failures don't necessarily indicate a bug in the system, but they
> might.
> 2. We don't have sufficient understanding of when the failures indicate a
> serious bug and when they don't.
> 3. The tests we run in CI don't reflect the actual usage patterns of REEF
> in production (we only run tests on local runtime, not on Yarn)
> 4. We don't have sufficient motivation to debug until we approach a
> release.
> 5. When we approach a release, we have motivation to mark any failures as
> known issues and proceed :-)
>
> I guess we should balance our belief that the test failures are benign/not
> frequent enough to cause problems if used in production vs our need to have
> some motivation for investigating them (if we always dismiss them as
> transient CI failures, we'll be missing actual bugs).
>
> -Mariia
>
> -----Original Message-----
> From: Taegeon Um [mailto:taegeonum@gmail.com]
> Sent: Tuesday, April 11, 2017 6:37 PM
> To: dev@reef.apache.org
> Subject: Re: [Discuss] Timely releases with known issues vs. rare
> issue-free releases
>
> Hi,
>
> I am totally agree with Markus on timely releases and making a release
> with a `known issues` section.
> If the test failures are rare, it would be good to note down the known
> issues and move forward to the release.
>
> My main concern is that "until when do we have to keep them as known
> issues?". We cannot keep them as known issues forever.
> I think at least we need a cut-off date. For example, if we find transient
> failures in 0.16 snapshot, we should resolve them until the next release
> (0.17)?
>
> Thanks,
> Taegeon
>
>
> 2017. 4. 12. 오전 6:09에 "Byung-Gon Chun" <bg...@gmail.com>님이 작성:
>
> I understand this concern. It's been about a month since we started to
> talk about release 0.16. Also, our last release occurred long time ago.
>
> Mariia, Taegeon, Sergiy, and Julia have been working to fix test failure
> issues for release 0.16. Since we heard from Julia, I hope to hear from
> Mariia, Taegeon, and Sergiy as well. What're your thoughts?
>
>
>
> On Wed, Apr 12, 2017 at 3:15 AM, Julia Wang (QIUHE) <
> Qiuhe.Wang@microsoft.com.invalid> wrote:
>
> > I totally agree with Markus's comments.
> >
> > If we have test failures that show some bugs in the system and impact
> > the quality of the code, we should resolve before a release.
> >
> > Current few transit test failures only happen on AppVayer. It could be
> > test issue that hit some edge scenarios, most possibly is related to
> > the timing of the events. I have fixed some of them couple of weeks
> > ago, like we didn't dispose active context in one of the test code (it
> > was a bug in test), the validation condition in a test was too strict
> > as the event may be received in different sequence, we depended on log
> > messages in test handlers which may receive events later than driver
> receives, etc.
> >
> > The failing tests may be just test issues, may imply some system
> > issues, we don't know yet. But as long as there is no obvious defect
> > in the
> current
> > code base, I would think it should not block the release.
> >
> > Thanks,
> > Julia
> >
> > -----Original Message-----
> > From: Markus Weimer [mailto:markus@weimo.de]
> > Sent: Tuesday, April 11, 2017 9:26 AM
> > To: REEF Developers Mailinglist <de...@reef.apache.org>
> > Subject: [Discuss] Timely releases with known issues vs. rare
> > issue-free releases
> >
> > Hi,
> >
> > the current saga of ever not fully completing integration tests
> > reminded me that we never actually had a discussion about what our bar
> > for a
> release
> > is.
> >
> >
> > Our informal agreement right now seems to be that we want all the
> > integration tests to finish all the time on our CI servers. I admire
> > our dedication to high-quality releases, and don't want to distract from
> it.
> > Over time, we must fix all of those issues and strive to provide the
> > most stable software we are capable off.
> >
> > At the same time, I haven't had a test failure on any of my own
> > machines when reviewing pull requests in a very long time. Which makes
> > me wonder whether our CI servers just set us up for failure. I believe
> > the failures on the CI servers are real, and point to interesting edge
> > cases in REEF we haven't fully solved. Hence, we absolutely must
> investigate and fix them.
> >
> > However, I am not sure whether this needs to happen in a way that
> > blocks the next release. Because there is a competing interest in
> > timely
> releases.
> > Our last actual release, 0.15, was in May of 2016*.
> > At that time, we did not have IMRU, we did not have a working group
> > communications and a bunch of bug fixes were not in yet. Our current
> > `master` is all around better than that release. Hence, we'd do our
> > users
> a
> > service by making a release with a `known issues` section in the
> > release notes. This also would help us get feedback on the current
> > code from
> actual
> > users, as many won't (and shouldn't) use a developer version.
> >
> > In summary, we are faced with two opposing goals: (1) Fixing all the
> > known issues before a release to make that release the best it can be
> > and (2) release frequently to get our latest fixes and features out.
> >
> > What do you think? Which approach would you like us to follow?
> >
> > Thanks!
> >
> > Markus
> >
> >
> > *: Let's just not talk about the disaster that is 0.15.1
> >
>
>
>
> --
> Byung-Gon Chun
>



-- 
Byung-Gon Chun

RE: [Discuss] Timely releases with known issues vs. rare issue-free releases

Posted by Mariia Mykhailova <ma...@microsoft.com.INVALID>.

First, we should note that 0.16 is not the first release preceded by a rather long bug hunt in our code; we spent at least a month fixing bugs for each 0.15 and 0.14. We tend to prioritize work on actual features between releases, and we either start investigating known test failures only before the release, or find bugs only when we start testing release candidate on various platforms/machines/environments. During the previous release discussions we used to agree that we shouldn’t release code with known reproducible test failures. 

Second, the fact that the test failure only happens in CI (as opposed to our dev boxes) doesn't imply that this isn't an actual bug in our code, that it can't happen in production or that it won't be bad if it happens. We have a history of actual nasty bugs in code which we discovered only using CI servers, and they were very much not obvious (which is why it takes so long to track them down). Two of the bugs which are the freshest in my memory were related to REEF job never terminating; I don't think doing a release with a known issue "REEF job sometimes never terminates" (that will stay a known issue for at least half a year) is doing a service to our users.

On the other hand, the nature of our CI servers is such that occasional test failures are inevitable. I've observed a lot of one-time transient test failures which I didn't consider JIRA-worthy because they were obviously caused by some resource issue on the CI server. Besides, our .NET tests, especially IMRU ones, deal with a lot of concurrency and may sometimes not account for certain valid sequences of events.

Right now we have a whole list of problems:
1. Test failures don't necessarily indicate a bug in the system, but they might. 
2. We don't have sufficient understanding of when the failures indicate a serious bug and when they don't.
3. The tests we run in CI don't reflect the actual usage patterns of REEF in production (we only run tests on local runtime, not on Yarn)
4. We don't have sufficient motivation to debug until we approach a release.
5. When we approach a release, we have motivation to mark any failures as known issues and proceed :-)

I guess we should balance our belief that the test failures are benign/not frequent enough to cause problems if used in production vs our need to have some motivation for investigating them (if we always dismiss them as transient CI failures, we'll be missing actual bugs). 

-Mariia

-----Original Message-----
From: Taegeon Um [mailto:taegeonum@gmail.com] 
Sent: Tuesday, April 11, 2017 6:37 PM
To: dev@reef.apache.org
Subject: Re: [Discuss] Timely releases with known issues vs. rare issue-free releases

Hi,

I am totally agree with Markus on timely releases and making a release with a `known issues` section.
If the test failures are rare, it would be good to note down the known issues and move forward to the release.

My main concern is that "until when do we have to keep them as known issues?". We cannot keep them as known issues forever.
I think at least we need a cut-off date. For example, if we find transient failures in 0.16 snapshot, we should resolve them until the next release (0.17)?

Thanks,
Taegeon

2017. 4. 12. 오전 6:09에 "Byung-Gon Chun" <bg...@gmail.com>님이 작성:

I understand this concern. It's been about a month since we started to talk about release 0.16. Also, our last release occurred long time ago.

Mariia, Taegeon, Sergiy, and Julia have been working to fix test failure issues for release 0.16. Since we heard from Julia, I hope to hear from Mariia, Taegeon, and Sergiy as well. What're your thoughts?

On Wed, Apr 12, 2017 at 3:15 AM, Julia Wang (QIUHE) < Qiuhe.Wang@microsoft.com.invalid> wrote:

> I totally agree with Markus's comments.
>
> If we have test failures that show some bugs in the system and impact 
> the quality of the code, we should resolve before a release.
>
> Current few transit test failures only happen on AppVayer. It could be 
> test issue that hit some edge scenarios, most possibly is related to 
> the timing of the events. I have fixed some of them couple of weeks 
> ago, like we didn't dispose active context in one of the test code (it 
> was a bug in test), the validation condition in a test was too strict 
> as the event may be received in different sequence, we depended on log 
> messages in test handlers which may receive events later than driver receives, etc.
>
> The failing tests may be just test issues, may imply some system 
> issues, we don't know yet. But as long as there is no obvious defect 
> in the
current
> code base, I would think it should not block the release.
>
> Thanks,
> Julia
>
> -----Original Message-----
> From: Markus Weimer [mailto:markus@weimo.de]
> Sent: Tuesday, April 11, 2017 9:26 AM
> To: REEF Developers Mailinglist <de...@reef.apache.org>
> Subject: [Discuss] Timely releases with known issues vs. rare 
> issue-free releases
>
> Hi,
>
> the current saga of ever not fully completing integration tests 
> reminded me that we never actually had a discussion about what our bar 
> for a
release
> is.
>
>
> Our informal agreement right now seems to be that we want all the 
> integration tests to finish all the time on our CI servers. I admire 
> our dedication to high-quality releases, and don't want to distract from it.
> Over time, we must fix all of those issues and strive to provide the 
> most stable software we are capable off.
>
> At the same time, I haven't had a test failure on any of my own 
> machines when reviewing pull requests in a very long time. Which makes 
> me wonder whether our CI servers just set us up for failure. I believe 
> the failures on the CI servers are real, and point to interesting edge 
> cases in REEF we haven't fully solved. Hence, we absolutely must investigate and fix them.
>
> However, I am not sure whether this needs to happen in a way that 
> blocks the next release. Because there is a competing interest in 
> timely
releases.
> Our last actual release, 0.15, was in May of 2016*.
> At that time, we did not have IMRU, we did not have a working group 
> communications and a bunch of bug fixes were not in yet. Our current 
> `master` is all around better than that release. Hence, we'd do our 
> users
a
> service by making a release with a `known issues` section in the 
> release notes. This also would help us get feedback on the current 
> code from
actual
> users, as many won't (and shouldn't) use a developer version.
>
> In summary, we are faced with two opposing goals: (1) Fixing all the 
> known issues before a release to make that release the best it can be 
> and (2) release frequently to get our latest fixes and features out.
>
> What do you think? Which approach would you like us to follow?
>
> Thanks!
>
> Markus
>
>
> *: Let's just not talk about the disaster that is 0.15.1
>

--
Byung-Gon Chun

Re: [Discuss] Timely releases with known issues vs. rare issue-free releases

Posted by Taegeon Um <ta...@gmail.com>.

Hi,

I am totally agree with Markus on timely releases and making a release with
a `known issues` section.
If the test failures are rare, it would be good to note down the known
issues and move forward to the release.

My main concern is that "until when do we have to keep them as known
issues?". We cannot keep them as known issues forever.
I think at least we need a cut-off date. For example, if we find transient
failures in 0.16 snapshot, we should resolve them until the next release
(0.17)?

Thanks,
Taegeon


2017. 4. 12. 오전 6:09에 "Byung-Gon Chun" <bg...@gmail.com>님이 작성:

I understand this concern. It's been about a month since we started to talk
about release 0.16. Also, our last release occurred long time ago.

Mariia, Taegeon, Sergiy, and Julia have been working to fix test failure
issues for release 0.16. Since we heard from Julia, I hope to hear from
Mariia, Taegeon, and Sergiy as well. What're your thoughts?



On Wed, Apr 12, 2017 at 3:15 AM, Julia Wang (QIUHE) <
Qiuhe.Wang@microsoft.com.invalid> wrote:

> I totally agree with Markus's comments.
>
> If we have test failures that show some bugs in the system and impact the
> quality of the code, we should resolve before a release.
>
> Current few transit test failures only happen on AppVayer. It could be
> test issue that hit some edge scenarios, most possibly is related to the
> timing of the events. I have fixed some of them couple of weeks ago, like
> we didn't dispose active context in one of the test code (it was a bug in
> test), the validation condition in a test was too strict as the event may
> be received in different sequence, we depended on log messages in test
> handlers which may receive events later than driver receives, etc.
>
> The failing tests may be just test issues, may imply some system issues,
> we don't know yet. But as long as there is no obvious defect in the
current
> code base, I would think it should not block the release.
>
> Thanks,
> Julia
>
> -----Original Message-----
> From: Markus Weimer [mailto:markus@weimo.de]
> Sent: Tuesday, April 11, 2017 9:26 AM
> To: REEF Developers Mailinglist <de...@reef.apache.org>
> Subject: [Discuss] Timely releases with known issues vs. rare issue-free
> releases
>
> Hi,
>
> the current saga of ever not fully completing integration tests reminded
> me that we never actually had a discussion about what our bar for a
release
> is.
>
>
> Our informal agreement right now seems to be that we want all the
> integration tests to finish all the time on our CI servers. I admire our
> dedication to high-quality releases, and don't want to distract from it.
> Over time, we must fix all of those issues and strive to provide the most
> stable software we are capable off.
>
> At the same time, I haven't had a test failure on any of my own machines
> when reviewing pull requests in a very long time. Which makes me wonder
> whether our CI servers just set us up for failure. I believe the failures
> on the CI servers are real, and point to interesting edge cases in REEF we
> haven't fully solved. Hence, we absolutely must investigate and fix them.
>
> However, I am not sure whether this needs to happen in a way that blocks
> the next release. Because there is a competing interest in timely
releases.
> Our last actual release, 0.15, was in May of 2016*.
> At that time, we did not have IMRU, we did not have a working group
> communications and a bunch of bug fixes were not in yet. Our current
> `master` is all around better than that release. Hence, we'd do our users
a
> service by making a release with a `known issues` section in the release
> notes. This also would help us get feedback on the current code from
actual
> users, as many won't (and shouldn't) use a developer version.
>
> In summary, we are faced with two opposing goals: (1) Fixing all the known
> issues before a release to make that release the best it can be and (2)
> release frequently to get our latest fixes and features out.
>
> What do you think? Which approach would you like us to follow?
>
> Thanks!
>
> Markus
>
>
> *: Let's just not talk about the disaster that is 0.15.1
>



--
Byung-Gon Chun

Re: [Discuss] Timely releases with known issues vs. rare issue-free releases

Posted by Byung-Gon Chun <bg...@gmail.com>.

I understand this concern. It's been about a month since we started to talk
about release 0.16. Also, our last release occurred long time ago.

Mariia, Taegeon, Sergiy, and Julia have been working to fix test failure
issues for release 0.16. Since we heard from Julia, I hope to hear from
Mariia, Taegeon, and Sergiy as well. What're your thoughts?



On Wed, Apr 12, 2017 at 3:15 AM, Julia Wang (QIUHE) <
Qiuhe.Wang@microsoft.com.invalid> wrote:

> I totally agree with Markus's comments.
>
> If we have test failures that show some bugs in the system and impact the
> quality of the code, we should resolve before a release.
>
> Current few transit test failures only happen on AppVayer. It could be
> test issue that hit some edge scenarios, most possibly is related to the
> timing of the events. I have fixed some of them couple of weeks ago, like
> we didn't dispose active context in one of the test code (it was a bug in
> test), the validation condition in a test was too strict as the event may
> be received in different sequence, we depended on log messages in test
> handlers which may receive events later than driver receives, etc.
>
> The failing tests may be just test issues, may imply some system issues,
> we don't know yet. But as long as there is no obvious defect in the current
> code base, I would think it should not block the release.
>
> Thanks,
> Julia
>
> -----Original Message-----
> From: Markus Weimer [mailto:markus@weimo.de]
> Sent: Tuesday, April 11, 2017 9:26 AM
> To: REEF Developers Mailinglist <de...@reef.apache.org>
> Subject: [Discuss] Timely releases with known issues vs. rare issue-free
> releases
>
> Hi,
>
> the current saga of ever not fully completing integration tests reminded
> me that we never actually had a discussion about what our bar for a release
> is.
>
>
> Our informal agreement right now seems to be that we want all the
> integration tests to finish all the time on our CI servers. I admire our
> dedication to high-quality releases, and don't want to distract from it.
> Over time, we must fix all of those issues and strive to provide the most
> stable software we are capable off.
>
> At the same time, I haven't had a test failure on any of my own machines
> when reviewing pull requests in a very long time. Which makes me wonder
> whether our CI servers just set us up for failure. I believe the failures
> on the CI servers are real, and point to interesting edge cases in REEF we
> haven't fully solved. Hence, we absolutely must investigate and fix them.
>
> However, I am not sure whether this needs to happen in a way that blocks
> the next release. Because there is a competing interest in timely releases.
> Our last actual release, 0.15, was in May of 2016*.
> At that time, we did not have IMRU, we did not have a working group
> communications and a bunch of bug fixes were not in yet. Our current
> `master` is all around better than that release. Hence, we'd do our users a
> service by making a release with a `known issues` section in the release
> notes. This also would help us get feedback on the current code from actual
> users, as many won't (and shouldn't) use a developer version.
>
> In summary, we are faced with two opposing goals: (1) Fixing all the known
> issues before a release to make that release the best it can be and (2)
> release frequently to get our latest fixes and features out.
>
> What do you think? Which approach would you like us to follow?
>
> Thanks!
>
> Markus
>
>
> *: Let's just not talk about the disaster that is 0.15.1
>



-- 
Byung-Gon Chun

RE: [Discuss] Timely releases with known issues vs. rare issue-free releases

Posted by "Julia Wang (QIUHE)" <Qi...@microsoft.com.INVALID>.

I totally agree with Markus's comments. 

If we have test failures that show some bugs in the system and impact the quality of the code, we should resolve before a release. 

Current few transit test failures only happen on AppVayer. It could be test issue that hit some edge scenarios, most possibly is related to the timing of the events. I have fixed some of them couple of weeks ago, like we didn't dispose active context in one of the test code (it was a bug in test), the validation condition in a test was too strict as the event may be received in different sequence, we depended on log messages in test handlers which may receive events later than driver receives, etc. 

The failing tests may be just test issues, may imply some system issues, we don't know yet. But as long as there is no obvious defect in the current code base, I would think it should not block the release. 

Thanks,
Julia

-----Original Message-----
From: Markus Weimer [mailto:markus@weimo.de] 
Sent: Tuesday, April 11, 2017 9:26 AM
To: REEF Developers Mailinglist <de...@reef.apache.org>
Subject: [Discuss] Timely releases with known issues vs. rare issue-free releases

Hi,

the current saga of ever not fully completing integration tests reminded me that we never actually had a discussion about what our bar for a release is.


Our informal agreement right now seems to be that we want all the integration tests to finish all the time on our CI servers. I admire our dedication to high-quality releases, and don't want to distract from it. Over time, we must fix all of those issues and strive to provide the most stable software we are capable off.

At the same time, I haven't had a test failure on any of my own machines when reviewing pull requests in a very long time. Which makes me wonder whether our CI servers just set us up for failure. I believe the failures on the CI servers are real, and point to interesting edge cases in REEF we haven't fully solved. Hence, we absolutely must investigate and fix them.

However, I am not sure whether this needs to happen in a way that blocks the next release. Because there is a competing interest in timely releases. Our last actual release, 0.15, was in May of 2016*.
At that time, we did not have IMRU, we did not have a working group communications and a bunch of bug fixes were not in yet. Our current `master` is all around better than that release. Hence, we'd do our users a service by making a release with a `known issues` section in the release notes. This also would help us get feedback on the current code from actual users, as many won't (and shouldn't) use a developer version.

In summary, we are faced with two opposing goals: (1) Fixing all the known issues before a release to make that release the best it can be and (2) release frequently to get our latest fixes and features out.

What do you think? Which approach would you like us to follow?

Thanks!

Markus


*: Let's just not talk about the disaster that is 0.15.1

RE: [Discuss] Timely releases with known issues vs. rare issue-free releases

Posted by Shouheng Yi <sh...@microsoft.com.INVALID>.

Hi REEF devs,

I've been experiencing the same CI issues while it built successfully on my local. The errors seem random to me. I agree to take some time to investigate our build system, since we can pin point where the problems are - whether it is in CI build environment or our code is broken at the first place.

Going back to Markus' point, I'm in favor of the second option - "release frequently to get our latest fixes and features out." The reasons are that we have a relatively agile community that allow us to work swiftly, the customers' feedback loop is shorter, and last but not least, we can "move fast" (and hopefully not to break things) to deliver more features to our users, and not invest more time it these features are not gaining any traction.

Best,
Shouheng

-----Original Message-----
From: Markus Weimer [mailto:markus@weimo.de] 
Sent: Tuesday, April 11, 2017 9:26 AM
To: REEF Developers Mailinglist <de...@reef.apache.org>
Subject: [Discuss] Timely releases with known issues vs. rare issue-free releases

Hi,

the current saga of ever not fully completing integration tests reminded me that we never actually had a discussion about what our bar for a release is.


Our informal agreement right now seems to be that we want all the integration tests to finish all the time on our CI servers. I admire our dedication to high-quality releases, and don't want to distract from it. Over time, we must fix all of those issues and strive to provide the most stable software we are capable off.

At the same time, I haven't had a test failure on any of my own machines when reviewing pull requests in a very long time. Which makes me wonder whether our CI servers just set us up for failure. I believe the failures on the CI servers are real, and point to interesting edge cases in REEF we haven't fully solved. Hence, we absolutely must investigate and fix them.

However, I am not sure whether this needs to happen in a way that blocks the next release. Because there is a competing interest in timely releases. Our last actual release, 0.15, was in May of 2016*.
At that time, we did not have IMRU, we did not have a working group communications and a bunch of bug fixes were not in yet. Our current `master` is all around better than that release. Hence, we'd do our users a service by making a release with a `known issues` section in the release notes. This also would help us get feedback on the current code from actual users, as many won't (and shouldn't) use a developer version.

In summary, we are faced with two opposing goals: (1) Fixing all the known issues before a release to make that release the best it can be and (2) release frequently to get our latest fixes and features out.

What do you think? Which approach would you like us to follow?

Thanks!

Markus


*: Let's just not talk about the disaster that is 0.15.1