You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cloudstack.apache.org by Remi Bergsma <RB...@schubergphilis.com> on 2015/09/25 21:51:18 UTC

Blameless post mortem

Hi all,

This mail is intended to be blameless. We need to learn something from it. That's why I left out who exactly did what because it’s not relevant. There are multiple examples but it's about the why. Let's learn from this without blaming anyone.

We know we need automated testing. We have integration tests, but we are unable to run all of them on any Pull Request we receive. If we would have that in place, it'd be much easier to spot errors, regression and so on. It'd also be more rewarding to write more tests.

Unfortunately we're not there yet. So, we need to do something else instead until we get there. If we do nothing, we know we have many issues because a master that breaks on a regular basis is the most frustrating things. We said we'd use Pull Requests with at least two humans to review and give their OK for a Pull Request. In the form of LGTM: Looks Good To Me. Ok, so the LGTMs are there because we have no automated testing. Keep that in mind. You are supposed to replace automated testing until it's there.

Since we do this, master got a lot more stable. But every now and then we still have issues. Let's look at how we do manual reviews. Again, this is not to blame anyone. It's to open our eyes and make us realise what we're doing and what results we get out of that.


Example Pull Request #784: 
Title: CLOUDSTACK-8799 fixed the default routes

That's nice, it has a Jira id and a short description (as it should be).

The first person comes along and makes a comment:
"There was also an issue with VPC VRs" ... "Have you seen this issue? Does your change affects the VPC VR (single/redundant)?"

Actually a good question. Unfortunaly there comes no answer. After a reminder, it was promised to do tests against VPC networks. Great!

The Jenkins builds both succeed and also Travis is green. But how much value does this have? They have the impression to do automated testing, and although you could argue they do, it's far from complete. If it breaks, you know you have an issue. But it doesn’t work the other way around.

Back to our example PR. In the mean time, another commit gets pushed to it: "CLOUDSTACK-8799 fixed for vpc networks." But if you look at the Jira issue, you see it is about redundant virtual routers. The non-VPC ones. So this is vague at best. But a reviewer gives a LGTM because the person could create a VPC. That doesn't have anything to do with the problem being fixed in this PR nor with the comments made earlier. But, at least the person said what he did and we should all do that. What nobody knew back then, was that this broke the default route on VPCs.

Then something strange happens: the two commits from the PR end up on master as direct commits. With just one LGTM and no verification from the person commenting about the linked issue. This happened on Friday September 11th. 

That day 21 commits came in, from 7 Pull Request and unfortunately also from some direct commits. We noticed the direct commits and notified the list (http://cloudstack.markmail.org/message/srmszloyipkxml36). As a lot came in at the same time, it was decided not to revert them. Looking back, we should have done it.

From this point on, VPCs were broken as they wouldn't get a default route. So, no public internet access from VMs in VPC tiers, no VPNs working, etc. This was mentioned to the list on Thursday September 15th, after some chats and debugging going on over the weekend (http://cloudstack.markmail.org/message/73ulpu4p75ex24tc)

Here we are, master is broken functionality wise and new Pull Requests come in to fix blockers. But we cannot ever test their proper working, because VPCs are broken in master and so also in the PRs branched off of it. With or without change in the PR. 

It starts to escalate as the days go by.

I’ll leave out the bit on how this frustrated people. Although it’s good to know we do not want to be in this situation.

Eventually Wilder and I spent an evening and a day working on a branch where we loaded 7 PRs on top of each other (all VR related) only to find the VPC is still broken. It allowed us to zoom in and find the default route was missing again. We said it worked 3 weeks before, because the same tests that succeeded then, now were broken. We had already fixed this in PR #738 on August 25 so were sure about it.

After some digging we could trace it back to Pull Request #784. Imagine the feeling seeing your own comment there mentioning the previous issue on the default gateways. Fair to say our human review process clearly failed here. Many many hours were spent on this problem over the past two weeks. Could we have prevented this from happening? I think so, yes.


This example clearly shows why:

- we should use Pull Requests
  It made the change visible: Great!

- we do reviews and ask for feedback
  We got feedback and questions: Also great!

- we should always respond to feedback and verify it is resolved, before merging
  We need to improve here. Even with two reviewers that say LGTM, we should still address any feedback before merging.

- we should have two humans doing a review
  We need to improve here as well. Not one reviewer, we need two. Really.

- we need to document why we say LGTM. 
  Another improvement. It’s nice to say LGTM, but a review of only 4 characters and nothing more is useless. We need to know what was tested and how. Test results, screen shots or anything that shows what's been verified. If you only reviewed the code, also fine but at least say that. Then the next reviewer should do another type of review to get the comlete picture. Remember you're replacing automated testing!

- we should always merge Pull Requests
  We made it easy, merging is the de facto standard, and it has even more benefits. You can trace commits back to their Pull Request (and find all comments and discussion there: saves time, trust me). It also allows for easier reverting of a Pull Request. We’ll see even more benefits once 4.7 is there. Although the intentions to merge the Pull Request were there, it still didn't happen. We should always check before we push. As a committer we just need to be sure.

- we need automated testing!
  The sooner the better. It’s all about the missing automated testing. After 4.6, we all need to focus on this. Saves a lot of time. And frustrations.



We're doing final testing on PR #887 and will merge it soon. From that point on we can look into new issues. Be aware that any PR out there that was created after September 10 needs to be rebased with current master (when #887 is merged). Without that, no serious testing can be done.

Let's be careful what to land on master. I'll only be merging Pull Requests that have had proper reviews with information on what was tested. At least one reviewer needs to actually verify it works (and show the rest of us). We simply cannot assume it will work.

If we do this, I think we can start resolving the remaining blockers one-by-one and go into the first RC round. Please help out where you can so we can make this a success together. Thanks!

Looking forward to the day we have our automated testing in place ;-)

Regards,
Remi


Re: Blameless post mortem

Posted by Sebastien Goasguen <ru...@gmail.com>.
> On Sep 28, 2015, at 7:22 AM, Sanjeev N <sa...@apache.org> wrote:
> 
> I have a concern here. Some of us are actively involved in reviewing the
> PRs related to marvin tests(Enhancing existing tests/Adding new tests). If
> we have to test a PR it requires an environment to be created with actual
> resources and this is going to take lot of time. Some of the tests can run
> on simulator but most of the tests require real hardware to test. PR
> submitter is already testing and submitting the test results along with the
> PR.

In lots of cases we don’t see those test results. 
We should make sure we do or at a minimum explain what tests we did.

> So is it require to test these PRs by reviewers?
> 

If you LGTM a PR, explain why and what tests we did.
Just “LGTM” is not enough

> On Sat, Sep 26, 2015 at 1:49 PM, sebgoa <ru...@gmail.com> wrote:
> 
>> Remi, thanks for the detailed post-mortem, it's a good read and great
>> learning.
>> I hope everyone reads it.
>> 
>> The one thing to emphasize is that we now have a very visible way to get
>> code into master, we have folks investing time to provide review (great),
>> we need the submitters to make due diligence and answer all comments in the
>> reviews.
>> 
>> In another project i work on, nothing can be added to the code without
>> unit tests. I think we could go down the route of asking for new
>> integration tests and unit tests for anything. If not, the PR does not get
>> merged. But let's digest your post-mortem and we can discuss after 4.6.0.
>> 
>> I see that you reverted one commit that was not made by you, that's great.
>> 
>> Let's focus on the blockers now, everything else can wait.
>> 
>> The big bonus of doing what we are doing is that once 4.6.0 is out, we can
>> merge PRs again (assuming they are properly rebased and tested) and we can
>> release 4.6.1 really quickly after.
>> 
>> -sebastien
>> 
>> On Sep 25, 2015, at 9:51 PM, Remi Bergsma <RB...@schubergphilis.com>
>> wrote:
>> 
>>> Hi all,
>>> 
>>> This mail is intended to be blameless. We need to learn something from
>> it. That's why I left out who exactly did what because it’s not relevant.
>> There are multiple examples but it's about the why. Let's learn from this
>> without blaming anyone.
>>> 
>>> We know we need automated testing. We have integration tests, but we are
>> unable to run all of them on any Pull Request we receive. If we would have
>> that in place, it'd be much easier to spot errors, regression and so on.
>> It'd also be more rewarding to write more tests.
>>> 
>>> Unfortunately we're not there yet. So, we need to do something else
>> instead until we get there. If we do nothing, we know we have many issues
>> because a master that breaks on a regular basis is the most frustrating
>> things. We said we'd use Pull Requests with at least two humans to review
>> and give their OK for a Pull Request. In the form of LGTM: Looks Good To
>> Me. Ok, so the LGTMs are there because we have no automated testing. Keep
>> that in mind. You are supposed to replace automated testing until it's
>> there.
>>> 
>>> Since we do this, master got a lot more stable. But every now and then
>> we still have issues. Let's look at how we do manual reviews. Again, this
>> is not to blame anyone. It's to open our eyes and make us realise what
>> we're doing and what results we get out of that.
>>> 
>>> 
>>> Example Pull Request #784:
>>> Title: CLOUDSTACK-8799 fixed the default routes
>>> 
>>> That's nice, it has a Jira id and a short description (as it should be).
>>> 
>>> The first person comes along and makes a comment:
>>> "There was also an issue with VPC VRs" ... "Have you seen this issue?
>> Does your change affects the VPC VR (single/redundant)?"
>>> 
>>> Actually a good question. Unfortunaly there comes no answer. After a
>> reminder, it was promised to do tests against VPC networks. Great!
>>> 
>>> The Jenkins builds both succeed and also Travis is green. But how much
>> value does this have? They have the impression to do automated testing, and
>> although you could argue they do, it's far from complete. If it breaks, you
>> know you have an issue. But it doesn’t work the other way around.
>>> 
>>> Back to our example PR. In the mean time, another commit gets pushed to
>> it: "CLOUDSTACK-8799 fixed for vpc networks." But if you look at the Jira
>> issue, you see it is about redundant virtual routers. The non-VPC ones. So
>> this is vague at best. But a reviewer gives a LGTM because the person could
>> create a VPC. That doesn't have anything to do with the problem being fixed
>> in this PR nor with the comments made earlier. But, at least the person
>> said what he did and we should all do that. What nobody knew back then, was
>> that this broke the default route on VPCs.
>>> 
>>> Then something strange happens: the two commits from the PR end up on
>> master as direct commits. With just one LGTM and no verification from the
>> person commenting about the linked issue. This happened on Friday September
>> 11th.
>>> 
>>> That day 21 commits came in, from 7 Pull Request and unfortunately also
>> from some direct commits. We noticed the direct commits and notified the
>> list (http://cloudstack.markmail.org/message/srmszloyipkxml36). As a lot
>> came in at the same time, it was decided not to revert them. Looking back,
>> we should have done it.
>>> 
>>> From this point on, VPCs were broken as they wouldn't get a default
>> route. So, no public internet access from VMs in VPC tiers, no VPNs
>> working, etc. This was mentioned to the list on Thursday September 15th,
>> after some chats and debugging going on over the weekend (
>> http://cloudstack.markmail.org/message/73ulpu4p75ex24tc)
>>> 
>>> Here we are, master is broken functionality wise and new Pull Requests
>> come in to fix blockers. But we cannot ever test their proper working,
>> because VPCs are broken in master and so also in the PRs branched off of
>> it. With or without change in the PR.
>>> 
>>> It starts to escalate as the days go by.
>>> 
>>> I’ll leave out the bit on how this frustrated people. Although it’s good
>> to know we do not want to be in this situation.
>>> 
>>> Eventually Wilder and I spent an evening and a day working on a branch
>> where we loaded 7 PRs on top of each other (all VR related) only to find
>> the VPC is still broken. It allowed us to zoom in and find the default
>> route was missing again. We said it worked 3 weeks before, because the same
>> tests that succeeded then, now were broken. We had already fixed this in PR
>> #738 on August 25 so were sure about it.
>>> 
>>> After some digging we could trace it back to Pull Request #784. Imagine
>> the feeling seeing your own comment there mentioning the previous issue on
>> the default gateways. Fair to say our human review process clearly failed
>> here. Many many hours were spent on this problem over the past two weeks.
>> Could we have prevented this from happening? I think so, yes.
>>> 
>>> 
>>> This example clearly shows why:
>>> 
>>> - we should use Pull Requests
>>> It made the change visible: Great!
>>> 
>>> - we do reviews and ask for feedback
>>> We got feedback and questions: Also great!
>>> 
>>> - we should always respond to feedback and verify it is resolved, before
>> merging
>>> We need to improve here. Even with two reviewers that say LGTM, we
>> should still address any feedback before merging.
>>> 
>>> - we should have two humans doing a review
>>> We need to improve here as well. Not one reviewer, we need two. Really.
>>> 
>>> - we need to document why we say LGTM.
>>> Another improvement. It’s nice to say LGTM, but a review of only 4
>> characters and nothing more is useless. We need to know what was tested and
>> how. Test results, screen shots or anything that shows what's been
>> verified. If you only reviewed the code, also fine but at least say that.
>> Then the next reviewer should do another type of review to get the comlete
>> picture. Remember you're replacing automated testing!
>>> 
>>> - we should always merge Pull Requests
>>> We made it easy, merging is the de facto standard, and it has even more
>> benefits. You can trace commits back to their Pull Request (and find all
>> comments and discussion there: saves time, trust me). It also allows for
>> easier reverting of a Pull Request. We’ll see even more benefits once 4.7
>> is there. Although the intentions to merge the Pull Request were there, it
>> still didn't happen. We should always check before we push. As a committer
>> we just need to be sure.
>>> 
>>> - we need automated testing!
>>> The sooner the better. It’s all about the missing automated testing.
>> After 4.6, we all need to focus on this. Saves a lot of time. And
>> frustrations.
>>> 
>>> 
>>> 
>>> We're doing final testing on PR #887 and will merge it soon. From that
>> point on we can look into new issues. Be aware that any PR out there that
>> was created after September 10 needs to be rebased with current master
>> (when #887 is merged). Without that, no serious testing can be done.
>>> 
>>> Let's be careful what to land on master. I'll only be merging Pull
>> Requests that have had proper reviews with information on what was tested.
>> At least one reviewer needs to actually verify it works (and show the rest
>> of us). We simply cannot assume it will work.
>>> 
>>> If we do this, I think we can start resolving the remaining blockers
>> one-by-one and go into the first RC round. Please help out where you can so
>> we can make this a success together. Thanks!
>>> 
>>> Looking forward to the day we have our automated testing in place ;-)
>>> 
>>> Regards,
>>> Remi
>>> 
>> 
>> 


Re: Blameless post mortem

Posted by Bharat Kumar <bh...@citrix.com>.
Hi Remi,

Thank you for the Blame less postmortem. 

I think there is a bigger problem here than just the review process and running tests. Even if we run the tests we cannot be sure that every thing will work as intended. The tests will only give some level of confidence. The tests may not cover all the use cases.

I think the problem here is that the way major changes to the code base are dealt with. For example,  VR refactoring was done without discussing the design implications and the amount of changes it would bring in. I could not find any design document. The vr refactor changed a lot of code and the way VR used to work and in my opinion it was incomplete-vpn, isolated networks, basic networks, iptable rules and rvr in isolated case etc were not implemented. Most of us are still in the process of understanding this. Even before reaching this state we had to spend a lot of time fixing issues mentioned in the thread [Blocker/Critical] VR related Issues.  

When a change of this magnitude is being made, we should call out all the changes and document them properly. This will help people to create better fixes. Currently when we attempt to fix the isolated vr case it is effecting the vpc and vice versa. for example pr 738 fixed it for vpc networks but broke it for isolated case. I believe it is not too late to at least start documenting the changes now.

Thanks,
Bharat.

On 28-Sep-2015, at 10:52 am, Sanjeev N <sa...@apache.org> wrote:

> I have a concern here. Some of us are actively involved in reviewing the
> PRs related to marvin tests(Enhancing existing tests/Adding new tests). If
> we have to test a PR it requires an environment to be created with actual
> resources and this is going to take lot of time. Some of the tests can run
> on simulator but most of the tests require real hardware to test. PR
> submitter is already testing and submitting the test results along with the
> PR. So is it require to test these PRs by reviewers?
> 
> On Sat, Sep 26, 2015 at 1:49 PM, sebgoa <ru...@gmail.com> wrote:
> 
>> Remi, thanks for the detailed post-mortem, it's a good read and great
>> learning.
>> I hope everyone reads it.
>> 
>> The one thing to emphasize is that we now have a very visible way to get
>> code into master, we have folks investing time to provide review (great),
>> we need the submitters to make due diligence and answer all comments in the
>> reviews.
>> 
>> In another project i work on, nothing can be added to the code without
>> unit tests. I think we could go down the route of asking for new
>> integration tests and unit tests for anything. If not, the PR does not get
>> merged. But let's digest your post-mortem and we can discuss after 4.6.0.
>> 
>> I see that you reverted one commit that was not made by you, that's great.
>> 
>> Let's focus on the blockers now, everything else can wait.
>> 
>> The big bonus of doing what we are doing is that once 4.6.0 is out, we can
>> merge PRs again (assuming they are properly rebased and tested) and we can
>> release 4.6.1 really quickly after.
>> 
>> -sebastien
>> 
>> On Sep 25, 2015, at 9:51 PM, Remi Bergsma <RB...@schubergphilis.com>
>> wrote:
>> 
>>> Hi all,
>>> 
>>> This mail is intended to be blameless. We need to learn something from
>> it. That's why I left out who exactly did what because it’s not relevant.
>> There are multiple examples but it's about the why. Let's learn from this
>> without blaming anyone.
>>> 
>>> We know we need automated testing. We have integration tests, but we are
>> unable to run all of them on any Pull Request we receive. If we would have
>> that in place, it'd be much easier to spot errors, regression and so on.
>> It'd also be more rewarding to write more tests.
>>> 
>>> Unfortunately we're not there yet. So, we need to do something else
>> instead until we get there. If we do nothing, we know we have many issues
>> because a master that breaks on a regular basis is the most frustrating
>> things. We said we'd use Pull Requests with at least two humans to review
>> and give their OK for a Pull Request. In the form of LGTM: Looks Good To
>> Me. Ok, so the LGTMs are there because we have no automated testing. Keep
>> that in mind. You are supposed to replace automated testing until it's
>> there.
>>> 
>>> Since we do this, master got a lot more stable. But every now and then
>> we still have issues. Let's look at how we do manual reviews. Again, this
>> is not to blame anyone. It's to open our eyes and make us realise what
>> we're doing and what results we get out of that.
>>> 
>>> 
>>> Example Pull Request #784:
>>> Title: CLOUDSTACK-8799 fixed the default routes
>>> 
>>> That's nice, it has a Jira id and a short description (as it should be).
>>> 
>>> The first person comes along and makes a comment:
>>> "There was also an issue with VPC VRs" ... "Have you seen this issue?
>> Does your change affects the VPC VR (single/redundant)?"
>>> 
>>> Actually a good question. Unfortunaly there comes no answer. After a
>> reminder, it was promised to do tests against VPC networks. Great!
>>> 
>>> The Jenkins builds both succeed and also Travis is green. But how much
>> value does this have? They have the impression to do automated testing, and
>> although you could argue they do, it's far from complete. If it breaks, you
>> know you have an issue. But it doesn’t work the other way around.
>>> 
>>> Back to our example PR. In the mean time, another commit gets pushed to
>> it: "CLOUDSTACK-8799 fixed for vpc networks." But if you look at the Jira
>> issue, you see it is about redundant virtual routers. The non-VPC ones. So
>> this is vague at best. But a reviewer gives a LGTM because the person could
>> create a VPC. That doesn't have anything to do with the problem being fixed
>> in this PR nor with the comments made earlier. But, at least the person
>> said what he did and we should all do that. What nobody knew back then, was
>> that this broke the default route on VPCs.
>>> 
>>> Then something strange happens: the two commits from the PR end up on
>> master as direct commits. With just one LGTM and no verification from the
>> person commenting about the linked issue. This happened on Friday September
>> 11th.
>>> 
>>> That day 21 commits came in, from 7 Pull Request and unfortunately also
>> from some direct commits. We noticed the direct commits and notified the
>> list (http://cloudstack.markmail.org/message/srmszloyipkxml36). As a lot
>> came in at the same time, it was decided not to revert them. Looking back,
>> we should have done it.
>>> 
>>> From this point on, VPCs were broken as they wouldn't get a default
>> route. So, no public internet access from VMs in VPC tiers, no VPNs
>> working, etc. This was mentioned to the list on Thursday September 15th,
>> after some chats and debugging going on over the weekend (
>> http://cloudstack.markmail.org/message/73ulpu4p75ex24tc)
>>> 
>>> Here we are, master is broken functionality wise and new Pull Requests
>> come in to fix blockers. But we cannot ever test their proper working,
>> because VPCs are broken in master and so also in the PRs branched off of
>> it. With or without change in the PR.
>>> 
>>> It starts to escalate as the days go by.
>>> 
>>> I’ll leave out the bit on how this frustrated people. Although it’s good
>> to know we do not want to be in this situation.
>>> 
>>> Eventually Wilder and I spent an evening and a day working on a branch
>> where we loaded 7 PRs on top of each other (all VR related) only to find
>> the VPC is still broken. It allowed us to zoom in and find the default
>> route was missing again. We said it worked 3 weeks before, because the same
>> tests that succeeded then, now were broken. We had already fixed this in PR
>> #738 on August 25 so were sure about it.
>>> 
>>> After some digging we could trace it back to Pull Request #784. Imagine
>> the feeling seeing your own comment there mentioning the previous issue on
>> the default gateways. Fair to say our human review process clearly failed
>> here. Many many hours were spent on this problem over the past two weeks.
>> Could we have prevented this from happening? I think so, yes.
>>> 
>>> 
>>> This example clearly shows why:
>>> 
>>> - we should use Pull Requests
>>> It made the change visible: Great!
>>> 
>>> - we do reviews and ask for feedback
>>> We got feedback and questions: Also great!
>>> 
>>> - we should always respond to feedback and verify it is resolved, before
>> merging
>>> We need to improve here. Even with two reviewers that say LGTM, we
>> should still address any feedback before merging.
>>> 
>>> - we should have two humans doing a review
>>> We need to improve here as well. Not one reviewer, we need two. Really.
>>> 
>>> - we need to document why we say LGTM.
>>> Another improvement. It’s nice to say LGTM, but a review of only 4
>> characters and nothing more is useless. We need to know what was tested and
>> how. Test results, screen shots or anything that shows what's been
>> verified. If you only reviewed the code, also fine but at least say that.
>> Then the next reviewer should do another type of review to get the comlete
>> picture. Remember you're replacing automated testing!
>>> 
>>> - we should always merge Pull Requests
>>> We made it easy, merging is the de facto standard, and it has even more
>> benefits. You can trace commits back to their Pull Request (and find all
>> comments and discussion there: saves time, trust me). It also allows for
>> easier reverting of a Pull Request. We’ll see even more benefits once 4.7
>> is there. Although the intentions to merge the Pull Request were there, it
>> still didn't happen. We should always check before we push. As a committer
>> we just need to be sure.
>>> 
>>> - we need automated testing!
>>> The sooner the better. It’s all about the missing automated testing.
>> After 4.6, we all need to focus on this. Saves a lot of time. And
>> frustrations.
>>> 
>>> 
>>> 
>>> We're doing final testing on PR #887 and will merge it soon. From that
>> point on we can look into new issues. Be aware that any PR out there that
>> was created after September 10 needs to be rebased with current master
>> (when #887 is merged). Without that, no serious testing can be done.
>>> 
>>> Let's be careful what to land on master. I'll only be merging Pull
>> Requests that have had proper reviews with information on what was tested.
>> At least one reviewer needs to actually verify it works (and show the rest
>> of us). We simply cannot assume it will work.
>>> 
>>> If we do this, I think we can start resolving the remaining blockers
>> one-by-one and go into the first RC round. Please help out where you can so
>> we can make this a success together. Thanks!
>>> 
>>> Looking forward to the day we have our automated testing in place ;-)
>>> 
>>> Regards,
>>> Remi
>>> 
>> 
>> 


Re: Blameless post mortem

Posted by Sanjeev N <sa...@apache.org>.
I have a concern here. Some of us are actively involved in reviewing the
PRs related to marvin tests(Enhancing existing tests/Adding new tests). If
we have to test a PR it requires an environment to be created with actual
resources and this is going to take lot of time. Some of the tests can run
on simulator but most of the tests require real hardware to test. PR
submitter is already testing and submitting the test results along with the
PR. So is it require to test these PRs by reviewers?

On Sat, Sep 26, 2015 at 1:49 PM, sebgoa <ru...@gmail.com> wrote:

> Remi, thanks for the detailed post-mortem, it's a good read and great
> learning.
> I hope everyone reads it.
>
> The one thing to emphasize is that we now have a very visible way to get
> code into master, we have folks investing time to provide review (great),
> we need the submitters to make due diligence and answer all comments in the
> reviews.
>
> In another project i work on, nothing can be added to the code without
> unit tests. I think we could go down the route of asking for new
> integration tests and unit tests for anything. If not, the PR does not get
> merged. But let's digest your post-mortem and we can discuss after 4.6.0.
>
> I see that you reverted one commit that was not made by you, that's great.
>
> Let's focus on the blockers now, everything else can wait.
>
> The big bonus of doing what we are doing is that once 4.6.0 is out, we can
> merge PRs again (assuming they are properly rebased and tested) and we can
> release 4.6.1 really quickly after.
>
> -sebastien
>
> On Sep 25, 2015, at 9:51 PM, Remi Bergsma <RB...@schubergphilis.com>
> wrote:
>
> > Hi all,
> >
> > This mail is intended to be blameless. We need to learn something from
> it. That's why I left out who exactly did what because it’s not relevant.
> There are multiple examples but it's about the why. Let's learn from this
> without blaming anyone.
> >
> > We know we need automated testing. We have integration tests, but we are
> unable to run all of them on any Pull Request we receive. If we would have
> that in place, it'd be much easier to spot errors, regression and so on.
> It'd also be more rewarding to write more tests.
> >
> > Unfortunately we're not there yet. So, we need to do something else
> instead until we get there. If we do nothing, we know we have many issues
> because a master that breaks on a regular basis is the most frustrating
> things. We said we'd use Pull Requests with at least two humans to review
> and give their OK for a Pull Request. In the form of LGTM: Looks Good To
> Me. Ok, so the LGTMs are there because we have no automated testing. Keep
> that in mind. You are supposed to replace automated testing until it's
> there.
> >
> > Since we do this, master got a lot more stable. But every now and then
> we still have issues. Let's look at how we do manual reviews. Again, this
> is not to blame anyone. It's to open our eyes and make us realise what
> we're doing and what results we get out of that.
> >
> >
> > Example Pull Request #784:
> > Title: CLOUDSTACK-8799 fixed the default routes
> >
> > That's nice, it has a Jira id and a short description (as it should be).
> >
> > The first person comes along and makes a comment:
> > "There was also an issue with VPC VRs" ... "Have you seen this issue?
> Does your change affects the VPC VR (single/redundant)?"
> >
> > Actually a good question. Unfortunaly there comes no answer. After a
> reminder, it was promised to do tests against VPC networks. Great!
> >
> > The Jenkins builds both succeed and also Travis is green. But how much
> value does this have? They have the impression to do automated testing, and
> although you could argue they do, it's far from complete. If it breaks, you
> know you have an issue. But it doesn’t work the other way around.
> >
> > Back to our example PR. In the mean time, another commit gets pushed to
> it: "CLOUDSTACK-8799 fixed for vpc networks." But if you look at the Jira
> issue, you see it is about redundant virtual routers. The non-VPC ones. So
> this is vague at best. But a reviewer gives a LGTM because the person could
> create a VPC. That doesn't have anything to do with the problem being fixed
> in this PR nor with the comments made earlier. But, at least the person
> said what he did and we should all do that. What nobody knew back then, was
> that this broke the default route on VPCs.
> >
> > Then something strange happens: the two commits from the PR end up on
> master as direct commits. With just one LGTM and no verification from the
> person commenting about the linked issue. This happened on Friday September
> 11th.
> >
> > That day 21 commits came in, from 7 Pull Request and unfortunately also
> from some direct commits. We noticed the direct commits and notified the
> list (http://cloudstack.markmail.org/message/srmszloyipkxml36). As a lot
> came in at the same time, it was decided not to revert them. Looking back,
> we should have done it.
> >
> > From this point on, VPCs were broken as they wouldn't get a default
> route. So, no public internet access from VMs in VPC tiers, no VPNs
> working, etc. This was mentioned to the list on Thursday September 15th,
> after some chats and debugging going on over the weekend (
> http://cloudstack.markmail.org/message/73ulpu4p75ex24tc)
> >
> > Here we are, master is broken functionality wise and new Pull Requests
> come in to fix blockers. But we cannot ever test their proper working,
> because VPCs are broken in master and so also in the PRs branched off of
> it. With or without change in the PR.
> >
> > It starts to escalate as the days go by.
> >
> > I’ll leave out the bit on how this frustrated people. Although it’s good
> to know we do not want to be in this situation.
> >
> > Eventually Wilder and I spent an evening and a day working on a branch
> where we loaded 7 PRs on top of each other (all VR related) only to find
> the VPC is still broken. It allowed us to zoom in and find the default
> route was missing again. We said it worked 3 weeks before, because the same
> tests that succeeded then, now were broken. We had already fixed this in PR
> #738 on August 25 so were sure about it.
> >
> > After some digging we could trace it back to Pull Request #784. Imagine
> the feeling seeing your own comment there mentioning the previous issue on
> the default gateways. Fair to say our human review process clearly failed
> here. Many many hours were spent on this problem over the past two weeks.
> Could we have prevented this from happening? I think so, yes.
> >
> >
> > This example clearly shows why:
> >
> > - we should use Pull Requests
> >  It made the change visible: Great!
> >
> > - we do reviews and ask for feedback
> >  We got feedback and questions: Also great!
> >
> > - we should always respond to feedback and verify it is resolved, before
> merging
> >  We need to improve here. Even with two reviewers that say LGTM, we
> should still address any feedback before merging.
> >
> > - we should have two humans doing a review
> >  We need to improve here as well. Not one reviewer, we need two. Really.
> >
> > - we need to document why we say LGTM.
> >  Another improvement. It’s nice to say LGTM, but a review of only 4
> characters and nothing more is useless. We need to know what was tested and
> how. Test results, screen shots or anything that shows what's been
> verified. If you only reviewed the code, also fine but at least say that.
> Then the next reviewer should do another type of review to get the comlete
> picture. Remember you're replacing automated testing!
> >
> > - we should always merge Pull Requests
> >  We made it easy, merging is the de facto standard, and it has even more
> benefits. You can trace commits back to their Pull Request (and find all
> comments and discussion there: saves time, trust me). It also allows for
> easier reverting of a Pull Request. We’ll see even more benefits once 4.7
> is there. Although the intentions to merge the Pull Request were there, it
> still didn't happen. We should always check before we push. As a committer
> we just need to be sure.
> >
> > - we need automated testing!
> >  The sooner the better. It’s all about the missing automated testing.
> After 4.6, we all need to focus on this. Saves a lot of time. And
> frustrations.
> >
> >
> >
> > We're doing final testing on PR #887 and will merge it soon. From that
> point on we can look into new issues. Be aware that any PR out there that
> was created after September 10 needs to be rebased with current master
> (when #887 is merged). Without that, no serious testing can be done.
> >
> > Let's be careful what to land on master. I'll only be merging Pull
> Requests that have had proper reviews with information on what was tested.
> At least one reviewer needs to actually verify it works (and show the rest
> of us). We simply cannot assume it will work.
> >
> > If we do this, I think we can start resolving the remaining blockers
> one-by-one and go into the first RC round. Please help out where you can so
> we can make this a success together. Thanks!
> >
> > Looking forward to the day we have our automated testing in place ;-)
> >
> > Regards,
> > Remi
> >
>
>

Re: Blameless post mortem

Posted by sebgoa <ru...@gmail.com>.
Remi, thanks for the detailed post-mortem, it's a good read and great learning.
I hope everyone reads it.

The one thing to emphasize is that we now have a very visible way to get code into master, we have folks investing time to provide review (great), we need the submitters to make due diligence and answer all comments in the reviews.

In another project i work on, nothing can be added to the code without unit tests. I think we could go down the route of asking for new integration tests and unit tests for anything. If not, the PR does not get merged. But let's digest your post-mortem and we can discuss after 4.6.0.

I see that you reverted one commit that was not made by you, that's great.

Let's focus on the blockers now, everything else can wait.

The big bonus of doing what we are doing is that once 4.6.0 is out, we can merge PRs again (assuming they are properly rebased and tested) and we can release 4.6.1 really quickly after.

-sebastien

On Sep 25, 2015, at 9:51 PM, Remi Bergsma <RB...@schubergphilis.com> wrote:

> Hi all,
> 
> This mail is intended to be blameless. We need to learn something from it. That's why I left out who exactly did what because it’s not relevant. There are multiple examples but it's about the why. Let's learn from this without blaming anyone.
> 
> We know we need automated testing. We have integration tests, but we are unable to run all of them on any Pull Request we receive. If we would have that in place, it'd be much easier to spot errors, regression and so on. It'd also be more rewarding to write more tests.
> 
> Unfortunately we're not there yet. So, we need to do something else instead until we get there. If we do nothing, we know we have many issues because a master that breaks on a regular basis is the most frustrating things. We said we'd use Pull Requests with at least two humans to review and give their OK for a Pull Request. In the form of LGTM: Looks Good To Me. Ok, so the LGTMs are there because we have no automated testing. Keep that in mind. You are supposed to replace automated testing until it's there.
> 
> Since we do this, master got a lot more stable. But every now and then we still have issues. Let's look at how we do manual reviews. Again, this is not to blame anyone. It's to open our eyes and make us realise what we're doing and what results we get out of that.
> 
> 
> Example Pull Request #784: 
> Title: CLOUDSTACK-8799 fixed the default routes
> 
> That's nice, it has a Jira id and a short description (as it should be).
> 
> The first person comes along and makes a comment:
> "There was also an issue with VPC VRs" ... "Have you seen this issue? Does your change affects the VPC VR (single/redundant)?"
> 
> Actually a good question. Unfortunaly there comes no answer. After a reminder, it was promised to do tests against VPC networks. Great!
> 
> The Jenkins builds both succeed and also Travis is green. But how much value does this have? They have the impression to do automated testing, and although you could argue they do, it's far from complete. If it breaks, you know you have an issue. But it doesn’t work the other way around.
> 
> Back to our example PR. In the mean time, another commit gets pushed to it: "CLOUDSTACK-8799 fixed for vpc networks." But if you look at the Jira issue, you see it is about redundant virtual routers. The non-VPC ones. So this is vague at best. But a reviewer gives a LGTM because the person could create a VPC. That doesn't have anything to do with the problem being fixed in this PR nor with the comments made earlier. But, at least the person said what he did and we should all do that. What nobody knew back then, was that this broke the default route on VPCs.
> 
> Then something strange happens: the two commits from the PR end up on master as direct commits. With just one LGTM and no verification from the person commenting about the linked issue. This happened on Friday September 11th. 
> 
> That day 21 commits came in, from 7 Pull Request and unfortunately also from some direct commits. We noticed the direct commits and notified the list (http://cloudstack.markmail.org/message/srmszloyipkxml36). As a lot came in at the same time, it was decided not to revert them. Looking back, we should have done it.
> 
> From this point on, VPCs were broken as they wouldn't get a default route. So, no public internet access from VMs in VPC tiers, no VPNs working, etc. This was mentioned to the list on Thursday September 15th, after some chats and debugging going on over the weekend (http://cloudstack.markmail.org/message/73ulpu4p75ex24tc)
> 
> Here we are, master is broken functionality wise and new Pull Requests come in to fix blockers. But we cannot ever test their proper working, because VPCs are broken in master and so also in the PRs branched off of it. With or without change in the PR. 
> 
> It starts to escalate as the days go by.
> 
> I’ll leave out the bit on how this frustrated people. Although it’s good to know we do not want to be in this situation.
> 
> Eventually Wilder and I spent an evening and a day working on a branch where we loaded 7 PRs on top of each other (all VR related) only to find the VPC is still broken. It allowed us to zoom in and find the default route was missing again. We said it worked 3 weeks before, because the same tests that succeeded then, now were broken. We had already fixed this in PR #738 on August 25 so were sure about it.
> 
> After some digging we could trace it back to Pull Request #784. Imagine the feeling seeing your own comment there mentioning the previous issue on the default gateways. Fair to say our human review process clearly failed here. Many many hours were spent on this problem over the past two weeks. Could we have prevented this from happening? I think so, yes.
> 
> 
> This example clearly shows why:
> 
> - we should use Pull Requests
>  It made the change visible: Great!
> 
> - we do reviews and ask for feedback
>  We got feedback and questions: Also great!
> 
> - we should always respond to feedback and verify it is resolved, before merging
>  We need to improve here. Even with two reviewers that say LGTM, we should still address any feedback before merging.
> 
> - we should have two humans doing a review
>  We need to improve here as well. Not one reviewer, we need two. Really.
> 
> - we need to document why we say LGTM. 
>  Another improvement. It’s nice to say LGTM, but a review of only 4 characters and nothing more is useless. We need to know what was tested and how. Test results, screen shots or anything that shows what's been verified. If you only reviewed the code, also fine but at least say that. Then the next reviewer should do another type of review to get the comlete picture. Remember you're replacing automated testing!
> 
> - we should always merge Pull Requests
>  We made it easy, merging is the de facto standard, and it has even more benefits. You can trace commits back to their Pull Request (and find all comments and discussion there: saves time, trust me). It also allows for easier reverting of a Pull Request. We’ll see even more benefits once 4.7 is there. Although the intentions to merge the Pull Request were there, it still didn't happen. We should always check before we push. As a committer we just need to be sure.
> 
> - we need automated testing!
>  The sooner the better. It’s all about the missing automated testing. After 4.6, we all need to focus on this. Saves a lot of time. And frustrations.
> 
> 
> 
> We're doing final testing on PR #887 and will merge it soon. From that point on we can look into new issues. Be aware that any PR out there that was created after September 10 needs to be rebased with current master (when #887 is merged). Without that, no serious testing can be done.
> 
> Let's be careful what to land on master. I'll only be merging Pull Requests that have had proper reviews with information on what was tested. At least one reviewer needs to actually verify it works (and show the rest of us). We simply cannot assume it will work.
> 
> If we do this, I think we can start resolving the remaining blockers one-by-one and go into the first RC round. Please help out where you can so we can make this a success together. Thanks!
> 
> Looking forward to the day we have our automated testing in place ;-)
> 
> Regards,
> Remi
> 


RE: Blameless post mortem

Posted by Raja Pullela <ra...@citrix.com>.
Thanks for the update Remi!

-----Original Message-----
From: Remi Bergsma [mailto:RBergsma@schubergphilis.com] 
Sent: Saturday, September 26, 2015 1:21 AM
To: dev@cloudstack.apache.org
Subject: Blameless post mortem 

Hi all,

This mail is intended to be blameless. We need to learn something from it. That's why I left out who exactly did what because it’s not relevant. There are multiple examples but it's about the why. Let's learn from this without blaming anyone.

We know we need automated testing. We have integration tests, but we are unable to run all of them on any Pull Request we receive. If we would have that in place, it'd be much easier to spot errors, regression and so on. It'd also be more rewarding to write more tests.

Unfortunately we're not there yet. So, we need to do something else instead until we get there. If we do nothing, we know we have many issues because a master that breaks on a regular basis is the most frustrating things. We said we'd use Pull Requests with at least two humans to review and give their OK for a Pull Request. In the form of LGTM: Looks Good To Me. Ok, so the LGTMs are there because we have no automated testing. Keep that in mind. You are supposed to replace automated testing until it's there.

Since we do this, master got a lot more stable. But every now and then we still have issues. Let's look at how we do manual reviews. Again, this is not to blame anyone. It's to open our eyes and make us realise what we're doing and what results we get out of that.


Example Pull Request #784: 
Title: CLOUDSTACK-8799 fixed the default routes

That's nice, it has a Jira id and a short description (as it should be).

The first person comes along and makes a comment:
"There was also an issue with VPC VRs" ... "Have you seen this issue? Does your change affects the VPC VR (single/redundant)?"

Actually a good question. Unfortunaly there comes no answer. After a reminder, it was promised to do tests against VPC networks. Great!

The Jenkins builds both succeed and also Travis is green. But how much value does this have? They have the impression to do automated testing, and although you could argue they do, it's far from complete. If it breaks, you know you have an issue. But it doesn’t work the other way around.

Back to our example PR. In the mean time, another commit gets pushed to it: "CLOUDSTACK-8799 fixed for vpc networks." But if you look at the Jira issue, you see it is about redundant virtual routers. The non-VPC ones. So this is vague at best. But a reviewer gives a LGTM because the person could create a VPC. That doesn't have anything to do with the problem being fixed in this PR nor with the comments made earlier. But, at least the person said what he did and we should all do that. What nobody knew back then, was that this broke the default route on VPCs.

Then something strange happens: the two commits from the PR end up on master as direct commits. With just one LGTM and no verification from the person commenting about the linked issue. This happened on Friday September 11th. 

That day 21 commits came in, from 7 Pull Request and unfortunately also from some direct commits. We noticed the direct commits and notified the list (http://cloudstack.markmail.org/message/srmszloyipkxml36). As a lot came in at the same time, it was decided not to revert them. Looking back, we should have done it.

From this point on, VPCs were broken as they wouldn't get a default route. So, no public internet access from VMs in VPC tiers, no VPNs working, etc. This was mentioned to the list on Thursday September 15th, after some chats and debugging going on over the weekend (http://cloudstack.markmail.org/message/73ulpu4p75ex24tc)

Here we are, master is broken functionality wise and new Pull Requests come in to fix blockers. But we cannot ever test their proper working, because VPCs are broken in master and so also in the PRs branched off of it. With or without change in the PR. 

It starts to escalate as the days go by.

I’ll leave out the bit on how this frustrated people. Although it’s good to know we do not want to be in this situation.

Eventually Wilder and I spent an evening and a day working on a branch where we loaded 7 PRs on top of each other (all VR related) only to find the VPC is still broken. It allowed us to zoom in and find the default route was missing again. We said it worked 3 weeks before, because the same tests that succeeded then, now were broken. We had already fixed this in PR #738 on August 25 so were sure about it.

After some digging we could trace it back to Pull Request #784. Imagine the feeling seeing your own comment there mentioning the previous issue on the default gateways. Fair to say our human review process clearly failed here. Many many hours were spent on this problem over the past two weeks. Could we have prevented this from happening? I think so, yes.


This example clearly shows why:

- we should use Pull Requests
  It made the change visible: Great!

- we do reviews and ask for feedback
  We got feedback and questions: Also great!

- we should always respond to feedback and verify it is resolved, before merging
  We need to improve here. Even with two reviewers that say LGTM, we should still address any feedback before merging.

- we should have two humans doing a review
  We need to improve here as well. Not one reviewer, we need two. Really.

- we need to document why we say LGTM. 
  Another improvement. It’s nice to say LGTM, but a review of only 4 characters and nothing more is useless. We need to know what was tested and how. Test results, screen shots or anything that shows what's been verified. If you only reviewed the code, also fine but at least say that. Then the next reviewer should do another type of review to get the comlete picture. Remember you're replacing automated testing!

- we should always merge Pull Requests
  We made it easy, merging is the de facto standard, and it has even more benefits. You can trace commits back to their Pull Request (and find all comments and discussion there: saves time, trust me). It also allows for easier reverting of a Pull Request. We’ll see even more benefits once 4.7 is there. Although the intentions to merge the Pull Request were there, it still didn't happen. We should always check before we push. As a committer we just need to be sure.

- we need automated testing!
  The sooner the better. It’s all about the missing automated testing. After 4.6, we all need to focus on this. Saves a lot of time. And frustrations.



We're doing final testing on PR #887 and will merge it soon. From that point on we can look into new issues. Be aware that any PR out there that was created after September 10 needs to be rebased with current master (when #887 is merged). Without that, no serious testing can be done.

Let's be careful what to land on master. I'll only be merging Pull Requests that have had proper reviews with information on what was tested. At least one reviewer needs to actually verify it works (and show the rest of us). We simply cannot assume it will work.

If we do this, I think we can start resolving the remaining blockers one-by-one and go into the first RC round. Please help out where you can so we can make this a success together. Thanks!

Looking forward to the day we have our automated testing in place ;-)

Regards,
Remi