You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-dev@hadoop.apache.org by Steve Loughran <st...@hortonworks.com> on 2015/11/22 13:21:21 UTC

Jenkins stability and patching

Jenkins is pretty much dead in the water these days; a test run that works is a rare miracle rather than the default state. Which also means most patches are being +1'd in even though patches are failing, with comments like "the test failures are probably unrelated"

I think everyone has to be grateful that I'm not volunteering to be release manager for 2.8, as if I were i'd have already imposed a block on any patches going in until jenkins was stable. That is: nothing but test fixes would go in.

as it is, at least for the next couple of weeks, I'm going to experiment with reverting patches which break the build. Usually those breakages are being fixed, eventually, with followup patches. With a "patches which break the build get reverted" policy, whoever submitted that first patch gets to write the fix *and test it again*. This should encourage people to be more rigorous first time round.

1. Yes, I'm going to have to be ruthless and do this for myself too. Or others can. I'm not doing much (any?) core hadoop coding right now, so more isolated.
2. No, I don't plan to show favouritism: break the build and it gets rolled back.
3. We can review this in a week or two to see how it goes. And someone else can volunteer to keep jenkins happy.
4. I'll get a smaller fix for HDFS-9263 in.
5. I've also started running slider 0.90-SNAPSHOT test runs with Hadoop 2.8.0-SNAPSHOT, so I'm being the first to find problems beyond jenkins. So far HADOOP-12050 is the first blocker. It went in in August, which shows we aren't doing enough cross-version testing beyond just Jenkins. That breakage (HADOOP-12587) is stopping my test code working against secure clusters —if I was being really harsh I'd have reverted that too, but's been in long enough I think a fix is probably the best solution.
6. Finally: everyone should feel free to fix tests. Don't be shy now!

Giving this is a US vacation week, it should be a quieter week for breakages.

Sorry —but if we can't even get Jenkins stable, then what hope do we have for a 2.8 release working?

-Steve

Re: Jenkins stability and patching

Posted by Steve Loughran <st...@hortonworks.com>.

> On 23 Nov 2015, at 21:57, Colin P. McCabe <cm...@apache.org> wrote:
> 
> On Mon, Nov 23, 2015 at 1:53 PM, Colin P. McCabe <cm...@apache.org> wrote:
>> I agree that our tests are in a bad state.  It would help if we could
>> maintain a list of "flaky tests" somewhere in git and have Yetus
>> consider the flakiness of a test before -1ing a patch.  Right now, we
>> pretty much all have that list in our heads, and we're not applying it
>> very consistently.  Having this list would also let us know where to
>> concentrate our efforts to fix things.
>> 
>> On Sun, Nov 22, 2015 at 4:21 AM, Steve Loughran <st...@hortonworks.com> wrote:
>>> 
>>> Jenkins is pretty much dead in the water these days; a test run that works is a rare miracle rather than the default state. Which also means most patches are being +1'd in even though patches are failing, with comments like "the test failures are probably unrelated"
>>> 
>>> 
>>> I think everyone has to be grateful that I'm not volunteering to be release manager for 2.8, as if I were i'd have already imposed a block on any patches going in until jenkins was stable. That is: nothing but test fixes would go in.
>>> 
>>> as it is, at least for the next couple of weeks, I'm going to experiment with reverting patches which break the build. Usually those breakages are being fixed, eventually, with followup patches. With a "patches which break the build get reverted" policy, whoever submitted that first patch gets to write the fix *and test it again*. This should encourage people to be more rigorous first time round.
>>> 
>>> 
>>>  1.  Yes, I'm going to have to be ruthless and do this for myself too. Or others can. I'm not doing much (any?) core hadoop coding right now, so more isolated.
>>>  2.  No, I don't plan to show favouritism: break the build and it gets rolled back.
>>>  3.  We can review this in a week or two  to see how it goes. And someone else can volunteer to keep jenkins happy.
>>>  4.  I'll get a smaller fix for HDFS-9263 in.
>>>  5.  I've also started running slider 0.90-SNAPSHOT test runs with Hadoop 2.8.0-SNAPSHOT, so I'm being the first to find problems beyond jenkins. So far HADOOP-12050 is the first blocker. It went in in August, which shows we aren't doing enough cross-version testing beyond just Jenkins. That breakage (HADOOP-12587) is stopping my test code working against secure clusters —if I was being really harsh I'd have reverted that too, but's been in long enough I think a fix is probably the best solution.
>> 
>> Well, this is already directly contracting point #2, isn't it? :)
> 

yes. I'm not happy how that patch has broken some of my tests though. Reverting it would benefit me, and I am sorely tempted, but think starting with with the recent commits is the way to gently adopt a stricter process.

> Just to be clear, I'm not trying to imply that this was favoritism (I
> don't think it was) but just that a revert is not always the right
> solution.  A short discussion usually helps to find the right
> solution, which could be a revert, a follow-on fix, or something else;
> 
> best,
> Colin


I think if a patch that goes in immediately causes a problem then it should be rolled back. Chris has just done that to HADOOP-12572 and the LZ4 codec upgrade; I think after a while people will just expect that as the outcome of anything going in which turns out to have problems in jenkins or other downstream builds and tests (and my code is all ASF code here, people are free to check out branch develop from https://git-wip-us.apache.org/repos/asf/incubator-slider.git; build it with -Pbranch-2 and see what happens. 

To be really ambitious, we could think about having jenkins builds for downstream projects, the way apache gump used to do for the entire ant-based ASF stack. That still wouldn't catch the cross-version incompatibilities that I've hit this week, but it'd catch more immediate things like changed APIs, poms, dependencies

Re: Jenkins stability and patching

Posted by Steve Loughran <st...@hortonworks.com>.

> On 23 Nov 2015, at 21:57, Colin P. McCabe <cm...@apache.org> wrote:
> 
> On Mon, Nov 23, 2015 at 1:53 PM, Colin P. McCabe <cm...@apache.org> wrote:
>> I agree that our tests are in a bad state.  It would help if we could
>> maintain a list of "flaky tests" somewhere in git and have Yetus
>> consider the flakiness of a test before -1ing a patch.  Right now, we
>> pretty much all have that list in our heads, and we're not applying it
>> very consistently.  Having this list would also let us know where to
>> concentrate our efforts to fix things.
>> 
>> On Sun, Nov 22, 2015 at 4:21 AM, Steve Loughran <st...@hortonworks.com> wrote:
>>> 
>>> Jenkins is pretty much dead in the water these days; a test run that works is a rare miracle rather than the default state. Which also means most patches are being +1'd in even though patches are failing, with comments like "the test failures are probably unrelated"
>>> 
>>> 
>>> I think everyone has to be grateful that I'm not volunteering to be release manager for 2.8, as if I were i'd have already imposed a block on any patches going in until jenkins was stable. That is: nothing but test fixes would go in.
>>> 
>>> as it is, at least for the next couple of weeks, I'm going to experiment with reverting patches which break the build. Usually those breakages are being fixed, eventually, with followup patches. With a "patches which break the build get reverted" policy, whoever submitted that first patch gets to write the fix *and test it again*. This should encourage people to be more rigorous first time round.
>>> 
>>> 
>>>  1.  Yes, I'm going to have to be ruthless and do this for myself too. Or others can. I'm not doing much (any?) core hadoop coding right now, so more isolated.
>>>  2.  No, I don't plan to show favouritism: break the build and it gets rolled back.
>>>  3.  We can review this in a week or two  to see how it goes. And someone else can volunteer to keep jenkins happy.
>>>  4.  I'll get a smaller fix for HDFS-9263 in.
>>>  5.  I've also started running slider 0.90-SNAPSHOT test runs with Hadoop 2.8.0-SNAPSHOT, so I'm being the first to find problems beyond jenkins. So far HADOOP-12050 is the first blocker. It went in in August, which shows we aren't doing enough cross-version testing beyond just Jenkins. That breakage (HADOOP-12587) is stopping my test code working against secure clusters —if I was being really harsh I'd have reverted that too, but's been in long enough I think a fix is probably the best solution.
>> 
>> Well, this is already directly contracting point #2, isn't it? :)
> 

yes. I'm not happy how that patch has broken some of my tests though. Reverting it would benefit me, and I am sorely tempted, but think starting with with the recent commits is the way to gently adopt a stricter process.

> Just to be clear, I'm not trying to imply that this was favoritism (I
> don't think it was) but just that a revert is not always the right
> solution.  A short discussion usually helps to find the right
> solution, which could be a revert, a follow-on fix, or something else;
> 
> best,
> Colin


I think if a patch that goes in immediately causes a problem then it should be rolled back. Chris has just done that to HADOOP-12572 and the LZ4 codec upgrade; I think after a while people will just expect that as the outcome of anything going in which turns out to have problems in jenkins or other downstream builds and tests (and my code is all ASF code here, people are free to check out branch develop from https://git-wip-us.apache.org/repos/asf/incubator-slider.git; build it with -Pbranch-2 and see what happens. 

To be really ambitious, we could think about having jenkins builds for downstream projects, the way apache gump used to do for the entire ant-based ASF stack. That still wouldn't catch the cross-version incompatibilities that I've hit this week, but it'd catch more immediate things like changed APIs, poms, dependencies

Re: Jenkins stability and patching

Posted by Steve Loughran <st...@hortonworks.com>.

> On 23 Nov 2015, at 21:57, Colin P. McCabe <cm...@apache.org> wrote:
> 
> On Mon, Nov 23, 2015 at 1:53 PM, Colin P. McCabe <cm...@apache.org> wrote:
>> I agree that our tests are in a bad state.  It would help if we could
>> maintain a list of "flaky tests" somewhere in git and have Yetus
>> consider the flakiness of a test before -1ing a patch.  Right now, we
>> pretty much all have that list in our heads, and we're not applying it
>> very consistently.  Having this list would also let us know where to
>> concentrate our efforts to fix things.
>> 
>> On Sun, Nov 22, 2015 at 4:21 AM, Steve Loughran <st...@hortonworks.com> wrote:
>>> 
>>> Jenkins is pretty much dead in the water these days; a test run that works is a rare miracle rather than the default state. Which also means most patches are being +1'd in even though patches are failing, with comments like "the test failures are probably unrelated"
>>> 
>>> 
>>> I think everyone has to be grateful that I'm not volunteering to be release manager for 2.8, as if I were i'd have already imposed a block on any patches going in until jenkins was stable. That is: nothing but test fixes would go in.
>>> 
>>> as it is, at least for the next couple of weeks, I'm going to experiment with reverting patches which break the build. Usually those breakages are being fixed, eventually, with followup patches. With a "patches which break the build get reverted" policy, whoever submitted that first patch gets to write the fix *and test it again*. This should encourage people to be more rigorous first time round.
>>> 
>>> 
>>>  1.  Yes, I'm going to have to be ruthless and do this for myself too. Or others can. I'm not doing much (any?) core hadoop coding right now, so more isolated.
>>>  2.  No, I don't plan to show favouritism: break the build and it gets rolled back.
>>>  3.  We can review this in a week or two  to see how it goes. And someone else can volunteer to keep jenkins happy.
>>>  4.  I'll get a smaller fix for HDFS-9263 in.
>>>  5.  I've also started running slider 0.90-SNAPSHOT test runs with Hadoop 2.8.0-SNAPSHOT, so I'm being the first to find problems beyond jenkins. So far HADOOP-12050 is the first blocker. It went in in August, which shows we aren't doing enough cross-version testing beyond just Jenkins. That breakage (HADOOP-12587) is stopping my test code working against secure clusters —if I was being really harsh I'd have reverted that too, but's been in long enough I think a fix is probably the best solution.
>> 
>> Well, this is already directly contracting point #2, isn't it? :)
> 

yes. I'm not happy how that patch has broken some of my tests though. Reverting it would benefit me, and I am sorely tempted, but think starting with with the recent commits is the way to gently adopt a stricter process.

> Just to be clear, I'm not trying to imply that this was favoritism (I
> don't think it was) but just that a revert is not always the right
> solution.  A short discussion usually helps to find the right
> solution, which could be a revert, a follow-on fix, or something else;
> 
> best,
> Colin


I think if a patch that goes in immediately causes a problem then it should be rolled back. Chris has just done that to HADOOP-12572 and the LZ4 codec upgrade; I think after a while people will just expect that as the outcome of anything going in which turns out to have problems in jenkins or other downstream builds and tests (and my code is all ASF code here, people are free to check out branch develop from https://git-wip-us.apache.org/repos/asf/incubator-slider.git; build it with -Pbranch-2 and see what happens. 

To be really ambitious, we could think about having jenkins builds for downstream projects, the way apache gump used to do for the entire ant-based ASF stack. That still wouldn't catch the cross-version incompatibilities that I've hit this week, but it'd catch more immediate things like changed APIs, poms, dependencies

Re: Jenkins stability and patching

Posted by Steve Loughran <st...@hortonworks.com>.

> On 23 Nov 2015, at 21:57, Colin P. McCabe <cm...@apache.org> wrote:
> 
> On Mon, Nov 23, 2015 at 1:53 PM, Colin P. McCabe <cm...@apache.org> wrote:
>> I agree that our tests are in a bad state.  It would help if we could
>> maintain a list of "flaky tests" somewhere in git and have Yetus
>> consider the flakiness of a test before -1ing a patch.  Right now, we
>> pretty much all have that list in our heads, and we're not applying it
>> very consistently.  Having this list would also let us know where to
>> concentrate our efforts to fix things.
>> 
>> On Sun, Nov 22, 2015 at 4:21 AM, Steve Loughran <st...@hortonworks.com> wrote:
>>> 
>>> Jenkins is pretty much dead in the water these days; a test run that works is a rare miracle rather than the default state. Which also means most patches are being +1'd in even though patches are failing, with comments like "the test failures are probably unrelated"
>>> 
>>> 
>>> I think everyone has to be grateful that I'm not volunteering to be release manager for 2.8, as if I were i'd have already imposed a block on any patches going in until jenkins was stable. That is: nothing but test fixes would go in.
>>> 
>>> as it is, at least for the next couple of weeks, I'm going to experiment with reverting patches which break the build. Usually those breakages are being fixed, eventually, with followup patches. With a "patches which break the build get reverted" policy, whoever submitted that first patch gets to write the fix *and test it again*. This should encourage people to be more rigorous first time round.
>>> 
>>> 
>>>  1.  Yes, I'm going to have to be ruthless and do this for myself too. Or others can. I'm not doing much (any?) core hadoop coding right now, so more isolated.
>>>  2.  No, I don't plan to show favouritism: break the build and it gets rolled back.
>>>  3.  We can review this in a week or two  to see how it goes. And someone else can volunteer to keep jenkins happy.
>>>  4.  I'll get a smaller fix for HDFS-9263 in.
>>>  5.  I've also started running slider 0.90-SNAPSHOT test runs with Hadoop 2.8.0-SNAPSHOT, so I'm being the first to find problems beyond jenkins. So far HADOOP-12050 is the first blocker. It went in in August, which shows we aren't doing enough cross-version testing beyond just Jenkins. That breakage (HADOOP-12587) is stopping my test code working against secure clusters —if I was being really harsh I'd have reverted that too, but's been in long enough I think a fix is probably the best solution.
>> 
>> Well, this is already directly contracting point #2, isn't it? :)
> 

yes. I'm not happy how that patch has broken some of my tests though. Reverting it would benefit me, and I am sorely tempted, but think starting with with the recent commits is the way to gently adopt a stricter process.

> Just to be clear, I'm not trying to imply that this was favoritism (I
> don't think it was) but just that a revert is not always the right
> solution.  A short discussion usually helps to find the right
> solution, which could be a revert, a follow-on fix, or something else;
> 
> best,
> Colin


I think if a patch that goes in immediately causes a problem then it should be rolled back. Chris has just done that to HADOOP-12572 and the LZ4 codec upgrade; I think after a while people will just expect that as the outcome of anything going in which turns out to have problems in jenkins or other downstream builds and tests (and my code is all ASF code here, people are free to check out branch develop from https://git-wip-us.apache.org/repos/asf/incubator-slider.git; build it with -Pbranch-2 and see what happens. 

To be really ambitious, we could think about having jenkins builds for downstream projects, the way apache gump used to do for the entire ant-based ASF stack. That still wouldn't catch the cross-version incompatibilities that I've hit this week, but it'd catch more immediate things like changed APIs, poms, dependencies

Re: Jenkins stability and patching

Posted by "Colin P. McCabe" <cm...@apache.org>.

On Mon, Nov 23, 2015 at 1:53 PM, Colin P. McCabe <cm...@apache.org> wrote:
> I agree that our tests are in a bad state.  It would help if we could
> maintain a list of "flaky tests" somewhere in git and have Yetus
> consider the flakiness of a test before -1ing a patch.  Right now, we
> pretty much all have that list in our heads, and we're not applying it
> very consistently.  Having this list would also let us know where to
> concentrate our efforts to fix things.
>
> On Sun, Nov 22, 2015 at 4:21 AM, Steve Loughran <st...@hortonworks.com> wrote:
>>
>> Jenkins is pretty much dead in the water these days; a test run that works is a rare miracle rather than the default state. Which also means most patches are being +1'd in even though patches are failing, with comments like "the test failures are probably unrelated"
>>
>>
>> I think everyone has to be grateful that I'm not volunteering to be release manager for 2.8, as if I were i'd have already imposed a block on any patches going in until jenkins was stable. That is: nothing but test fixes would go in.
>>
>> as it is, at least for the next couple of weeks, I'm going to experiment with reverting patches which break the build. Usually those breakages are being fixed, eventually, with followup patches. With a "patches which break the build get reverted" policy, whoever submitted that first patch gets to write the fix *and test it again*. This should encourage people to be more rigorous first time round.
>>
>>
>>   1.  Yes, I'm going to have to be ruthless and do this for myself too. Or others can. I'm not doing much (any?) core hadoop coding right now, so more isolated.
>>   2.  No, I don't plan to show favouritism: break the build and it gets rolled back.
>>   3.  We can review this in a week or two  to see how it goes. And someone else can volunteer to keep jenkins happy.
>>   4.  I'll get a smaller fix for HDFS-9263 in.
>>   5.  I've also started running slider 0.90-SNAPSHOT test runs with Hadoop 2.8.0-SNAPSHOT, so I'm being the first to find problems beyond jenkins. So far HADOOP-12050 is the first blocker. It went in in August, which shows we aren't doing enough cross-version testing beyond just Jenkins. That breakage (HADOOP-12587) is stopping my test code working against secure clusters —if I was being really harsh I'd have reverted that too, but's been in long enough I think a fix is probably the best solution.
>
> Well, this is already directly contracting point #2, isn't it? :)

Just to be clear, I'm not trying to imply that this was favoritism (I
don't think it was) but just that a revert is not always the right
solution.  A short discussion usually helps to find the right
solution, which could be a revert, a follow-on fix, or something else.

best,
Colin

>
> I am open to being more critical about patches going in, but I think
> we should have some very minimal discussion before reverting things.
> It's just polite.
>
> Colin
>
>
>>   6.  Finally: everyone should feel free to fix tests. Don't be shy now!
>>
>> Giving this is a US vacation week, it should be a quieter week for breakages.
>>
>> Sorry —but if we can't even get Jenkins stable, then what hope do we have for a 2.8 release working?
>>
>> -Steve
>>
>>

Re: Jenkins stability and patching

Posted by "Colin P. McCabe" <cm...@apache.org>.

On Mon, Nov 23, 2015 at 1:53 PM, Colin P. McCabe <cm...@apache.org> wrote:
> I agree that our tests are in a bad state.  It would help if we could
> maintain a list of "flaky tests" somewhere in git and have Yetus
> consider the flakiness of a test before -1ing a patch.  Right now, we
> pretty much all have that list in our heads, and we're not applying it
> very consistently.  Having this list would also let us know where to
> concentrate our efforts to fix things.
>
> On Sun, Nov 22, 2015 at 4:21 AM, Steve Loughran <st...@hortonworks.com> wrote:
>>
>> Jenkins is pretty much dead in the water these days; a test run that works is a rare miracle rather than the default state. Which also means most patches are being +1'd in even though patches are failing, with comments like "the test failures are probably unrelated"
>>
>>
>> I think everyone has to be grateful that I'm not volunteering to be release manager for 2.8, as if I were i'd have already imposed a block on any patches going in until jenkins was stable. That is: nothing but test fixes would go in.
>>
>> as it is, at least for the next couple of weeks, I'm going to experiment with reverting patches which break the build. Usually those breakages are being fixed, eventually, with followup patches. With a "patches which break the build get reverted" policy, whoever submitted that first patch gets to write the fix *and test it again*. This should encourage people to be more rigorous first time round.
>>
>>
>>   1.  Yes, I'm going to have to be ruthless and do this for myself too. Or others can. I'm not doing much (any?) core hadoop coding right now, so more isolated.
>>   2.  No, I don't plan to show favouritism: break the build and it gets rolled back.
>>   3.  We can review this in a week or two  to see how it goes. And someone else can volunteer to keep jenkins happy.
>>   4.  I'll get a smaller fix for HDFS-9263 in.
>>   5.  I've also started running slider 0.90-SNAPSHOT test runs with Hadoop 2.8.0-SNAPSHOT, so I'm being the first to find problems beyond jenkins. So far HADOOP-12050 is the first blocker. It went in in August, which shows we aren't doing enough cross-version testing beyond just Jenkins. That breakage (HADOOP-12587) is stopping my test code working against secure clusters —if I was being really harsh I'd have reverted that too, but's been in long enough I think a fix is probably the best solution.
>
> Well, this is already directly contracting point #2, isn't it? :)

Just to be clear, I'm not trying to imply that this was favoritism (I
don't think it was) but just that a revert is not always the right
solution.  A short discussion usually helps to find the right
solution, which could be a revert, a follow-on fix, or something else.

best,
Colin

>
> I am open to being more critical about patches going in, but I think
> we should have some very minimal discussion before reverting things.
> It's just polite.
>
> Colin
>
>
>>   6.  Finally: everyone should feel free to fix tests. Don't be shy now!
>>
>> Giving this is a US vacation week, it should be a quieter week for breakages.
>>
>> Sorry —but if we can't even get Jenkins stable, then what hope do we have for a 2.8 release working?
>>
>> -Steve
>>
>>

Re: Jenkins stability and patching

Posted by "Colin P. McCabe" <cm...@apache.org>.

On Mon, Nov 23, 2015 at 1:53 PM, Colin P. McCabe <cm...@apache.org> wrote:
> I agree that our tests are in a bad state.  It would help if we could
> maintain a list of "flaky tests" somewhere in git and have Yetus
> consider the flakiness of a test before -1ing a patch.  Right now, we
> pretty much all have that list in our heads, and we're not applying it
> very consistently.  Having this list would also let us know where to
> concentrate our efforts to fix things.
>
> On Sun, Nov 22, 2015 at 4:21 AM, Steve Loughran <st...@hortonworks.com> wrote:
>>
>> Jenkins is pretty much dead in the water these days; a test run that works is a rare miracle rather than the default state. Which also means most patches are being +1'd in even though patches are failing, with comments like "the test failures are probably unrelated"
>>
>>
>> I think everyone has to be grateful that I'm not volunteering to be release manager for 2.8, as if I were i'd have already imposed a block on any patches going in until jenkins was stable. That is: nothing but test fixes would go in.
>>
>> as it is, at least for the next couple of weeks, I'm going to experiment with reverting patches which break the build. Usually those breakages are being fixed, eventually, with followup patches. With a "patches which break the build get reverted" policy, whoever submitted that first patch gets to write the fix *and test it again*. This should encourage people to be more rigorous first time round.
>>
>>
>>   1.  Yes, I'm going to have to be ruthless and do this for myself too. Or others can. I'm not doing much (any?) core hadoop coding right now, so more isolated.
>>   2.  No, I don't plan to show favouritism: break the build and it gets rolled back.
>>   3.  We can review this in a week or two  to see how it goes. And someone else can volunteer to keep jenkins happy.
>>   4.  I'll get a smaller fix for HDFS-9263 in.
>>   5.  I've also started running slider 0.90-SNAPSHOT test runs with Hadoop 2.8.0-SNAPSHOT, so I'm being the first to find problems beyond jenkins. So far HADOOP-12050 is the first blocker. It went in in August, which shows we aren't doing enough cross-version testing beyond just Jenkins. That breakage (HADOOP-12587) is stopping my test code working against secure clusters —if I was being really harsh I'd have reverted that too, but's been in long enough I think a fix is probably the best solution.
>
> Well, this is already directly contracting point #2, isn't it? :)

Just to be clear, I'm not trying to imply that this was favoritism (I
don't think it was) but just that a revert is not always the right
solution.  A short discussion usually helps to find the right
solution, which could be a revert, a follow-on fix, or something else.

best,
Colin

>
> I am open to being more critical about patches going in, but I think
> we should have some very minimal discussion before reverting things.
> It's just polite.
>
> Colin
>
>
>>   6.  Finally: everyone should feel free to fix tests. Don't be shy now!
>>
>> Giving this is a US vacation week, it should be a quieter week for breakages.
>>
>> Sorry —but if we can't even get Jenkins stable, then what hope do we have for a 2.8 release working?
>>
>> -Steve
>>
>>

Re: Jenkins stability and patching

Posted by "Colin P. McCabe" <cm...@apache.org>.

On Mon, Nov 23, 2015 at 1:53 PM, Colin P. McCabe <cm...@apache.org> wrote:
> I agree that our tests are in a bad state.  It would help if we could
> maintain a list of "flaky tests" somewhere in git and have Yetus
> consider the flakiness of a test before -1ing a patch.  Right now, we
> pretty much all have that list in our heads, and we're not applying it
> very consistently.  Having this list would also let us know where to
> concentrate our efforts to fix things.
>
> On Sun, Nov 22, 2015 at 4:21 AM, Steve Loughran <st...@hortonworks.com> wrote:
>>
>> Jenkins is pretty much dead in the water these days; a test run that works is a rare miracle rather than the default state. Which also means most patches are being +1'd in even though patches are failing, with comments like "the test failures are probably unrelated"
>>
>>
>> I think everyone has to be grateful that I'm not volunteering to be release manager for 2.8, as if I were i'd have already imposed a block on any patches going in until jenkins was stable. That is: nothing but test fixes would go in.
>>
>> as it is, at least for the next couple of weeks, I'm going to experiment with reverting patches which break the build. Usually those breakages are being fixed, eventually, with followup patches. With a "patches which break the build get reverted" policy, whoever submitted that first patch gets to write the fix *and test it again*. This should encourage people to be more rigorous first time round.
>>
>>
>>   1.  Yes, I'm going to have to be ruthless and do this for myself too. Or others can. I'm not doing much (any?) core hadoop coding right now, so more isolated.
>>   2.  No, I don't plan to show favouritism: break the build and it gets rolled back.
>>   3.  We can review this in a week or two  to see how it goes. And someone else can volunteer to keep jenkins happy.
>>   4.  I'll get a smaller fix for HDFS-9263 in.
>>   5.  I've also started running slider 0.90-SNAPSHOT test runs with Hadoop 2.8.0-SNAPSHOT, so I'm being the first to find problems beyond jenkins. So far HADOOP-12050 is the first blocker. It went in in August, which shows we aren't doing enough cross-version testing beyond just Jenkins. That breakage (HADOOP-12587) is stopping my test code working against secure clusters —if I was being really harsh I'd have reverted that too, but's been in long enough I think a fix is probably the best solution.
>
> Well, this is already directly contracting point #2, isn't it? :)

Just to be clear, I'm not trying to imply that this was favoritism (I
don't think it was) but just that a revert is not always the right
solution.  A short discussion usually helps to find the right
solution, which could be a revert, a follow-on fix, or something else.

best,
Colin

>
> I am open to being more critical about patches going in, but I think
> we should have some very minimal discussion before reverting things.
> It's just polite.
>
> Colin
>
>
>>   6.  Finally: everyone should feel free to fix tests. Don't be shy now!
>>
>> Giving this is a US vacation week, it should be a quieter week for breakages.
>>
>> Sorry —but if we can't even get Jenkins stable, then what hope do we have for a 2.8 release working?
>>
>> -Steve
>>
>>

Re: Jenkins stability and patching

Posted by "Colin P. McCabe" <cm...@apache.org>.

I agree that our tests are in a bad state.  It would help if we could
maintain a list of "flaky tests" somewhere in git and have Yetus
consider the flakiness of a test before -1ing a patch.  Right now, we
pretty much all have that list in our heads, and we're not applying it
very consistently.  Having this list would also let us know where to
concentrate our efforts to fix things.

On Sun, Nov 22, 2015 at 4:21 AM, Steve Loughran <st...@hortonworks.com> wrote:
>
> Jenkins is pretty much dead in the water these days; a test run that works is a rare miracle rather than the default state. Which also means most patches are being +1'd in even though patches are failing, with comments like "the test failures are probably unrelated"
>
>
> I think everyone has to be grateful that I'm not volunteering to be release manager for 2.8, as if I were i'd have already imposed a block on any patches going in until jenkins was stable. That is: nothing but test fixes would go in.
>
> as it is, at least for the next couple of weeks, I'm going to experiment with reverting patches which break the build. Usually those breakages are being fixed, eventually, with followup patches. With a "patches which break the build get reverted" policy, whoever submitted that first patch gets to write the fix *and test it again*. This should encourage people to be more rigorous first time round.
>
>
>   1.  Yes, I'm going to have to be ruthless and do this for myself too. Or others can. I'm not doing much (any?) core hadoop coding right now, so more isolated.
>   2.  No, I don't plan to show favouritism: break the build and it gets rolled back.
>   3.  We can review this in a week or two  to see how it goes. And someone else can volunteer to keep jenkins happy.
>   4.  I'll get a smaller fix for HDFS-9263 in.
>   5.  I've also started running slider 0.90-SNAPSHOT test runs with Hadoop 2.8.0-SNAPSHOT, so I'm being the first to find problems beyond jenkins. So far HADOOP-12050 is the first blocker. It went in in August, which shows we aren't doing enough cross-version testing beyond just Jenkins. That breakage (HADOOP-12587) is stopping my test code working against secure clusters —if I was being really harsh I'd have reverted that too, but's been in long enough I think a fix is probably the best solution.

Well, this is already directly contracting point #2, isn't it? :)

I am open to being more critical about patches going in, but I think
we should have some very minimal discussion before reverting things.
It's just polite.

Colin


>   6.  Finally: everyone should feel free to fix tests. Don't be shy now!
>
> Giving this is a US vacation week, it should be a quieter week for breakages.
>
> Sorry —but if we can't even get Jenkins stable, then what hope do we have for a 2.8 release working?
>
> -Steve
>
>

Re: Jenkins stability and patching

Posted by Steve Loughran <st...@hortonworks.com>.

> On 22 Nov 2015, at 16:12, Tsuyoshi Ozawa <oz...@apache.org> wrote:
> 
> Thank you for starting discussion, Steve. It sounds good to me.  I'll
> check the test failures.

thx

here's a start on one  :

https://issues.apache.org/jira/browse/HADOOP-11149 TestZKFailoverController times out


> 
> - Tsuyoshi
> 
> On Sun, Nov 22, 2015 at 9:21 PM, Steve Loughran <st...@hortonworks.com> wrote:
>> 
>> Jenkins is pretty much dead in the water these days; a test run that works is a rare miracle rather than the default state. Which also means most patches are being +1'd in even though patches are failing, with comments like "the test failures are probably unrelated"
>> 
>> 
>> I think everyone has to be grateful that I'm not volunteering to be release manager for 2.8, as if I were i'd have already imposed a block on any patches going in until jenkins was stable. That is: nothing but test fixes would go in.
>> 
>> as it is, at least for the next couple of weeks, I'm going to experiment with reverting patches which break the build. Usually those breakages are being fixed, eventually, with followup patches. With a "patches which break the build get reverted" policy, whoever submitted that first patch gets to write the fix *and test it again*. This should encourage people to be more rigorous first time round.
>> 
>> 
>>  1.  Yes, I'm going to have to be ruthless and do this for myself too. Or others can. I'm not doing much (any?) core hadoop coding right now, so more isolated.
>>  2.  No, I don't plan to show favouritism: break the build and it gets rolled back.
>>  3.  We can review this in a week or two  to see how it goes. And someone else can volunteer to keep jenkins happy.
>>  4.  I'll get a smaller fix for HDFS-9263 in.
>>  5.  I've also started running slider 0.90-SNAPSHOT test runs with Hadoop 2.8.0-SNAPSHOT, so I'm being the first to find problems beyond jenkins. So far HADOOP-12050 is the first blocker. It went in in August, which shows we aren't doing enough cross-version testing beyond just Jenkins. That breakage (HADOOP-12587) is stopping my test code working against secure clusters —if I was being really harsh I'd have reverted that too, but's been in long enough I think a fix is probably the best solution.
>>  6.  Finally: everyone should feel free to fix tests. Don't be shy now!
>> 
>> Giving this is a US vacation week, it should be a quieter week for breakages.
>> 
>> Sorry —but if we can't even get Jenkins stable, then what hope do we have for a 2.8 release working?
>> 
>> -Steve
>> 
>> 
>

Re: Jenkins stability and patching

Posted by Tsuyoshi Ozawa <oz...@apache.org>.

Thank you for starting discussion, Steve. It sounds good to me.  I'll
check the test failures.

- Tsuyoshi

On Sun, Nov 22, 2015 at 9:21 PM, Steve Loughran <st...@hortonworks.com> wrote:
>
> Jenkins is pretty much dead in the water these days; a test run that works is a rare miracle rather than the default state. Which also means most patches are being +1'd in even though patches are failing, with comments like "the test failures are probably unrelated"
>
>
> I think everyone has to be grateful that I'm not volunteering to be release manager for 2.8, as if I were i'd have already imposed a block on any patches going in until jenkins was stable. That is: nothing but test fixes would go in.
>
> as it is, at least for the next couple of weeks, I'm going to experiment with reverting patches which break the build. Usually those breakages are being fixed, eventually, with followup patches. With a "patches which break the build get reverted" policy, whoever submitted that first patch gets to write the fix *and test it again*. This should encourage people to be more rigorous first time round.
>
>
>   1.  Yes, I'm going to have to be ruthless and do this for myself too. Or others can. I'm not doing much (any?) core hadoop coding right now, so more isolated.
>   2.  No, I don't plan to show favouritism: break the build and it gets rolled back.
>   3.  We can review this in a week or two  to see how it goes. And someone else can volunteer to keep jenkins happy.
>   4.  I'll get a smaller fix for HDFS-9263 in.
>   5.  I've also started running slider 0.90-SNAPSHOT test runs with Hadoop 2.8.0-SNAPSHOT, so I'm being the first to find problems beyond jenkins. So far HADOOP-12050 is the first blocker. It went in in August, which shows we aren't doing enough cross-version testing beyond just Jenkins. That breakage (HADOOP-12587) is stopping my test code working against secure clusters —if I was being really harsh I'd have reverted that too, but's been in long enough I think a fix is probably the best solution.
>   6.  Finally: everyone should feel free to fix tests. Don't be shy now!
>
> Giving this is a US vacation week, it should be a quieter week for breakages.
>
> Sorry —but if we can't even get Jenkins stable, then what hope do we have for a 2.8 release working?
>
> -Steve
>
>

Re: Jenkins stability and patching

Posted by "Colin P. McCabe" <cm...@apache.org>.

I agree that our tests are in a bad state.  It would help if we could
maintain a list of "flaky tests" somewhere in git and have Yetus
consider the flakiness of a test before -1ing a patch.  Right now, we
pretty much all have that list in our heads, and we're not applying it
very consistently.  Having this list would also let us know where to
concentrate our efforts to fix things.

On Sun, Nov 22, 2015 at 4:21 AM, Steve Loughran <st...@hortonworks.com> wrote:
>
> Jenkins is pretty much dead in the water these days; a test run that works is a rare miracle rather than the default state. Which also means most patches are being +1'd in even though patches are failing, with comments like "the test failures are probably unrelated"
>
>
> I think everyone has to be grateful that I'm not volunteering to be release manager for 2.8, as if I were i'd have already imposed a block on any patches going in until jenkins was stable. That is: nothing but test fixes would go in.
>
> as it is, at least for the next couple of weeks, I'm going to experiment with reverting patches which break the build. Usually those breakages are being fixed, eventually, with followup patches. With a "patches which break the build get reverted" policy, whoever submitted that first patch gets to write the fix *and test it again*. This should encourage people to be more rigorous first time round.
>
>
>   1.  Yes, I'm going to have to be ruthless and do this for myself too. Or others can. I'm not doing much (any?) core hadoop coding right now, so more isolated.
>   2.  No, I don't plan to show favouritism: break the build and it gets rolled back.
>   3.  We can review this in a week or two  to see how it goes. And someone else can volunteer to keep jenkins happy.
>   4.  I'll get a smaller fix for HDFS-9263 in.
>   5.  I've also started running slider 0.90-SNAPSHOT test runs with Hadoop 2.8.0-SNAPSHOT, so I'm being the first to find problems beyond jenkins. So far HADOOP-12050 is the first blocker. It went in in August, which shows we aren't doing enough cross-version testing beyond just Jenkins. That breakage (HADOOP-12587) is stopping my test code working against secure clusters —if I was being really harsh I'd have reverted that too, but's been in long enough I think a fix is probably the best solution.

Well, this is already directly contracting point #2, isn't it? :)

I am open to being more critical about patches going in, but I think
we should have some very minimal discussion before reverting things.
It's just polite.

Colin


>   6.  Finally: everyone should feel free to fix tests. Don't be shy now!
>
> Giving this is a US vacation week, it should be a quieter week for breakages.
>
> Sorry —but if we can't even get Jenkins stable, then what hope do we have for a 2.8 release working?
>
> -Steve
>
>

Re: Jenkins stability and patching

Posted by Tsuyoshi Ozawa <oz...@apache.org>.

Thank you for starting discussion, Steve. It sounds good to me.  I'll
check the test failures.

- Tsuyoshi

On Sun, Nov 22, 2015 at 9:21 PM, Steve Loughran <st...@hortonworks.com> wrote:
>
> Jenkins is pretty much dead in the water these days; a test run that works is a rare miracle rather than the default state. Which also means most patches are being +1'd in even though patches are failing, with comments like "the test failures are probably unrelated"
>
>
> I think everyone has to be grateful that I'm not volunteering to be release manager for 2.8, as if I were i'd have already imposed a block on any patches going in until jenkins was stable. That is: nothing but test fixes would go in.
>
> as it is, at least for the next couple of weeks, I'm going to experiment with reverting patches which break the build. Usually those breakages are being fixed, eventually, with followup patches. With a "patches which break the build get reverted" policy, whoever submitted that first patch gets to write the fix *and test it again*. This should encourage people to be more rigorous first time round.
>
>
>   1.  Yes, I'm going to have to be ruthless and do this for myself too. Or others can. I'm not doing much (any?) core hadoop coding right now, so more isolated.
>   2.  No, I don't plan to show favouritism: break the build and it gets rolled back.
>   3.  We can review this in a week or two  to see how it goes. And someone else can volunteer to keep jenkins happy.
>   4.  I'll get a smaller fix for HDFS-9263 in.
>   5.  I've also started running slider 0.90-SNAPSHOT test runs with Hadoop 2.8.0-SNAPSHOT, so I'm being the first to find problems beyond jenkins. So far HADOOP-12050 is the first blocker. It went in in August, which shows we aren't doing enough cross-version testing beyond just Jenkins. That breakage (HADOOP-12587) is stopping my test code working against secure clusters —if I was being really harsh I'd have reverted that too, but's been in long enough I think a fix is probably the best solution.
>   6.  Finally: everyone should feel free to fix tests. Don't be shy now!
>
> Giving this is a US vacation week, it should be a quieter week for breakages.
>
> Sorry —but if we can't even get Jenkins stable, then what hope do we have for a 2.8 release working?
>
> -Steve
>
>

Re: Jenkins stability and patching

Posted by "Colin P. McCabe" <cm...@apache.org>.

I agree that our tests are in a bad state.  It would help if we could
maintain a list of "flaky tests" somewhere in git and have Yetus
consider the flakiness of a test before -1ing a patch.  Right now, we
pretty much all have that list in our heads, and we're not applying it
very consistently.  Having this list would also let us know where to
concentrate our efforts to fix things.

On Sun, Nov 22, 2015 at 4:21 AM, Steve Loughran <st...@hortonworks.com> wrote:
>
> Jenkins is pretty much dead in the water these days; a test run that works is a rare miracle rather than the default state. Which also means most patches are being +1'd in even though patches are failing, with comments like "the test failures are probably unrelated"
>
>
> I think everyone has to be grateful that I'm not volunteering to be release manager for 2.8, as if I were i'd have already imposed a block on any patches going in until jenkins was stable. That is: nothing but test fixes would go in.
>
> as it is, at least for the next couple of weeks, I'm going to experiment with reverting patches which break the build. Usually those breakages are being fixed, eventually, with followup patches. With a "patches which break the build get reverted" policy, whoever submitted that first patch gets to write the fix *and test it again*. This should encourage people to be more rigorous first time round.
>
>
>   1.  Yes, I'm going to have to be ruthless and do this for myself too. Or others can. I'm not doing much (any?) core hadoop coding right now, so more isolated.
>   2.  No, I don't plan to show favouritism: break the build and it gets rolled back.
>   3.  We can review this in a week or two  to see how it goes. And someone else can volunteer to keep jenkins happy.
>   4.  I'll get a smaller fix for HDFS-9263 in.
>   5.  I've also started running slider 0.90-SNAPSHOT test runs with Hadoop 2.8.0-SNAPSHOT, so I'm being the first to find problems beyond jenkins. So far HADOOP-12050 is the first blocker. It went in in August, which shows we aren't doing enough cross-version testing beyond just Jenkins. That breakage (HADOOP-12587) is stopping my test code working against secure clusters —if I was being really harsh I'd have reverted that too, but's been in long enough I think a fix is probably the best solution.

Well, this is already directly contracting point #2, isn't it? :)

I am open to being more critical about patches going in, but I think
we should have some very minimal discussion before reverting things.
It's just polite.

Colin


>   6.  Finally: everyone should feel free to fix tests. Don't be shy now!
>
> Giving this is a US vacation week, it should be a quieter week for breakages.
>
> Sorry —but if we can't even get Jenkins stable, then what hope do we have for a 2.8 release working?
>
> -Steve
>
>

Re: Jenkins stability and patching

Posted by Tsuyoshi Ozawa <oz...@apache.org>.

Thank you for starting discussion, Steve. It sounds good to me.  I'll
check the test failures.

- Tsuyoshi

On Sun, Nov 22, 2015 at 9:21 PM, Steve Loughran <st...@hortonworks.com> wrote:
>
> Jenkins is pretty much dead in the water these days; a test run that works is a rare miracle rather than the default state. Which also means most patches are being +1'd in even though patches are failing, with comments like "the test failures are probably unrelated"
>
>
> I think everyone has to be grateful that I'm not volunteering to be release manager for 2.8, as if I were i'd have already imposed a block on any patches going in until jenkins was stable. That is: nothing but test fixes would go in.
>
> as it is, at least for the next couple of weeks, I'm going to experiment with reverting patches which break the build. Usually those breakages are being fixed, eventually, with followup patches. With a "patches which break the build get reverted" policy, whoever submitted that first patch gets to write the fix *and test it again*. This should encourage people to be more rigorous first time round.
>
>
>   1.  Yes, I'm going to have to be ruthless and do this for myself too. Or others can. I'm not doing much (any?) core hadoop coding right now, so more isolated.
>   2.  No, I don't plan to show favouritism: break the build and it gets rolled back.
>   3.  We can review this in a week or two  to see how it goes. And someone else can volunteer to keep jenkins happy.
>   4.  I'll get a smaller fix for HDFS-9263 in.
>   5.  I've also started running slider 0.90-SNAPSHOT test runs with Hadoop 2.8.0-SNAPSHOT, so I'm being the first to find problems beyond jenkins. So far HADOOP-12050 is the first blocker. It went in in August, which shows we aren't doing enough cross-version testing beyond just Jenkins. That breakage (HADOOP-12587) is stopping my test code working against secure clusters —if I was being really harsh I'd have reverted that too, but's been in long enough I think a fix is probably the best solution.
>   6.  Finally: everyone should feel free to fix tests. Don't be shy now!
>
> Giving this is a US vacation week, it should be a quieter week for breakages.
>
> Sorry —but if we can't even get Jenkins stable, then what hope do we have for a 2.8 release working?
>
> -Steve
>
>

Re: Jenkins stability and patching

Posted by Tsuyoshi Ozawa <oz...@apache.org>.

Thank you for starting discussion, Steve. It sounds good to me.  I'll
check the test failures.

- Tsuyoshi

On Sun, Nov 22, 2015 at 9:21 PM, Steve Loughran <st...@hortonworks.com> wrote:
>
> Jenkins is pretty much dead in the water these days; a test run that works is a rare miracle rather than the default state. Which also means most patches are being +1'd in even though patches are failing, with comments like "the test failures are probably unrelated"
>
>
> I think everyone has to be grateful that I'm not volunteering to be release manager for 2.8, as if I were i'd have already imposed a block on any patches going in until jenkins was stable. That is: nothing but test fixes would go in.
>
> as it is, at least for the next couple of weeks, I'm going to experiment with reverting patches which break the build. Usually those breakages are being fixed, eventually, with followup patches. With a "patches which break the build get reverted" policy, whoever submitted that first patch gets to write the fix *and test it again*. This should encourage people to be more rigorous first time round.
>
>
>   1.  Yes, I'm going to have to be ruthless and do this for myself too. Or others can. I'm not doing much (any?) core hadoop coding right now, so more isolated.
>   2.  No, I don't plan to show favouritism: break the build and it gets rolled back.
>   3.  We can review this in a week or two  to see how it goes. And someone else can volunteer to keep jenkins happy.
>   4.  I'll get a smaller fix for HDFS-9263 in.
>   5.  I've also started running slider 0.90-SNAPSHOT test runs with Hadoop 2.8.0-SNAPSHOT, so I'm being the first to find problems beyond jenkins. So far HADOOP-12050 is the first blocker. It went in in August, which shows we aren't doing enough cross-version testing beyond just Jenkins. That breakage (HADOOP-12587) is stopping my test code working against secure clusters —if I was being really harsh I'd have reverted that too, but's been in long enough I think a fix is probably the best solution.
>   6.  Finally: everyone should feel free to fix tests. Don't be shy now!
>
> Giving this is a US vacation week, it should be a quieter week for breakages.
>
> Sorry —but if we can't even get Jenkins stable, then what hope do we have for a 2.8 release working?
>
> -Steve
>
>

Re: Jenkins stability and patching

Posted by "Colin P. McCabe" <cm...@apache.org>.

I agree that our tests are in a bad state.  It would help if we could
maintain a list of "flaky tests" somewhere in git and have Yetus
consider the flakiness of a test before -1ing a patch.  Right now, we
pretty much all have that list in our heads, and we're not applying it
very consistently.  Having this list would also let us know where to
concentrate our efforts to fix things.

On Sun, Nov 22, 2015 at 4:21 AM, Steve Loughran <st...@hortonworks.com> wrote:
>
> Jenkins is pretty much dead in the water these days; a test run that works is a rare miracle rather than the default state. Which also means most patches are being +1'd in even though patches are failing, with comments like "the test failures are probably unrelated"
>
>
> I think everyone has to be grateful that I'm not volunteering to be release manager for 2.8, as if I were i'd have already imposed a block on any patches going in until jenkins was stable. That is: nothing but test fixes would go in.
>
> as it is, at least for the next couple of weeks, I'm going to experiment with reverting patches which break the build. Usually those breakages are being fixed, eventually, with followup patches. With a "patches which break the build get reverted" policy, whoever submitted that first patch gets to write the fix *and test it again*. This should encourage people to be more rigorous first time round.
>
>
>   1.  Yes, I'm going to have to be ruthless and do this for myself too. Or others can. I'm not doing much (any?) core hadoop coding right now, so more isolated.
>   2.  No, I don't plan to show favouritism: break the build and it gets rolled back.
>   3.  We can review this in a week or two  to see how it goes. And someone else can volunteer to keep jenkins happy.
>   4.  I'll get a smaller fix for HDFS-9263 in.
>   5.  I've also started running slider 0.90-SNAPSHOT test runs with Hadoop 2.8.0-SNAPSHOT, so I'm being the first to find problems beyond jenkins. So far HADOOP-12050 is the first blocker. It went in in August, which shows we aren't doing enough cross-version testing beyond just Jenkins. That breakage (HADOOP-12587) is stopping my test code working against secure clusters —if I was being really harsh I'd have reverted that too, but's been in long enough I think a fix is probably the best solution.

Well, this is already directly contracting point #2, isn't it? :)

I am open to being more critical about patches going in, but I think
we should have some very minimal discussion before reverting things.
It's just polite.

Colin


>   6.  Finally: everyone should feel free to fix tests. Don't be shy now!
>
> Giving this is a US vacation week, it should be a quieter week for breakages.
>
> Sorry —but if we can't even get Jenkins stable, then what hope do we have for a 2.8 release working?
>
> -Steve
>
>