You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by Akira Ajisaka <aa...@apache.org> on 2020/10/22 19:14:12 UTC

Fixing flaky tests in Apache Hadoop

Hi Hadoop developers,

Now there are a lot of failing unit tests and there is an issue to
tackle this bad situation.
https://issues.apache.org/jira/browse/HDFS-15646

Although this issue is in HDFS project, this issue is related to all
the Hadoop developers. Please check the above URL, read the
description, and volunteer to dedicate more time to fix flaky tests.
Your contribution to fixing the flaky tests will be really
appreciated!

Thank you Ahmed Hussein for your report.

Regards,
Akira

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org

Re: Fixing flaky tests in Apache Hadoop

Posted by Ahmed Hussein <a...@ahussein.me>.

>
> 2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
> day two years ago, and maybe it's time to repeat it again:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105
>  this
> is going to be tricky as we are in a pandemic and most of the community are
> working from home, unlike the last time when we can lock ourselves in a
> conference room and force everybody to work :)


How about the following idea:

We set a monthly window during which only Unit test fixes can be merged.
Any other commit that is not directly
linked to Junit test failures would be blocked until the end of this
"Bug-Window".
For example, we set "Bug-days" to be from 25th to 31st of each month. All
commits during those days are meant to
fix and improve the testing environment.

Any thoughts?

On Thu, Oct 22, 2020 at 11:53 PM Ahmed Hussein <a...@ahussein.me> wrote:

> Thank you Akira and We-Chiu.
> IMHO, the citation is more than just flaky tests. It has more depth:
> - Every developer stays committed to keep the code healthy.
> - Those flaky tests are actually "*bugs*" that need to be fixed. It is
> evident that there is a major problem in handling
>   the resources as I will explain below.
>
> 1. Other projects such as HBase have a tool to exclude flaky tests from
>> being executed. They track flaky tests and display them in a dashboard.
>> This will allow good tests to pass while leaving time for folks to fix
>> them. Or we could manually exclude tests (this is what we used to do at
>> Cloudera)
>>
>
> I like the idea of having a tool that gives a view of broken tests.
>
>  I spent a long time converting HDFS flaky tests into sub-tasks under
> HDFS-15646 <https://issues.apache.org/jira/browse/HDFS-15646>. I
> believe there are still tons
> on the loose.
> I remember I explored a tool called DeFlaker
> <https://www.jonbell.net/icse18-deflaker.pdf> which detects flaky tests.
> Then it reruns the tests to verify that they still
> pass.
>
> I do not think we necessarily want to exclude the flaky tests, but at
> least they should be enumerated and addressed
> regularly because they are after all "bugs". Having few flaky tests that
> cause everything to blow up indicates
> that there is a major problem with handling resources.
> I pointed out this issue in YARN-10334
> <https://issues.apache.org/jira/browse/YARN-10334> where I found that
> TestDistributedShell is nothing but a black hole that sucks
> all the resources memory/cpu/port of resources.
> Another example, I ran some few Unit tests on my local machine. In less
> than an hour, I found that there are 6 java
> processes still listening to ports.
>
> The point is flaky tests should not be *undermined* for such a long time
> as they could be indicators of a serious bug.
> In this current situation, we should find what is eating all those
> resources.
>
> 2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
>> day two years ago, and maybe it's time to repeat it again:
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105
>>  this
>> is going to be tricky as we are in a pandemic and most of the community
>> are
>> working from home, unlike the last time when we can lock ourselves in a
>> conference room and force everybody to work :)
>>
> This sounds fun and I like it actually but I doubt it is feasible to apply
> :)
>
> I also wondered if the hardware was too stressed since all Hadoop related
>> projects all use the same set of Jenkins servers.
>> However, HBase just recently moved to their own dedicated machines, so I'm
>> actually surprised to see a lot of resource related failures even now.
>>
> As I mentioned in my response on the first point, a black-hole is created
> once the tests are triggered.
> I could not even run TestDistibutedShell on my local machine. The tests
> run out of everything  after the first 11 unit tests.
> It takes only 1 unit to fail to break the rest.
>
> On Thu, Oct 22, 2020 at 5:28 PM Wei-Chiu Chuang <we...@apache.org>
> wrote:
>
>> I also wondered if the hardware was too stressed since all Hadoop related
>> projects all use the same set of Jenkins servers.
>> However, HBase just recently moved to their own dedicated machines, so I'm
>> actually surprised to see a lot of resource related failures even now.
>>
>> On Thu, Oct 22, 2020 at 2:03 PM Wei-Chiu Chuang <we...@apache.org>
>> wrote:
>>
>> > Thanks for raising the issue, Akira and Ahmed,
>> >
>> > Fixing flaky tests is a thankless job so I want to take this opportunity
>> > to recognize the time and effort.
>> >
>> > We will always have flaky tests due to bad tests or simply infra issues.
>> > Fixing flaky tests will take time but if they are not addressed it
>> wastes
>> > everybody's time.
>> >
>> > Recognizing this problem, I have two suggestions:
>> >
>> > 1. Other projects such as HBase have a tool to exclude flaky tests from
>> > being executed. They track flaky tests and display them in a dashboard.
>> > This will allow good tests to pass while leaving time for folks to fix
>> > them. Or we could manually exclude tests (this is what we used to do at
>> > Cloudera)
>> >
>> > 2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
>> > day two years ago, and maybe it's time to repeat it again:
>> >
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105
>> this
>> > is going to be tricky as we are in a pandemic and most of the community
>> are
>> > working from home, unlike the last time when we can lock ourselves in a
>> > conference room and force everybody to work :)
>> >
>> > Thoughts?
>> >
>> >
>> > On Thu, Oct 22, 2020 at 12:14 PM Akira Ajisaka <aa...@apache.org>
>> > wrote:
>> >
>> >> Hi Hadoop developers,
>> >>
>> >> Now there are a lot of failing unit tests and there is an issue to
>> >> tackle this bad situation.
>> >> https://issues.apache.org/jira/browse/HDFS-15646
>> >>
>> >> Although this issue is in HDFS project, this issue is related to all
>> >> the Hadoop developers. Please check the above URL, read the
>> >> description, and volunteer to dedicate more time to fix flaky tests.
>> >> Your contribution to fixing the flaky tests will be really
>> >> appreciated!
>> >>
>> >> Thank you Ahmed Hussein for your report.
>> >>
>> >> Regards,
>> >> Akira
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
>> >> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
>> >>
>> >>
>>
>
>
> --
> Best Regards,
>
> *Ahmed Hussein, PhD*
>


-- 
Best Regards,

*Ahmed Hussein, PhD*

Re: Fixing flaky tests in Apache Hadoop

Posted by Ahmed Hussein <a...@ahussein.me>.

>
> 2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
> day two years ago, and maybe it's time to repeat it again:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105
>  this
> is going to be tricky as we are in a pandemic and most of the community are
> working from home, unlike the last time when we can lock ourselves in a
> conference room and force everybody to work :)


How about the following idea:

We set a monthly window during which only Unit test fixes can be merged.
Any other commit that is not directly
linked to Junit test failures would be blocked until the end of this
"Bug-Window".
For example, we set "Bug-days" to be from 25th to 31st of each month. All
commits during those days are meant to
fix and improve the testing environment.

Any thoughts?

On Thu, Oct 22, 2020 at 11:53 PM Ahmed Hussein <a...@ahussein.me> wrote:

> Thank you Akira and We-Chiu.
> IMHO, the citation is more than just flaky tests. It has more depth:
> - Every developer stays committed to keep the code healthy.
> - Those flaky tests are actually "*bugs*" that need to be fixed. It is
> evident that there is a major problem in handling
>   the resources as I will explain below.
>
> 1. Other projects such as HBase have a tool to exclude flaky tests from
>> being executed. They track flaky tests and display them in a dashboard.
>> This will allow good tests to pass while leaving time for folks to fix
>> them. Or we could manually exclude tests (this is what we used to do at
>> Cloudera)
>>
>
> I like the idea of having a tool that gives a view of broken tests.
>
>  I spent a long time converting HDFS flaky tests into sub-tasks under
> HDFS-15646 <https://issues.apache.org/jira/browse/HDFS-15646>. I
> believe there are still tons
> on the loose.
> I remember I explored a tool called DeFlaker
> <https://www.jonbell.net/icse18-deflaker.pdf> which detects flaky tests.
> Then it reruns the tests to verify that they still
> pass.
>
> I do not think we necessarily want to exclude the flaky tests, but at
> least they should be enumerated and addressed
> regularly because they are after all "bugs". Having few flaky tests that
> cause everything to blow up indicates
> that there is a major problem with handling resources.
> I pointed out this issue in YARN-10334
> <https://issues.apache.org/jira/browse/YARN-10334> where I found that
> TestDistributedShell is nothing but a black hole that sucks
> all the resources memory/cpu/port of resources.
> Another example, I ran some few Unit tests on my local machine. In less
> than an hour, I found that there are 6 java
> processes still listening to ports.
>
> The point is flaky tests should not be *undermined* for such a long time
> as they could be indicators of a serious bug.
> In this current situation, we should find what is eating all those
> resources.
>
> 2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
>> day two years ago, and maybe it's time to repeat it again:
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105
>>  this
>> is going to be tricky as we are in a pandemic and most of the community
>> are
>> working from home, unlike the last time when we can lock ourselves in a
>> conference room and force everybody to work :)
>>
> This sounds fun and I like it actually but I doubt it is feasible to apply
> :)
>
> I also wondered if the hardware was too stressed since all Hadoop related
>> projects all use the same set of Jenkins servers.
>> However, HBase just recently moved to their own dedicated machines, so I'm
>> actually surprised to see a lot of resource related failures even now.
>>
> As I mentioned in my response on the first point, a black-hole is created
> once the tests are triggered.
> I could not even run TestDistibutedShell on my local machine. The tests
> run out of everything  after the first 11 unit tests.
> It takes only 1 unit to fail to break the rest.
>
> On Thu, Oct 22, 2020 at 5:28 PM Wei-Chiu Chuang <we...@apache.org>
> wrote:
>
>> I also wondered if the hardware was too stressed since all Hadoop related
>> projects all use the same set of Jenkins servers.
>> However, HBase just recently moved to their own dedicated machines, so I'm
>> actually surprised to see a lot of resource related failures even now.
>>
>> On Thu, Oct 22, 2020 at 2:03 PM Wei-Chiu Chuang <we...@apache.org>
>> wrote:
>>
>> > Thanks for raising the issue, Akira and Ahmed,
>> >
>> > Fixing flaky tests is a thankless job so I want to take this opportunity
>> > to recognize the time and effort.
>> >
>> > We will always have flaky tests due to bad tests or simply infra issues.
>> > Fixing flaky tests will take time but if they are not addressed it
>> wastes
>> > everybody's time.
>> >
>> > Recognizing this problem, I have two suggestions:
>> >
>> > 1. Other projects such as HBase have a tool to exclude flaky tests from
>> > being executed. They track flaky tests and display them in a dashboard.
>> > This will allow good tests to pass while leaving time for folks to fix
>> > them. Or we could manually exclude tests (this is what we used to do at
>> > Cloudera)
>> >
>> > 2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
>> > day two years ago, and maybe it's time to repeat it again:
>> >
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105
>> this
>> > is going to be tricky as we are in a pandemic and most of the community
>> are
>> > working from home, unlike the last time when we can lock ourselves in a
>> > conference room and force everybody to work :)
>> >
>> > Thoughts?
>> >
>> >
>> > On Thu, Oct 22, 2020 at 12:14 PM Akira Ajisaka <aa...@apache.org>
>> > wrote:
>> >
>> >> Hi Hadoop developers,
>> >>
>> >> Now there are a lot of failing unit tests and there is an issue to
>> >> tackle this bad situation.
>> >> https://issues.apache.org/jira/browse/HDFS-15646
>> >>
>> >> Although this issue is in HDFS project, this issue is related to all
>> >> the Hadoop developers. Please check the above URL, read the
>> >> description, and volunteer to dedicate more time to fix flaky tests.
>> >> Your contribution to fixing the flaky tests will be really
>> >> appreciated!
>> >>
>> >> Thank you Ahmed Hussein for your report.
>> >>
>> >> Regards,
>> >> Akira
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
>> >> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
>> >>
>> >>
>>
>
>
> --
> Best Regards,
>
> *Ahmed Hussein, PhD*
>


-- 
Best Regards,

*Ahmed Hussein, PhD*

Re: Fixing flaky tests in Apache Hadoop

Posted by Ahmed Hussein <a...@ahussein.me>.

>
> 2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
> day two years ago, and maybe it's time to repeat it again:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105
>  this
> is going to be tricky as we are in a pandemic and most of the community are
> working from home, unlike the last time when we can lock ourselves in a
> conference room and force everybody to work :)


How about the following idea:

We set a monthly window during which only Unit test fixes can be merged.
Any other commit that is not directly
linked to Junit test failures would be blocked until the end of this
"Bug-Window".
For example, we set "Bug-days" to be from 25th to 31st of each month. All
commits during those days are meant to
fix and improve the testing environment.

Any thoughts?

On Thu, Oct 22, 2020 at 11:53 PM Ahmed Hussein <a...@ahussein.me> wrote:

> Thank you Akira and We-Chiu.
> IMHO, the citation is more than just flaky tests. It has more depth:
> - Every developer stays committed to keep the code healthy.
> - Those flaky tests are actually "*bugs*" that need to be fixed. It is
> evident that there is a major problem in handling
>   the resources as I will explain below.
>
> 1. Other projects such as HBase have a tool to exclude flaky tests from
>> being executed. They track flaky tests and display them in a dashboard.
>> This will allow good tests to pass while leaving time for folks to fix
>> them. Or we could manually exclude tests (this is what we used to do at
>> Cloudera)
>>
>
> I like the idea of having a tool that gives a view of broken tests.
>
>  I spent a long time converting HDFS flaky tests into sub-tasks under
> HDFS-15646 <https://issues.apache.org/jira/browse/HDFS-15646>. I
> believe there are still tons
> on the loose.
> I remember I explored a tool called DeFlaker
> <https://www.jonbell.net/icse18-deflaker.pdf> which detects flaky tests.
> Then it reruns the tests to verify that they still
> pass.
>
> I do not think we necessarily want to exclude the flaky tests, but at
> least they should be enumerated and addressed
> regularly because they are after all "bugs". Having few flaky tests that
> cause everything to blow up indicates
> that there is a major problem with handling resources.
> I pointed out this issue in YARN-10334
> <https://issues.apache.org/jira/browse/YARN-10334> where I found that
> TestDistributedShell is nothing but a black hole that sucks
> all the resources memory/cpu/port of resources.
> Another example, I ran some few Unit tests on my local machine. In less
> than an hour, I found that there are 6 java
> processes still listening to ports.
>
> The point is flaky tests should not be *undermined* for such a long time
> as they could be indicators of a serious bug.
> In this current situation, we should find what is eating all those
> resources.
>
> 2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
>> day two years ago, and maybe it's time to repeat it again:
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105
>>  this
>> is going to be tricky as we are in a pandemic and most of the community
>> are
>> working from home, unlike the last time when we can lock ourselves in a
>> conference room and force everybody to work :)
>>
> This sounds fun and I like it actually but I doubt it is feasible to apply
> :)
>
> I also wondered if the hardware was too stressed since all Hadoop related
>> projects all use the same set of Jenkins servers.
>> However, HBase just recently moved to their own dedicated machines, so I'm
>> actually surprised to see a lot of resource related failures even now.
>>
> As I mentioned in my response on the first point, a black-hole is created
> once the tests are triggered.
> I could not even run TestDistibutedShell on my local machine. The tests
> run out of everything  after the first 11 unit tests.
> It takes only 1 unit to fail to break the rest.
>
> On Thu, Oct 22, 2020 at 5:28 PM Wei-Chiu Chuang <we...@apache.org>
> wrote:
>
>> I also wondered if the hardware was too stressed since all Hadoop related
>> projects all use the same set of Jenkins servers.
>> However, HBase just recently moved to their own dedicated machines, so I'm
>> actually surprised to see a lot of resource related failures even now.
>>
>> On Thu, Oct 22, 2020 at 2:03 PM Wei-Chiu Chuang <we...@apache.org>
>> wrote:
>>
>> > Thanks for raising the issue, Akira and Ahmed,
>> >
>> > Fixing flaky tests is a thankless job so I want to take this opportunity
>> > to recognize the time and effort.
>> >
>> > We will always have flaky tests due to bad tests or simply infra issues.
>> > Fixing flaky tests will take time but if they are not addressed it
>> wastes
>> > everybody's time.
>> >
>> > Recognizing this problem, I have two suggestions:
>> >
>> > 1. Other projects such as HBase have a tool to exclude flaky tests from
>> > being executed. They track flaky tests and display them in a dashboard.
>> > This will allow good tests to pass while leaving time for folks to fix
>> > them. Or we could manually exclude tests (this is what we used to do at
>> > Cloudera)
>> >
>> > 2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
>> > day two years ago, and maybe it's time to repeat it again:
>> >
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105
>> this
>> > is going to be tricky as we are in a pandemic and most of the community
>> are
>> > working from home, unlike the last time when we can lock ourselves in a
>> > conference room and force everybody to work :)
>> >
>> > Thoughts?
>> >
>> >
>> > On Thu, Oct 22, 2020 at 12:14 PM Akira Ajisaka <aa...@apache.org>
>> > wrote:
>> >
>> >> Hi Hadoop developers,
>> >>
>> >> Now there are a lot of failing unit tests and there is an issue to
>> >> tackle this bad situation.
>> >> https://issues.apache.org/jira/browse/HDFS-15646
>> >>
>> >> Although this issue is in HDFS project, this issue is related to all
>> >> the Hadoop developers. Please check the above URL, read the
>> >> description, and volunteer to dedicate more time to fix flaky tests.
>> >> Your contribution to fixing the flaky tests will be really
>> >> appreciated!
>> >>
>> >> Thank you Ahmed Hussein for your report.
>> >>
>> >> Regards,
>> >> Akira
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
>> >> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
>> >>
>> >>
>>
>
>
> --
> Best Regards,
>
> *Ahmed Hussein, PhD*
>


-- 
Best Regards,

*Ahmed Hussein, PhD*

Re: Fixing flaky tests in Apache Hadoop

Posted by Ahmed Hussein <a...@ahussein.me>.

>
> 2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
> day two years ago, and maybe it's time to repeat it again:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105
>  this
> is going to be tricky as we are in a pandemic and most of the community are
> working from home, unlike the last time when we can lock ourselves in a
> conference room and force everybody to work :)


How about the following idea:

We set a monthly window during which only Unit test fixes can be merged.
Any other commit that is not directly
linked to Junit test failures would be blocked until the end of this
"Bug-Window".
For example, we set "Bug-days" to be from 25th to 31st of each month. All
commits during those days are meant to
fix and improve the testing environment.

Any thoughts?

On Thu, Oct 22, 2020 at 11:53 PM Ahmed Hussein <a...@ahussein.me> wrote:

> Thank you Akira and We-Chiu.
> IMHO, the citation is more than just flaky tests. It has more depth:
> - Every developer stays committed to keep the code healthy.
> - Those flaky tests are actually "*bugs*" that need to be fixed. It is
> evident that there is a major problem in handling
>   the resources as I will explain below.
>
> 1. Other projects such as HBase have a tool to exclude flaky tests from
>> being executed. They track flaky tests and display them in a dashboard.
>> This will allow good tests to pass while leaving time for folks to fix
>> them. Or we could manually exclude tests (this is what we used to do at
>> Cloudera)
>>
>
> I like the idea of having a tool that gives a view of broken tests.
>
>  I spent a long time converting HDFS flaky tests into sub-tasks under
> HDFS-15646 <https://issues.apache.org/jira/browse/HDFS-15646>. I
> believe there are still tons
> on the loose.
> I remember I explored a tool called DeFlaker
> <https://www.jonbell.net/icse18-deflaker.pdf> which detects flaky tests.
> Then it reruns the tests to verify that they still
> pass.
>
> I do not think we necessarily want to exclude the flaky tests, but at
> least they should be enumerated and addressed
> regularly because they are after all "bugs". Having few flaky tests that
> cause everything to blow up indicates
> that there is a major problem with handling resources.
> I pointed out this issue in YARN-10334
> <https://issues.apache.org/jira/browse/YARN-10334> where I found that
> TestDistributedShell is nothing but a black hole that sucks
> all the resources memory/cpu/port of resources.
> Another example, I ran some few Unit tests on my local machine. In less
> than an hour, I found that there are 6 java
> processes still listening to ports.
>
> The point is flaky tests should not be *undermined* for such a long time
> as they could be indicators of a serious bug.
> In this current situation, we should find what is eating all those
> resources.
>
> 2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
>> day two years ago, and maybe it's time to repeat it again:
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105
>>  this
>> is going to be tricky as we are in a pandemic and most of the community
>> are
>> working from home, unlike the last time when we can lock ourselves in a
>> conference room and force everybody to work :)
>>
> This sounds fun and I like it actually but I doubt it is feasible to apply
> :)
>
> I also wondered if the hardware was too stressed since all Hadoop related
>> projects all use the same set of Jenkins servers.
>> However, HBase just recently moved to their own dedicated machines, so I'm
>> actually surprised to see a lot of resource related failures even now.
>>
> As I mentioned in my response on the first point, a black-hole is created
> once the tests are triggered.
> I could not even run TestDistibutedShell on my local machine. The tests
> run out of everything  after the first 11 unit tests.
> It takes only 1 unit to fail to break the rest.
>
> On Thu, Oct 22, 2020 at 5:28 PM Wei-Chiu Chuang <we...@apache.org>
> wrote:
>
>> I also wondered if the hardware was too stressed since all Hadoop related
>> projects all use the same set of Jenkins servers.
>> However, HBase just recently moved to their own dedicated machines, so I'm
>> actually surprised to see a lot of resource related failures even now.
>>
>> On Thu, Oct 22, 2020 at 2:03 PM Wei-Chiu Chuang <we...@apache.org>
>> wrote:
>>
>> > Thanks for raising the issue, Akira and Ahmed,
>> >
>> > Fixing flaky tests is a thankless job so I want to take this opportunity
>> > to recognize the time and effort.
>> >
>> > We will always have flaky tests due to bad tests or simply infra issues.
>> > Fixing flaky tests will take time but if they are not addressed it
>> wastes
>> > everybody's time.
>> >
>> > Recognizing this problem, I have two suggestions:
>> >
>> > 1. Other projects such as HBase have a tool to exclude flaky tests from
>> > being executed. They track flaky tests and display them in a dashboard.
>> > This will allow good tests to pass while leaving time for folks to fix
>> > them. Or we could manually exclude tests (this is what we used to do at
>> > Cloudera)
>> >
>> > 2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
>> > day two years ago, and maybe it's time to repeat it again:
>> >
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105
>> this
>> > is going to be tricky as we are in a pandemic and most of the community
>> are
>> > working from home, unlike the last time when we can lock ourselves in a
>> > conference room and force everybody to work :)
>> >
>> > Thoughts?
>> >
>> >
>> > On Thu, Oct 22, 2020 at 12:14 PM Akira Ajisaka <aa...@apache.org>
>> > wrote:
>> >
>> >> Hi Hadoop developers,
>> >>
>> >> Now there are a lot of failing unit tests and there is an issue to
>> >> tackle this bad situation.
>> >> https://issues.apache.org/jira/browse/HDFS-15646
>> >>
>> >> Although this issue is in HDFS project, this issue is related to all
>> >> the Hadoop developers. Please check the above URL, read the
>> >> description, and volunteer to dedicate more time to fix flaky tests.
>> >> Your contribution to fixing the flaky tests will be really
>> >> appreciated!
>> >>
>> >> Thank you Ahmed Hussein for your report.
>> >>
>> >> Regards,
>> >> Akira
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
>> >> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
>> >>
>> >>
>>
>
>
> --
> Best Regards,
>
> *Ahmed Hussein, PhD*
>


-- 
Best Regards,

*Ahmed Hussein, PhD*

Re: Fixing flaky tests in Apache Hadoop

Posted by Ahmed Hussein <a...@ahussein.me>.

Thank you Akira and We-Chiu.
IMHO, the citation is more than just flaky tests. It has more depth:
- Every developer stays committed to keep the code healthy.
- Those flaky tests are actually "*bugs*" that need to be fixed. It is
evident that there is a major problem in handling
  the resources as I will explain below.

1. Other projects such as HBase have a tool to exclude flaky tests from
> being executed. They track flaky tests and display them in a dashboard.
> This will allow good tests to pass while leaving time for folks to fix
> them. Or we could manually exclude tests (this is what we used to do at
> Cloudera)
>

I like the idea of having a tool that gives a view of broken tests.

 I spent a long time converting HDFS flaky tests into sub-tasks under
HDFS-15646 <https://issues.apache.org/jira/browse/HDFS-15646>. I
believe there are still tons
on the loose.
I remember I explored a tool called DeFlaker
<https://www.jonbell.net/icse18-deflaker.pdf> which detects flaky tests.
Then it reruns the tests to verify that they still
pass.

I do not think we necessarily want to exclude the flaky tests, but at least
they should be enumerated and addressed
regularly because they are after all "bugs". Having few flaky tests that
cause everything to blow up indicates
that there is a major problem with handling resources.
I pointed out this issue in YARN-10334
<https://issues.apache.org/jira/browse/YARN-10334> where I found that
TestDistributedShell is nothing but a black hole that sucks
all the resources memory/cpu/port of resources.
Another example, I ran some few Unit tests on my local machine. In less
than an hour, I found that there are 6 java
processes still listening to ports.

The point is flaky tests should not be *undermined* for such a long time as
they could be indicators of a serious bug.
In this current situation, we should find what is eating all those
resources.

2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
> day two years ago, and maybe it's time to repeat it again:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105
>  this
> is going to be tricky as we are in a pandemic and most of the community are
> working from home, unlike the last time when we can lock ourselves in a
> conference room and force everybody to work :)
>
This sounds fun and I like it actually but I doubt it is feasible to apply
:)

I also wondered if the hardware was too stressed since all Hadoop related
> projects all use the same set of Jenkins servers.
> However, HBase just recently moved to their own dedicated machines, so I'm
> actually surprised to see a lot of resource related failures even now.
>
As I mentioned in my response on the first point, a black-hole is created
once the tests are triggered.
I could not even run TestDistibutedShell on my local machine. The tests run
out of everything  after the first 11 unit tests.
It takes only 1 unit to fail to break the rest.

On Thu, Oct 22, 2020 at 5:28 PM Wei-Chiu Chuang <we...@apache.org> wrote:

> I also wondered if the hardware was too stressed since all Hadoop related
> projects all use the same set of Jenkins servers.
> However, HBase just recently moved to their own dedicated machines, so I'm
> actually surprised to see a lot of resource related failures even now.
>
> On Thu, Oct 22, 2020 at 2:03 PM Wei-Chiu Chuang <we...@apache.org>
> wrote:
>
> > Thanks for raising the issue, Akira and Ahmed,
> >
> > Fixing flaky tests is a thankless job so I want to take this opportunity
> > to recognize the time and effort.
> >
> > We will always have flaky tests due to bad tests or simply infra issues.
> > Fixing flaky tests will take time but if they are not addressed it wastes
> > everybody's time.
> >
> > Recognizing this problem, I have two suggestions:
> >
> > 1. Other projects such as HBase have a tool to exclude flaky tests from
> > being executed. They track flaky tests and display them in a dashboard.
> > This will allow good tests to pass while leaving time for folks to fix
> > them. Or we could manually exclude tests (this is what we used to do at
> > Cloudera)
> >
> > 2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
> > day two years ago, and maybe it's time to repeat it again:
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105
> this
> > is going to be tricky as we are in a pandemic and most of the community
> are
> > working from home, unlike the last time when we can lock ourselves in a
> > conference room and force everybody to work :)
> >
> > Thoughts?
> >
> >
> > On Thu, Oct 22, 2020 at 12:14 PM Akira Ajisaka <aa...@apache.org>
> > wrote:
> >
> >> Hi Hadoop developers,
> >>
> >> Now there are a lot of failing unit tests and there is an issue to
> >> tackle this bad situation.
> >> https://issues.apache.org/jira/browse/HDFS-15646
> >>
> >> Although this issue is in HDFS project, this issue is related to all
> >> the Hadoop developers. Please check the above URL, read the
> >> description, and volunteer to dedicate more time to fix flaky tests.
> >> Your contribution to fixing the flaky tests will be really
> >> appreciated!
> >>
> >> Thank you Ahmed Hussein for your report.
> >>
> >> Regards,
> >> Akira
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
> >> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
> >>
> >>
>


-- 
Best Regards,

*Ahmed Hussein, PhD*

Re: Fixing flaky tests in Apache Hadoop

Posted by Ahmed Hussein <a...@ahussein.me>.

Thank you Akira and We-Chiu.
IMHO, the citation is more than just flaky tests. It has more depth:
- Every developer stays committed to keep the code healthy.
- Those flaky tests are actually "*bugs*" that need to be fixed. It is
evident that there is a major problem in handling
  the resources as I will explain below.

1. Other projects such as HBase have a tool to exclude flaky tests from
> being executed. They track flaky tests and display them in a dashboard.
> This will allow good tests to pass while leaving time for folks to fix
> them. Or we could manually exclude tests (this is what we used to do at
> Cloudera)
>

I like the idea of having a tool that gives a view of broken tests.

 I spent a long time converting HDFS flaky tests into sub-tasks under
HDFS-15646 <https://issues.apache.org/jira/browse/HDFS-15646>. I
believe there are still tons
on the loose.
I remember I explored a tool called DeFlaker
<https://www.jonbell.net/icse18-deflaker.pdf> which detects flaky tests.
Then it reruns the tests to verify that they still
pass.

I do not think we necessarily want to exclude the flaky tests, but at least
they should be enumerated and addressed
regularly because they are after all "bugs". Having few flaky tests that
cause everything to blow up indicates
that there is a major problem with handling resources.
I pointed out this issue in YARN-10334
<https://issues.apache.org/jira/browse/YARN-10334> where I found that
TestDistributedShell is nothing but a black hole that sucks
all the resources memory/cpu/port of resources.
Another example, I ran some few Unit tests on my local machine. In less
than an hour, I found that there are 6 java
processes still listening to ports.

The point is flaky tests should not be *undermined* for such a long time as
they could be indicators of a serious bug.
In this current situation, we should find what is eating all those
resources.

2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
> day two years ago, and maybe it's time to repeat it again:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105
>  this
> is going to be tricky as we are in a pandemic and most of the community are
> working from home, unlike the last time when we can lock ourselves in a
> conference room and force everybody to work :)
>
This sounds fun and I like it actually but I doubt it is feasible to apply
:)

I also wondered if the hardware was too stressed since all Hadoop related
> projects all use the same set of Jenkins servers.
> However, HBase just recently moved to their own dedicated machines, so I'm
> actually surprised to see a lot of resource related failures even now.
>
As I mentioned in my response on the first point, a black-hole is created
once the tests are triggered.
I could not even run TestDistibutedShell on my local machine. The tests run
out of everything  after the first 11 unit tests.
It takes only 1 unit to fail to break the rest.

On Thu, Oct 22, 2020 at 5:28 PM Wei-Chiu Chuang <we...@apache.org> wrote:

> I also wondered if the hardware was too stressed since all Hadoop related
> projects all use the same set of Jenkins servers.
> However, HBase just recently moved to their own dedicated machines, so I'm
> actually surprised to see a lot of resource related failures even now.
>
> On Thu, Oct 22, 2020 at 2:03 PM Wei-Chiu Chuang <we...@apache.org>
> wrote:
>
> > Thanks for raising the issue, Akira and Ahmed,
> >
> > Fixing flaky tests is a thankless job so I want to take this opportunity
> > to recognize the time and effort.
> >
> > We will always have flaky tests due to bad tests or simply infra issues.
> > Fixing flaky tests will take time but if they are not addressed it wastes
> > everybody's time.
> >
> > Recognizing this problem, I have two suggestions:
> >
> > 1. Other projects such as HBase have a tool to exclude flaky tests from
> > being executed. They track flaky tests and display them in a dashboard.
> > This will allow good tests to pass while leaving time for folks to fix
> > them. Or we could manually exclude tests (this is what we used to do at
> > Cloudera)
> >
> > 2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
> > day two years ago, and maybe it's time to repeat it again:
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105
> this
> > is going to be tricky as we are in a pandemic and most of the community
> are
> > working from home, unlike the last time when we can lock ourselves in a
> > conference room and force everybody to work :)
> >
> > Thoughts?
> >
> >
> > On Thu, Oct 22, 2020 at 12:14 PM Akira Ajisaka <aa...@apache.org>
> > wrote:
> >
> >> Hi Hadoop developers,
> >>
> >> Now there are a lot of failing unit tests and there is an issue to
> >> tackle this bad situation.
> >> https://issues.apache.org/jira/browse/HDFS-15646
> >>
> >> Although this issue is in HDFS project, this issue is related to all
> >> the Hadoop developers. Please check the above URL, read the
> >> description, and volunteer to dedicate more time to fix flaky tests.
> >> Your contribution to fixing the flaky tests will be really
> >> appreciated!
> >>
> >> Thank you Ahmed Hussein for your report.
> >>
> >> Regards,
> >> Akira
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
> >> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
> >>
> >>
>


-- 
Best Regards,

*Ahmed Hussein, PhD*

Re: Fixing flaky tests in Apache Hadoop

Posted by Ahmed Hussein <a...@ahussein.me>.

Thank you Akira and We-Chiu.
IMHO, the citation is more than just flaky tests. It has more depth:
- Every developer stays committed to keep the code healthy.
- Those flaky tests are actually "*bugs*" that need to be fixed. It is
evident that there is a major problem in handling
  the resources as I will explain below.

1. Other projects such as HBase have a tool to exclude flaky tests from
> being executed. They track flaky tests and display them in a dashboard.
> This will allow good tests to pass while leaving time for folks to fix
> them. Or we could manually exclude tests (this is what we used to do at
> Cloudera)
>

I like the idea of having a tool that gives a view of broken tests.

 I spent a long time converting HDFS flaky tests into sub-tasks under
HDFS-15646 <https://issues.apache.org/jira/browse/HDFS-15646>. I
believe there are still tons
on the loose.
I remember I explored a tool called DeFlaker
<https://www.jonbell.net/icse18-deflaker.pdf> which detects flaky tests.
Then it reruns the tests to verify that they still
pass.

I do not think we necessarily want to exclude the flaky tests, but at least
they should be enumerated and addressed
regularly because they are after all "bugs". Having few flaky tests that
cause everything to blow up indicates
that there is a major problem with handling resources.
I pointed out this issue in YARN-10334
<https://issues.apache.org/jira/browse/YARN-10334> where I found that
TestDistributedShell is nothing but a black hole that sucks
all the resources memory/cpu/port of resources.
Another example, I ran some few Unit tests on my local machine. In less
than an hour, I found that there are 6 java
processes still listening to ports.

The point is flaky tests should not be *undermined* for such a long time as
they could be indicators of a serious bug.
In this current situation, we should find what is eating all those
resources.

2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
> day two years ago, and maybe it's time to repeat it again:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105
>  this
> is going to be tricky as we are in a pandemic and most of the community are
> working from home, unlike the last time when we can lock ourselves in a
> conference room and force everybody to work :)
>
This sounds fun and I like it actually but I doubt it is feasible to apply
:)

I also wondered if the hardware was too stressed since all Hadoop related
> projects all use the same set of Jenkins servers.
> However, HBase just recently moved to their own dedicated machines, so I'm
> actually surprised to see a lot of resource related failures even now.
>
As I mentioned in my response on the first point, a black-hole is created
once the tests are triggered.
I could not even run TestDistibutedShell on my local machine. The tests run
out of everything  after the first 11 unit tests.
It takes only 1 unit to fail to break the rest.

On Thu, Oct 22, 2020 at 5:28 PM Wei-Chiu Chuang <we...@apache.org> wrote:

> I also wondered if the hardware was too stressed since all Hadoop related
> projects all use the same set of Jenkins servers.
> However, HBase just recently moved to their own dedicated machines, so I'm
> actually surprised to see a lot of resource related failures even now.
>
> On Thu, Oct 22, 2020 at 2:03 PM Wei-Chiu Chuang <we...@apache.org>
> wrote:
>
> > Thanks for raising the issue, Akira and Ahmed,
> >
> > Fixing flaky tests is a thankless job so I want to take this opportunity
> > to recognize the time and effort.
> >
> > We will always have flaky tests due to bad tests or simply infra issues.
> > Fixing flaky tests will take time but if they are not addressed it wastes
> > everybody's time.
> >
> > Recognizing this problem, I have two suggestions:
> >
> > 1. Other projects such as HBase have a tool to exclude flaky tests from
> > being executed. They track flaky tests and display them in a dashboard.
> > This will allow good tests to pass while leaving time for folks to fix
> > them. Or we could manually exclude tests (this is what we used to do at
> > Cloudera)
> >
> > 2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
> > day two years ago, and maybe it's time to repeat it again:
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105
> this
> > is going to be tricky as we are in a pandemic and most of the community
> are
> > working from home, unlike the last time when we can lock ourselves in a
> > conference room and force everybody to work :)
> >
> > Thoughts?
> >
> >
> > On Thu, Oct 22, 2020 at 12:14 PM Akira Ajisaka <aa...@apache.org>
> > wrote:
> >
> >> Hi Hadoop developers,
> >>
> >> Now there are a lot of failing unit tests and there is an issue to
> >> tackle this bad situation.
> >> https://issues.apache.org/jira/browse/HDFS-15646
> >>
> >> Although this issue is in HDFS project, this issue is related to all
> >> the Hadoop developers. Please check the above URL, read the
> >> description, and volunteer to dedicate more time to fix flaky tests.
> >> Your contribution to fixing the flaky tests will be really
> >> appreciated!
> >>
> >> Thank you Ahmed Hussein for your report.
> >>
> >> Regards,
> >> Akira
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
> >> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
> >>
> >>
>


-- 
Best Regards,

*Ahmed Hussein, PhD*

Re: Fixing flaky tests in Apache Hadoop

Posted by Ahmed Hussein <a...@ahussein.me>.

Thank you Akira and We-Chiu.
IMHO, the citation is more than just flaky tests. It has more depth:
- Every developer stays committed to keep the code healthy.
- Those flaky tests are actually "*bugs*" that need to be fixed. It is
evident that there is a major problem in handling
  the resources as I will explain below.

1. Other projects such as HBase have a tool to exclude flaky tests from
> being executed. They track flaky tests and display them in a dashboard.
> This will allow good tests to pass while leaving time for folks to fix
> them. Or we could manually exclude tests (this is what we used to do at
> Cloudera)
>

I like the idea of having a tool that gives a view of broken tests.

 I spent a long time converting HDFS flaky tests into sub-tasks under
HDFS-15646 <https://issues.apache.org/jira/browse/HDFS-15646>. I
believe there are still tons
on the loose.
I remember I explored a tool called DeFlaker
<https://www.jonbell.net/icse18-deflaker.pdf> which detects flaky tests.
Then it reruns the tests to verify that they still
pass.

I do not think we necessarily want to exclude the flaky tests, but at least
they should be enumerated and addressed
regularly because they are after all "bugs". Having few flaky tests that
cause everything to blow up indicates
that there is a major problem with handling resources.
I pointed out this issue in YARN-10334
<https://issues.apache.org/jira/browse/YARN-10334> where I found that
TestDistributedShell is nothing but a black hole that sucks
all the resources memory/cpu/port of resources.
Another example, I ran some few Unit tests on my local machine. In less
than an hour, I found that there are 6 java
processes still listening to ports.

The point is flaky tests should not be *undermined* for such a long time as
they could be indicators of a serious bug.
In this current situation, we should find what is eating all those
resources.

2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
> day two years ago, and maybe it's time to repeat it again:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105
>  this
> is going to be tricky as we are in a pandemic and most of the community are
> working from home, unlike the last time when we can lock ourselves in a
> conference room and force everybody to work :)
>
This sounds fun and I like it actually but I doubt it is feasible to apply
:)

I also wondered if the hardware was too stressed since all Hadoop related
> projects all use the same set of Jenkins servers.
> However, HBase just recently moved to their own dedicated machines, so I'm
> actually surprised to see a lot of resource related failures even now.
>
As I mentioned in my response on the first point, a black-hole is created
once the tests are triggered.
I could not even run TestDistibutedShell on my local machine. The tests run
out of everything  after the first 11 unit tests.
It takes only 1 unit to fail to break the rest.

On Thu, Oct 22, 2020 at 5:28 PM Wei-Chiu Chuang <we...@apache.org> wrote:

> I also wondered if the hardware was too stressed since all Hadoop related
> projects all use the same set of Jenkins servers.
> However, HBase just recently moved to their own dedicated machines, so I'm
> actually surprised to see a lot of resource related failures even now.
>
> On Thu, Oct 22, 2020 at 2:03 PM Wei-Chiu Chuang <we...@apache.org>
> wrote:
>
> > Thanks for raising the issue, Akira and Ahmed,
> >
> > Fixing flaky tests is a thankless job so I want to take this opportunity
> > to recognize the time and effort.
> >
> > We will always have flaky tests due to bad tests or simply infra issues.
> > Fixing flaky tests will take time but if they are not addressed it wastes
> > everybody's time.
> >
> > Recognizing this problem, I have two suggestions:
> >
> > 1. Other projects such as HBase have a tool to exclude flaky tests from
> > being executed. They track flaky tests and display them in a dashboard.
> > This will allow good tests to pass while leaving time for folks to fix
> > them. Or we could manually exclude tests (this is what we used to do at
> > Cloudera)
> >
> > 2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
> > day two years ago, and maybe it's time to repeat it again:
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105
> this
> > is going to be tricky as we are in a pandemic and most of the community
> are
> > working from home, unlike the last time when we can lock ourselves in a
> > conference room and force everybody to work :)
> >
> > Thoughts?
> >
> >
> > On Thu, Oct 22, 2020 at 12:14 PM Akira Ajisaka <aa...@apache.org>
> > wrote:
> >
> >> Hi Hadoop developers,
> >>
> >> Now there are a lot of failing unit tests and there is an issue to
> >> tackle this bad situation.
> >> https://issues.apache.org/jira/browse/HDFS-15646
> >>
> >> Although this issue is in HDFS project, this issue is related to all
> >> the Hadoop developers. Please check the above URL, read the
> >> description, and volunteer to dedicate more time to fix flaky tests.
> >> Your contribution to fixing the flaky tests will be really
> >> appreciated!
> >>
> >> Thank you Ahmed Hussein for your report.
> >>
> >> Regards,
> >> Akira
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
> >> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
> >>
> >>
>


-- 
Best Regards,

*Ahmed Hussein, PhD*

Re: Fixing flaky tests in Apache Hadoop

Posted by Wei-Chiu Chuang <we...@apache.org>.

I also wondered if the hardware was too stressed since all Hadoop related
projects all use the same set of Jenkins servers.
However, HBase just recently moved to their own dedicated machines, so I'm
actually surprised to see a lot of resource related failures even now.

On Thu, Oct 22, 2020 at 2:03 PM Wei-Chiu Chuang <we...@apache.org> wrote:

> Thanks for raising the issue, Akira and Ahmed,
>
> Fixing flaky tests is a thankless job so I want to take this opportunity
> to recognize the time and effort.
>
> We will always have flaky tests due to bad tests or simply infra issues.
> Fixing flaky tests will take time but if they are not addressed it wastes
> everybody's time.
>
> Recognizing this problem, I have two suggestions:
>
> 1. Other projects such as HBase have a tool to exclude flaky tests from
> being executed. They track flaky tests and display them in a dashboard.
> This will allow good tests to pass while leaving time for folks to fix
> them. Or we could manually exclude tests (this is what we used to do at
> Cloudera)
>
> 2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
> day two years ago, and maybe it's time to repeat it again:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105 this
> is going to be tricky as we are in a pandemic and most of the community are
> working from home, unlike the last time when we can lock ourselves in a
> conference room and force everybody to work :)
>
> Thoughts?
>
>
> On Thu, Oct 22, 2020 at 12:14 PM Akira Ajisaka <aa...@apache.org>
> wrote:
>
>> Hi Hadoop developers,
>>
>> Now there are a lot of failing unit tests and there is an issue to
>> tackle this bad situation.
>> https://issues.apache.org/jira/browse/HDFS-15646
>>
>> Although this issue is in HDFS project, this issue is related to all
>> the Hadoop developers. Please check the above URL, read the
>> description, and volunteer to dedicate more time to fix flaky tests.
>> Your contribution to fixing the flaky tests will be really
>> appreciated!
>>
>> Thank you Ahmed Hussein for your report.
>>
>> Regards,
>> Akira
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
>> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
>>
>>

Re: Fixing flaky tests in Apache Hadoop

Posted by Wei-Chiu Chuang <we...@apache.org>.

I also wondered if the hardware was too stressed since all Hadoop related
projects all use the same set of Jenkins servers.
However, HBase just recently moved to their own dedicated machines, so I'm
actually surprised to see a lot of resource related failures even now.

On Thu, Oct 22, 2020 at 2:03 PM Wei-Chiu Chuang <we...@apache.org> wrote:

> Thanks for raising the issue, Akira and Ahmed,
>
> Fixing flaky tests is a thankless job so I want to take this opportunity
> to recognize the time and effort.
>
> We will always have flaky tests due to bad tests or simply infra issues.
> Fixing flaky tests will take time but if they are not addressed it wastes
> everybody's time.
>
> Recognizing this problem, I have two suggestions:
>
> 1. Other projects such as HBase have a tool to exclude flaky tests from
> being executed. They track flaky tests and display them in a dashboard.
> This will allow good tests to pass while leaving time for folks to fix
> them. Or we could manually exclude tests (this is what we used to do at
> Cloudera)
>
> 2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
> day two years ago, and maybe it's time to repeat it again:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105 this
> is going to be tricky as we are in a pandemic and most of the community are
> working from home, unlike the last time when we can lock ourselves in a
> conference room and force everybody to work :)
>
> Thoughts?
>
>
> On Thu, Oct 22, 2020 at 12:14 PM Akira Ajisaka <aa...@apache.org>
> wrote:
>
>> Hi Hadoop developers,
>>
>> Now there are a lot of failing unit tests and there is an issue to
>> tackle this bad situation.
>> https://issues.apache.org/jira/browse/HDFS-15646
>>
>> Although this issue is in HDFS project, this issue is related to all
>> the Hadoop developers. Please check the above URL, read the
>> description, and volunteer to dedicate more time to fix flaky tests.
>> Your contribution to fixing the flaky tests will be really
>> appreciated!
>>
>> Thank you Ahmed Hussein for your report.
>>
>> Regards,
>> Akira
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
>> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
>>
>>

Re: Fixing flaky tests in Apache Hadoop

Posted by Wei-Chiu Chuang <we...@apache.org>.

I also wondered if the hardware was too stressed since all Hadoop related
projects all use the same set of Jenkins servers.
However, HBase just recently moved to their own dedicated machines, so I'm
actually surprised to see a lot of resource related failures even now.

On Thu, Oct 22, 2020 at 2:03 PM Wei-Chiu Chuang <we...@apache.org> wrote:

> Thanks for raising the issue, Akira and Ahmed,
>
> Fixing flaky tests is a thankless job so I want to take this opportunity
> to recognize the time and effort.
>
> We will always have flaky tests due to bad tests or simply infra issues.
> Fixing flaky tests will take time but if they are not addressed it wastes
> everybody's time.
>
> Recognizing this problem, I have two suggestions:
>
> 1. Other projects such as HBase have a tool to exclude flaky tests from
> being executed. They track flaky tests and display them in a dashboard.
> This will allow good tests to pass while leaving time for folks to fix
> them. Or we could manually exclude tests (this is what we used to do at
> Cloudera)
>
> 2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
> day two years ago, and maybe it's time to repeat it again:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105 this
> is going to be tricky as we are in a pandemic and most of the community are
> working from home, unlike the last time when we can lock ourselves in a
> conference room and force everybody to work :)
>
> Thoughts?
>
>
> On Thu, Oct 22, 2020 at 12:14 PM Akira Ajisaka <aa...@apache.org>
> wrote:
>
>> Hi Hadoop developers,
>>
>> Now there are a lot of failing unit tests and there is an issue to
>> tackle this bad situation.
>> https://issues.apache.org/jira/browse/HDFS-15646
>>
>> Although this issue is in HDFS project, this issue is related to all
>> the Hadoop developers. Please check the above URL, read the
>> description, and volunteer to dedicate more time to fix flaky tests.
>> Your contribution to fixing the flaky tests will be really
>> appreciated!
>>
>> Thank you Ahmed Hussein for your report.
>>
>> Regards,
>> Akira
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
>> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
>>
>>

Re: Fixing flaky tests in Apache Hadoop

Posted by Wei-Chiu Chuang <we...@apache.org>.

I also wondered if the hardware was too stressed since all Hadoop related
projects all use the same set of Jenkins servers.
However, HBase just recently moved to their own dedicated machines, so I'm
actually surprised to see a lot of resource related failures even now.

On Thu, Oct 22, 2020 at 2:03 PM Wei-Chiu Chuang <we...@apache.org> wrote:

> Thanks for raising the issue, Akira and Ahmed,
>
> Fixing flaky tests is a thankless job so I want to take this opportunity
> to recognize the time and effort.
>
> We will always have flaky tests due to bad tests or simply infra issues.
> Fixing flaky tests will take time but if they are not addressed it wastes
> everybody's time.
>
> Recognizing this problem, I have two suggestions:
>
> 1. Other projects such as HBase have a tool to exclude flaky tests from
> being executed. They track flaky tests and display them in a dashboard.
> This will allow good tests to pass while leaving time for folks to fix
> them. Or we could manually exclude tests (this is what we used to do at
> Cloudera)
>
> 2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
> day two years ago, and maybe it's time to repeat it again:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105 this
> is going to be tricky as we are in a pandemic and most of the community are
> working from home, unlike the last time when we can lock ourselves in a
> conference room and force everybody to work :)
>
> Thoughts?
>
>
> On Thu, Oct 22, 2020 at 12:14 PM Akira Ajisaka <aa...@apache.org>
> wrote:
>
>> Hi Hadoop developers,
>>
>> Now there are a lot of failing unit tests and there is an issue to
>> tackle this bad situation.
>> https://issues.apache.org/jira/browse/HDFS-15646
>>
>> Although this issue is in HDFS project, this issue is related to all
>> the Hadoop developers. Please check the above URL, read the
>> description, and volunteer to dedicate more time to fix flaky tests.
>> Your contribution to fixing the flaky tests will be really
>> appreciated!
>>
>> Thank you Ahmed Hussein for your report.
>>
>> Regards,
>> Akira
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
>> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
>>
>>

Re: Fixing flaky tests in Apache Hadoop

Posted by Wei-Chiu Chuang <we...@apache.org>.

Thanks for raising the issue, Akira and Ahmed,

Fixing flaky tests is a thankless job so I want to take this opportunity to
recognize the time and effort.

We will always have flaky tests due to bad tests or simply infra issues.
Fixing flaky tests will take time but if they are not addressed it wastes
everybody's time.

Recognizing this problem, I have two suggestions:

1. Other projects such as HBase have a tool to exclude flaky tests from
being executed. They track flaky tests and display them in a dashboard.
This will allow good tests to pass while leaving time for folks to fix
them. Or we could manually exclude tests (this is what we used to do at
Cloudera)

2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
day two years ago, and maybe it's time to repeat it again:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105 this
is going to be tricky as we are in a pandemic and most of the community are
working from home, unlike the last time when we can lock ourselves in a
conference room and force everybody to work :)

Thoughts?

On Thu, Oct 22, 2020 at 12:14 PM Akira Ajisaka <aa...@apache.org> wrote:

> Hi Hadoop developers,
>
> Now there are a lot of failing unit tests and there is an issue to
> tackle this bad situation.
> https://issues.apache.org/jira/browse/HDFS-15646
>
> Although this issue is in HDFS project, this issue is related to all
> the Hadoop developers. Please check the above URL, read the
> description, and volunteer to dedicate more time to fix flaky tests.
> Your contribution to fixing the flaky tests will be really
> appreciated!
>
> Thank you Ahmed Hussein for your report.
>
> Regards,
> Akira
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
>
>

Re: Fixing flaky tests in Apache Hadoop

Posted by Wei-Chiu Chuang <we...@apache.org>.

Thanks for raising the issue, Akira and Ahmed,

Fixing flaky tests is a thankless job so I want to take this opportunity to
recognize the time and effort.

We will always have flaky tests due to bad tests or simply infra issues.
Fixing flaky tests will take time but if they are not addressed it wastes
everybody's time.

Recognizing this problem, I have two suggestions:

1. Other projects such as HBase have a tool to exclude flaky tests from
being executed. They track flaky tests and display them in a dashboard.
This will allow good tests to pass while leaving time for folks to fix
them. Or we could manually exclude tests (this is what we used to do at
Cloudera)

2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
day two years ago, and maybe it's time to repeat it again:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105 this
is going to be tricky as we are in a pandemic and most of the community are
working from home, unlike the last time when we can lock ourselves in a
conference room and force everybody to work :)

Thoughts?

On Thu, Oct 22, 2020 at 12:14 PM Akira Ajisaka <aa...@apache.org> wrote:

> Hi Hadoop developers,
>
> Now there are a lot of failing unit tests and there is an issue to
> tackle this bad situation.
> https://issues.apache.org/jira/browse/HDFS-15646
>
> Although this issue is in HDFS project, this issue is related to all
> the Hadoop developers. Please check the above URL, read the
> description, and volunteer to dedicate more time to fix flaky tests.
> Your contribution to fixing the flaky tests will be really
> appreciated!
>
> Thank you Ahmed Hussein for your report.
>
> Regards,
> Akira
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
>
>

Re: Fixing flaky tests in Apache Hadoop

Posted by Wei-Chiu Chuang <we...@apache.org>.

Thanks for raising the issue, Akira and Ahmed,

Fixing flaky tests is a thankless job so I want to take this opportunity to
recognize the time and effort.

We will always have flaky tests due to bad tests or simply infra issues.
Fixing flaky tests will take time but if they are not addressed it wastes
everybody's time.

Recognizing this problem, I have two suggestions:

1. Other projects such as HBase have a tool to exclude flaky tests from
being executed. They track flaky tests and display them in a dashboard.
This will allow good tests to pass while leaving time for folks to fix
them. Or we could manually exclude tests (this is what we used to do at
Cloudera)

2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
day two years ago, and maybe it's time to repeat it again:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105 this
is going to be tricky as we are in a pandemic and most of the community are
working from home, unlike the last time when we can lock ourselves in a
conference room and force everybody to work :)

Thoughts?

On Thu, Oct 22, 2020 at 12:14 PM Akira Ajisaka <aa...@apache.org> wrote:

> Hi Hadoop developers,
>
> Now there are a lot of failing unit tests and there is an issue to
> tackle this bad situation.
> https://issues.apache.org/jira/browse/HDFS-15646
>
> Although this issue is in HDFS project, this issue is related to all
> the Hadoop developers. Please check the above URL, read the
> description, and volunteer to dedicate more time to fix flaky tests.
> Your contribution to fixing the flaky tests will be really
> appreciated!
>
> Thank you Ahmed Hussein for your report.
>
> Regards,
> Akira
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
>
>

Re: Fixing flaky tests in Apache Hadoop

Posted by Wei-Chiu Chuang <we...@apache.org>.

Thanks for raising the issue, Akira and Ahmed,

Fixing flaky tests is a thankless job so I want to take this opportunity to
recognize the time and effort.

We will always have flaky tests due to bad tests or simply infra issues.
Fixing flaky tests will take time but if they are not addressed it wastes
everybody's time.

Recognizing this problem, I have two suggestions:

1. Other projects such as HBase have a tool to exclude flaky tests from
being executed. They track flaky tests and display them in a dashboard.
This will allow good tests to pass while leaving time for folks to fix
them. Or we could manually exclude tests (this is what we used to do at
Cloudera)

2. Dedicate a community "Bug Bash Day" / "Fix it Day". We had a bug bash
day two years ago, and maybe it's time to repeat it again:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75965105 this
is going to be tricky as we are in a pandemic and most of the community are
working from home, unlike the last time when we can lock ourselves in a
conference room and force everybody to work :)

Thoughts?

On Thu, Oct 22, 2020 at 12:14 PM Akira Ajisaka <aa...@apache.org> wrote:

> Hi Hadoop developers,
>
> Now there are a lot of failing unit tests and there is an issue to
> tackle this bad situation.
> https://issues.apache.org/jira/browse/HDFS-15646
>
> Although this issue is in HDFS project, this issue is related to all
> the Hadoop developers. Please check the above URL, read the
> description, and volunteer to dedicate more time to fix flaky tests.
> Your contribution to fixing the flaky tests will be really
> appreciated!
>
> Thank you Ahmed Hussein for your report.
>
> Regards,
> Akira
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
>
>