You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Stack <st...@duboce.net> on 2013/07/22 20:49:42 UTC

Getting unit tests to pass

Below is a state of hbase 0.95/trunk unit tests (Includes a little taxonomy
of test failure type definitions).

On Andrew's ec2 build box, 0.95 is passing most of the time:

http://54.241.6.143/job/HBase-0.95/
http://54.241.6.143/job/HBase-0.95-Hadoop-2/

It is not as good on Apache build box but it is getting better:

https://builds.apache.org/view/H-L/view/HBase/job/hbase-0.95/
https://builds.apache.org/view/H-L/view/HBase/job/hbase-0.95-on-hadoop2/

On Apache, I have seen loads up in the 500s and all file descriptors used
according to the little resources report printed at the end of each test.
 If these numbers are to be believed (TBD), we may never achieve 100% pass
rate on Apache builds.

Andrew's ec2 builds run the integration tests too where the apache builds
do not -- sometimes we'll fail an integration test run which makes the
Andrew ec2 red/green ratio look worse that it actually is.

Trunk builds lag.  They are being worked on.

We seem to be over the worst of the flakey unit tests.  We have a few
stragglers still but they are being hunted down by the likes of the
merciless Jimmy Xiang and Jeffrey Zhong.

The "zombies" have been mostly nailed too (where "zombies" are tests that
refuse to die continuing after the suite has completed causing the build to
fail).  The zombie trap from test-patch.sh was ported over to apache and
ec2 build and it caught the last of undying.

We are now into a new phase where "all" tests pass but the build still
fails.  Here is an example:
http://54.241.6.143/job/HBase-TRUNK/429/org.apache.hbase$hbase-server/  The
only clue I have to go on is the fact that when we fail, the number of
tests run is less than the total that shows for a successful run.

Unless anyone has a better idea, to figure why the hang, I compare the list
of tests that show in a good run vs. those of a bad run.  Tests that are in
the good run but missing from the bad run are deemed suspect.  In the
absence of  other evidence or other ideas, I am blaming these "invisibles"
for the build fail.

Here is an example:

This is a good 0.95 hadoop2 run (notice how we are running integration
tests tooooo and they succeed!!  On hadoop2!!!!):

http://54.241.6.143/job/HBase-0.95-Hadoop-2/669/

In hbase-server module:

Tests run: 1491, Failures: 0, Errors: 0, Skipped: 19


This is a bad run:

http://54.241.6.143/job/HBase-0.95-Hadoop-2/668/

Tests run: 1458, Failures: 0, Errors: 0, Skipped: 18


If I compare tests, the successful run has:

> Running org.apache.hadoop.hbase.regionserver.wal.TestHLogSplitCompressed


... where the bad run does not show the above test.
 TestHLogSplitCompressed has 34 tests one of which is disabled so that
would seem to account for the discrepancy.

I've started to disable tests that fail likes this putting them aside for
original authors or the interested to take a look to see why they fail
occasionally.  I put them aside so we can enjoy passing builds in the
meantime.  I've already moved aside or disabled a few tests and test
classes:

TestMultiTableInputFormat
TestReplicationKillSlaveRS
TestHCM.testDeleteForZKConnLeak was disabled

... and a few others.

Finally (if you are still reading), I would suggest that test failures in
hadoopqa are now more worthy of investigation.   Illustrative is what
happened recently around "HBASE-8983 HBaseConnection#deleteAllConnections"
where the patch had +1s and on its first run, a unit test failed (though it
passed locally).  The second run obscured the first run's failure.  After
digging by another, the patch had actually broken the first test (though it
looked unrelated).  I would suggest that now tests are healthier, test
failures are worth paying more attention too.

Yours,
St.Ack

Re: Getting unit tests to pass

Posted by Stack <st...@duboce.net>.
On Mon, Jul 22, 2013 at 11:54 PM, Lars Francke <la...@gmail.com>wrote:

> Slightly related, sorry for hijacking: I can't get HBase trunk to
> build. In particular TestHCM.testClusterStatus always fails for me. I
> tried on my own Jenkins as well as my IDE (IntelliJ) with the same
> result (two different machines, CentOS & Mac OS).
>
> mvn -U -PrunAllTests -Dmaven.test.redirectTestOutputToFile=true
> -Dit.test=noItTest clean install
> <http://pastebin.com/upFjq09A>
>
> From my MacBook's command line I got the test to pass using the same
> command but not in Jenkins or from IntelliJ.
>
> I'm happy to post in a new thread if this is distracting and no one
> else has seen this before.
>
> Any ideas?
>

Open an issue and paste output I'd say LarsF.  I notice that Francis has
trouble getting this to pass too recently over in this issue:
https://issues.apache.org/jira/browse/HBASE-8015.  It seems to pass on our
Jenkins: http://54.241.6.143/job/HBase-TRUNK/ and here
https://builds.apache.org/view/H-L/view/HBase/job/HBase-TRUNK/  IIRC, I've
seen it fail though in my experience; I just lost its failure detail in the
rain of general test failures (now a drizzle).

Thanks Lars,
St.Ack

Re: Getting unit tests to pass

Posted by Lars Francke <la...@gmail.com>.
Slightly related, sorry for hijacking: I can't get HBase trunk to
build. In particular TestHCM.testClusterStatus always fails for me. I
tried on my own Jenkins as well as my IDE (IntelliJ) with the same
result (two different machines, CentOS & Mac OS).

mvn -U -PrunAllTests -Dmaven.test.redirectTestOutputToFile=true
-Dit.test=noItTest clean install
<http://pastebin.com/upFjq09A>

>From my MacBook's command line I got the test to pass using the same
command but not in Jenkins or from IntelliJ.

I'm happy to post in a new thread if this is distracting and no one
else has seen this before.

Any ideas?

Thanks,
Lars

On Tue, Jul 23, 2013 at 7:01 AM, Stack <st...@duboce.net> wrote:
> nvm.  I read the resourcechecker code.  It is just printing out before and
> afters so my speculation that we are up against fd limits is just off.
>
> Back to figuring out why tests fail at random....
>
> St.Ack
>
>
> On Mon, Jul 22, 2013 at 9:50 PM, Stack <st...@duboce.net> wrote:
>
>> Here is another from tail of
>> https://issues.apache.org/jira/browse/HBASE-5995
>>
>> 2013-07-23 01:23:29,574 INFO  [pool-1-thread-1]
>> hbase.ResourceChecker(171): after:
>> regionserver.wal.TestLogRolling#testLogRollOnPipelineRestart Thread=39 (was
>> 31) - Thread LEAK? -, OpenFileDescriptor=312 (was 272) - OpenFileDescriptor
>> LEAK? -, MaxFileDescriptor=40000 (was 40000), SystemLoadAverage=351 (was
>> 368), ProcessCount=144 (was 142) - ProcessCount LEAK? -,
>> AvailableMemoryMB=906 (was 1995), ConnectionCount=0 (was 0)
>>
>> This one showed up as a zombie too; stuck.
>>
>> Or here, https://builds.apache.org/view/H-L/view/HBase/job/HBase-TRUNK/,
>> where we'd had a nice run of passing tests, of a sudden a test that I've
>> not seen fail before, fails:
>>
>> https://builds.apache.org/view/H-L/view/HBase/job/HBase-TRUNK/4282/
>>
>>
>> org.apache.hadoop.hbase.master.TestActiveMasterManager.testActiveMasterManagerFromZK
>>
>> Near the end of the test, the resource checker reports:
>> *
>> *
>>
>>  - Thread LEAK? -, OpenFileDescriptor=100 (was 92) - OpenFileDescriptor LEAK? -, MaxFileDescriptor=40000 (was 40000), SystemLoadAverage=328 (was 331), ProcessCount=138 (was 138), AvailableMemoryMB=1223 (was 1246), ConnectionCount=0 (was 0)
>>
>>
>>
>> Getting tests to pass on these build boxes (other than hadoopqa which is a
>> different set of machines) seems unattainable.
>>
>> I will write infra about the 40k to see if they can do something about
>> that.
>>
>> St.Ack
>>
>>
>>
>>
>> On Mon, Jul 22, 2013 at 9:13 PM, Stack <st...@duboce.net> wrote:
>>
>>> By way of illustration of how loaded Apache build boxes can be:
>>>
>>> Thread LEAK? -, OpenFileDescriptor=174 (was 162) - OpenFileDescriptor LEAK? -, MaxFileDescriptor=40000 (was 40000), SystemLoadAverage=351 (was 383), ProcessCount=142 (was 144), AvailableMemoryMB=819 (was 892), ConnectionCount=0 (was 0)
>>>
>>> This seems to have caused a test that usually passes to fail:
>>> https://issues.apache.org/jira/browse/HBASE-9023
>>>
>>> St.Ack
>>>
>>>
>>> On Mon, Jul 22, 2013 at 11:49 AM, Stack <st...@duboce.net> wrote:
>>>
>>>> Below is a state of hbase 0.95/trunk unit tests (Includes a little
>>>> taxonomy of test failure type definitions).
>>>>
>>>> On Andrew's ec2 build box, 0.95 is passing most of the time:
>>>>
>>>> http://54.241.6.143/job/HBase-0.95/
>>>> http://54.241.6.143/job/HBase-0.95-Hadoop-2/
>>>>
>>>> It is not as good on Apache build box but it is getting better:
>>>>
>>>> https://builds.apache.org/view/H-L/view/HBase/job/hbase-0.95/
>>>> https://builds.apache.org/view/H-L/view/HBase/job/hbase-0.95-on-hadoop2/
>>>>
>>>> On Apache, I have seen loads up in the 500s and all file descriptors
>>>> used according to the little resources report printed at the end of each
>>>> test.  If these numbers are to be believed (TBD), we may never achieve 100%
>>>> pass rate on Apache builds.
>>>>
>>>> Andrew's ec2 builds run the integration tests too where the apache
>>>> builds do not -- sometimes we'll fail an integration test run which makes
>>>> the Andrew ec2 red/green ratio look worse that it actually is.
>>>>
>>>> Trunk builds lag.  They are being worked on.
>>>>
>>>> We seem to be over the worst of the flakey unit tests.  We have a few
>>>> stragglers still but they are being hunted down by the likes of the
>>>> merciless Jimmy Xiang and Jeffrey Zhong.
>>>>
>>>> The "zombies" have been mostly nailed too (where "zombies" are tests
>>>> that refuse to die continuing after the suite has completed causing the
>>>> build to fail).  The zombie trap from test-patch.sh was ported over to
>>>> apache and ec2 build and it caught the last of undying.
>>>>
>>>> We are now into a new phase where "all" tests pass but the build still
>>>> fails.  Here is an example:
>>>> http://54.241.6.143/job/HBase-TRUNK/429/org.apache.hbase$hbase-server/ The only clue I have to go on is the fact that when we fail, the number of
>>>> tests run is less than the total that shows for a successful run.
>>>>
>>>> Unless anyone has a better idea, to figure why the hang, I compare the
>>>> list of tests that show in a good run vs. those of a bad run.  Tests that
>>>> are in the good run but missing from the bad run are deemed suspect.  In
>>>> the absence of  other evidence or other ideas, I am blaming these
>>>> "invisibles" for the build fail.
>>>>
>>>> Here is an example:
>>>>
>>>> This is a good 0.95 hadoop2 run (notice how we are running integration
>>>> tests tooooo and they succeed!!  On hadoop2!!!!):
>>>>
>>>> http://54.241.6.143/job/HBase-0.95-Hadoop-2/669/
>>>>
>>>> In hbase-server module:
>>>>
>>>> Tests run: 1491, Failures: 0, Errors: 0, Skipped: 19
>>>>
>>>>
>>>> This is a bad run:
>>>>
>>>> http://54.241.6.143/job/HBase-0.95-Hadoop-2/668/
>>>>
>>>> Tests run: 1458, Failures: 0, Errors: 0, Skipped: 18
>>>>
>>>>
>>>> If I compare tests, the successful run has:
>>>>
>>>> > Running
>>>> org.apache.hadoop.hbase.regionserver.wal.TestHLogSplitCompressed
>>>>
>>>>
>>>> ... where the bad run does not show the above test.
>>>>  TestHLogSplitCompressed has 34 tests one of which is disabled so that
>>>> would seem to account for the discrepancy.
>>>>
>>>> I've started to disable tests that fail likes this putting them aside
>>>> for original authors or the interested to take a look to see why they fail
>>>> occasionally.  I put them aside so we can enjoy passing builds in the
>>>> meantime.  I've already moved aside or disabled a few tests and test
>>>> classes:
>>>>
>>>> TestMultiTableInputFormat
>>>> TestReplicationKillSlaveRS
>>>> TestHCM.testDeleteForZKConnLeak was disabled
>>>>
>>>> ... and a few others.
>>>>
>>>> Finally (if you are still reading), I would suggest that test failures
>>>> in hadoopqa are now more worthy of investigation.   Illustrative is what
>>>> happened recently around "HBASE-8983 HBaseConnection#deleteAllConnections"
>>>> where the patch had +1s and on its first run, a unit test failed (though it
>>>> passed locally).  The second run obscured the first run's failure.  After
>>>> digging by another, the patch had actually broken the first test (though it
>>>> looked unrelated).  I would suggest that now tests are healthier, test
>>>> failures are worth paying more attention too.
>>>>
>>>> Yours,
>>>> St.Ack
>>>>
>>>>
>>>>
>>>>
>>>
>>

Re: Getting unit tests to pass

Posted by Stack <st...@duboce.net>.
nvm.  I read the resourcechecker code.  It is just printing out before and
afters so my speculation that we are up against fd limits is just off.

Back to figuring out why tests fail at random....

St.Ack


On Mon, Jul 22, 2013 at 9:50 PM, Stack <st...@duboce.net> wrote:

> Here is another from tail of
> https://issues.apache.org/jira/browse/HBASE-5995
>
> 2013-07-23 01:23:29,574 INFO  [pool-1-thread-1]
> hbase.ResourceChecker(171): after:
> regionserver.wal.TestLogRolling#testLogRollOnPipelineRestart Thread=39 (was
> 31) - Thread LEAK? -, OpenFileDescriptor=312 (was 272) - OpenFileDescriptor
> LEAK? -, MaxFileDescriptor=40000 (was 40000), SystemLoadAverage=351 (was
> 368), ProcessCount=144 (was 142) - ProcessCount LEAK? -,
> AvailableMemoryMB=906 (was 1995), ConnectionCount=0 (was 0)
>
> This one showed up as a zombie too; stuck.
>
> Or here, https://builds.apache.org/view/H-L/view/HBase/job/HBase-TRUNK/,
> where we'd had a nice run of passing tests, of a sudden a test that I've
> not seen fail before, fails:
>
> https://builds.apache.org/view/H-L/view/HBase/job/HBase-TRUNK/4282/
>
>
> org.apache.hadoop.hbase.master.TestActiveMasterManager.testActiveMasterManagerFromZK
>
> Near the end of the test, the resource checker reports:
> *
> *
>
>  - Thread LEAK? -, OpenFileDescriptor=100 (was 92) - OpenFileDescriptor LEAK? -, MaxFileDescriptor=40000 (was 40000), SystemLoadAverage=328 (was 331), ProcessCount=138 (was 138), AvailableMemoryMB=1223 (was 1246), ConnectionCount=0 (was 0)
>
>
>
> Getting tests to pass on these build boxes (other than hadoopqa which is a
> different set of machines) seems unattainable.
>
> I will write infra about the 40k to see if they can do something about
> that.
>
> St.Ack
>
>
>
>
> On Mon, Jul 22, 2013 at 9:13 PM, Stack <st...@duboce.net> wrote:
>
>> By way of illustration of how loaded Apache build boxes can be:
>>
>> Thread LEAK? -, OpenFileDescriptor=174 (was 162) - OpenFileDescriptor LEAK? -, MaxFileDescriptor=40000 (was 40000), SystemLoadAverage=351 (was 383), ProcessCount=142 (was 144), AvailableMemoryMB=819 (was 892), ConnectionCount=0 (was 0)
>>
>> This seems to have caused a test that usually passes to fail:
>> https://issues.apache.org/jira/browse/HBASE-9023
>>
>> St.Ack
>>
>>
>> On Mon, Jul 22, 2013 at 11:49 AM, Stack <st...@duboce.net> wrote:
>>
>>> Below is a state of hbase 0.95/trunk unit tests (Includes a little
>>> taxonomy of test failure type definitions).
>>>
>>> On Andrew's ec2 build box, 0.95 is passing most of the time:
>>>
>>> http://54.241.6.143/job/HBase-0.95/
>>> http://54.241.6.143/job/HBase-0.95-Hadoop-2/
>>>
>>> It is not as good on Apache build box but it is getting better:
>>>
>>> https://builds.apache.org/view/H-L/view/HBase/job/hbase-0.95/
>>> https://builds.apache.org/view/H-L/view/HBase/job/hbase-0.95-on-hadoop2/
>>>
>>> On Apache, I have seen loads up in the 500s and all file descriptors
>>> used according to the little resources report printed at the end of each
>>> test.  If these numbers are to be believed (TBD), we may never achieve 100%
>>> pass rate on Apache builds.
>>>
>>> Andrew's ec2 builds run the integration tests too where the apache
>>> builds do not -- sometimes we'll fail an integration test run which makes
>>> the Andrew ec2 red/green ratio look worse that it actually is.
>>>
>>> Trunk builds lag.  They are being worked on.
>>>
>>> We seem to be over the worst of the flakey unit tests.  We have a few
>>> stragglers still but they are being hunted down by the likes of the
>>> merciless Jimmy Xiang and Jeffrey Zhong.
>>>
>>> The "zombies" have been mostly nailed too (where "zombies" are tests
>>> that refuse to die continuing after the suite has completed causing the
>>> build to fail).  The zombie trap from test-patch.sh was ported over to
>>> apache and ec2 build and it caught the last of undying.
>>>
>>> We are now into a new phase where "all" tests pass but the build still
>>> fails.  Here is an example:
>>> http://54.241.6.143/job/HBase-TRUNK/429/org.apache.hbase$hbase-server/ The only clue I have to go on is the fact that when we fail, the number of
>>> tests run is less than the total that shows for a successful run.
>>>
>>> Unless anyone has a better idea, to figure why the hang, I compare the
>>> list of tests that show in a good run vs. those of a bad run.  Tests that
>>> are in the good run but missing from the bad run are deemed suspect.  In
>>> the absence of  other evidence or other ideas, I am blaming these
>>> "invisibles" for the build fail.
>>>
>>> Here is an example:
>>>
>>> This is a good 0.95 hadoop2 run (notice how we are running integration
>>> tests tooooo and they succeed!!  On hadoop2!!!!):
>>>
>>> http://54.241.6.143/job/HBase-0.95-Hadoop-2/669/
>>>
>>> In hbase-server module:
>>>
>>> Tests run: 1491, Failures: 0, Errors: 0, Skipped: 19
>>>
>>>
>>> This is a bad run:
>>>
>>> http://54.241.6.143/job/HBase-0.95-Hadoop-2/668/
>>>
>>> Tests run: 1458, Failures: 0, Errors: 0, Skipped: 18
>>>
>>>
>>> If I compare tests, the successful run has:
>>>
>>> > Running
>>> org.apache.hadoop.hbase.regionserver.wal.TestHLogSplitCompressed
>>>
>>>
>>> ... where the bad run does not show the above test.
>>>  TestHLogSplitCompressed has 34 tests one of which is disabled so that
>>> would seem to account for the discrepancy.
>>>
>>> I've started to disable tests that fail likes this putting them aside
>>> for original authors or the interested to take a look to see why they fail
>>> occasionally.  I put them aside so we can enjoy passing builds in the
>>> meantime.  I've already moved aside or disabled a few tests and test
>>> classes:
>>>
>>> TestMultiTableInputFormat
>>> TestReplicationKillSlaveRS
>>> TestHCM.testDeleteForZKConnLeak was disabled
>>>
>>> ... and a few others.
>>>
>>> Finally (if you are still reading), I would suggest that test failures
>>> in hadoopqa are now more worthy of investigation.   Illustrative is what
>>> happened recently around "HBASE-8983 HBaseConnection#deleteAllConnections"
>>> where the patch had +1s and on its first run, a unit test failed (though it
>>> passed locally).  The second run obscured the first run's failure.  After
>>> digging by another, the patch had actually broken the first test (though it
>>> looked unrelated).  I would suggest that now tests are healthier, test
>>> failures are worth paying more attention too.
>>>
>>> Yours,
>>> St.Ack
>>>
>>>
>>>
>>>
>>
>

Re: Getting unit tests to pass

Posted by Stack <st...@duboce.net>.
Here is another from tail of
https://issues.apache.org/jira/browse/HBASE-5995

2013-07-23 01:23:29,574 INFO  [pool-1-thread-1] hbase.ResourceChecker(171):
after: regionserver.wal.TestLogRolling#testLogRollOnPipelineRestart
Thread=39 (was 31) - Thread LEAK? -, OpenFileDescriptor=312 (was 272) -
OpenFileDescriptor LEAK? -, MaxFileDescriptor=40000 (was 40000),
SystemLoadAverage=351 (was 368), ProcessCount=144 (was 142) - ProcessCount
LEAK? -, AvailableMemoryMB=906 (was 1995), ConnectionCount=0 (was 0)

This one showed up as a zombie too; stuck.

Or here, https://builds.apache.org/view/H-L/view/HBase/job/HBase-TRUNK/,
where we'd had a nice run of passing tests, of a sudden a test that I've
not seen fail before, fails:

https://builds.apache.org/view/H-L/view/HBase/job/HBase-TRUNK/4282/

org.apache.hadoop.hbase.master.TestActiveMasterManager.testActiveMasterManagerFromZK

Near the end of the test, the resource checker reports:
*
*

 - Thread LEAK? -, OpenFileDescriptor=100 (was 92) -
OpenFileDescriptor LEAK? -, MaxFileDescriptor=40000 (was 40000),
SystemLoadAverage=328 (was 331), ProcessCount=138 (was 138),
AvailableMemoryMB=1223 (was 1246), ConnectionCount=0 (was 0)



Getting tests to pass on these build boxes (other than hadoopqa which is a
different set of machines) seems unattainable.

I will write infra about the 40k to see if they can do something about that.

St.Ack




On Mon, Jul 22, 2013 at 9:13 PM, Stack <st...@duboce.net> wrote:

> By way of illustration of how loaded Apache build boxes can be:
>
> Thread LEAK? -, OpenFileDescriptor=174 (was 162) - OpenFileDescriptor LEAK? -, MaxFileDescriptor=40000 (was 40000), SystemLoadAverage=351 (was 383), ProcessCount=142 (was 144), AvailableMemoryMB=819 (was 892), ConnectionCount=0 (was 0)
>
> This seems to have caused a test that usually passes to fail:
> https://issues.apache.org/jira/browse/HBASE-9023
>
> St.Ack
>
>
> On Mon, Jul 22, 2013 at 11:49 AM, Stack <st...@duboce.net> wrote:
>
>> Below is a state of hbase 0.95/trunk unit tests (Includes a little
>> taxonomy of test failure type definitions).
>>
>> On Andrew's ec2 build box, 0.95 is passing most of the time:
>>
>> http://54.241.6.143/job/HBase-0.95/
>> http://54.241.6.143/job/HBase-0.95-Hadoop-2/
>>
>> It is not as good on Apache build box but it is getting better:
>>
>> https://builds.apache.org/view/H-L/view/HBase/job/hbase-0.95/
>> https://builds.apache.org/view/H-L/view/HBase/job/hbase-0.95-on-hadoop2/
>>
>> On Apache, I have seen loads up in the 500s and all file descriptors used
>> according to the little resources report printed at the end of each test.
>>  If these numbers are to be believed (TBD), we may never achieve 100% pass
>> rate on Apache builds.
>>
>> Andrew's ec2 builds run the integration tests too where the apache builds
>> do not -- sometimes we'll fail an integration test run which makes the
>> Andrew ec2 red/green ratio look worse that it actually is.
>>
>> Trunk builds lag.  They are being worked on.
>>
>> We seem to be over the worst of the flakey unit tests.  We have a few
>> stragglers still but they are being hunted down by the likes of the
>> merciless Jimmy Xiang and Jeffrey Zhong.
>>
>> The "zombies" have been mostly nailed too (where "zombies" are tests that
>> refuse to die continuing after the suite has completed causing the build to
>> fail).  The zombie trap from test-patch.sh was ported over to apache and
>> ec2 build and it caught the last of undying.
>>
>> We are now into a new phase where "all" tests pass but the build still
>> fails.  Here is an example:
>> http://54.241.6.143/job/HBase-TRUNK/429/org.apache.hbase$hbase-server/ The only clue I have to go on is the fact that when we fail, the number of
>> tests run is less than the total that shows for a successful run.
>>
>> Unless anyone has a better idea, to figure why the hang, I compare the
>> list of tests that show in a good run vs. those of a bad run.  Tests that
>> are in the good run but missing from the bad run are deemed suspect.  In
>> the absence of  other evidence or other ideas, I am blaming these
>> "invisibles" for the build fail.
>>
>> Here is an example:
>>
>> This is a good 0.95 hadoop2 run (notice how we are running integration
>> tests tooooo and they succeed!!  On hadoop2!!!!):
>>
>> http://54.241.6.143/job/HBase-0.95-Hadoop-2/669/
>>
>> In hbase-server module:
>>
>> Tests run: 1491, Failures: 0, Errors: 0, Skipped: 19
>>
>>
>> This is a bad run:
>>
>> http://54.241.6.143/job/HBase-0.95-Hadoop-2/668/
>>
>> Tests run: 1458, Failures: 0, Errors: 0, Skipped: 18
>>
>>
>> If I compare tests, the successful run has:
>>
>> > Running org.apache.hadoop.hbase.regionserver.wal.TestHLogSplitCompressed
>>
>>
>> ... where the bad run does not show the above test.
>>  TestHLogSplitCompressed has 34 tests one of which is disabled so that
>> would seem to account for the discrepancy.
>>
>> I've started to disable tests that fail likes this putting them aside for
>> original authors or the interested to take a look to see why they fail
>> occasionally.  I put them aside so we can enjoy passing builds in the
>> meantime.  I've already moved aside or disabled a few tests and test
>> classes:
>>
>> TestMultiTableInputFormat
>> TestReplicationKillSlaveRS
>> TestHCM.testDeleteForZKConnLeak was disabled
>>
>> ... and a few others.
>>
>> Finally (if you are still reading), I would suggest that test failures in
>> hadoopqa are now more worthy of investigation.   Illustrative is what
>> happened recently around "HBASE-8983 HBaseConnection#deleteAllConnections"
>> where the patch had +1s and on its first run, a unit test failed (though it
>> passed locally).  The second run obscured the first run's failure.  After
>> digging by another, the patch had actually broken the first test (though it
>> looked unrelated).  I would suggest that now tests are healthier, test
>> failures are worth paying more attention too.
>>
>> Yours,
>> St.Ack
>>
>>
>>
>>
>

Re: Getting unit tests to pass

Posted by Stack <st...@duboce.net>.
By way of illustration of how loaded Apache build boxes can be:

Thread LEAK? -, OpenFileDescriptor=174 (was 162) - OpenFileDescriptor
LEAK? -, MaxFileDescriptor=40000 (was 40000), SystemLoadAverage=351
(was 383), ProcessCount=142 (was 144), AvailableMemoryMB=819 (was
892), ConnectionCount=0 (was 0)

This seems to have caused a test that usually passes to fail:
https://issues.apache.org/jira/browse/HBASE-9023

St.Ack


On Mon, Jul 22, 2013 at 11:49 AM, Stack <st...@duboce.net> wrote:

> Below is a state of hbase 0.95/trunk unit tests (Includes a little
> taxonomy of test failure type definitions).
>
> On Andrew's ec2 build box, 0.95 is passing most of the time:
>
> http://54.241.6.143/job/HBase-0.95/
> http://54.241.6.143/job/HBase-0.95-Hadoop-2/
>
> It is not as good on Apache build box but it is getting better:
>
> https://builds.apache.org/view/H-L/view/HBase/job/hbase-0.95/
> https://builds.apache.org/view/H-L/view/HBase/job/hbase-0.95-on-hadoop2/
>
> On Apache, I have seen loads up in the 500s and all file descriptors used
> according to the little resources report printed at the end of each test.
>  If these numbers are to be believed (TBD), we may never achieve 100% pass
> rate on Apache builds.
>
> Andrew's ec2 builds run the integration tests too where the apache builds
> do not -- sometimes we'll fail an integration test run which makes the
> Andrew ec2 red/green ratio look worse that it actually is.
>
> Trunk builds lag.  They are being worked on.
>
> We seem to be over the worst of the flakey unit tests.  We have a few
> stragglers still but they are being hunted down by the likes of the
> merciless Jimmy Xiang and Jeffrey Zhong.
>
> The "zombies" have been mostly nailed too (where "zombies" are tests that
> refuse to die continuing after the suite has completed causing the build to
> fail).  The zombie trap from test-patch.sh was ported over to apache and
> ec2 build and it caught the last of undying.
>
> We are now into a new phase where "all" tests pass but the build still
> fails.  Here is an example:
> http://54.241.6.143/job/HBase-TRUNK/429/org.apache.hbase$hbase-server/ The only clue I have to go on is the fact that when we fail, the number of
> tests run is less than the total that shows for a successful run.
>
> Unless anyone has a better idea, to figure why the hang, I compare the
> list of tests that show in a good run vs. those of a bad run.  Tests that
> are in the good run but missing from the bad run are deemed suspect.  In
> the absence of  other evidence or other ideas, I am blaming these
> "invisibles" for the build fail.
>
> Here is an example:
>
> This is a good 0.95 hadoop2 run (notice how we are running integration
> tests tooooo and they succeed!!  On hadoop2!!!!):
>
> http://54.241.6.143/job/HBase-0.95-Hadoop-2/669/
>
> In hbase-server module:
>
> Tests run: 1491, Failures: 0, Errors: 0, Skipped: 19
>
>
> This is a bad run:
>
> http://54.241.6.143/job/HBase-0.95-Hadoop-2/668/
>
> Tests run: 1458, Failures: 0, Errors: 0, Skipped: 18
>
>
> If I compare tests, the successful run has:
>
> > Running org.apache.hadoop.hbase.regionserver.wal.TestHLogSplitCompressed
>
>
> ... where the bad run does not show the above test.
>  TestHLogSplitCompressed has 34 tests one of which is disabled so that
> would seem to account for the discrepancy.
>
> I've started to disable tests that fail likes this putting them aside for
> original authors or the interested to take a look to see why they fail
> occasionally.  I put them aside so we can enjoy passing builds in the
> meantime.  I've already moved aside or disabled a few tests and test
> classes:
>
> TestMultiTableInputFormat
> TestReplicationKillSlaveRS
> TestHCM.testDeleteForZKConnLeak was disabled
>
> ... and a few others.
>
> Finally (if you are still reading), I would suggest that test failures in
> hadoopqa are now more worthy of investigation.   Illustrative is what
> happened recently around "HBASE-8983 HBaseConnection#deleteAllConnections"
> where the patch had +1s and on its first run, a unit test failed (though it
> passed locally).  The second run obscured the first run's failure.  After
> digging by another, the patch had actually broken the first test (though it
> looked unrelated).  I would suggest that now tests are healthier, test
> failures are worth paying more attention too.
>
> Yours,
> St.Ack
>
>
>
>

Re: Getting unit tests to pass

Posted by Andrew Purtell <ap...@apache.org>.
On Thu, Aug 1, 2013 at 8:25 AM, Stack <st...@duboce.net> wrote:

> I'd also suggest that we tend away from big fat integration-type unit
> tests.  Apache infrastructure is overloaded and it a PITA setting timeouts
> and retries so tests will pass in this "hostile" setting.  Consider making
> an hbase-it contrib. instead.
>

Please feel free to run these using the EC2 resources. Each Jenkins job
runs on a dedicated VM. For 0.94 and 0.95 it makes sense (at least to me)
to have jobs that kick off only the IT tests on each checkin. Will give
coverage not found elsewhere.


-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Getting unit tests to pass

Posted by Stack <st...@duboce.net>.
Update:

Look at these beautiful columns of trunk and 0.95 blue and green dots:

+ Andrew's ec2 build: http://54.241.6.143/ (click on the 0.95 and trunk
builds)
+ Apache: https://builds.apache.org/view/H-L/view/HBase/ (ditto)

Thanks all who helped get test passing over the hump (Jimmy, Matteo,
Jeffrey, JD, etc.).

There are still a few flakies and it looks like DistributedLogSplitting can
go 'invisible' on occasion but their time is nigh!

>From here on out, lets keep the dots blue or green.  A failed test though
it may seem unrelated probably is related somehow so I'd suggest paying
closer attention to fails from here on out (sign up for the builds mailing
list if you have not already).

I'd also suggest that we tend away from big fat integration-type unit
tests.  Apache infrastructure is overloaded and it a PITA setting timeouts
and retries so tests will pass in this "hostile" setting.  Consider making
an hbase-it contrib. instead.  These are run w/ less regularity but are
approaching quotidian hopefully on a test rig near you.

Yours,
St.Ack






On Mon, Jul 22, 2013 at 11:49 AM, Stack <st...@duboce.net> wrote:

> Below is a state of hbase 0.95/trunk unit tests (Includes a little
> taxonomy of test failure type definitions).
>
> On Andrew's ec2 build box, 0.95 is passing most of the time:
>
> http://54.241.6.143/job/HBase-0.95/
> http://54.241.6.143/job/HBase-0.95-Hadoop-2/
>
> It is not as good on Apache build box but it is getting better:
>
> https://builds.apache.org/view/H-L/view/HBase/job/hbase-0.95/
> https://builds.apache.org/view/H-L/view/HBase/job/hbase-0.95-on-hadoop2/
>
> On Apache, I have seen loads up in the 500s and all file descriptors used
> according to the little resources report printed at the end of each test.
>  If these numbers are to be believed (TBD), we may never achieve 100% pass
> rate on Apache builds.
>
> Andrew's ec2 builds run the integration tests too where the apache builds
> do not -- sometimes we'll fail an integration test run which makes the
> Andrew ec2 red/green ratio look worse that it actually is.
>
> Trunk builds lag.  They are being worked on.
>
> We seem to be over the worst of the flakey unit tests.  We have a few
> stragglers still but they are being hunted down by the likes of the
> merciless Jimmy Xiang and Jeffrey Zhong.
>
> The "zombies" have been mostly nailed too (where "zombies" are tests that
> refuse to die continuing after the suite has completed causing the build to
> fail).  The zombie trap from test-patch.sh was ported over to apache and
> ec2 build and it caught the last of undying.
>
> We are now into a new phase where "all" tests pass but the build still
> fails.  Here is an example:
> http://54.241.6.143/job/HBase-TRUNK/429/org.apache.hbase$hbase-server/ The only clue I have to go on is the fact that when we fail, the number of
> tests run is less than the total that shows for a successful run.
>
> Unless anyone has a better idea, to figure why the hang, I compare the
> list of tests that show in a good run vs. those of a bad run.  Tests that
> are in the good run but missing from the bad run are deemed suspect.  In
> the absence of  other evidence or other ideas, I am blaming these
> "invisibles" for the build fail.
>
> Here is an example:
>
> This is a good 0.95 hadoop2 run (notice how we are running integration
> tests tooooo and they succeed!!  On hadoop2!!!!):
>
> http://54.241.6.143/job/HBase-0.95-Hadoop-2/669/
>
> In hbase-server module:
>
> Tests run: 1491, Failures: 0, Errors: 0, Skipped: 19
>
>
> This is a bad run:
>
> http://54.241.6.143/job/HBase-0.95-Hadoop-2/668/
>
> Tests run: 1458, Failures: 0, Errors: 0, Skipped: 18
>
>
> If I compare tests, the successful run has:
>
> > Running org.apache.hadoop.hbase.regionserver.wal.TestHLogSplitCompressed
>
>
> ... where the bad run does not show the above test.
>  TestHLogSplitCompressed has 34 tests one of which is disabled so that
> would seem to account for the discrepancy.
>
> I've started to disable tests that fail likes this putting them aside for
> original authors or the interested to take a look to see why they fail
> occasionally.  I put them aside so we can enjoy passing builds in the
> meantime.  I've already moved aside or disabled a few tests and test
> classes:
>
> TestMultiTableInputFormat
> TestReplicationKillSlaveRS
> TestHCM.testDeleteForZKConnLeak was disabled
>
> ... and a few others.
>
> Finally (if you are still reading), I would suggest that test failures in
> hadoopqa are now more worthy of investigation.   Illustrative is what
> happened recently around "HBASE-8983 HBaseConnection#deleteAllConnections"
> where the patch had +1s and on its first run, a unit test failed (though it
> passed locally).  The second run obscured the first run's failure.  After
> digging by another, the patch had actually broken the first test (though it
> looked unrelated).  I would suggest that now tests are healthier, test
> failures are worth paying more attention too.
>
> Yours,
> St.Ack
>
>
>
>