You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2015/10/29 18:19:27 UTC

[jira] [Commented] (HBASE-14589) Looking for the surefire-killer; builds being killed...

    [ https://issues.apache.org/jira/browse/HBASE-14589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14980839#comment-14980839 ] 

stack commented on HBASE-14589:
-------------------------------

Looking for the surefire-killer... what is causing these:

ExecutionException: java.lang.RuntimeException: The forked VM terminated without properly saying goodbye. VM crash or System.exit called?

Looking at recent fail in 1.3:

{code}
kalashnikov:hbase.git.commit2 stack$ python ./dev-support/findHangingTests.py  https://builds.apache.org/view/H-L/view/HBase/job/HBase-1.3/322/jdk=latest1.7,label=Hadoop/consoleText
Fetching https://builds.apache.org/view/H-L/view/HBase/job/HBase-1.3/322/jdk=latest1.7,label=Hadoop/consoleText
Building remotely on H4 (Mapreduce zookeeper Hadoop Pig falcon Hdfs) in workspace /home/jenkins/jenkins-slave/workspace/HBase-1.3/jdk/latest1.7/label/Hadoop
Printing hanging tests
Hanging test : org.apache.hadoop.hbase.client.TestMetaWithReplicas
Hanging test : org.apache.hadoop.hbase.client.TestHCM
Hanging test : org.apache.hadoop.hbase.client.TestSnapshotFromClientWithRegionReplicas
Hanging test : org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClient
{code}

.... I notice that the above 4 hangs don't produce xml files -- just .txt files (was hoping that an unclosed xml file would help identify the bad tests...).

Also, above are a mix of medium and large tests... first two are medium and latter two are large.

The above are described as 'hanging' tests...  but all that means is that they were started but no reported ending....

I see this in the output:

Killed
Killed
Killed
Killed
Killed

So 5 killed but only 4 show as started w/o ending.

Looking at first test, it runs for more than two minutes and doesn't seem to finish properly. At least two methods take longer than the prescribed medium test time of 50 seconds. Let me move it to large. None of the tests have timeout. Let me also add category-based timeout (some of the methods run longer than the medium category sizing of 50 seconds). Hopefully when large and timeout, failure will bubble up as other than the mysterious surefire exception. Let me make TestHCM large too.

Looking at TestSnapshotFromClientWithRegionReplicas, it is killed two seconds into the test... doing:

client.TestSnapshotFromClientWithRegionReplicas#testListTableSnapshotsWithRegex

Trying it locally, it runs nice and promptly.

org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClient is cutoff in the middle of testCloneSnapshotOfCloned after seven seconds. Normally it runs promptly in two and a half minutes.

These tests do spew megabytes of output.











> Looking for the surefire-killer; builds being killed...
> -------------------------------------------------------
>
>                 Key: HBASE-14589
>                 URL: https://issues.apache.org/jira/browse/HBASE-14589
>             Project: HBase
>          Issue Type: Sub-task
>          Components: test
>            Reporter: stack
>            Assignee: stack
>         Attachments: 14589.mx.patch, 14589.timeout.txt, 14589.txt, 14598.addendum.sufire.timeout.patch
>
>
> I see this in a build that started at two hours ago... about 6:45... its build 15941 on ubuntu-6
> {code}
> WARNING: 2 rogue build processes detected, terminating.
> /bin/kill -9 18640 
> /bin/kill -9 22625 
> {code}
> If I back up to build 15939, started about 3 1/2 hours ago, say, 5:15....  I see:
> Running org.apache.hadoop.hbase.client.TestShell
> Killed
> ... but it was running on ubuntu-1.... so it doesn't look like we are killing ourselves...  when we do this in test-patch.sh
>   ### Kill any rogue build processes from the last attempt
>   $PS auxwww | $GREP ${PROJECT_NAME}PatchProcess | $AWK '{print $2}' | /usr/bin/xargs -t -I {} /bin/kill -9 {} > /dev/null
> The above code runs in a few places... in test-patch.sh.
> Let me try and add some more info around what is being killed... 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)