You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Sean Busbey (JIRA)" <ji...@apache.org> on 2018/09/12 13:56:00 UTC
[jira] [Comment Edited] (HBASE-21187) The HBase UTs are extremely slow on some jenkins node

    [ https://issues.apache.org/jira/browse/HBASE-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612154#comment-16612154 ] 

Sean Busbey edited comment on HBASE-21187 at 9/12/18 1:55 PM:
--------------------------------------------------------------

How does the machine info compare for the two? Maybe we have a thrashing neighbor on the node?

I believe our Yetus version is new enough that it should have the [Process Reaper|http://yetus.apache.org/documentation/0.7.0/precommit-advanced/#process-reaper] functionality. IIRC, it came with some underlying functionality to monitor processes (e.g. the "process+thread count" in current reports).  we could make a plugin that uses the same thing to e.g. measure CPU or memory use as precommit goes.


was (Author: busbey):
How does the machine info compare for the two? Maybe we have a thrashing neighbor on the node?

I believe our Yetus version is new enough that it should have the [http://yetus.apache.org/documentation/0.7.0/precommit-advanced/#process-reaper|Process Reaper] functionality. IIRC, it came with some underlying functionality to monitor processes (e.g. the "process+thread count" in current reports).  we could make a plugin that uses the same thing to e.g. measure CPU or memory use as precommit goes.

> The HBase UTs are extremely slow on some jenkins node
> -----------------------------------------------------
>
>                 Key: HBASE-21187
>                 URL: https://issues.apache.org/jira/browse/HBASE-21187
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>            Reporter: Duo Zhang
>            Priority: Major
>
> Looking at the flaky dashboard for master branch, the top several UTs are likely to fail at the same time. One of the common things for the failed flaky tests job is that, the execution time is more than one hour, and the successful executions are usually only about half an hour.
> And I have compared the output for TestRestoreSnapshotFromClientWithRegionReplicas, for a successful run, the DisableTableProcedure can finish within one second, and for the failed run, it can take even more than half a minute.
> Not sure what is the real problem, but it seems that for the failed runs, there are likely time holes in the output, i.e, there is no log output for several seconds. Like this:
> {noformat}
> 2018-09-11 21:08:08,152 INFO  [PEWorker-4] procedure2.ProcedureExecutor(1500): Finished pid=490, state=SUCCESS, hasLock=false; CreateTableProcedure table=testRestoreSnapshotAfterTruncate in 12.9380sec
> 2018-09-11 21:08:15,590 DEBUG [RpcServer.default.FPBQ.Fifo.handler=1,queue=0,port=33663] master.MasterRpcServices(1174): Checking to see if procedure is done pid=490
> {noformat}
> No log output for about 7 seconds.
> And for a successful run, the same place
> {noformat}
> 2018-09-12 07:47:32,488 INFO  [PEWorker-7] procedure2.ProcedureExecutor(1500): Finished pid=490, state=SUCCESS, hasLock=false; CreateTableProcedure table=testRestoreSnapshotAfterTruncate in 1.2220sec
> 2018-09-12 07:47:32,881 DEBUG [RpcServer.default.FPBQ.Fifo.handler=3,queue=0,port=59079] master.MasterRpcServices(1174): Checking to see if procedure is done pid=490
> {noformat}
> There is no such hole.
> Maybe there is big GC?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)