You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Duo Zhang (Jira)" <ji...@apache.org> on 2022/07/13 02:55:00 UTC

[jira] [Resolved] (HBASE-27193) TestZooKeeper is flaky

     [ https://issues.apache.org/jira/browse/HBASE-27193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Duo Zhang resolved HBASE-27193.
-------------------------------
    Fix Version/s: 2.5.0
                   3.0.0-alpha-4
                   2.4.14
     Hadoop Flags: Reviewed
       Resolution: Fixed

Pushed to branch-2.4+.

Thanks [~zghao] for reviewing!

> TestZooKeeper is flaky
> ----------------------
>
>                 Key: HBASE-27193
>                 URL: https://issues.apache.org/jira/browse/HBASE-27193
>             Project: HBase
>          Issue Type: Bug
>          Components: test, Zookeeper
>            Reporter: Duo Zhang
>            Assignee: Duo Zhang
>            Priority: Major
>             Fix For: 2.5.0, 3.0.0-alpha-4, 2.4.14
>
>
> https://nightlies.apache.org/hbase/HBase-Flaky-Tests/master/3847/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.TestZooKeeper-output.txt
> Found this in the log output
> {noformat}
> 2022-07-11T20:46:34,839 ERROR [zk-event-processor-pool-0] master.HMaster(3054): Master server abort: loaded coprocessors are: [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
> 2022-07-11T20:46:34,839 ERROR [zk-event-processor-pool-0] master.HMaster(3059): ***** ABORTING master jenkins-hbase19.apache.org,44207,1657572391676: Received an unexpected KeeperException, aborting *****
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase/master
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:134) ~[zookeeper-3.5.7.jar:3.5.7]
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[zookeeper-3.5.7.jar:3.5.7]
> 	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2021) ~[zookeeper-3.5.7.jar:3.5.7]
> 	at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:262) ~[hbase-zookeeper-3.0.0-alpha-4-SNAPSHOT.jar:3.0.0-alpha-4-SNAPSHOT]
> 	at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:250) ~[hbase-zookeeper-3.0.0-alpha-4-SNAPSHOT.jar:3.0.0-alpha-4-SNAPSHOT]
> 	at org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:108) ~[hbase-zookeeper-3.0.0-alpha-4-SNAPSHOT.jar:3.0.0-alpha-4-SNAPSHOT]
> 	at org.apache.hadoop.hbase.master.ActiveMasterManager.handleMasterNodeChange(ActiveMasterManager.java:197) ~[classes/:3.0.0-alpha-4-SNAPSHOT]
> 	at org.apache.hadoop.hbase.master.ActiveMasterManager.handle(ActiveMasterManager.java:130) ~[classes/:3.0.0-alpha-4-SNAPSHOT]
> 	at org.apache.hadoop.hbase.master.ActiveMasterManager.nodeDeleted(ActiveMasterManager.java:125) ~[classes/:3.0.0-alpha-4-SNAPSHOT]
> 	at org.apache.hadoop.hbase.zookeeper.ZKWatcher.lambda$processEvent$0(ZKWatcher.java:574) ~[hbase-zookeeper-3.0.0-alpha-4-SNAPSHOT.jar:3.0.0-alpha-4-SNAPSHOT]
> 	at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:187) ~[hbase-common-3.0.0-alpha-4-SNAPSHOT.jar:3.0.0-alpha-4-SNAPSHOT]
> 	at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:177) ~[hbase-common-3.0.0-alpha-4-SNAPSHOT.jar:3.0.0-alpha-4-SNAPSHOT]
> 	at org.apache.hadoop.hbase.zookeeper.ZKWatcher.processEvent(ZKWatcher.java:556) ~[hbase-zookeeper-3.0.0-alpha-4-SNAPSHOT.jar:3.0.0-alpha-4-SNAPSHOT]
> 	at org.apache.hadoop.hbase.zookeeper.ZKWatcher.lambda$process$1(ZKWatcher.java:609) ~[hbase-zookeeper-3.0.0-alpha-4-SNAPSHOT.jar:3.0.0-alpha-4-SNAPSHOT]
> 	at org.apache.hadoop.hbase.trace.TraceUtil.lambda$tracedRunnable$2(TraceUtil.java:155) ~[hbase-common-3.0.0-alpha-4-SNAPSHOT.jar:3.0.0-alpha-4-SNAPSHOT]
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_291]
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_291]
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_291]
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_291]
> 	at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_291]
> 2022-07-11T20:46:34,839 DEBUG [Listener at localhost.localdomain/39949] master.HMaster(3049): Abort called but aborted=true, stopped=false
> 2022-07-11T20:46:34,839 INFO  [zk-event-processor-pool-0] master.HMaster(3153): ***** STOPPING master 'jenkins-hbase19.apache.org,44207,1657572391676' *****
> 2022-07-11T20:46:34,840 INFO  [zk-event-processor-pool-0] master.HMaster(3155): STOPPED: Stopped by zk-event-processor-pool-0
> {noformat}
> So the problem here is that, we call abort manually in UT, but it is possible that after closing zookeeper client, the master itself will call abort in some background threads, and there is a race that, the abortedRequested is set, but stopped is not set while we call abort manually, so we will return immediately and once we check for master.isStopped, it is still false, as the log output above.
> It is just a test issue. Let me provide a very simple patch to fix it, just a wait is enough...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)