You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Hadoop QA (JIRA)" <ji...@apache.org> on 2019/08/14 00:48:00 UTC
[jira] [Commented] (HBASE-22601) Misconfigured addition of peers leads to cluster shutdown.

    [ https://issues.apache.org/jira/browse/HBASE-22601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906763#comment-16906763 ] 

Hadoop QA commented on HBASE-22601:
-----------------------------------

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 22m 28s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:orange}-0{color} | {color:orange} test4tests {color} | {color:orange}  0m  0s{color} | {color:orange} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} branch-1.3 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  7m  4s{color} | {color:green} branch-1.3 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 39s{color} | {color:green} branch-1.3 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 20s{color} | {color:green} branch-1.3 passed {color} |
| {color:red}-1{color} | {color:red} shadedjars {color} | {color:red}  0m 12s{color} | {color:red} branch has 10 errors when building our shaded downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 36s{color} | {color:green} branch-1.3 passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  2m 18s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 16s{color} | {color:green} branch-1.3 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m  0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} shadedjars {color} | {color:red}  0m 10s{color} | {color:red} patch has 10 errors when building our shaded downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  8m 57s{color} | {color:green} Patch does not cause any errors with Hadoop 2.4.1 2.5.2 2.6.5 2.7.7. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 31s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 17s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}115m 17s{color} | {color:red} hbase-server in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 22s{color} | {color:green} The patch does not generate ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}167m 31s{color} | {color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hbase.client.replication.TestReplicationAdminWithTwoDifferentZKClusters |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.1 Server=19.03.1 base: https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-490/1/artifact/out/Dockerfile |
| GITHUB PR | https://github.com/apache/hbase/pull/490 |
| JIRA Issue | HBASE-22601 |
| Optional Tests | dupname asflicense javac javadoc unit spotbugs findbugs shadedjars hadoopcheck hbaseanti checkstyle compile |
| uname | Linux 7a96ffc0e253 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /home/jenkins/jenkins-slave/workspace/HBase-PreCommit-GitHub-PR_PR-490/out/precommit/personality/provided.sh |
| git revision | branch-1.3 / 8e5e868 |
| Default Java | 1.8.0_222 |
| shadedjars | https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-490/1/artifact/out/branch-shadedjars.txt |
| shadedjars | https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-490/1/artifact/out/patch-shadedjars.txt |
| unit | https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-490/1/artifact/out/patch-unit-hbase-server.txt |
|  Test Results | https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-490/1/testReport/ |
| Max. process+thread count | 3994 (vs. ulimit of 10000) |
| modules | C: hbase-server U: hbase-server |
| Console output | https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-490/1/console |
| versions | git=1.9.1 maven=3.0.5 findbugs=3.1.0-RC1 |
| Powered by | Apache Yetus 0.10.0 http://yetus.apache.org |


This message was automatically generated.



> Misconfigured addition of peers leads to cluster shutdown.
> ----------------------------------------------------------
>
>                 Key: HBASE-22601
>                 URL: https://issues.apache.org/jira/browse/HBASE-22601
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.3.2
>            Reporter: Rushabh S Shah
>            Assignee: Rushabh S Shah
>            Priority: Major
>
> Recently we added a peer to a production cluster which were in different kerberos realm.
> *Steps to reproduce:*
>  1. Add a misconfigured peer which is in different kerberos realm.
>  2. Remove that peer.
>  3. All region servers will start to crash.
> *RCA*
>  Enabled trace logging on one Region server for a short amount of time.
>  After adding peer, saw the following log lines.
> {noformat}
> 2019-06-18 22:19:20,949 INFO  [main-EventThread] replication.ReplicationTrackerZKImpl - /hbase/replication/peers znode expired, triggering peerListChanged event
> 2019-06-18 22:19:20,992 INFO  [main-EventThread] replication.ReplicationPeersZKImpl - Added new peer cluster=<dev-cluster-zookeeper>:/hbase
> 2019-06-18 22:19:21,113 INFO  [main-EventThread] zookeeper.RecoverableZooKeeper - Process identifier=hconnection-0x794a56d6 connecting to ZooKeeper ensemble=<dev-cluster-zookeeper>
> 2019-06-18 22:20:01,280 WARN  [main-EventThread] zookeeper.ZKUtil - hconnection-0x794a56d6-0x16b56265fbebb1b, quorum=<dev-cluster-zookeeper>, baseZNode=/hbase Unable to set watcher on znode (/hbase/hbaseid)
> org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed for /hbase/hbaseid
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:123)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1102)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:220)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:421)
>         at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
>         at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:105)
>         at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.retrieveClusterId(ConnectionManager.java:922)
>         at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.<init>(ConnectionManager.java:706)
>         at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.<init>(ConnectionManager.java:638)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>         at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:238)
>         at org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:432)
>         at org.apache.hadoop.hbase.client.ConnectionManager.createConnectionInternal(ConnectionManager.java:341)
>         at org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:144)
>         at org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.init(HBaseInterClusterReplicationEndpoint.java:135)
>         at com.salesforce.hbase.replication.TenantReplicationEndpoint.init(TenantReplicationEndpoint.java:30)
>         at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.getReplicationSource(ReplicationSourceManager.java:517)
>         at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.addSource(ReplicationSourceManager.java:273)
>         at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.peerListChanged(ReplicationSourceManager.java:635)
>         at org.apache.hadoop.hbase.replication.ReplicationTrackerZKImpl$PeersWatcher.nodeChildrenChanged(ReplicationTrackerZKImpl.java:192)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:643)
>         at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:544)
>         at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:519)
> 2019-06-18 22:20:42,999 WARN  [Source,<dev-cluster>] zookeeper.ZKUtil - connection to cluster: <dev-cluster>-0x26b56265fe7b5cd, quorum=<dev-cluster-zookeeper>, baseZNode=/hbase Unable to set watcher on znode (/hbase/hbaseid)
> org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed for /hbase/hbaseid
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:123)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1102)
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:220)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:421)
>         at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
>         at org.apache.hadoop.hbase.zookeeper.ZKClusterId.getUUIDForCluster(ZKClusterId.java:96)
>         at org.apache.hadoop.hbase.replication.HBaseReplicationEndpoint.getPeerUUID(HBaseReplicationEndpoint.java:104)
>         at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:304)
> 2019-06-18 22:20:43,002 TRACE [Source,<dev-cluster>] regionserver.ReplicationSource - Cannot contact the peer's zk ensemble, sleeping 1000 times 1
> <same stack trace as before>
> 2019-06-18 22:20:44,008 TRACE [Source,<dev-cluster>] regionserver.ReplicationSource - Cannot contact the peer's zk ensemble, sleeping 1000 times 2
> <same stack trace as before>
> {noformat}
> This goes on and on until we removed the peer.
>  After removing the peer,
> {noformat}
> 2019-06-18 22:21:20,731 INFO  [main-EventThread] replication.ReplicationTrackerZKImpl - /hbase/replication/peers/<dev-cluster> znode expired, triggering peerRemoved event
> 2019-06-18 22:21:20,731 INFO  [main-EventThread] regionserver.ReplicationSourceManager - Closing the following queue <dev-cluster>, currently have 2 and another 0 that were recovered
> 2019-06-18 22:21:20,733 INFO  [main-EventThread] regionserver.ReplicationSourceManager - Number of deleted recovered sources for <dev-cluster>: 0
> 2019-06-18 22:21:20,734 INFO  [main-EventThread] regionserver.ReplicationSource - Closing source <dev-cluster> because: Replication stream was removed by a user
> 2019-06-18 22:21:20,734 INFO  [main-EventThread] replication.TenantReplicationEndpoint - Stopping endpoint
> 2019-06-18 22:21:20,736 INFO  [main-EventThread] client.ConnectionManager$HConnectionImplementation - Closing zookeeper sessionid=0x16b56265fbebb1b
> 2019-06-18 22:21:20,736 DEBUG [main-EventThread] ipc.RpcClientImpl - Stopping rpc client
> 2019-06-18 22:21:20,738 INFO  [main-EventThread] regionserver.ReplicationSourceManager - Done with the queue <dev-cluster>
> 2019-06-18 22:21:20,744 DEBUG [main-EventThread] replication.ReplicationQueuesZKImpl - Peer /hbase/replication/hfile-refs/<dev-cluster> not found in hfile reference queue.
> 2019-06-18 22:21:28,066 INFO  [Source,<dev-cluster>] regionserver.ReplicationSource - Replicating d12a5eb3-16bd-4910-91ed-41e0e7990007 -> null
> 2019-06-18 22:21:28,067 DEBUG [Source,<dev-cluster>] regionserver.ReplicationSource - Starting up worker for wal group <rs-serner-name>%2C60020%2C1560896050398
> 2019-06-18 22:21:28,069 ERROR [050398,<dev-cluster>] regionserver.ReplicationSource - Unexpected exception in ReplicationSourceWorkerThread, currentPath=null
> java.lang.IllegalArgumentException: Peer with id= <dev-cluster> is not connected
>         at org.apache.hadoop.hbase.replication.ReplicationPeersZKImpl.getStatusOfPeer(ReplicationPeersZKImpl.java:217)
>         at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.isPeerEnabled(ReplicationSource.java:363)
>         at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:549)
> 2019-06-18 22:21:28,070 INFO  [050398,<dev-cluster>] regionserver.HRegionServer - STOPPED: Unexpected exception in ReplicationSourceWorkerThread
> 2019-06-18 22:21:28,071 INFO  [/10.231.90.212:60020] regionserver.SplitLogWorker - Sending interrupt to stop the worker thread
> 2019-06-18 22:21:28,073 INFO  [/10.231.90.212:60020] regionserver.HRegionServer - Stopping infoServer
> 2019-06-18 22:21:28,075 INFO  [as-dnds4-3-prd:60020] regionserver.SplitLogWorker - SplitLogWorker interrupted. Exiting. 
> 2019-06-18 22:21:28,076 INFO  [as-dnds4-3-prd:60020] regionserver.SplitLogWorker - SplitLogWorker <rs-server-name>,60020,1560896050398 exiting
> 2019-06-18 22:21:28,181 INFO  [/10.231.90.212:60020] regionserver.HeapMemoryManager - Stopping HeapMemoryTuner chore.
> 2019-06-18 22:21:28,181 INFO  [/10.231.90.212:60020] flush.RegionServerFlushTableProcedureManager - Stopping region server flush procedure manager gracefully.
> {noformat}
> Also verified from jstacks I captured between addition of peer and removal of peer, the Replication source thread was sleeping waiting to connect to peer.
> {noformat}
> "main-EventThread.replicationSource,<dev-cluster>" #1923 daemon prio=5 os_prio=0 tid=0x00007f1ad18f4800 nid=0x25a47 
> waiting on condition [0x00007f1220314000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>         at java.lang.Thread.sleep(Native Method)
>         at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.sleepForRetries(ReplicationSource.java:349)
>         at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:306)
> {noformat}
> According to me, the bug lies here:
> {code:java|title=ReplicationSource.java|borderStyle=solid}
> // Some comments here
>   public void run() {
> {
> ....
> ....
>  while (this.isSourceActive() && this.peerClusterId == null) {
>       this.peerClusterId = replicationEndpoint.getPeerUUID();
>       if (this.isSourceActive() && this.peerClusterId == null) {
>         if (sleepForRetries("Cannot contact the peer's zk ensemble", sleepMultiplier)) {
>           sleepMultiplier++;
>         }
>       }
>     }
> !-- There is a bug here. We should check that peerClusterId is not null. If null, then terminate the thread. -->
>     // In rare case, zookeeper setting may be messed up. That leads to the incorrect
>     // peerClusterId value, which is the same as the source clusterId
>     if (clusterId.equals(peerClusterId) && !replicationEndpoint.canReplicateToSameCluster()) {
>       this.terminate("ClusterId " + clusterId + " is replicating to itself: peerClusterId "
>           + peerClusterId + " which is not allowed by ReplicationEndpoint:"
>           + replicationEndpoint.getClass().getName(), null, false);
>       this.manager.closeQueue(this);
>       return;
>     }
> ....
> ....
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)