You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Jun Yuan (Jira)" <ji...@apache.org> on 2020/03/20 22:44:00 UTC
[jira] [Comment Edited] (HBASE-19617) Remove ReplicationQueues, use ReplicationQueueStorage directly

    [ https://issues.apache.org/jira/browse/HBASE-19617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063653#comment-17063653 ] 

Jun Yuan edited comment on HBASE-19617 at 3/20/20, 10:43 PM:
-------------------------------------------------------------

[~zhangduo], After this patch, ReplicationSyncUp is broken

{noformat}

Start Replication Server start

20/03/20 22:12:12 INFO regionserver.Replication: replication.initialize()

20/03/20 22:12:12 WARN impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-hbase.properties,hadoop-metrics2.properties

20/03/20 22:12:12 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).

20/03/20 22:12:12 INFO impl.MetricsSystemImpl: HBase metrics system started

20/03/20 22:12:12 INFO metrics.MetricRegistries: Loaded MetricRegistries class org.apache.hadoop.hbase.metrics.impl.MetricRegistriesImpl

20/03/20 22:12:12 INFO regionserver.ReplicationSource: queueId=1, ReplicationSource : 1, currentBandwidth=0

20/03/20 22:12:12 INFO regionserver.ReplicationSourceManager: Current list of replicators: [stl-colo-srv102.splicemachine.colo,16020,1584661797974, stl-colo-srv101.splicemachine.colo,16020,1584661797971, stl-colo-srv099.splicemachine.colo,16020,1584661798080, stl-colo-srv097.splicemachine.colo,16020,1584661797540] other RSs: [stl-colo-srv102.splicemachine.colo,16020,1584661797974, stl-colo-srv101.splicemachine.colo,16020,1584661797971, stl-colo-srv097.splicemachine.colo,16020,1584661797540, stl-colo-srv099.splicemachine.colo,16020,1584661798080]

20/03/20 22:12:12 INFO regionserver.ReplicationSource: [Source for peer 1]: Closing source 1 because: Region server is closing

20/03/20 22:12:12 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=srv051:2181 sessionTimeout=120000 watcher=org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$$Lambda$28/306168495@573d4e8c

20/03/20 22:12:12 INFO zookeeper.ClientCnxn: Opening socket connection to server srv051/10.1.1.151:2181. Will not attempt to authenticate using SASL (unknown error)

20/03/20 22:12:12 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.1.1.195:48284, server: srv051/10.1.1.151:2181

20/03/20 22:12:12 WARN client.ConnectionImplementation: Retrieve cluster id failed

java.lang.InterruptedException

 at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:347)

 at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)

 at org.apache.hadoop.hbase.client.ConnectionImplementation.retrieveClusterId(ConnectionImplementation.java:549)

 at org.apache.hadoop.hbase.client.ConnectionImplementation.<init>(ConnectionImplementation.java:287)

 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

 at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

 at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

 at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:220)

 at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:115)

 at org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.init(HBaseInterClusterReplicationEndpoint.java:136)

 at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.initAndStartReplicationEndpoint(ReplicationSource.java:285)

 at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.initialize(ReplicationSource.java:480)

 at java.lang.Thread.run(Thread.java:745)

20/03/20 22:12:12 INFO zookeeper.ClientCnxn: Session establishment complete on server srv051/10.1.1.151:2181, sessionid = 0x17087b240999737, negotiated timeout = 120000

20/03/20 22:12:12 INFO zookeeper.RecoverableZooKeeper: Process identifier=connection to cluster: 1 connecting to ZooKeeper ensemble=srv051:2181

20/03/20 22:12:12 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=srv051:2181 sessionTimeout=120000 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@1af1f864

20/03/20 22:12:12 INFO zookeeper.ClientCnxn: Opening socket connection to server srv051/10.1.1.151:2181. Will not attempt to authenticate using SASL (unknown error)

20/03/20 22:12:12 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.1.1.195:48288, server: srv051/10.1.1.151:2181

20/03/20 22:12:12 INFO zookeeper.ClientCnxn: Session establishment complete on server srv051/10.1.1.151:2181, sessionid = 0x17087b240999738, negotiated timeout = 120000

{noformat}

The following code changes caused problem
{noformat}
manager.init().get();
      while (manager.activeFailoverTaskCount() > 0) {
        Thread.sleep(SLEEP_TIME);
      }
      while (manager.getOldSources().size() > 0) {
        Thread.sleep(SLEEP_TIME);
      }
{noformat}

Looks like both while loop fall through, so ReplicationSyncUp does not wait and finished too soon. As a result, WALs will not be sync'ed.


was (Author: jyuanca):
[~zhangduo], After this patch, ReplicationSyncUp is broken

{noformat}

Start Replication Server start

20/03/20 22:12:12 INFO regionserver.Replication: replication.initialize()

20/03/20 22:12:12 WARN impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-hbase.properties,hadoop-metrics2.properties

20/03/20 22:12:12 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).

20/03/20 22:12:12 INFO impl.MetricsSystemImpl: HBase metrics system started

20/03/20 22:12:12 INFO metrics.MetricRegistries: Loaded MetricRegistries class org.apache.hadoop.hbase.metrics.impl.MetricRegistriesImpl

20/03/20 22:12:12 INFO regionserver.ReplicationSource: queueId=1, ReplicationSource : 1, currentBandwidth=0

20/03/20 22:12:12 INFO regionserver.ReplicationSourceManager: Current list of replicators: [stl-colo-srv102.splicemachine.colo,16020,1584661797974, stl-colo-srv101.splicemachine.colo,16020,1584661797971, stl-colo-srv099.splicemachine.colo,16020,1584661798080, stl-colo-srv097.splicemachine.colo,16020,1584661797540] other RSs: [stl-colo-srv102.splicemachine.colo,16020,1584661797974, stl-colo-srv101.splicemachine.colo,16020,1584661797971, stl-colo-srv097.splicemachine.colo,16020,1584661797540, stl-colo-srv099.splicemachine.colo,16020,1584661798080]

20/03/20 22:12:12 INFO regionserver.ReplicationSource: [Source for peer 1]: Closing source 1 because: Region server is closing

20/03/20 22:12:12 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=srv051:2181 sessionTimeout=120000 watcher=org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$$Lambda$28/306168495@573d4e8c

20/03/20 22:12:12 INFO zookeeper.ClientCnxn: Opening socket connection to server srv051/10.1.1.151:2181. Will not attempt to authenticate using SASL (unknown error)

20/03/20 22:12:12 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.1.1.195:48284, server: srv051/10.1.1.151:2181

20/03/20 22:12:12 WARN client.ConnectionImplementation: Retrieve cluster id failed

java.lang.InterruptedException

 at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:347)

 at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)

 at org.apache.hadoop.hbase.client.ConnectionImplementation.retrieveClusterId(ConnectionImplementation.java:549)

 at org.apache.hadoop.hbase.client.ConnectionImplementation.<init>(ConnectionImplementation.java:287)

 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

 at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

 at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

 at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:220)

 at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:115)

 at org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.init(HBaseInterClusterReplicationEndpoint.java:136)

 at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.initAndStartReplicationEndpoint(ReplicationSource.java:285)

 at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.initialize(ReplicationSource.java:480)

 at java.lang.Thread.run(Thread.java:745)

20/03/20 22:12:12 INFO zookeeper.ClientCnxn: Session establishment complete on server srv051/10.1.1.151:2181, sessionid = 0x17087b240999737, negotiated timeout = 120000

20/03/20 22:12:12 INFO zookeeper.RecoverableZooKeeper: Process identifier=connection to cluster: 1 connecting to ZooKeeper ensemble=srv051:2181

20/03/20 22:12:12 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=srv051:2181 sessionTimeout=120000 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@1af1f864

20/03/20 22:12:12 INFO zookeeper.ClientCnxn: Opening socket connection to server srv051/10.1.1.151:2181. Will not attempt to authenticate using SASL (unknown error)

20/03/20 22:12:12 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.1.1.195:48288, server: srv051/10.1.1.151:2181

20/03/20 22:12:12 INFO zookeeper.ClientCnxn: Session establishment complete on server srv051/10.1.1.151:2181, sessionid = 0x17087b240999738, negotiated timeout = 120000

{noformat}

> Remove ReplicationQueues, use ReplicationQueueStorage directly
> --------------------------------------------------------------
>
>                 Key: HBASE-19617
>                 URL: https://issues.apache.org/jira/browse/HBASE-19617
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Replication
>            Reporter: Duo Zhang
>            Assignee: Duo Zhang
>            Priority: Major
>             Fix For: 3.0.0, 2.1.0
>
>         Attachments: HBASE-19617-HBASE-19397-v1.patch, HBASE-19617-HBASE-19397-v2.patch, HBASE-19617-HBASE-19397-v3.patch, HBASE-19617-HBASE-19397-v3.patch, HBASE-19617-HBASE-19397-v4.patch, HBASE-19617-HBASE-19397.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)