You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by GitBox <gi...@apache.org> on 2021/03/03 21:46:19 UTC

[GitHub] [hbase] bharathv commented on a change in pull request #3009: HBASE-25627: HBase replication should have a metric to represent if it cannot talk to peer's zk

bharathv commented on a change in pull request #3009:
URL: https://github.com/apache/hbase/pull/3009#discussion_r586812579



##########
File path: hbase-hadoop2-compat/src/main/java/org/apache/hadoop/hbase/replication/regionserver/MetricsReplicationGlobalSourceSource.java
##########
@@ -197,6 +205,20 @@ public void incrFailedRecoveryQueue() {
     failedRecoveryQueue.incr(1L);
   }
 
+  @Override
+  public void setPeerZkConnectionFailures(boolean success) {

Review comment:
       This logic in a global source doesn't give the intended result here because `peerZkConnectionFailures` is reset to 0 if _any_ source succeeds. That doesn't seem right.

##########
File path: hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
##########
@@ -293,9 +293,12 @@ public void run() {
     while (this.isSourceActive() && this.peerClusterId == null) {
       this.peerClusterId = replicationEndpoint.getPeerUUID();
       if (this.isSourceActive() && this.peerClusterId == null) {
+        metrics.setPeerZkConnectionFailures(false);

Review comment:
       Had a brief discussion offline with @sandeepvinayak on how to generalize this. I think the intent here is capture and flag any sources that are stuck during initialization and ZK connection failure is just a symptom. So capturing those ZK failure count may not add much value, instead we track _number of uninitialized sources_, (gauge) that'd be much more helpful.
   
   So one way forward is to track number of such uninitialized sources at a global scope (that the monitoring tooling can flag if its > 0 for a time window) and then backport https://issues.apache.org/jira/browse/HBASE-22731 to branch-1. These two together should help us narrow down it to the right RS and root cause.
   
   Thoughts?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org