You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2022/05/07 00:51:59 UTC

[GitHub] [incubator-doris] morningman opened a new pull request, #9424: [fix] fix bug that replica can not be repaired duo to DECOMMISSION state

morningman opened a new pull request, #9424:
URL: https://github.com/apache/incubator-doris/pull/9424

   # Proposed changes
   
   Issue Number: close #9422
   
   ## Problem Summary:
   
   Reset state of replica which state are in DECOMMISSION after finished scheduling.
   
   ## Checklist(Required)
   
   1. Does it affect the original behavior: (No)
   2. Has unit tests been added: (No)
   3. Has document been added or modified: (No Need)
   4. Does it need to update dependencies: (No)
   5. Are there any changes that cannot be rolled back: (No)
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why you chose the solution you did and what alternatives you considered, etc...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] morningman commented on a diff in pull request #9424: [fix] fix bug that replica can not be repaired duo to DECOMMISSION state

Posted by GitBox <gi...@apache.org>.
morningman commented on code in PR #9424:
URL: https://github.com/apache/incubator-doris/pull/9424#discussion_r867326361


##########
fe/fe-core/src/main/java/org/apache/doris/clone/TabletSchedCtx.java:
##########
@@ -1184,4 +1186,24 @@ public int compare(Replica r1, Replica r2) {
             }
         }
     }
+
+    // call this when releaseTabletCtx()
+    public void resetReplicaState() {
+        if (tablet != null) {
+            for (Replica replica : tablet.getReplicas()) {
+                // To address issue: https://github.com/apache/incubator-doris/issues/9422
+                // the DECOMMISSION state is set in TabletScheduler and not persist to meta.
+                // So it is reasonable to reset this state if we failed to scheduler this tablet.
+                // That is, if the TabletScheduler cannot process the tablet, then it should reset
+                // any intermediate state it set during the scheduling process.
+                if (replica.getState() == ReplicaState.DECOMMISSION) {
+                    replica.setState(ReplicaState.NORMAL);
+                    replica.setWatermarkTxnId(-1);
+                    LOG.debug("reset replica {} on backend {} of tablet {} state from DECOMMISSION to NORMAL",

Review Comment:
   It may print a lot log. And actually, this is not an error. It is a common case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] github-actions[bot] commented on pull request #9424: [fix] fix bug that replica can not be repaired duo to DECOMMISSION state

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #9424:
URL: https://github.com/apache/incubator-doris/pull/9424#issuecomment-1127417899

   PR approved by at least one committer and no changes requested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] morningman commented on pull request #9424: [fix] fix bug that replica can not be repaired duo to DECOMMISSION state

Posted by GitBox <gi...@apache.org>.
morningman commented on PR #9424:
URL: https://github.com/apache/incubator-doris/pull/9424#issuecomment-1124479788

   > In this situation, which status of this tablet will cause a repair task?
   > REPLICA_MISSING OR VERSION_INCOMPLETE ?
   
   REPLICA_MISSING


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wykjLDF commented on pull request #9424: [fix] fix bug that replica can not be repaired duo to DECOMMISSION state

Posted by GitBox <gi...@apache.org>.
wykjLDF commented on PR #9424:
URL: https://github.com/apache/incubator-doris/pull/9424#issuecomment-1131265134

   Hi, I meet a bug w.r.t the 'be decommission' too these days, can you please help me to check whether it has some connections with this bug?
   Before decommission, the clusters have 5 be, and I want to do decommission on one. And currently, there are no unhealthy tablets in the databases;
   Then after commanding on decommission, there are some unhealthy tablets, but most of them recover in about 20 minutes. However, there are 3 tablets that can not recover, even after about 20 hours; Here I attached some screenshots about the status:
   Using command "SHOW PROC '/statistic'';":
   ![image](https://user-images.githubusercontent.com/41881379/169223339-09ffaf46-f181-41b4-a5d8-fb5a87695fa6.png)
   And check the cluster_balance jobs by command SHOW PROC "'/cluster_balance/history_tablets';":
   ![image](https://user-images.githubusercontent.com/41881379/169223792-503d876a-ba65-42bd-bc07-0e316e9ec58d.png)
   It seems that the TabletChecker keeps trying to relocate the three left tablets, however it still not works.
   Can you please help me to see this problem? I really appreciate it!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] morningman merged pull request #9424: [fix] fix bug that replica can not be repaired duo to DECOMMISSION state

Posted by GitBox <gi...@apache.org>.
morningman merged PR #9424:
URL: https://github.com/apache/incubator-doris/pull/9424


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] pengxiangyu commented on a diff in pull request #9424: [fix] fix bug that replica can not be repaired duo to DECOMMISSION state

Posted by GitBox <gi...@apache.org>.
pengxiangyu commented on code in PR #9424:
URL: https://github.com/apache/incubator-doris/pull/9424#discussion_r867289520


##########
fe/fe-core/src/main/java/org/apache/doris/clone/TabletScheduler.java:
##########
@@ -1569,7 +1574,10 @@ public void handleRunningTablets() {
 
         // 2. release ctx
         timeoutTablets.stream().forEach(t -> {
-            releaseTabletCtx(t, TabletSchedCtx.State.CANCELLED);
+            // Set "resetReplicaState" to true because
+            // the timeout task should also be considered as UNRECOVERABLE,
+            // so need to reset replica state.
+            releaseTabletCtx(t, TabletSchedCtx.State.CANCELLED, true);

Review Comment:
   It is better to add a log here, it is useful for us to find why resetReplicaState is called. This log will be not too many.



##########
fe/fe-core/src/main/java/org/apache/doris/clone/TabletSchedCtx.java:
##########
@@ -1184,4 +1186,24 @@ public int compare(Replica r1, Replica r2) {
             }
         }
     }
+
+    // call this when releaseTabletCtx()
+    public void resetReplicaState() {
+        if (tablet != null) {
+            for (Replica replica : tablet.getReplicas()) {
+                // To address issue: https://github.com/apache/incubator-doris/issues/9422
+                // the DECOMMISSION state is set in TabletScheduler and not persist to meta.
+                // So it is reasonable to reset this state if we failed to scheduler this tablet.
+                // That is, if the TabletScheduler cannot process the tablet, then it should reset
+                // any intermediate state it set during the scheduling process.
+                if (replica.getState() == ReplicaState.DECOMMISSION) {
+                    replica.setState(ReplicaState.NORMAL);
+                    replica.setWatermarkTxnId(-1);
+                    LOG.debug("reset replica {} on backend {} of tablet {} state from DECOMMISSION to NORMAL",

Review Comment:
   LOG.warn() is better,resetReplicaState will not be called frequently, so this log will not be too many, but we have to known which tablet is reset, to find out why it is be like this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] morningman commented on a diff in pull request #9424: [fix] fix bug that replica can not be repaired duo to DECOMMISSION state

Posted by GitBox <gi...@apache.org>.
morningman commented on code in PR #9424:
URL: https://github.com/apache/incubator-doris/pull/9424#discussion_r867326447


##########
fe/fe-core/src/main/java/org/apache/doris/clone/TabletScheduler.java:
##########
@@ -1569,7 +1574,10 @@ public void handleRunningTablets() {
 
         // 2. release ctx
         timeoutTablets.stream().forEach(t -> {
-            releaseTabletCtx(t, TabletSchedCtx.State.CANCELLED);
+            // Set "resetReplicaState" to true because
+            // the timeout task should also be considered as UNRECOVERABLE,
+            // so need to reset replica state.
+            releaseTabletCtx(t, TabletSchedCtx.State.CANCELLED, true);

Review Comment:
   No need, this is an origin logic, and we don't expect this log in the past.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org