You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@rocketmq.apache.org by "ShannonDing (via GitHub)" <gi...@apache.org> on 2023/03/24 08:03:13 UTC

[GitHub] [rocketmq] ShannonDing opened a new issue, #6468: The HA switch is not triggered when the disk I/O load is continuously high.

ShannonDing opened a new issue, #6468:
URL: https://github.com/apache/rocketmq/issues/6468

   By simulating the failure of the injected disk load, the current disk usage reaches 99% .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@rocketmq.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] The HA switch is not triggered when the broker disk I/O load is continuously high. [rocketmq]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on issue #6468:
URL: https://github.com/apache/rocketmq/issues/6468#issuecomment-2021683837

   This issue is stale because it has been open for 365 days with no activity. It will be closed in 3 days if no further activity occurs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [rocketmq] echooymxq commented on issue #6468: The HA switch is not triggered when the broker disk I/O load is continuously high.

Posted by "echooymxq (via GitHub)" <gi...@apache.org>.

echooymxq commented on issue #6468:
URL: https://github.com/apache/rocketmq/issues/6468#issuecomment-1482780754

   I think it's a good idea to implement a health check endpoint on the broker side. Then  the brokerIsActive is nolong check the heartbeat,but invoke the endpoint. the endpoint is as simple as the heartbeat, just return the status health or not.  When elect the broker from the syncStateSet, it should also invoke it to check whether the broker is health before can be elelct as Master.
   > At this point, the broker may reconnect quickly. Is it necessary to do a HA switching?
   
   IMO, it't not neccssary. what more, The current controller has three places to trigger the election:
   1. The broker channel closed.
   2. The BrokerHeartbeatManager scanNotActiveBroker
   3. The DLedgerController scanInactiveMasterAndTriggerReelect.
   
   I think just the scanInactiveMasterAndTriggerReelect is  enough, In the current implementation, it is easy to have a concurrent election problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [rocketmq] RongtongJin commented on issue #6468: The HA switch is not triggered when the broker disk I/O load is continuously high.

Posted by "RongtongJin (via GitHub)" <gi...@apache.org>.

RongtongJin commented on issue #6468:
URL: https://github.com/apache/rocketmq/issues/6468#issuecomment-1483978256

   > I think it's a good idea to implement a health check endpoint on the broker side. Then the brokerIsActive is no longer check the heartbeat,but invoke the endpoint. the endpoint is as simple as the heartbeat, just return the status health or not. When elect the broker from the syncStateSet, it should also invoke it to check whether the broker is health before can be elelct as Master.
   > 
   > > At this point, the broker may reconnect quickly. Is it necessary to do a HA switching?
   > 
   > IMO, it't not neccssary. what more, The current controller has three places to trigger the election:
   > 
   > 1. The broker channel closed.
   > 2. The BrokerHeartbeatManager scanNotActiveBroker
   > 3. The DLedgerController scanInactiveMasterAndTriggerReelect.
   > 
   > I think just the scanInactiveMasterAndTriggerReelect is enough, In the current implementation, it is easy to have a concurrent election problem.
   
   Good suggestion. I also believe that disconnecting the channel should not immediately trigger the election.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [rocketmq] DongyuanPan commented on issue #6468: The HA switch is not triggered when the broker disk I/O load is continuously high.

Posted by "DongyuanPan (via GitHub)" <gi...@apache.org>.

DongyuanPan commented on issue #6468:
URL: https://github.com/apache/rocketmq/issues/6468#issuecomment-1482626988

   @RongtongJin @ShannonDing 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [rocketmq] ShannonDing commented on issue #6468: The HA switch is not triggered when the broker disk I/O load is continuously high.

Posted by "ShannonDing (via GitHub)" <gi...@apache.org>.

ShannonDing commented on issue #6468:
URL: https://github.com/apache/rocketmq/issues/6468#issuecomment-1482439896

   > Should we consider another scenario at the same time?
   > 
   > When network jitter occurs, network reconnection may be triggered. In this case and in the view of the controller module, the broker channel will change. when the controller detects that the old channel is closed, it will directly kick the broker out of the brokerLiveTable and trigger a re-election.
   > 
   > At this point, the broker may reconnect quickly. Is it necessary to do a HA switching?
   
   maybe we can open another issue to discuss this situation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [rocketmq] ShannonDing commented on issue #6468: The HA switch is not triggered when the disk I/O load is continuously high.

Posted by "ShannonDing (via GitHub)" <gi...@apache.org>.

ShannonDing commented on issue #6468:
URL: https://github.com/apache/rocketmq/issues/6468#issuecomment-1482412448

   From the current implementation, The controller module mainly depends on the heartbeats between the broker and the controller to determine whether the broker status is normal. However, if the broker disk has a problem, the Heartbeat can still be sent to the controller module. In this case, the Controller does not determine that the broker node is abnormal, which triggers a new election and a HA switching.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [rocketmq] ShannonDing commented on issue #6468: The HA switch is not triggered when the broker disk I/O load is continuously high.

Posted by "ShannonDing (via GitHub)" <gi...@apache.org>.

ShannonDing commented on issue #6468:
URL: https://github.com/apache/rocketmq/issues/6468#issuecomment-1484627768

   > I think it's a good idea to implement a health check endpoint on the broker side. Then the brokerIsActive is no longer check the heartbeat,but invoke the endpoint. the endpoint is as simple as the heartbeat, just return the status health or not. When elect the broker from the syncStateSet, it should also invoke it to check whether the broker is health before can be elelct as Master.
   > 
   > > At this point, the broker may reconnect quickly. Is it necessary to do a HA switching?
   > 
   > IMO, it't not neccssary. what more, The current controller has three places to trigger the election:
   > 
   > 1. The broker channel closed.
   > 2. The BrokerHeartbeatManager scanNotActiveBroker
   > 3. The DLedgerController scanInactiveMasterAndTriggerReelect.
   > 
   > I think just the scanInactiveMasterAndTriggerReelect is enough, In the current implementation, it is easy to have a concurrent election problem.
   
   In fact, what we hope is that when the master fails, re-election can be triggered timely and quickly to ensure a timely switchover between the master and slave nodes. However, we don't want to trigger frequent HA switches due to small jitters.
   Therefore, it is necessary to do regularly scanning node status actively and  do a passively trigger re-election by node abnormal events. What is important is that we may need to clarify the threshold and conditions for HA switch, as well as the timing for triggering re-election.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [rocketmq] DongyuanPan commented on issue #6468: The HA switch is not triggered when the broker disk I/O load is continuously high.

Posted by "DongyuanPan (via GitHub)" <gi...@apache.org>.

DongyuanPan commented on issue #6468:
URL: https://github.com/apache/rocketmq/issues/6468#issuecomment-1482625620

   It does appear to be a problem. Need a clear rule to determine whether the machine is offline. we can discuss this rule


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] The HA switch is not triggered when the broker disk I/O load is continuously high. [rocketmq]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] closed issue #6468: The HA switch is not triggered when the broker disk I/O load is continuously high.
URL: https://github.com/apache/rocketmq/issues/6468


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [rocketmq] RongtongJin commented on issue #6468: The HA switch is not triggered when the broker disk I/O load is continuously high.

Posted by "RongtongJin (via GitHub)" <gi...@apache.org>.

RongtongJin commented on issue #6468:
URL: https://github.com/apache/rocketmq/issues/6468#issuecomment-1483979298

   @ShannonDing @GenerousMan @DongyuanPan @echooymxq I agree with you. I think that switching is necessary when there are fatal I/O issues (such as disk corruption or abnormally high I/O).
   However, we also need to give users a switch because if the disk is normal but the traffic itself is large, resulting in continuously high I/O, the switch will further affect the stability of the cluster.
   In addition, switching standards and threshold settings is very challenging.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] The HA switch is not triggered when the broker disk I/O load is continuously high. [rocketmq]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on issue #6468:
URL: https://github.com/apache/rocketmq/issues/6468#issuecomment-2028503940

   This issue was closed because it has been inactive for 3 days since being marked as stale.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [rocketmq] GenerousMan commented on issue #6468: The HA switch is not triggered when the broker disk I/O load is continuously high.

Posted by "GenerousMan (via GitHub)" <gi...@apache.org>.

GenerousMan commented on issue #6468:
URL: https://github.com/apache/rocketmq/issues/6468#issuecomment-1482433243

   Yes, in such cases, triggering automatic switching can make things better. 
   I think it is possible to add health check in the master: When some fatal exceptions (such as IOException caused by disk problems) occur, the master can report to the controller and elect a new master——just as kafka did. Some non-fatal abnormal states may be recorded, which will contribute to the observability of HA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [rocketmq] ShannonDing commented on issue #6468: The HA switch is not triggered when the broker disk I/O load is continuously high.

Posted by "ShannonDing (via GitHub)" <gi...@apache.org>.

ShannonDing commented on issue #6468:
URL: https://github.com/apache/rocketmq/issues/6468#issuecomment-1484602643

   > @ShannonDing @GenerousMan @DongyuanPan @echooymxq I agree with you. I think that HA switching is necessary when there are fatal I/O issues (such as disk corruption or abnormally high I/O). However, we also need to give users a optional switch because if the disk is normal but the traffic itself is large, resulting in continuously high I/O, the HA switch will further affect the stability of the cluster. In addition, HA switching standards and threshold settings is very challenging.
   
   yes, let's clarify the HA switching rules and triggering thresholds.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [rocketmq] ShannonDing commented on issue #6468: The HA switch is not triggered when the broker disk I/O load is continuously high.

Posted by "ShannonDing (via GitHub)" <gi...@apache.org>.

ShannonDing commented on issue #6468:
URL: https://github.com/apache/rocketmq/issues/6468#issuecomment-1482437181

   Should we consider another scenario at the same time?
   
   When network jitter occurs, network reconnection may be triggered. In this case and in the view of the controller module, the broker channel will change.
   when the controller detects that the old channel is closed, it will directly kick the broker out of the brokerLiveTable and trigger a re-election. 
   
   At this point, the broker may reconnect quickly. Is it necessary to do a HA switching?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org