You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@rocketmq.apache.org by GitBox <gi...@apache.org> on 2022/08/12 10:15:12 UTC

[GitHub] [rocketmq] yewenhai-github opened a new issue, #4814: DLedger pagecachert overtime

yewenhai-github opened a new issue, #4814:
URL: https://github.com/apache/rocketmq/issues/4814

   1. Please describe the issue you observed:
       4.9.1版本DLedger,有时消息会出现超时特别严重现象,日志中38秒延迟。
   
   2. Please tell us about your environment:
     生产环境,DLedger1主2从模式,使用默认的mmap模式。
   
   
   3. Other information (e.g. detailed explanation, logs, related issues, suggestions on how to fix, etc):
   项目钉钉告警会抓取PutMessageDistributeTime的broker分析数据,对于超过200ms以上的数据进行告警,监控到10s以上的告警比较频繁,比如图中的38秒告警,在此期间是没有任何数据写入的。
   
   **store.log**
   2022-07-26 22:23:13 WARN QuorumAckChecker-n1 - putMessage not in lock elapsed time(ms)=**38044**, bodyLength=179
   2022-07-26 22:23:45 INFO StoreStatsService - [STORETPS] put_tps 0.08331944675887352 get_found_tps 0.06665555740709882 get_miss_tps 8.765205799033494 get_transfered_tps 0.06665555740709882
   2022-07-26 22:23:45 INFO StoreStatsService - [PAGECACHERT] TotalPut 5, PutMessageDistributeTime [<=0ms]:0 [0~10ms]:4 [10~50ms]:0 [50~100ms]:0 [100~200ms]:0 [200~500ms]:0 [500ms~1s]:0 [1~2s]:0 [2~3s]:0 [3~4s]:0 [4~5s]:0 [5~10s]:0 **[10s~]:1** 
   
   **store.log**
   ![image](https://user-images.githubusercontent.com/59827688/184326779-b4757a24-ade0-43ad-8c3a-bc2ba5bd830a.png)
   **broker_default.log** 3个broker的水位线日志
   ![image](https://user-images.githubusercontent.com/59827688/184326803-7e482fa7-2749-415a-a5af-712849b73330.png)
   ![image](https://user-images.githubusercontent.com/59827688/184326831-ffbfcfb0-3d4e-4ed0-bc1c-adea6f911ece.png)
   ![image](https://user-images.githubusercontent.com/59827688/184326851-f5c748df-5275-4312-a543-7368f71d5721.png)
   
   
   **问题定位**
   1、定位日志打印位置:****INFO StoreStatsService - [PAGECACHERT] ***
      broker启动时会运行此线程StoreStatsService.run(),60秒执行一次异步打印;   
   ![image](https://user-images.githubusercontent.com/59827688/184329537-7d1ac07d-af1b-4a6b-b2de-d2c0a4e2276b.png)
   
   2、定位堵塞代码位置:
      putResultFuture.thenAccept会等待this.commitLog.asyncPutMessage(msg);执行完成,所以耗时的38秒在这里;
   ![image](https://user-images.githubusercontent.com/59827688/184329824-66b4d366-cc94-46c8-b37d-e5431ab38c77.png)
      写入一条数据时,会生成一个AppendFuture,等待QuorumAckChecker异步进程来AppendFuture.complete响应;
   ![image](https://user-images.githubusercontent.com/59827688/184330162-42f4114c-0bf0-453c-9095-198df32ccd10.png)
   ![image](https://user-images.githubusercontent.com/59827688/184330256-9a13bc97-a833-4999-926f-f178421f19cf.png)
   
   3、DLedger数据同步、ACK逻辑分析
      主从数据同步:数据写入时触发DLedgerMmapFileStore.appendAsLeader方法ledgerEndIndex++时,DLedger数据同步触发,DLedgerEntryPusher.EntryDispatcher.dowork将主数据同步给从并更新从的水位线。
     QuorumAckChecker.doWork()移除peerWaterMarksByTerm内的临时数据逻辑中,此时如果写入的数据并未放入peerWaterMarksByTerm时(下图,qa环境中加入了休眠随机1-10秒验证,可以复现此现象),那么将会走if(future == null)逻辑转补偿逻辑,补偿(futureEntry.getKey() < quorumIndex)不会处理此数据,lastQuorumIndex更新为最新水位线,在此期间如果没有数据写入,那么一直走检查超时的逻辑,但超时逻辑从quorumIndex + 1开始检查,补偿逻辑又只处理<quorumIndex的数据,完美的错过了futureEntry.getKey() = quorumIndex的场景,造成此数据只能等待下一次数据来冲洗。
   ![image](https://user-images.githubusercontent.com/59827688/184332319-f6ec255e-1a6e-4772-83ac-c9f094ace526.png)
   ![image](https://user-images.githubusercontent.com/59827688/184333831-3671f802-a011-4380-9a69-72dab11b0d38.png)
   
   4、修复方案
      补偿逻辑中:if (futureEntry.getKey() < quorumIndex)  更改为 if (futureEntry.getKey() <= quorumIndex) 
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@rocketmq.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [rocketmq] mxsm commented on issue #4814: DLedger pagecachert overtime

Posted by GitBox <gi...@apache.org>.
mxsm commented on issue #4814:
URL: https://github.com/apache/rocketmq/issues/4814#issuecomment-1214392951

   @RongtongJin if the guy not subbmit PR to fixed it, I will subbmit PR to fixed this issue in DLedger repo


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [rocketmq] RongtongJin commented on issue #4814: DLedger pagecachert overtime

Posted by GitBox <gi...@apache.org>.
RongtongJin commented on issue #4814:
URL: https://github.com/apache/rocketmq/issues/4814#issuecomment-1212976731

   @yewenhai-github Good catch! Could you create an issue and  submit a related pull request in [DLedger repo](https://github.com/openmessaging/dledger)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [rocketmq] RongtongJin closed issue #4814: DLedger pagecachert overtime

Posted by GitBox <gi...@apache.org>.
RongtongJin closed issue #4814: DLedger pagecachert overtime
URL: https://github.com/apache/rocketmq/issues/4814


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [rocketmq] yewenhai-github commented on issue #4814: DLedger pagecachert overtime

Posted by GitBox <gi...@apache.org>.
yewenhai-github commented on issue #4814:
URL: https://github.com/apache/rocketmq/issues/4814#issuecomment-1214535135

   > Hi @RongtongJin if the guy not subbmit PR to fixed it, I will subbmit PR to fixed this issue in DLedger repo
   
   I subbmit PR to the master,but I'm found this question in the dledger tagv0.2.2,You can go to the next step or modify it directly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [rocketmq] RongtongJin commented on issue #4814: DLedger pagecachert overtime

Posted by GitBox <gi...@apache.org>.
RongtongJin commented on issue #4814:
URL: https://github.com/apache/rocketmq/issues/4814#issuecomment-1214582490

   Merged


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@rocketmq.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org