You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pegasus.apache.org by GitBox <gi...@apache.org> on 2021/04/23 03:01:02 UTC

[GitHub] [incubator-pegasus] ZhongChaoqiang commented on issue #719: data loss after restarting

ZhongChaoqiang commented on issue #719:
URL: https://github.com/apache/incubator-pegasus/issues/719#issuecomment-825351849


   @zhangyifan27 
   谢谢你的恢复！不好意思啊，没有及时看到你的信息！
   这个问题比较久远了，我们是在现网碰到的这个问题，具体怎么操作导致的这个问题当时也没有具体说明，但是应该是有多次的重启操作的。
   由于我们把err的清理时间（gc_disk_error_replica_interval_seconds）设置过短了，当时的确是丢数据了。（primary的机器下线了，secondary的节点open的时候把数据移到了err目录），所以用户读不到数据了。
   
   还有一个关键日志：在open replica的时候，打印了如下日志。
   ```
   E2020-12-03 21:27:57.929 (1607002077929018092 3b14) replica.replica0.04010000000000d9: replication_app_base.cpp:347:open_internal(): 35.24@10.32.82.225:34801: replica data is not complete coz last_durable_decree(16368) < init_durable_decree(16371)
   E2020-12-03 21:27:57.929 (1607002077929048291 3b14) replica.replica0.04010000000000d9: replication_app_base.cpp:353:open_internal(): 35.24@10.32.82.225:34801: open replica app return ERR_INCOMPLETE_DATA
   ```
   这里表明，open replica的时候，我们记录的last_durable_decree是不能少于从.init-info读取出来的init_durable_decree的。因为这样有可能是以为丢失了数据，此时的replica数据应该是不完整的了，所以需要把数据移到err目录下。
   
   怎么样保证last_durable_decree不少于init_durable_decree？由于我们对degree的机制了解的不是很深，目前我们想到的办法是last_committed_decree改成last_durable_decree，应该可以优化这个问题。
   
   怎么重现这个问题我好像没有想到好的方法，所以也是从代码上去分析这个问题的。麻烦你们再帮忙看看是不是会存在这样的问题？谢谢！


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org