You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pegasus.apache.org by GitBox <gi...@apache.org> on 2021/05/28 10:53:28 UTC

[GitHub] [incubator-pegasus] Shuo-Jia opened a new issue #740: plog may be lost on going duplicating when replica re-open and then learn

Shuo-Jia opened a new issue #740:
URL: https://github.com/apache/incubator-pegasus/issues/740


   https://github.com/apache/incubator-pegasus/issues/693 has report one problem, but it hasn't be reproduced in later many test,  however, I found this follow problem that can be ususlly reproduced after enough test. so I open the issue and focus fixing this problem
   
   ## Bug Report
   
   Please answer these questions before submitting your issue. Thanks!
   
   ### 1. What did you do?
   * dup status is `start`
   * set meta lively to balance after adding three node
   
   ### 2. What did you see instead?
   server crash and the core stack:
   ```
   #0  0x0000003e8d8328a5 in raise () from /lib64/libc.so.6
   #1  0x0000003e8d834085 in abort () from /lib64/libc.so.6
   #2  0x00007f24bfcf0d4e in dsn_coredump () at /home/jiashuo1/work/pegasus/rdsn/src/runtime/service_api_c.cpp:78
   #3  0x00007f24bfac71a3 in dsn::replication::prepare_list::commit (this=this@entry=0x1546d7d90, d=659887, ct=ct@entry=dsn::replication::COMMIT_TO_DECREE_HARD) at /home/jiashuo1/work/pegasus/rdsn/src/replica/prepare_list.cpp:154
   #4  0x00007f24bfac7e6a in dsn::replication::prepare_list::prepare (this=0x1546d7d90, mu=..., status=status@entry=dsn::replication::partition_status::PS_INACTIVE, pop_all_committed_mutations=pop_all_committed_mutations@entry=false)
       at /home/jiashuo1/work/pegasus/rdsn/src/replica/prepare_list.cpp:124
   #5  0x00007f24bfbf6d93 in dsn::replication::mutation_batch::add (this=0x25cb33bc0, mu=...) at /home/jiashuo1/work/pegasus/rdsn/src/replica/duplication/mutation_batch.cpp:52
   #6  0x00007f24bfbf159b in operator() (mu=..., log_bytes_length=2355, __closure=0x7f246d5caaa0) at /home/jiashuo1/work/pegasus/rdsn/src/replica/duplication/load_from_private_log.cpp:124
   #7  std::_Function_handler<bool(int, dsn::ref_ptr<dsn::replication::mutation>&), dsn::replication::load_from_private_log::replay_log_block()::<lambda(int, dsn::replication::mutation_ptr&)> >::_M_invoke(const std::_Any_data &, <unknown type in /home/work/app/pegasus/c4tst-tune/replica/package/bin/libdsn_replica_server.so, CU 0x1ff2be0, DIE 0x20e7113>, dsn::ref_ptr<dsn::replication::mutation> &) (__functor=..., __args#0=<optimized out>, __args#1=...)
       at /home/jiashuo1/app/toolchain/gcc540/output/include/c++/5.4.0/functional:1857
   #8  0x00007f24bfabb00e in operator() (__args#1=..., __args#0=2355, this=0x7f246d5caaa0) at /home/jiashuo1/app/toolchain/gcc540/output/include/c++/5.4.0/functional:2267
   #9  dsn::replication::mutation_log::replay_block(dsn::ref_ptr<dsn::replication::log_file>&, std::function<bool (int, dsn::ref_ptr<dsn::replication::mutation>&)>&, unsigned long, long&) (log=..., callback=..., 
       start_offset=<optimized out>, end_offset=@0x25cb33bb8: 1004176305) at /home/jiashuo1/work/pegasus/rdsn/src/replica/mutation_log_replay.cpp:94
   #10 0x00007f24bfbf0b7c in replay_block (end_offset=<optimized out>, start_offset=<optimized out>, 
       callback=<unknown type in /home/work/app/pegasus/c4tst-tune/replica/package/bin/libdsn_replica_server.so, CU 0x1ff2be0, DIE 0x20b8705>, log=...) at /home/jiashuo1/work/pegasus/rdsn/src/replica/mutation_log.h:149
   #11 dsn::replication::load_from_private_log::replay_log_block (this=this@entry=0x25cb33b00) at /home/jiashuo1/work/pegasus/rdsn/src/replica/duplication/load_from_private_log.cpp:131
   #12 0x00007f24bfbf12ed in dsn::replication::load_from_private_log::run (this=0x25cb33b00) at /home/jiashuo1/work/pegasus/rdsn/src/replica/duplication/load_from_private_log.cpp:68
   #13 0x00007f24bfd1bbbd in dsn::task::exec_internal (this=this@entry=0x114855877) at /home/jiashuo1/work/pegasus/rdsn/src/runtime/task/task.cpp:176
   #14 0x00007f24bfd3671a in dsn::task_worker::loop (this=0x3040780) at /home/jiashuo1/work/pegasus/rdsn/src/runtime/task/task_worker.cpp:211
   #15 0x00007f24bfd368e8 in dsn::task_worker::run_internal (this=0x3040780) at /home/jiashuo1/work/pegasus/rdsn/src/runtime/task/task_worker.cpp:191
   #16 0x00007f24bca55fd0 in std::execute_native_thread_routine (__p=<optimized out>) at /home/jiashuo1/app/toolchain/gcc540/objdir/../gcc-5.4.0/libstdc++-v3/src/c++11/thread.cc:84
   #17 0x0000003e8dc07851 in start_thread () from /lib64/libpthread.so.0
   #18 0x0000003e8d8e811d in clone () from /lib64/libc.so.6
   
   ```
   
   ```
   F2021-04-20 17:01:01.121 (1618909261121014136 26dc7) replica.rep_long1.04050011002af702: prepare_list.cpp:155:commit(): assertion expression: mu != nullptr && mu->is_logged()
   F2021-04-20 17:01:01.121 (1618909261121033079 26dc7) replica.rep_long1.04050011002af702: prepare_list.cpp:155:commit(): [mutation_batch@210.59@10.132.5.5:32801] mutation 659887 is missing in prepare list
   ```
   
   ### 3. What version of Pegasus are you using?
   [Pegasus 2.1.1](https://github.com/apache/incubator-pegasus/tree/v2.1.1) 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] Shuo-Jia edited a comment on issue #740: plog may be lost on going duplicating when replica re-open and then learn

Posted by GitBox <gi...@apache.org>.
Shuo-Jia edited a comment on issue #740:
URL: https://github.com/apache/incubator-pegasus/issues/740#issuecomment-859502272


   https://github.com/XiaoMi/rdsn/pull/838 https://github.com/XiaoMi/rdsn/pull/845 are fixing the problem, https://github.com/XiaoMi/rdsn/pull/838 has fixed `duplicating status` problem, when I fix `reset_from` bug in testing https://github.com/XiaoMi/rdsn/pull/845, I found the `learn plog` may be incomplete.
   
   for duplicating, if some log hasn't been confirmed, it will be reserve and send the `potential secondary`, the case which is named `step back` and the log learned of `potential secondary` in `/learn` path will be moved and replace`/plog`path file in https://github.com/XiaoMi/rdsn/blob/v2.2/src/replica/replica_learn.cpp#L1526.  However, the `reset_from` is mot been implemented correctly, https://github.com/XiaoMi/rdsn/pull/845 is resolve the case. but the follow log show the log learned may be incomplete to start at the source(learnee node).
   
   **Learnee:**
   Learnee receve learn request and find should step back("last_max_decree(9217992) vs learn_start_decree(7255649)" in log, learn_start_decree step back to be equal with last confirmed decree for duplicating), and plog file selected is index 292("learned files count 1 (292 => 292)" in log)
   ```
   D2021-06-11 15:40:06.539 (1623397206539476516 119869) replica.replica17.0400d3f800032c69: mutation_log.cpp:1148:get_learn_state(): gpid(265.18) get_learn_state returns false, private logs count 73 (220 => 292), learned files count 1 (292 => 292): learned_file_start_offset(8504932039) >= valid_start_offset(8504932039) && last_max_decree(9217992) > 0 && last_max_decree(9217992) < learn_start_decree(7255649)
   
   
   D2021-06-11 15:40:06.539 (1623397206539545453 119869) replica.replica17.0400d3f800032c69: replica_learn.cpp:482:on_learn(): 265.18@10.132.5.5:32801: on_learn[0000006c00000002]: learner = 10.132.5.21:32801, choose to learn private logs, because learn_start_decree steps back for duplication
   ```
   the index 292 log infomation(size=33MB):
   ```
   mi mi 33M 6月  11 17:21 log.292.8504932039
   ```
   context as follow, that is to say, the start decree is 9217993, but not 7255649
   ```
   mutation: gpid=265.18, ballot=106, decree=9217993, timestamp=2021-06-11 15:39:42.084, last_committed_decree=9217992
   ```
   
   **Learnee:**
   Learnee receved the response and start copy remote plog file
   ```
   D2021-06-11 15:40:06.540 (1623397206540251920 51589) replica.replica17.0405001100000283: replica_learn.cpp:601:on_learn_reply(): 265.18@10.132.5.21:32801: on_learn_reply_start=>jiashuo_debug[0000006c00000002]: learnee = 10.132.5.5:32801, learn_duration = 0 ms, response_err = ERR_OK, remote_committed_decree = 9220655, prepare_start_decree = -1, learn_type = replication::learn_type::LT_LOG, learned_buffer_size = 5828, learned_file_count = 1, to_decree_included = 9220655, learn_start_decree = 7255649[9217971], current_learning_status = replication::learner_status::LearningWithoutPrepare
   
   D2021-06-11 15:40:06.565 (1623397206565705781 51624) replica.rep_long2.0405001100000285: replica_learn.cpp:1065:on_copy_remote_state_completed(): 265.18@10.132.5.21:32801: on_copy_remote_state_completed[0000006c00000002]: learnee = 10.132.5.5:32801, learn_duration = 26 ms, copy remote state done, err = ERR_OK, copy_file_count = 1, copy_file_size = 3501690, copy_time_used = 20 ms, local_committed_decree = 9217971, app_committed_decree = 9217971, app_durable_decree = 9217971, prepare_start_decree = -1, current_learning_status = replication::learner_status::LearningWithoutPrepare
   ```
   the file size info is(file size only has 3.4MB(same with result of log,  that is "copy_file_size = 3501690")):
   ```
   mi mi 3.4M 6月  11 15:51 log.292.8504932039
   ```
   file context is:
   ```
   mutation: gpid=265.18, ballot=106, decree=9217993, timestamp=2021-06-11 15:39:42.084, last_committed_decree=9217992
   ```
   
   **That is to say**
   * the plog selected whic need learned start decree is not equal or less `true start_decree of duplicating`
   * learnee's log send remote, but learner receved file size is not equal with origin file
   
   so when replay the log, it will core dump for mutaion loss:
   ```
   E2021-06-11 15:40:06.588 (1623397206588520392 51624) replica.rep_long2.0405001100000285: replica_learn.cpp:1617:apply_learned_state_from_private_log(): [265.18@10.132.5.21:32801] repaly_before=>jiashuo_debug: step back=true, last_commit_decree=9217971, dup=true
   D2021-06-11 15:40:06.588 (1623397206588854941 51624) replica.rep_long2.0405001100000285: mutation_log_replay.cpp:36:replay(): start to replay mutation log /home/work/ssd10/pegasus/c4tst-tune/replica/reps/265.18.pegasus/plog/log.292.8504932039, offset = [8504932039, 8508433729), size = 3501690
   F2021-06-11 15:40:06.589 (1623397206589160526 51624) replica.rep_long2.0405001100000285: prepare_list.cpp:155:commit(): assertion expression: mu != nullptr && mu->is_logged()
   F2021-06-11 15:40:06.589 (1623397206589180244 51624) replica.rep_long2.0405001100000285: prepare_list.cpp:155:commit(): [265.18@10.132.5.21:32801] mutation 9217972 is missing in prepare list
   ```
   
   I don't know this bug whether caused only by learn or cause by duplicating, mark it and test later


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] Shuo-Jia edited a comment on issue #740: plog may be lost on going duplicating when replica re-open and then learn

Posted by GitBox <gi...@apache.org>.
Shuo-Jia edited a comment on issue #740:
URL: https://github.com/apache/incubator-pegasus/issues/740#issuecomment-859502272


   https://github.com/XiaoMi/rdsn/pull/838 https://github.com/XiaoMi/rdsn/pull/845 are fixing the problem, https://github.com/XiaoMi/rdsn/pull/838 has fixed `duplicating status` problem, when I fix `reset_from` bug in testing https://github.com/XiaoMi/rdsn/pull/845, I found the `learn plog` may be incomplete.
   
   for duplicating, if some log hasn't been confirmed, it will be reserve and send the `potential secondary`, the case which is named `step back` and the log learned of `potential secondary` in `/learn` path will be moved and replace`/plog`path file in https://github.com/XiaoMi/rdsn/blob/v2.2/src/replica/replica_learn.cpp#L1526.  However, the `reset_from` is mot been implemented correctly, https://github.com/XiaoMi/rdsn/pull/845 is resolve the case. but the follow log show the log learned may be incomplete to start at the source(learnee node).
   
   **Learnee:**
   Learnee receve learn request and find should step back(last_max_decree(9217992) vs learn_start_decree(7255649)), and plog file selected is index 292(learned files count 1 (292 => 292))
   ```
   D2021-06-11 15:40:06.539 (1623397206539476516 119869) replica.replica17.0400d3f800032c69: mutation_log.cpp:1148:get_learn_state(): gpid(265.18) get_learn_state returns false, private logs count 73 (220 => 292), learned files count 1 (292 => 292): learned_file_start_offset(8504932039) >= valid_start_offset(8504932039) && last_max_decree(9217992) > 0 && last_max_decree(9217992) < learn_start_decree(7255649)
   
   
   D2021-06-11 15:40:06.539 (1623397206539545453 119869) replica.replica17.0400d3f800032c69: replica_learn.cpp:482:on_learn(): 265.18@10.132.5.5:32801: on_learn[0000006c00000002]: learner = 10.132.5.21:32801, choose to learn private logs, because learn_start_decree steps back for duplication
   ```
   the index 292 log infomation(size=33MB):
   ```
   mi mi 33M 6月  11 17:21 log.292.8504932039
   ```
   context as follow, that is to say, the start decree is 9217993, but not 7255649
   ```
   mutation: gpid=265.18, ballot=106, decree=9217993, timestamp=2021-06-11 15:39:42.084, last_committed_decree=9217992
   ```
   
   **Learnee:**
   Learnee receved the response and start copy remote plog file
   ```
   D2021-06-11 15:40:06.540 (1623397206540251920 51589) replica.replica17.0405001100000283: replica_learn.cpp:601:on_learn_reply(): 265.18@10.132.5.21:32801: on_learn_reply_start=>jiashuo_debug[0000006c00000002]: learnee = 10.132.5.5:32801, learn_duration = 0 ms, response_err = ERR_OK, remote_committed_decree = 9220655, prepare_start_decree = -1, learn_type = replication::learn_type::LT_LOG, learned_buffer_size = 5828, learned_file_count = 1, to_decree_included = 9220655, learn_start_decree = 7255649[9217971], current_learning_status = replication::learner_status::LearningWithoutPrepare
   
   D2021-06-11 15:40:06.565 (1623397206565705781 51624) replica.rep_long2.0405001100000285: replica_learn.cpp:1065:on_copy_remote_state_completed(): 265.18@10.132.5.21:32801: on_copy_remote_state_completed[0000006c00000002]: learnee = 10.132.5.5:32801, learn_duration = 26 ms, copy remote state done, err = ERR_OK, copy_file_count = 1, copy_file_size = 3501690, copy_time_used = 20 ms, local_committed_decree = 9217971, app_committed_decree = 9217971, app_durable_decree = 9217971, prepare_start_decree = -1, current_learning_status = replication::learner_status::LearningWithoutPrepare
   ```
   the file size info is(file size only has 3.4MB(fsame with log, copy_file_size = 3501690)):
   ```
   mi mi 3.4M 6月  11 15:51 log.292.8504932039
   ```
   file context is:
   ```
   mutation: gpid=265.18, ballot=106, decree=9217993, timestamp=2021-06-11 15:39:42.084, last_committed_decree=9217992
   ```
   
   **That is to say**
   * the plog selected whic need learned start decree is not equal or less `true start_decree of duplicating`
   * learnee's log send remote, but learner receved file size is not equal with origin file
   
   so when replay the log, it will core dump for mutaion loss:
   ```
   E2021-06-11 15:40:06.588 (1623397206588520392 51624) replica.rep_long2.0405001100000285: replica_learn.cpp:1617:apply_learned_state_from_private_log(): [265.18@10.132.5.21:32801] repaly_before=>jiashuo_debug: step back=true, last_commit_decree=9217971, dup=true
   D2021-06-11 15:40:06.588 (1623397206588854941 51624) replica.rep_long2.0405001100000285: mutation_log_replay.cpp:36:replay(): start to replay mutation log /home/work/ssd10/pegasus/c4tst-tune/replica/reps/265.18.pegasus/plog/log.292.8504932039, offset = [8504932039, 8508433729), size = 3501690
   F2021-06-11 15:40:06.589 (1623397206589160526 51624) replica.rep_long2.0405001100000285: prepare_list.cpp:155:commit(): assertion expression: mu != nullptr && mu->is_logged()
   F2021-06-11 15:40:06.589 (1623397206589180244 51624) replica.rep_long2.0405001100000285: prepare_list.cpp:155:commit(): [265.18@10.132.5.21:32801] mutation 9217972 is missing in prepare list
   ```
   
   I don't know this is bug whether cause only by learn or cause by duplicating, mark it and test later


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] Shuo-Jia commented on issue #740: plog may be lost on going duplicating when replica re-open and then learn

Posted by GitBox <gi...@apache.org>.
Shuo-Jia commented on issue #740:
URL: https://github.com/apache/incubator-pegasus/issues/740#issuecomment-859502272


   https://github.com/XiaoMi/rdsn/pull/838 https://github.com/XiaoMi/rdsn/pull/845 is fix the problem, https://github.com/XiaoMi/rdsn/pull/838 has fixed `duplicating status` problem, when I was testing https://github.com/XiaoMi/rdsn/pull/845, I found the `learn plog` may be incomplete.
   
   for duplicating, if some log hasn't been confirmed, it will be reserve and send the `potential secondary` which is named `step back` and the log learned in `/learn` path will be moved and replace`/plog`path file. https://github.com/XiaoMi/rdsn/pull/845 is resolve the case. but the follow log show the log learned may be incomplete to start at the source(learnee node).
   
   **Learnee:**
   Learnee receve learn request and find should step back(last_max_decree(9217992) vs learn_start_decree(7255649)), and plog file selected is index 292(learned files count 1 (292 => 292))
   ```
   D2021-06-11 15:40:06.539 (1623397206539476516 119869) replica.replica17.0400d3f800032c69: mutation_log.cpp:1148:get_learn_state(): gpid(265.18) get_learn_state returns false, private logs count 73 (220 => 292), learned files count 1 (292 => 292): learned_file_start_offset(8504932039) >= valid_start_offset(8504932039) && last_max_decree(9217992) > 0 && last_max_decree(9217992) < learn_start_decree(7255649)
   
   
   D2021-06-11 15:40:06.539 (1623397206539545453 119869) replica.replica17.0400d3f800032c69: replica_learn.cpp:482:on_learn(): 265.18@10.132.5.5:32801: on_learn[0000006c00000002]: learner = 10.132.5.21:32801, choose to learn private logs, because learn_start_decree steps back for duplication
   ```
   the index 292 log infomation(size=33MB):
   ```
   mi mi 33M 6月  11 17:21 log.292.8504932039
   ```
   context as follow, that is to say, the start decree is 9217993, but not 7255649
   ```
   mutation: gpid=265.18, ballot=106, decree=9217993, timestamp=2021-06-11 15:39:42.084, last_committed_decree=9217992
   ```
   
   **Learnee:**
   Learnee receved the response and start copy remote plog file
   ```
   D2021-06-11 15:40:06.540 (1623397206540251920 51589) replica.replica17.0405001100000283: replica_learn.cpp:601:on_learn_reply(): 265.18@10.132.5.21:32801: on_learn_reply_start=>jiashuo_debug[0000006c00000002]: learnee = 10.132.5.5:32801, learn_duration = 0 ms, response_err = ERR_OK, remote_committed_decree = 9220655, prepare_start_decree = -1, learn_type = replication::learn_type::LT_LOG, learned_buffer_size = 5828, learned_file_count = 1, to_decree_included = 9220655, learn_start_decree = 7255649[9217971], current_learning_status = replication::learner_status::LearningWithoutPrepare
   
   D2021-06-11 15:40:06.565 (1623397206565705781 51624) replica.rep_long2.0405001100000285: replica_learn.cpp:1065:on_copy_remote_state_completed(): 265.18@10.132.5.21:32801: on_copy_remote_state_completed[0000006c00000002]: learnee = 10.132.5.5:32801, learn_duration = 26 ms, copy remote state done, err = ERR_OK, copy_file_count = 1, copy_file_size = 3501690, copy_time_used = 20 ms, local_committed_decree = 9217971, app_committed_decree = 9217971, app_durable_decree = 9217971, prepare_start_decree = -1, current_learning_status = replication::learner_status::LearningWithoutPrepare
   ```
   the file size info is(file size only has 3.4MB(fsame with log, copy_file_size = 3501690)):
   ```
   mi mi 3.4M 6月  11 15:51 log.292.8504932039
   ```
   file context is:
   ```
   mutation: gpid=265.18, ballot=106, decree=9217993, timestamp=2021-06-11 15:39:42.084, last_committed_decree=9217992
   ```
   
   **That is to say**
   * the plog selected whic need learned start decree is not equal or less `true start_decree of duplicating`
   * learnee's log send remote, but learner receved file size is not equal with origin file
   
   so when replay the log, it will core dump for mutaion loss:
   ```
   E2021-06-11 15:40:06.588 (1623397206588520392 51624) replica.rep_long2.0405001100000285: replica_learn.cpp:1617:apply_learned_state_from_private_log(): [265.18@10.132.5.21:32801] repaly_before=>jiashuo_debug: step back=true, last_commit_decree=9217971, dup=true
   D2021-06-11 15:40:06.588 (1623397206588854941 51624) replica.rep_long2.0405001100000285: mutation_log_replay.cpp:36:replay(): start to replay mutation log /home/work/ssd10/pegasus/c4tst-tune/replica/reps/265.18.pegasus/plog/log.292.8504932039, offset = [8504932039, 8508433729), size = 3501690
   F2021-06-11 15:40:06.589 (1623397206589160526 51624) replica.rep_long2.0405001100000285: prepare_list.cpp:155:commit(): assertion expression: mu != nullptr && mu->is_logged()
   F2021-06-11 15:40:06.589 (1623397206589180244 51624) replica.rep_long2.0405001100000285: prepare_list.cpp:155:commit(): [265.18@10.132.5.21:32801] mutation 9217972 is missing in prepare list
   ```
   
   I don't know this is bug whether cause only by learn or cause by duplicating, mark it and test later


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] Shuo-Jia edited a comment on issue #740: plog may be lost on going duplicating when replica re-open and then learn

Posted by GitBox <gi...@apache.org>.
Shuo-Jia edited a comment on issue #740:
URL: https://github.com/apache/incubator-pegasus/issues/740#issuecomment-859502272


   https://github.com/XiaoMi/rdsn/pull/838 https://github.com/XiaoMi/rdsn/pull/845 are fixing the problem, https://github.com/XiaoMi/rdsn/pull/838 has fixed `duplicating status` problem, when I fix `reset_from` bug in testing https://github.com/XiaoMi/rdsn/pull/845, I found the `learn plog` may be incomplete.
   
   for duplicating, if some log hasn't been confirmed, it will be reserve and send the `potential secondary`, the case which is named `step back` and the log learned in `/learn` path will be moved and replace`/plog`path file. https://github.com/XiaoMi/rdsn/pull/845 is resolve the case. but the follow log show the log learned may be incomplete to start at the source(learnee node).
   
   **Learnee:**
   Learnee receve learn request and find should step back(last_max_decree(9217992) vs learn_start_decree(7255649)), and plog file selected is index 292(learned files count 1 (292 => 292))
   ```
   D2021-06-11 15:40:06.539 (1623397206539476516 119869) replica.replica17.0400d3f800032c69: mutation_log.cpp:1148:get_learn_state(): gpid(265.18) get_learn_state returns false, private logs count 73 (220 => 292), learned files count 1 (292 => 292): learned_file_start_offset(8504932039) >= valid_start_offset(8504932039) && last_max_decree(9217992) > 0 && last_max_decree(9217992) < learn_start_decree(7255649)
   
   
   D2021-06-11 15:40:06.539 (1623397206539545453 119869) replica.replica17.0400d3f800032c69: replica_learn.cpp:482:on_learn(): 265.18@10.132.5.5:32801: on_learn[0000006c00000002]: learner = 10.132.5.21:32801, choose to learn private logs, because learn_start_decree steps back for duplication
   ```
   the index 292 log infomation(size=33MB):
   ```
   mi mi 33M 6月  11 17:21 log.292.8504932039
   ```
   context as follow, that is to say, the start decree is 9217993, but not 7255649
   ```
   mutation: gpid=265.18, ballot=106, decree=9217993, timestamp=2021-06-11 15:39:42.084, last_committed_decree=9217992
   ```
   
   **Learnee:**
   Learnee receved the response and start copy remote plog file
   ```
   D2021-06-11 15:40:06.540 (1623397206540251920 51589) replica.replica17.0405001100000283: replica_learn.cpp:601:on_learn_reply(): 265.18@10.132.5.21:32801: on_learn_reply_start=>jiashuo_debug[0000006c00000002]: learnee = 10.132.5.5:32801, learn_duration = 0 ms, response_err = ERR_OK, remote_committed_decree = 9220655, prepare_start_decree = -1, learn_type = replication::learn_type::LT_LOG, learned_buffer_size = 5828, learned_file_count = 1, to_decree_included = 9220655, learn_start_decree = 7255649[9217971], current_learning_status = replication::learner_status::LearningWithoutPrepare
   
   D2021-06-11 15:40:06.565 (1623397206565705781 51624) replica.rep_long2.0405001100000285: replica_learn.cpp:1065:on_copy_remote_state_completed(): 265.18@10.132.5.21:32801: on_copy_remote_state_completed[0000006c00000002]: learnee = 10.132.5.5:32801, learn_duration = 26 ms, copy remote state done, err = ERR_OK, copy_file_count = 1, copy_file_size = 3501690, copy_time_used = 20 ms, local_committed_decree = 9217971, app_committed_decree = 9217971, app_durable_decree = 9217971, prepare_start_decree = -1, current_learning_status = replication::learner_status::LearningWithoutPrepare
   ```
   the file size info is(file size only has 3.4MB(fsame with log, copy_file_size = 3501690)):
   ```
   mi mi 3.4M 6月  11 15:51 log.292.8504932039
   ```
   file context is:
   ```
   mutation: gpid=265.18, ballot=106, decree=9217993, timestamp=2021-06-11 15:39:42.084, last_committed_decree=9217992
   ```
   
   **That is to say**
   * the plog selected whic need learned start decree is not equal or less `true start_decree of duplicating`
   * learnee's log send remote, but learner receved file size is not equal with origin file
   
   so when replay the log, it will core dump for mutaion loss:
   ```
   E2021-06-11 15:40:06.588 (1623397206588520392 51624) replica.rep_long2.0405001100000285: replica_learn.cpp:1617:apply_learned_state_from_private_log(): [265.18@10.132.5.21:32801] repaly_before=>jiashuo_debug: step back=true, last_commit_decree=9217971, dup=true
   D2021-06-11 15:40:06.588 (1623397206588854941 51624) replica.rep_long2.0405001100000285: mutation_log_replay.cpp:36:replay(): start to replay mutation log /home/work/ssd10/pegasus/c4tst-tune/replica/reps/265.18.pegasus/plog/log.292.8504932039, offset = [8504932039, 8508433729), size = 3501690
   F2021-06-11 15:40:06.589 (1623397206589160526 51624) replica.rep_long2.0405001100000285: prepare_list.cpp:155:commit(): assertion expression: mu != nullptr && mu->is_logged()
   F2021-06-11 15:40:06.589 (1623397206589180244 51624) replica.rep_long2.0405001100000285: prepare_list.cpp:155:commit(): [265.18@10.132.5.21:32801] mutation 9217972 is missing in prepare list
   ```
   
   I don't know this is bug whether cause only by learn or cause by duplicating, mark it and test later


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] Shuo-Jia edited a comment on issue #740: plog may be lost on going duplicating when replica re-open and then learn

Posted by GitBox <gi...@apache.org>.
Shuo-Jia edited a comment on issue #740:
URL: https://github.com/apache/incubator-pegasus/issues/740#issuecomment-859502272


   https://github.com/XiaoMi/rdsn/pull/838 https://github.com/XiaoMi/rdsn/pull/845 are fixing the problem, https://github.com/XiaoMi/rdsn/pull/838 has fixed `duplicating status` problem, when I fix `reset_from` bug in testing https://github.com/XiaoMi/rdsn/pull/845, I found the `learn plog` may be incomplete.
   
   for duplicating, if some log hasn't been confirmed, it will be reserve and send the `potential secondary`, the case which is named `step back` and the log learned of `potential secondary` in `/learn` path will be moved and replace`/plog`path file. https://github.com/XiaoMi/rdsn/pull/845 is resolve the case. but the follow log show the log learned may be incomplete to start at the source(learnee node).
   
   **Learnee:**
   Learnee receve learn request and find should step back(last_max_decree(9217992) vs learn_start_decree(7255649)), and plog file selected is index 292(learned files count 1 (292 => 292))
   ```
   D2021-06-11 15:40:06.539 (1623397206539476516 119869) replica.replica17.0400d3f800032c69: mutation_log.cpp:1148:get_learn_state(): gpid(265.18) get_learn_state returns false, private logs count 73 (220 => 292), learned files count 1 (292 => 292): learned_file_start_offset(8504932039) >= valid_start_offset(8504932039) && last_max_decree(9217992) > 0 && last_max_decree(9217992) < learn_start_decree(7255649)
   
   
   D2021-06-11 15:40:06.539 (1623397206539545453 119869) replica.replica17.0400d3f800032c69: replica_learn.cpp:482:on_learn(): 265.18@10.132.5.5:32801: on_learn[0000006c00000002]: learner = 10.132.5.21:32801, choose to learn private logs, because learn_start_decree steps back for duplication
   ```
   the index 292 log infomation(size=33MB):
   ```
   mi mi 33M 6月  11 17:21 log.292.8504932039
   ```
   context as follow, that is to say, the start decree is 9217993, but not 7255649
   ```
   mutation: gpid=265.18, ballot=106, decree=9217993, timestamp=2021-06-11 15:39:42.084, last_committed_decree=9217992
   ```
   
   **Learnee:**
   Learnee receved the response and start copy remote plog file
   ```
   D2021-06-11 15:40:06.540 (1623397206540251920 51589) replica.replica17.0405001100000283: replica_learn.cpp:601:on_learn_reply(): 265.18@10.132.5.21:32801: on_learn_reply_start=>jiashuo_debug[0000006c00000002]: learnee = 10.132.5.5:32801, learn_duration = 0 ms, response_err = ERR_OK, remote_committed_decree = 9220655, prepare_start_decree = -1, learn_type = replication::learn_type::LT_LOG, learned_buffer_size = 5828, learned_file_count = 1, to_decree_included = 9220655, learn_start_decree = 7255649[9217971], current_learning_status = replication::learner_status::LearningWithoutPrepare
   
   D2021-06-11 15:40:06.565 (1623397206565705781 51624) replica.rep_long2.0405001100000285: replica_learn.cpp:1065:on_copy_remote_state_completed(): 265.18@10.132.5.21:32801: on_copy_remote_state_completed[0000006c00000002]: learnee = 10.132.5.5:32801, learn_duration = 26 ms, copy remote state done, err = ERR_OK, copy_file_count = 1, copy_file_size = 3501690, copy_time_used = 20 ms, local_committed_decree = 9217971, app_committed_decree = 9217971, app_durable_decree = 9217971, prepare_start_decree = -1, current_learning_status = replication::learner_status::LearningWithoutPrepare
   ```
   the file size info is(file size only has 3.4MB(fsame with log, copy_file_size = 3501690)):
   ```
   mi mi 3.4M 6月  11 15:51 log.292.8504932039
   ```
   file context is:
   ```
   mutation: gpid=265.18, ballot=106, decree=9217993, timestamp=2021-06-11 15:39:42.084, last_committed_decree=9217992
   ```
   
   **That is to say**
   * the plog selected whic need learned start decree is not equal or less `true start_decree of duplicating`
   * learnee's log send remote, but learner receved file size is not equal with origin file
   
   so when replay the log, it will core dump for mutaion loss:
   ```
   E2021-06-11 15:40:06.588 (1623397206588520392 51624) replica.rep_long2.0405001100000285: replica_learn.cpp:1617:apply_learned_state_from_private_log(): [265.18@10.132.5.21:32801] repaly_before=>jiashuo_debug: step back=true, last_commit_decree=9217971, dup=true
   D2021-06-11 15:40:06.588 (1623397206588854941 51624) replica.rep_long2.0405001100000285: mutation_log_replay.cpp:36:replay(): start to replay mutation log /home/work/ssd10/pegasus/c4tst-tune/replica/reps/265.18.pegasus/plog/log.292.8504932039, offset = [8504932039, 8508433729), size = 3501690
   F2021-06-11 15:40:06.589 (1623397206589160526 51624) replica.rep_long2.0405001100000285: prepare_list.cpp:155:commit(): assertion expression: mu != nullptr && mu->is_logged()
   F2021-06-11 15:40:06.589 (1623397206589180244 51624) replica.rep_long2.0405001100000285: prepare_list.cpp:155:commit(): [265.18@10.132.5.21:32801] mutation 9217972 is missing in prepare list
   ```
   
   I don't know this is bug whether cause only by learn or cause by duplicating, mark it and test later


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] Shuo-Jia edited a comment on issue #740: plog may be lost on going duplicating when replica re-open and then learn

Posted by GitBox <gi...@apache.org>.
Shuo-Jia edited a comment on issue #740:
URL: https://github.com/apache/incubator-pegasus/issues/740#issuecomment-859502272


   https://github.com/XiaoMi/rdsn/pull/838 https://github.com/XiaoMi/rdsn/pull/845 are fixing the problem, https://github.com/XiaoMi/rdsn/pull/838 has fixed `duplicating status` problem, when I fix `reset_from` bug in testing https://github.com/XiaoMi/rdsn/pull/845, I found the `learn plog` may be incomplete.
   
   for duplicating, if some log hasn't been confirmed, it will be reserve and send the `potential secondary`, the case which is named `step back` and the log learned of `potential secondary` in `/learn` path will be moved and replace`/plog`path file in https://github.com/XiaoMi/rdsn/blob/v2.2/src/replica/replica_learn.cpp#L1526.  However, the `reset_from` is mot been implemented correctly, https://github.com/XiaoMi/rdsn/pull/845 is resolve the case. but the follow log show the log learned may be incomplete to start at the source(learnee node).
   
   **Learnee:**
   Learnee receve learn request and find should step back("last_max_decree(9217992) vs learn_start_decree(7255649)" in log, learn_start_decree step back to be equal with last confirmed decree for duplicating), and plog file selected is index 292("learned files count 1 (292 => 292)" in log)
   ```
   D2021-06-11 15:40:06.539 (1623397206539476516 119869) replica.replica17.0400d3f800032c69: mutation_log.cpp:1148:get_learn_state(): gpid(265.18) get_learn_state returns false, private logs count 73 (220 => 292), learned files count 1 (292 => 292): learned_file_start_offset(8504932039) >= valid_start_offset(8504932039) && last_max_decree(9217992) > 0 && last_max_decree(9217992) < learn_start_decree(7255649)
   
   
   D2021-06-11 15:40:06.539 (1623397206539545453 119869) replica.replica17.0400d3f800032c69: replica_learn.cpp:482:on_learn(): 265.18@10.132.5.5:32801: on_learn[0000006c00000002]: learner = 10.132.5.21:32801, choose to learn private logs, because learn_start_decree steps back for duplication
   ```
   the index 292 log infomation(size=33MB):
   ```
   mi mi 33M 6月  11 17:21 log.292.8504932039
   ```
   context as follow, that is to say, the start decree is 9217993, but not 7255649
   ```
   mutation: gpid=265.18, ballot=106, decree=9217993, timestamp=2021-06-11 15:39:42.084, last_committed_decree=9217992
   ```
   
   **Learnee:**
   Learnee receved the response and start copy remote plog file
   ```
   D2021-06-11 15:40:06.540 (1623397206540251920 51589) replica.replica17.0405001100000283: replica_learn.cpp:601:on_learn_reply(): 265.18@10.132.5.21:32801: on_learn_reply_start=>jiashuo_debug[0000006c00000002]: learnee = 10.132.5.5:32801, learn_duration = 0 ms, response_err = ERR_OK, remote_committed_decree = 9220655, prepare_start_decree = -1, learn_type = replication::learn_type::LT_LOG, learned_buffer_size = 5828, learned_file_count = 1, to_decree_included = 9220655, learn_start_decree = 7255649[9217971], current_learning_status = replication::learner_status::LearningWithoutPrepare
   
   D2021-06-11 15:40:06.565 (1623397206565705781 51624) replica.rep_long2.0405001100000285: replica_learn.cpp:1065:on_copy_remote_state_completed(): 265.18@10.132.5.21:32801: on_copy_remote_state_completed[0000006c00000002]: learnee = 10.132.5.5:32801, learn_duration = 26 ms, copy remote state done, err = ERR_OK, copy_file_count = 1, copy_file_size = 3501690, copy_time_used = 20 ms, local_committed_decree = 9217971, app_committed_decree = 9217971, app_durable_decree = 9217971, prepare_start_decree = -1, current_learning_status = replication::learner_status::LearningWithoutPrepare
   ```
   the file size info is(file size only has 3.4MB(fsame with log, copy_file_size = 3501690)):
   ```
   mi mi 3.4M 6月  11 15:51 log.292.8504932039
   ```
   file context is:
   ```
   mutation: gpid=265.18, ballot=106, decree=9217993, timestamp=2021-06-11 15:39:42.084, last_committed_decree=9217992
   ```
   
   **That is to say**
   * the plog selected whic need learned start decree is not equal or less `true start_decree of duplicating`
   * learnee's log send remote, but learner receved file size is not equal with origin file
   
   so when replay the log, it will core dump for mutaion loss:
   ```
   E2021-06-11 15:40:06.588 (1623397206588520392 51624) replica.rep_long2.0405001100000285: replica_learn.cpp:1617:apply_learned_state_from_private_log(): [265.18@10.132.5.21:32801] repaly_before=>jiashuo_debug: step back=true, last_commit_decree=9217971, dup=true
   D2021-06-11 15:40:06.588 (1623397206588854941 51624) replica.rep_long2.0405001100000285: mutation_log_replay.cpp:36:replay(): start to replay mutation log /home/work/ssd10/pegasus/c4tst-tune/replica/reps/265.18.pegasus/plog/log.292.8504932039, offset = [8504932039, 8508433729), size = 3501690
   F2021-06-11 15:40:06.589 (1623397206589160526 51624) replica.rep_long2.0405001100000285: prepare_list.cpp:155:commit(): assertion expression: mu != nullptr && mu->is_logged()
   F2021-06-11 15:40:06.589 (1623397206589180244 51624) replica.rep_long2.0405001100000285: prepare_list.cpp:155:commit(): [265.18@10.132.5.21:32801] mutation 9217972 is missing in prepare list
   ```
   
   I don't know this is bug whether cause only by learn or cause by duplicating, mark it and test later


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] hycdong closed issue #740: fix duplicating&learning(2/n): plog may be lost on going duplicating when replica re-open and then learn

Posted by GitBox <gi...@apache.org>.
hycdong closed issue #740:
URL: https://github.com/apache/incubator-pegasus/issues/740


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] Shuo-Jia removed a comment on issue #740: fix duplicating&learnging(3/4): plog may be lost on going duplicating when replica re-open and then learn

Posted by GitBox <gi...@apache.org>.
Shuo-Jia removed a comment on issue #740:
URL: https://github.com/apache/incubator-pegasus/issues/740#issuecomment-859502272


   https://github.com/XiaoMi/rdsn/pull/838 https://github.com/XiaoMi/rdsn/pull/845 are fixing the problem, https://github.com/XiaoMi/rdsn/pull/838 has fixed `duplicating status` problem, when I fix `reset_from` bug in testing https://github.com/XiaoMi/rdsn/pull/845, I found the `learn plog` may be incomplete.
   
   for duplicating, if some log hasn't been confirmed, it will be reserve and send the `potential secondary`, the case which is named `step back` and the log learned of `potential secondary` in `/learn` path will be moved and replace`/plog`path file in https://github.com/XiaoMi/rdsn/blob/v2.2/src/replica/replica_learn.cpp#L1526.  However, the `reset_from` is mot been implemented correctly, https://github.com/XiaoMi/rdsn/pull/845 is resolve the case. but the follow log show the log learned may be incomplete to start at the source(learnee node).
   
   **Learnee:**
   Learnee receve learn request and find should step back("last_max_decree(9217992) vs learn_start_decree(7255649)" in log, learn_start_decree step back to be equal with last confirmed decree for duplicating), and plog file selected is index 292("learned files count 1 (292 => 292)" in log)
   ```
   D2021-06-11 15:40:06.539 (1623397206539476516 119869) replica.replica17.0400d3f800032c69: mutation_log.cpp:1148:get_learn_state(): gpid(265.18) get_learn_state returns false, private logs count 73 (220 => 292), learned files count 1 (292 => 292): learned_file_start_offset(8504932039) >= valid_start_offset(8504932039) && last_max_decree(9217992) > 0 && last_max_decree(9217992) < learn_start_decree(7255649)
   
   
   D2021-06-11 15:40:06.539 (1623397206539545453 119869) replica.replica17.0400d3f800032c69: replica_learn.cpp:482:on_learn(): 265.18@10.132.5.5:32801: on_learn[0000006c00000002]: learner = 10.132.5.21:32801, choose to learn private logs, because learn_start_decree steps back for duplication
   ```
   the index 292 log infomation(size=33MB):
   ```
   mi mi 33M 6月  11 17:21 log.292.8504932039
   ```
   context as follow, that is to say, the start decree is 9217993, but not 7255649
   ```
   mutation: gpid=265.18, ballot=106, decree=9217993, timestamp=2021-06-11 15:39:42.084, last_committed_decree=9217992
   ```
   
   **Learnee:**
   Learnee receved the response and start copy remote plog file
   ```
   D2021-06-11 15:40:06.540 (1623397206540251920 51589) replica.replica17.0405001100000283: replica_learn.cpp:601:on_learn_reply(): 265.18@10.132.5.21:32801: on_learn_reply_start=>jiashuo_debug[0000006c00000002]: learnee = 10.132.5.5:32801, learn_duration = 0 ms, response_err = ERR_OK, remote_committed_decree = 9220655, prepare_start_decree = -1, learn_type = replication::learn_type::LT_LOG, learned_buffer_size = 5828, learned_file_count = 1, to_decree_included = 9220655, learn_start_decree = 7255649[9217971], current_learning_status = replication::learner_status::LearningWithoutPrepare
   
   D2021-06-11 15:40:06.565 (1623397206565705781 51624) replica.rep_long2.0405001100000285: replica_learn.cpp:1065:on_copy_remote_state_completed(): 265.18@10.132.5.21:32801: on_copy_remote_state_completed[0000006c00000002]: learnee = 10.132.5.5:32801, learn_duration = 26 ms, copy remote state done, err = ERR_OK, copy_file_count = 1, copy_file_size = 3501690, copy_time_used = 20 ms, local_committed_decree = 9217971, app_committed_decree = 9217971, app_durable_decree = 9217971, prepare_start_decree = -1, current_learning_status = replication::learner_status::LearningWithoutPrepare
   ```
   the file size info is(file size only has 3.4MB(same with result of log,  that is "copy_file_size = 3501690")):
   ```
   mi mi 3.4M 6月  11 15:51 log.292.8504932039
   ```
   file context is:
   ```
   mutation: gpid=265.18, ballot=106, decree=9217993, timestamp=2021-06-11 15:39:42.084, last_committed_decree=9217992
   ```
   
   **That is to say**
   * the plog selected whic need learned start decree is not equal or less `true start_decree of duplicating`
   * learnee's log send remote, but learner receved file size is not equal with origin file
   
   so when replay the log, it will core dump for mutaion loss:
   ```
   E2021-06-11 15:40:06.588 (1623397206588520392 51624) replica.rep_long2.0405001100000285: replica_learn.cpp:1617:apply_learned_state_from_private_log(): [265.18@10.132.5.21:32801] repaly_before=>jiashuo_debug: step back=true, last_commit_decree=9217971, dup=true
   D2021-06-11 15:40:06.588 (1623397206588854941 51624) replica.rep_long2.0405001100000285: mutation_log_replay.cpp:36:replay(): start to replay mutation log /home/work/ssd10/pegasus/c4tst-tune/replica/reps/265.18.pegasus/plog/log.292.8504932039, offset = [8504932039, 8508433729), size = 3501690
   F2021-06-11 15:40:06.589 (1623397206589160526 51624) replica.rep_long2.0405001100000285: prepare_list.cpp:155:commit(): assertion expression: mu != nullptr && mu->is_logged()
   F2021-06-11 15:40:06.589 (1623397206589180244 51624) replica.rep_long2.0405001100000285: prepare_list.cpp:155:commit(): [265.18@10.132.5.21:32801] mutation 9217972 is missing in prepare list
   ```
   
   I don't know this bug whether caused only by learn or cause by duplicating, mark it and test later


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] Shuo-Jia edited a comment on issue #740: plog may be lost on going duplicating when replica re-open and then learn

Posted by GitBox <gi...@apache.org>.
Shuo-Jia edited a comment on issue #740:
URL: https://github.com/apache/incubator-pegasus/issues/740#issuecomment-859502272






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] Shuo-Jia edited a comment on issue #740: plog may be lost on going duplicating when replica re-open and then learn

Posted by GitBox <gi...@apache.org>.
Shuo-Jia edited a comment on issue #740:
URL: https://github.com/apache/incubator-pegasus/issues/740#issuecomment-859502272


   https://github.com/XiaoMi/rdsn/pull/838 https://github.com/XiaoMi/rdsn/pull/845 are fixing the problem, https://github.com/XiaoMi/rdsn/pull/838 has fixed `duplicating status` problem, when I fix `reset_from` bug in testing https://github.com/XiaoMi/rdsn/pull/845, I found the `learn plog` may be incomplete.
   
   for duplicating, if some log hasn't been confirmed, it will be reserve and send the `potential secondary`, the case which is named `step back` and the log learned of `potential secondary` in `/learn` path will be moved and replace`/plog`path file in https://github.com/XiaoMi/rdsn/blob/v2.2/src/replica/replica_learn.cpp#L1526.  However, the `reset_from` is mot been implemented correctly, https://github.com/XiaoMi/rdsn/pull/845 is resolve the case. but the follow log show the log learned may be incomplete to start at the source(learnee node).
   
   **Learnee:**
   Learnee receve learn request and find should step back(last_max_decree(9217992) vs learn_start_decree(7255649), learn_start_decree step back to be equal with last confirmed decree for duplicating), and plog file selected is index 292(learned files count 1 (292 => 292))
   ```
   D2021-06-11 15:40:06.539 (1623397206539476516 119869) replica.replica17.0400d3f800032c69: mutation_log.cpp:1148:get_learn_state(): gpid(265.18) get_learn_state returns false, private logs count 73 (220 => 292), learned files count 1 (292 => 292): learned_file_start_offset(8504932039) >= valid_start_offset(8504932039) && last_max_decree(9217992) > 0 && last_max_decree(9217992) < learn_start_decree(7255649)
   
   
   D2021-06-11 15:40:06.539 (1623397206539545453 119869) replica.replica17.0400d3f800032c69: replica_learn.cpp:482:on_learn(): 265.18@10.132.5.5:32801: on_learn[0000006c00000002]: learner = 10.132.5.21:32801, choose to learn private logs, because learn_start_decree steps back for duplication
   ```
   the index 292 log infomation(size=33MB):
   ```
   mi mi 33M 6月  11 17:21 log.292.8504932039
   ```
   context as follow, that is to say, the start decree is 9217993, but not 7255649
   ```
   mutation: gpid=265.18, ballot=106, decree=9217993, timestamp=2021-06-11 15:39:42.084, last_committed_decree=9217992
   ```
   
   **Learnee:**
   Learnee receved the response and start copy remote plog file
   ```
   D2021-06-11 15:40:06.540 (1623397206540251920 51589) replica.replica17.0405001100000283: replica_learn.cpp:601:on_learn_reply(): 265.18@10.132.5.21:32801: on_learn_reply_start=>jiashuo_debug[0000006c00000002]: learnee = 10.132.5.5:32801, learn_duration = 0 ms, response_err = ERR_OK, remote_committed_decree = 9220655, prepare_start_decree = -1, learn_type = replication::learn_type::LT_LOG, learned_buffer_size = 5828, learned_file_count = 1, to_decree_included = 9220655, learn_start_decree = 7255649[9217971], current_learning_status = replication::learner_status::LearningWithoutPrepare
   
   D2021-06-11 15:40:06.565 (1623397206565705781 51624) replica.rep_long2.0405001100000285: replica_learn.cpp:1065:on_copy_remote_state_completed(): 265.18@10.132.5.21:32801: on_copy_remote_state_completed[0000006c00000002]: learnee = 10.132.5.5:32801, learn_duration = 26 ms, copy remote state done, err = ERR_OK, copy_file_count = 1, copy_file_size = 3501690, copy_time_used = 20 ms, local_committed_decree = 9217971, app_committed_decree = 9217971, app_durable_decree = 9217971, prepare_start_decree = -1, current_learning_status = replication::learner_status::LearningWithoutPrepare
   ```
   the file size info is(file size only has 3.4MB(fsame with log, copy_file_size = 3501690)):
   ```
   mi mi 3.4M 6月  11 15:51 log.292.8504932039
   ```
   file context is:
   ```
   mutation: gpid=265.18, ballot=106, decree=9217993, timestamp=2021-06-11 15:39:42.084, last_committed_decree=9217992
   ```
   
   **That is to say**
   * the plog selected whic need learned start decree is not equal or less `true start_decree of duplicating`
   * learnee's log send remote, but learner receved file size is not equal with origin file
   
   so when replay the log, it will core dump for mutaion loss:
   ```
   E2021-06-11 15:40:06.588 (1623397206588520392 51624) replica.rep_long2.0405001100000285: replica_learn.cpp:1617:apply_learned_state_from_private_log(): [265.18@10.132.5.21:32801] repaly_before=>jiashuo_debug: step back=true, last_commit_decree=9217971, dup=true
   D2021-06-11 15:40:06.588 (1623397206588854941 51624) replica.rep_long2.0405001100000285: mutation_log_replay.cpp:36:replay(): start to replay mutation log /home/work/ssd10/pegasus/c4tst-tune/replica/reps/265.18.pegasus/plog/log.292.8504932039, offset = [8504932039, 8508433729), size = 3501690
   F2021-06-11 15:40:06.589 (1623397206589160526 51624) replica.rep_long2.0405001100000285: prepare_list.cpp:155:commit(): assertion expression: mu != nullptr && mu->is_logged()
   F2021-06-11 15:40:06.589 (1623397206589180244 51624) replica.rep_long2.0405001100000285: prepare_list.cpp:155:commit(): [265.18@10.132.5.21:32801] mutation 9217972 is missing in prepare list
   ```
   
   I don't know this is bug whether cause only by learn or cause by duplicating, mark it and test later


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org


[GitHub] [incubator-pegasus] Shuo-Jia edited a comment on issue #740: plog may be lost on going duplicating when replica re-open and then learn

Posted by GitBox <gi...@apache.org>.
Shuo-Jia edited a comment on issue #740:
URL: https://github.com/apache/incubator-pegasus/issues/740#issuecomment-859502272


   https://github.com/XiaoMi/rdsn/pull/838 https://github.com/XiaoMi/rdsn/pull/845 are fixing the problem, https://github.com/XiaoMi/rdsn/pull/838 has fixed `duplicating status` problem, when I fix `reset_from` bug in testing https://github.com/XiaoMi/rdsn/pull/845, I found the `learn plog` may be incomplete.
   
   for duplicating, if some log hasn't been confirmed, it will be reserve and send the `potential secondary`, the case which is named `step back` and the log learned of `potential secondary` in `/learn` path will be moved and replace`/plog`path file in https://github.com/XiaoMi/rdsn/blob/v2.2/src/replica/replica_learn.cpp#L1526.  However, the `reset_from` is mot been implemented correctly, https://github.com/XiaoMi/rdsn/pull/845 is resolve the case. but the follow log show the log learned may be incomplete to start at the source(learnee node).
   
   **Learnee:**
   Learnee receve learn request and find should step back("last_max_decree(9217992) vs learn_start_decree(7255649)" in log, learn_start_decree step back to be equal with last confirmed decree for duplicating), and plog file selected is index 292("learned files count 1 (292 => 292)" in log)
   ```
   D2021-06-11 15:40:06.539 (1623397206539476516 119869) replica.replica17.0400d3f800032c69: mutation_log.cpp:1148:get_learn_state(): gpid(265.18) get_learn_state returns false, private logs count 73 (220 => 292), learned files count 1 (292 => 292): learned_file_start_offset(8504932039) >= valid_start_offset(8504932039) && last_max_decree(9217992) > 0 && last_max_decree(9217992) < learn_start_decree(7255649)
   
   
   D2021-06-11 15:40:06.539 (1623397206539545453 119869) replica.replica17.0400d3f800032c69: replica_learn.cpp:482:on_learn(): 265.18@10.132.5.5:32801: on_learn[0000006c00000002]: learner = 10.132.5.21:32801, choose to learn private logs, because learn_start_decree steps back for duplication
   ```
   the index 292 log infomation(size=33MB):
   ```
   mi mi 33M 6月  11 17:21 log.292.8504932039
   ```
   context as follow, that is to say, the start decree is 9217993, but not 7255649
   ```
   mutation: gpid=265.18, ballot=106, decree=9217993, timestamp=2021-06-11 15:39:42.084, last_committed_decree=9217992
   ```
   
   **Learnee:**
   Learnee receved the response and start copy remote plog file
   ```
   D2021-06-11 15:40:06.540 (1623397206540251920 51589) replica.replica17.0405001100000283: replica_learn.cpp:601:on_learn_reply(): 265.18@10.132.5.21:32801: on_learn_reply_start=>jiashuo_debug[0000006c00000002]: learnee = 10.132.5.5:32801, learn_duration = 0 ms, response_err = ERR_OK, remote_committed_decree = 9220655, prepare_start_decree = -1, learn_type = replication::learn_type::LT_LOG, learned_buffer_size = 5828, learned_file_count = 1, to_decree_included = 9220655, learn_start_decree = 7255649[9217971], current_learning_status = replication::learner_status::LearningWithoutPrepare
   
   D2021-06-11 15:40:06.565 (1623397206565705781 51624) replica.rep_long2.0405001100000285: replica_learn.cpp:1065:on_copy_remote_state_completed(): 265.18@10.132.5.21:32801: on_copy_remote_state_completed[0000006c00000002]: learnee = 10.132.5.5:32801, learn_duration = 26 ms, copy remote state done, err = ERR_OK, copy_file_count = 1, copy_file_size = 3501690, copy_time_used = 20 ms, local_committed_decree = 9217971, app_committed_decree = 9217971, app_durable_decree = 9217971, prepare_start_decree = -1, current_learning_status = replication::learner_status::LearningWithoutPrepare
   ```
   the file size info is(file size only has 3.4MB(same with result of log,  that is "copy_file_size = 3501690")):
   ```
   mi mi 3.4M 6月  11 15:51 log.292.8504932039
   ```
   file context is:
   ```
   mutation: gpid=265.18, ballot=106, decree=9217993, timestamp=2021-06-11 15:39:42.084, last_committed_decree=9217992
   ```
   
   **That is to say**
   * the plog selected whic need learned start decree is not equal or less `true start_decree of duplicating`
   * learnee's log send remote, but learner receved file size is not equal with origin file
   
   so when replay the log, it will core dump for mutaion loss:
   ```
   E2021-06-11 15:40:06.588 (1623397206588520392 51624) replica.rep_long2.0405001100000285: replica_learn.cpp:1617:apply_learned_state_from_private_log(): [265.18@10.132.5.21:32801] repaly_before=>jiashuo_debug: step back=true, last_commit_decree=9217971, dup=true
   D2021-06-11 15:40:06.588 (1623397206588854941 51624) replica.rep_long2.0405001100000285: mutation_log_replay.cpp:36:replay(): start to replay mutation log /home/work/ssd10/pegasus/c4tst-tune/replica/reps/265.18.pegasus/plog/log.292.8504932039, offset = [8504932039, 8508433729), size = 3501690
   F2021-06-11 15:40:06.589 (1623397206589160526 51624) replica.rep_long2.0405001100000285: prepare_list.cpp:155:commit(): assertion expression: mu != nullptr && mu->is_logged()
   F2021-06-11 15:40:06.589 (1623397206589180244 51624) replica.rep_long2.0405001100000285: prepare_list.cpp:155:commit(): [265.18@10.132.5.21:32801] mutation 9217972 is missing in prepare list
   ```
   
   I don't know this is bug whether cause only by learn or cause by duplicating, mark it and test later


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org