You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@iotdb.apache.org by GitBox <gi...@apache.org> on 2021/11/10 07:01:20 UTC

[GitHub] [iotdb] cigarl opened a new issue #4352: CatchupTask may cause a cluster crash?

cigarl opened a new issue #4352:
URL: https://github.com/apache/iotdb/issues/4352


   # Description
   When my environment(3 nodes,3 replicas) has network fluctuations, or a node is overloaded and responds slowly, after it rejoined the cluster,i find that `CatchupTask` may cause my cluster to be corrupted. So I did some analysis and found the following. Please correct me if there is anything wrong.
   # Question
   1. `CatchupTask` does not control the size of data on a single slot. In another word,it can result in too many schema or files on a slot _(like slot[981] and slot[911])_,this could be a heavy operation. Besides, since we limit the maximum size of thrift frame to 512 MB,that means the request can not be sent to another node successfully.
   ```
   2021-11-05 17:39:13,148 [DataClientThread-133] INFO  o.a.i.c.s.m.DataGroupMember:396 - Data(x.x.x.x:9003, raftId=0): received a snapshot from RaftNode(node:Node(internalIp:x.x.x.x, metaPort:9003, nodeIdentifier:-206505346, dataPort:40010, clientPort:6667, clientIp:x.x.x.x), raftId:0) with size 285120061 
   2021-11-05 17:39:15,093 [DataClientThread-133] INFO  o.a.i.c.l.s.PartitionedSnapshot$Installer:165 - Data(x.x.x.x:9003, raftId=0): start to install a snapshot of 3948175-98 
   2021-11-05 17:39:15,098 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 9 series, index-term: 0-0} into slot[93] 
   2021-11-05 17:39:15,098 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:202 - Schemas in snapshot are registered 
   2021-11-05 17:39:15,110 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:304 - Data(x.x.x.x:9003, raftId=0): slot 93 is ready 
   2021-11-05 17:39:15,110 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 4 series, index-term: 0-0} into slot[121] 
   2021-11-05 17:39:15,110 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:202 - Schemas in snapshot are registered 
   2021-11-05 17:39:15,116 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:304 - Data(x.x.x.x:9003, raftId=0): slot 121 is ready 
   2021-11-05 17:39:15,116 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 16 series, index-term: 0-0} into slot[153] 
   2021-11-05 17:39:15,117 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:202 - Schemas in snapshot are registered 
   2021-11-05 17:39:15,120 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:304 - Data(x.x.x.x:9003, raftId=0): slot 153 is ready 
   2021-11-05 17:39:15,121 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 4 series, index-term: 0-0} into slot[160] 
   2021-11-05 17:39:15,121 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:202 - Schemas in snapshot are registered 
   2021-11-05 17:39:15,124 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:304 - Data(x.x.x.x:9003, raftId=0): slot 160 is ready 
   2021-11-05 17:39:15,125 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 660 series, index-term: 0-0} into slot[363] 
   2021-11-05 17:39:15,139 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:202 - Schemas in snapshot are registered 
   2021-11-05 17:39:15,143 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:304 - Data(x.x.x.x:9003, raftId=0): slot 363 is ready 
   2021-11-05 17:39:15,143 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 1615 series, index-term: 0-0} into slot[366] 
   2021-11-05 17:39:15,172 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:202 - Schemas in snapshot are registered 
   2021-11-05 17:39:15,176 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:304 - Data(x.x.x.x:9003, raftId=0): slot 366 is ready 
   2021-11-05 17:39:15,176 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 2768 series, index-term: 0-0} into slot[574] 
   2021-11-05 17:39:15,227 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:202 - Schemas in snapshot are registered 
   2021-11-05 17:39:15,229 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:304 - Data(x.x.x.x:9003, raftId=0): slot 574 is ready 
   2021-11-05 17:39:15,230 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 4 series, index-term: 0-0} into slot[711] 
   2021-11-05 17:39:15,230 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:202 - Schemas in snapshot are registered 
   2021-11-05 17:39:15,232 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:304 - Data(x.x.x.x:9003, raftId=0): slot 711 is ready 
   2021-11-05 17:39:15,232 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 309 series, index-term: 0-0} into slot[843] 
   2021-11-05 17:39:15,239 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:202 - Schemas in snapshot are registered 
   2021-11-05 17:39:15,240 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:304 - Data(x.x.x.x:9003, raftId=0): slot 843 is ready 
   2021-11-05 17:39:15,241 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 23 series, index-term: 0-0} into slot[879] 
   2021-11-05 17:39:15,241 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:202 - Schemas in snapshot are registered 
   2021-11-05 17:39:15,243 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:304 - Data(x.x.x.x:9003, raftId=0): slot 879 is ready 
   2021-11-05 17:39:15,243 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 300511 series, index-term: 0-0} into slot[911]
   2021-11-05 17:39:20,456 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:202 - Schemas in snapshot are registered 
   2021-11-05 17:39:20,458 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:304 - Data(x.x.x.x:9003, raftId=0): slot 911 is ready 
   2021-11-05 17:39:20,459 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{1 files, 1573281 series, index-term: 0-0} into slot[981]
   ```
   ![image](https://user-images.githubusercontent.com/44458757/141046373-b15f220d-3b7b-4097-8eef-04f83da61b6b.png)
   
   When a slot is blocked in a request, the request fails and is retried repeatedly. 
   
   At the same time, as the operation increases, the request becomes larger and will never be successfully executed. These threads are taking up resources and the number of threads is increasing.
   ```
   2021-11-05 17:39:15,098 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 9 series, index-term: 0-0} into slot[93]
   2021-11-05 17:39:15,110 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 4 series, index-term: 0-0} into slot[121]
   2021-11-05 17:39:15,116 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 16 series, index-term: 0-0} into slot[153]
   2021-11-05 17:39:15,121 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 4 series, index-term: 0-0} into slot[160]
   2021-11-05 17:39:15,125 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 660 series, index-term: 0-0} into slot[363]
   2021-11-05 17:39:15,143 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 1615 series, index-term: 0-0} into slot[366]
   2021-11-05 17:39:15,176 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 2768 series, index-term: 0-0} into slot[574]
   2021-11-05 17:39:15,230 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 4 series, index-term: 0-0} into slot[711]
   2021-11-05 17:39:15,232 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 309 series, index-term: 0-0} into slot[843]
   2021-11-05 17:39:15,241 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 23 series, index-term: 0-0} into slot[879]
   2021-11-05 17:39:15,243 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 300511 series, index-term: 0-0} into slot[911]
   2021-11-05 17:39:20,459 [DataClientThread-133] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{1 files, 1573281 series, index-term: 0-0} into slot[981]
   
   2021-11-05 19:38:41,003 [DataClientThread-216] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 9 series, index-term: 0-0} into slot[93]
   2021-11-05 19:38:41,006 [DataClientThread-216] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 4 series, index-term: 0-0} into slot[121]
   2021-11-05 19:38:41,007 [DataClientThread-216] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 16 series, index-term: 0-0} into slot[153]
   2021-11-05 19:38:41,010 [DataClientThread-216] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 4 series, index-term: 0-0} into slot[160]
   2021-11-05 19:38:41,012 [DataClientThread-216] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 660 series, index-term: 0-0} into slot[363]
   2021-11-05 19:38:41,095 [DataClientThread-216] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 1615 series, index-term: 0-0} into slot[366]
   2021-11-05 19:38:41,300 [DataClientThread-216] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 2768 series, index-term: 0-0} into slot[574]
   2021-11-05 19:38:41,684 [DataClientThread-216] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 4 series, index-term: 0-0} into slot[711]
   2021-11-05 19:38:41,686 [DataClientThread-216] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 309 series, index-term: 0-0} into slot[843]
   2021-11-05 19:38:41,726 [DataClientThread-216] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 23 series, index-term: 0-0} into slot[879]
   2021-11-05 19:38:41,730 [DataClientThread-216] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 300511 series, index-term: 0-0} into slot[911]
   
   
   2021-11-05 19:39:54,016 [DataClientThread-218] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 9 series, index-term: 0-0} into slot[93]
   2021-11-05 19:39:54,020 [DataClientThread-218] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 4 series, index-term: 0-0} into slot[121]
   2021-11-05 19:39:54,022 [DataClientThread-218] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 16 series, index-term: 0-0} into slot[153]
   2021-11-05 19:39:54,025 [DataClientThread-218] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 4 series, index-term: 0-0} into slot[160]
   2021-11-05 19:39:54,030 [DataClientThread-218] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 660 series, index-term: 0-0} into slot[363]
   2021-11-05 19:39:54,337 [DataClientThread-218] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 1615 series, index-term: 0-0} into slot[366]
   2021-11-05 19:39:54,535 [DataClientThread-218] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 2768 series, index-term: 0-0} into slot[574]
   2021-11-05 19:39:54,876 [DataClientThread-218] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 4 series, index-term: 0-0} into slot[711]
   2021-11-05 19:39:54,878 [DataClientThread-218] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 309 series, index-term: 0-0} into slot[843]
   2021-11-05 19:39:54,928 [DataClientThread-218] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 23 series, index-term: 0-0} into slot[879]
   2021-11-05 19:39:54,933 [DataClientThread-218] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 300511 series, index-term: 0-0} into slot[911]
   2021-11-05 19:40:43,473 [DataClientThread-218] INFO  o.a.i.c.l.s.FileSnapshot$Installer:200 - Starting to install a snapshot FileSnapshot{0 files, 1573281 series, index-term: 0-0} into slot[981]
   ```
   
   2. When a node is restarted, the sequence of `local recovery` and `peer recovery` is not controlled. `local recovery` could be slow due to a large `mlog.bin`,but `peer recovery` has begun. Although I haven't find what was wrong with it, it was obvious that the CPU load was climbing and the log files were reporting a lot of errors(Because `CatchupTask` has restored the schema, the `local recovery` starts repeating these operations).  
   ![image](https://user-images.githubusercontent.com/44458757/141048390-a51c295b-d147-4dc4-96a1-cc77db8b92e6.png)
   
   # Some thinking
   Maybe, In `catchupTask`, we need to control both the size of a request and the amount of data on a single slot.
   Assuming that the number of schema on a single slot is one million, we might need to split it into 10 or more operations. Also, we need to control the size of the entire request to ensure that it does not exceed the thrift Frame limit (512MB).  
   
   In addition, when a node is restarted, the `local recovery` should be prioritized. The `peer recovery` can start only after the local recovery is complete.  And we need to consider whether the conditions for `mtree-snapshot` are too strict, If no snapshot is taken for a long time, the `mlog.bin` file is too large and the recovery speed of nodes in the cluster is inconsistent, which may cause other problems(For example,nodes with slow recovery speed cannot be connected with others, repeated operations during recovery, and so on).
   
   WDYT?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [iotdb] github-actions[bot] commented on issue #4352: CatchupTask may cause a cluster crash?

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on issue #4352:
URL: https://github.com/apache/iotdb/issues/4352#issuecomment-964840080


   Hi, this is your first issue in IoTDB project. Thanks for your report. Welcome to join the community!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [iotdb] cigarl commented on issue #4352: CatchupTask may cause a cluster crash?

Posted by GitBox <gi...@apache.org>.
cigarl commented on issue #4352:
URL: https://github.com/apache/iotdb/issues/4352#issuecomment-966861563


   > I suggest you ask these questions in community group or mailing list after setting up an issue, so that more people will pay attention and participate in the discussion. This is more likely to be scheduled and resolved sooner.
   
   Thanks for your reminding, I will send an email to the community later, and try to address some of the problems in this process next week.(eg., we might have duplicate requests in the `catchupTask`) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [iotdb] LebronAl commented on issue #4352: CatchupTask may cause a cluster crash?

Posted by GitBox <gi...@apache.org>.
LebronAl commented on issue #4352:
URL: https://github.com/apache/iotdb/issues/4352#issuecomment-966850524


   For the first point, the framing mechanism is actually mentioned in Raft's paper when he talks about the snapshot implementation, and we should approach the implementation this way as well. Can you record an issue and we will evaluate the priority redevelopment then?
   
   On the second point, the work on mTree Snapshot seems important in this scenario. I think the judgment about Peer Recovery is also correct.
   
   I suggest you ask these questions in community group or mailing list after setting up an issue, so that more people will pay attention and participate in the discussion. This is more likely to be scheduled and resolved sooner.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org