You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Nanda kumar (Jira)" <ji...@apache.org> on 2019/12/05 17:39:00 UTC

[jira] [Assigned] (HDDS-2679) Ratis ring creation might be failed with async pipeline creation

     [ https://issues.apache.org/jira/browse/HDDS-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nanda kumar reassigned HDDS-2679:
---------------------------------

    Assignee: Nanda kumar

> Ratis ring creation might be failed with async pipeline creation 
> -----------------------------------------------------------------
>
>                 Key: HDDS-2679
>                 URL: https://issues.apache.org/jira/browse/HDDS-2679
>             Project: Hadoop Distributed Data Store
>          Issue Type: Improvement
>          Components: Ozone Datanode, SCM
>            Reporter: Marton Elek
>            Assignee: Nanda kumar
>            Priority: Blocker
>
> The problem introduced with async pipeline creation:
>  
>  # Let's say the SCM got registration from three datanodes.
>  # A Ratis/THREE pipeline will be created on SCM
>  # With the next HB Datanode1(DN1) will receive the CreatePipeline command
>  # Datanode1 will start the Ratis server which tries to get votes from DN2 and DN3
>  # If DN2 has not yet received the CreatePipeline command (which has high chance with 30sec HB) it will refuse to vote to DN1
>  # DN1 will request a  pipeline close from the SCM as there are no votes from DN2 and DN3
>  # Pipeline is closed on SCM side, but in the mean time DN2 (finally) receives the pipeline creation command and tries to get votes, but DN1 has a newer group/pipeline id.
>  # And so on
> If we are lucky enough after a while all DN will receive the container creation at more or less the same time, but if not, SCM couldn't create an Open Ratis
>  
> Possible solutions:
>  * At the very beginning datanode can trust in the peers and learn the group id (but it doesn't cover the case when one pipeline has been closed on DN1 *and* a new pipeline is created but DN2 still has the old pipeline).
>  * We can use bidirectional GRPC streaming for datanode scm communication (which is a good idea anyway to make the communication faster). It makes the communication faster but the problem is still there if there is a network blip between scm and DN1
> This log shows the initial problem (but in this case we were lucky enough to get the CreatePipeline at the same time):
> {code}
> 2019-12-05 12:06:08,457 INFO impl.RaftServerImpl: 7d6522ce-f918-4b92-a65f-4cf668c838ee@group-7E586EC20819: changes role from      null to FOLLOWER at term 0 for startAsFollower
> 2019-12-05 12:06:13,604 INFO impl.RaftServerImpl: 7d6522ce-f918-4b92-a65f-4cf668c838ee@group-7E586EC20819: changes role from  FOLLOWER to CANDIDATE at term 0 for changeToCandidate
> 2019-12-05 12:06:13,622 INFO impl.RaftServerImpl: 7d6522ce-f918-4b92-a65f-4cf668c838ee@group-7E586EC20819: changes role from CANDIDATE to LEADER at term 1 for changeToLeader
> 2019-12-05 12:06:44,652 INFO impl.RaftServerImpl: 7d6522ce-f918-4b92-a65f-4cf668c838ee@group-6CDAAB81725E: changes role from      null to FOLLOWER at term 0 for startAsFollower
> 2019-12-05 12:09:45,682 INFO impl.RaftServerImpl: 7d6522ce-f918-4b92-a65f-4cf668c838ee@group-8076DB6A465A: changes role from      null to FOLLOWER at term 0 for startAsFollower
> 2019-12-05 12:09:50,764 INFO impl.RaftServerImpl: 7d6522ce-f918-4b92-a65f-4cf668c838ee@group-8076DB6A465A: changes role from  FOLLOWER to CANDIDATE at term 0 for changeToCandidate
> 2019-12-05 12:09:50,936 INFO impl.RaftServerImpl: 7d6522ce-f918-4b92-a65f-4cf668c838ee@group-8076DB6A465A: changes role from CANDIDATE to FOLLOWER at term 1 for DISCOVERED_A_NEW_TERM
> 2019-12-05 12:09:55,963 INFO impl.RaftServerImpl: 7d6522ce-f918-4b92-a65f-4cf668c838ee@group-8076DB6A465A: changes role from  FOLLOWER to CANDIDATE at term 1 for changeToCandidate
> 2019-12-05 12:09:56,011 INFO impl.RaftServerImpl: 7d6522ce-f918-4b92-a65f-4cf668c838ee@group-8076DB6A465A: changes role from CANDIDATE to LEADER at term 2 for changeToLeader
>  elek  om  ~  projects  …  ozone-0.5.0-SNAPSHOT  compose  ozoneperf  master  2⬆  3⚑  %   docker logs ozoneperf_datanode_2 2>&1 | grep "changes role"
> 2019-12-05 12:06:28,401 INFO impl.RaftServerImpl: ff9ad02f-a8b1-4641-8ddb-fec62ddd3a63@group-77C0C0D747E8: changes role from      null to FOLLOWER at term 0 for startAsFollower
> 2019-12-05 12:06:28,457 INFO impl.RaftServerImpl: ff9ad02f-a8b1-4641-8ddb-fec62ddd3a63@group-6CDAAB81725E: changes role from      null to FOLLOWER at term 0 for startAsFollower
> 2019-12-05 12:06:33,460 INFO impl.RaftServerImpl: ff9ad02f-a8b1-4641-8ddb-fec62ddd3a63@group-77C0C0D747E8: changes role from  FOLLOWER to CANDIDATE at term 0 for changeToCandidate
> 2019-12-05 12:06:33,468 INFO impl.RaftServerImpl: ff9ad02f-a8b1-4641-8ddb-fec62ddd3a63@group-77C0C0D747E8: changes role from CANDIDATE to LEADER at term 1 for changeToLeader
> 2019-12-05 12:06:33,570 INFO impl.RaftServerImpl: ff9ad02f-a8b1-4641-8ddb-fec62ddd3a63@group-6CDAAB81725E: changes role from  FOLLOWER to CANDIDATE at term 0 for changeToCandidate
> 2019-12-05 12:06:33,805 INFO impl.RaftServerImpl: ff9ad02f-a8b1-4641-8ddb-fec62ddd3a63@group-6CDAAB81725E: changes role from CANDIDATE to FOLLOWER at term 1 for DISCOVERED_A_NEW_TERM
> 2019-12-05 12:06:38,835 INFO impl.RaftServerImpl: ff9ad02f-a8b1-4641-8ddb-fec62ddd3a63@group-6CDAAB81725E: changes role from  FOLLOWER to CANDIDATE at term 1 for changeToCandidate
> 2019-12-05 12:06:38,887 INFO impl.RaftServerImpl: ff9ad02f-a8b1-4641-8ddb-fec62ddd3a63@group-6CDAAB81725E: changes role from CANDIDATE to LEADER at term 2 for changeToLeader
> 2019-12-05 12:09:53,856 INFO impl.RaftServerImpl: ff9ad02f-a8b1-4641-8ddb-fec62ddd3a63@group-8076DB6A465A: changes role from      null to FOLLOWER at term 0 for startAsFollower
> 2019-12-05 12:09:56,002 INFO impl.RaftServerImpl: ff9ad02f-a8b1-4641-8ddb-fec62ddd3a63@group-8076DB6A465A: changes role from  FOLLOWER to FOLLOWER at term 2 for recognizeCandidate:7d6522ce-f918-4b92-a65f-4cf668c838ee
>  elek  om  ~  projects  …  ozone-0.5.0-SNAPSHOT  compose  ozoneperf  master  2⬆  3⚑  %   docker logs ozoneperf_datanode_3 2>&1 | grep "changes role"
> 2019-12-05 12:06:28,220 INFO impl.RaftServerImpl: 1483994f-8a56-4838-b941-3c12e79b2f80@group-22BE899E4998: changes role from      null to FOLLOWER at term 0 for startAsFollower
> 2019-12-05 12:06:28,273 INFO impl.RaftServerImpl: 1483994f-8a56-4838-b941-3c12e79b2f80@group-6CDAAB81725E: changes role from      null to FOLLOWER at term 0 for startAsFollower
> 2019-12-05 12:06:33,265 INFO impl.RaftServerImpl: 1483994f-8a56-4838-b941-3c12e79b2f80@group-22BE899E4998: changes role from  FOLLOWER to CANDIDATE at term 0 for changeToCandidate
> 2019-12-05 12:06:33,278 INFO impl.RaftServerImpl: 1483994f-8a56-4838-b941-3c12e79b2f80@group-6CDAAB81725E: changes role from  FOLLOWER to CANDIDATE at term 0 for changeToCandidate
> 2019-12-05 12:06:33,288 INFO impl.RaftServerImpl: 1483994f-8a56-4838-b941-3c12e79b2f80@group-22BE899E4998: changes role from CANDIDATE to LEADER at term 1 for changeToLeader
> 2019-12-05 12:06:33,804 INFO impl.RaftServerImpl: 1483994f-8a56-4838-b941-3c12e79b2f80@group-6CDAAB81725E: changes role from CANDIDATE to FOLLOWER at term 1 for DISCOVERED_A_NEW_TERM
> 2019-12-05 12:06:38,877 INFO impl.RaftServerImpl: 1483994f-8a56-4838-b941-3c12e79b2f80@group-6CDAAB81725E: changes role from  FOLLOWER to FOLLOWER at term 2 for recognizeCandidate:ff9ad02f-a8b1-4641-8ddb-fec62ddd3a63
> 2019-12-05 12:09:53,878 INFO impl.RaftServerImpl: 1483994f-8a56-4838-b941-3c12e79b2f80@group-8076DB6A465A: changes role from      null to FOLLOWER at term 0 for startAsFollower
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org