You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@helix.apache.org by "subramanian raghunathan (JIRA)" <ji...@apache.org> on 2017/01/26 20:03:24 UTC

[jira] [Created] (HELIX-652) Double assignment , when participant is not able to establish connection with zookeeper quorum

subramanian raghunathan created HELIX-652:
---------------------------------------------

             Summary: Double assignment , when participant is not able to establish connection with zookeeper quorum
                 Key: HELIX-652
                 URL: https://issues.apache.org/jira/browse/HELIX-652
             Project: Apache Helix
          Issue Type: Bug
          Components: helix-core
    Affects Versions: 0.6.4, 0.7.1
            Reporter: subramanian raghunathan


Double assignment , when participant is not able to establish connection with zookeeper quorum 
 
Following is the  set up. 
Version(s) :               Helix: 0.7.1
                                Zookeeper:3.3.4
 
- State Model: OnlineOffline 
- Controller (leader elected from one of the cluster nodes)
- Single resources with partitions.
- Full auto rebalancer
 
-Zookeeper quorum (3 nodes)
 
When one participant loses the zookeeper connection (It’s not able to connect to any of the zookeepers , a typical occurrence we faced was switch failure from that rack or a network switch failure on a node) 
 
  ---- >  The partition (P1) for which this participant (say Node N1) is online is still maintained
 
Meanwhile since it loses the ephemeral  node in zookeeper , the rebalancer gets triggered and it reallocates the partition (P1) to another participant node (say Node N2) to become online  @ time T1
 
                ---- >  After this both N1 and N2 are acting as online for the same Partition (P1) 
 
But as soon as participant in (say Node N1) is able to re-establish the zookeeper connection  @ time T2
                ---- >  Reset gets called on the partition in participant (say Node N1) 
                
Double assignment: 
The question here is this an expected behavior that both nodes N1 and N2 could be online for the same Partition (P1) between time (T1-T2) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)