You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@helix.apache.org by "Craig Murphey (Jira)" <ji...@apache.org> on 2020/03/05 19:40:00 UTC

[jira] [Created] (HELIX-822) OnlineOffline cluster stops rebalancing

Craig Murphey created HELIX-822:
-----------------------------------

             Summary: OnlineOffline cluster stops rebalancing
                 Key: HELIX-822
                 URL: https://issues.apache.org/jira/browse/HELIX-822
             Project: Apache Helix
          Issue Type: Bug
          Components: helix-core
    Affects Versions: 0.8.x
            Reporter: Craig Murphey
         Attachments: Screen Shot 2020-03-05 at 11.28.53 AM.png

We recently upgraded our controller to use 0.8.4, then downgraded it back to 0.8.2.   After this and after some time after a controller is elected master, we've seen our LiveInstanceChangeListener not get called for a live instance update.

On the controller, we have a thread that's spun up on controller start that constantly logs the external state and it sees the instance count decrease.

At the same time as the expected notification to the listener, we do see a large amount of zknodes being created and deleted.

!Screen Shot 2020-03-05 at 11.28.53 AM.png!

Upon inspection of our instances with helix-admin.sh, we found we have many more instances, than we have live-instances (20 live instance, 60-100 instances).  This is because we register the participant with hostname, which can change over time.

Looking into these instances, we found many of the non-live instances have many messages left over.

We are able to mitigate the issue by restarting the master controller manually.

How do left over instances affect the overall cluster health?  Is it possible that the controller is trying to tell offline instances that their resource is dropped, which is preventing the controller from issuing the live instance change event?

Here's a snapshot of what we saw in zk:

 
{noformat}
So, in DCA, there are a lot of messages in Zookeeper for instances that are not live ->

$ zkcli -h dlmzk ls /DLM/INSTANCES | awk -F \' '{print $2}' | while read host; do echo -n "$host : "; zkcli -h dlmzk ls /DLM/INSTANCES/$host/MESSAGES | awk -F \' '{print $2}' | grep -v "^$" | wc -l ;done | sort -nk 3
agent1016-dca1_8274 : 0
agent1053-dca1_8274 : 0
agent1100-dca1_8274 : 0
agent1346-dca1_8274 : 0
agent1397-dca1_8274 : 0
agent1406-dca1_8274 : 0
agent1412-dca1_8274 : 0
agent1549-dca1_8274 : 0
agent1558-dca1_8274 : 0
agent1573-dca1_8274 : 0
agent1584-dca1_8274 : 0
agent211-dca1_8274 : 0
agent2124-dca1_8274 : 0
agent2148-dca1_8274 : 0
agent2149-dca1_8274 : 0
agent2153-dca1_8274 : 0
agent2184-dca1_8274 : 0
agent21-dca1_8274 : 0
agent2287-dca1_8274 : 0
agent2713-dca1_8274 : 0
agent2763-dca1_8274 : 0
agent27-dca1_8274 : 0
agent2878-dca1_8274 : 0
agent2900-dca1_8274 : 0
agent2930-dca1_8274 : 0
agent31-dca1_8274 : 0
agent3372-dca1_8274 : 0
agent3376-dca1_8274 : 0
agent3435-dca1_8274 : 0
agent3436-dca1_8274 : 0
agent3473-dca1_8274 : 0
agent3543-dca1_8274 : 0
agent3564-dca1_8274 : 0
agent3572-dca1_8274 : 0
agent3601-dca1_8274 : 0
agent3646-dca1_8274 : 0
agent3647-dca1_8274 : 0
agent3648-dca1_8274 : 0
agent3651-dca1_8274 : 0
agent3671-dca1_8274 : 0
agent3677-dca1_8274 : 0
agent3678-dca1_8274 : 0
agent3699-dca1_8274 : 0
agent3714-dca1_8274 : 0
agent3726-dca1_8274 : 0
agent3991-dca1_8274 : 0
agent4070-dca1_8274 : 0
agent4096-dca1_8274 : 0
agent4121-dca1_8274 : 0
agent4545-dca1_8274 : 0
agent4581-dca1_8274 : 0
agent4601-dca1_8274 : 0
agent4612-dca1_8274 : 0
agent4649-dca1_8274 : 0
agent4650-dca1_8274 : 0
agent4651-dca1_8274 : 0
agent4664-dca1_8274 : 0
agent4672-dca1_8274 : 0
agent4678-dca1_8274 : 0
agent46-dca1_8274 : 0
agent4702-dca1_8274 : 0
agent4722-dca1_8274 : 0
agent4726-dca1_8274 : 0
agent4729-dca1_8274 : 0
agent4730-dca1_8274 : 0
agent5233-dca1_8274 : 0
agent5261-dca1_8274 : 0
agent5284-dca1_8274 : 0
agent63-dca1_8274 : 0
agent6444-dca1_8274 : 0
agent79-dca1_8274 : 0
agent83-dca1_8274 : 0
agent84-dca1_8274 : 0
agent90-dca1_8274 : 0
appdocker1204-dca1_8274 : 0
appdocker1454-dca1_8274 : 0
appdocker1858-dca1_8274 : 0
appdocker1950-dca1_8274 : 0
appdocker1966-dca1_8274 : 0
appdocker1970-dca1_8274 : 0
appdocker1985-dca1_8274 : 0
appdocker2012-dca1_8274 : 0
appdocker2046-dca1_8274 : 0
appdocker255-dca1_8274 : 0
appdocker30-dca1_8274 : 0
appdocker507-dca1_8274 : 0
appdocker568-dca1_8274 : 0
appdocker580-dca1_8274 : 0
appdocker61-dca1_8274 : 0
appdocker661-dca1_8274 : 0
appdocker693-dca1_8274 : 0
appdocker77-dca1_8274 : 0
appdocker791-dca1_8274 : 0
appdocker874-dca1_8274 : 0
appdocker909-dca1_8274 : 0
appdocker949-dca1_8274 : 0
compute1699-dca1_8274 : 0
compute2072-dca1_8274 : 0
compute228-dca1_8274 : 0
compute2527-dca1_8274 : 0
compute2541-dca1_8274 : 0
compute2579-dca1_8274 : 0
compute2608-dca1_8274 : 0
compute2792-dca1_8274 : 0
compute2822-dca1_8274 : 0
compute2842-dca1_8274 : 0
compute2849-dca1_8274 : 0
compute2862-dca1_8274 : 0
compute2928-dca1_8274 : 0
compute2937-dca1_8274 : 0
compute2946-dca1_8274 : 0
compute295-dca1_8274 : 0
compute2964-dca1_8274 : 0
compute2999-dca1_8274 : 0
compute3026-dca1_8274 : 0
compute3045-dca1_8274 : 0
compute3209-dca1_8274 : 0
compute3217-dca1_8274 : 0
compute3244-dca1_8274 : 0
compute3247-dca1_8274 : 0
compute3363-dca1_8274 : 0
compute3373-dca1_8274 : 0
compute3383-dca1_8274 : 0
compute3385-dca1_8274 : 0
compute3391-dca1_8274 : 0
compute3413-dca1_8274 : 0
compute3449-dca1_8274 : 0
compute3452-dca1_8274 : 0
compute3525-dca1_8274 : 0
compute3526-dca1_8274 : 0
compute3530-dca1_8274 : 0
compute3546-dca1_8274 : 0
compute3571-dca1_8274 : 0
compute3584-dca1_8274 : 0
compute3600-dca1_8274 : 0
compute3621-dca1_8274 : 0
compute3678-dca1_8274 : 0
compute3691-dca1_8274 : 0
compute3695-dca1_8274 : 0
compute36-dca1_8274 : 0
compute3750-dca1_8274 : 0
compute3770-dca1_8274 : 0
compute3809-dca1_8274 : 0
compute3846-dca1_8274 : 0
compute3857-dca1_8274 : 0
compute3919-dca1_8274 : 0
compute3985-dca1_8274 : 0
compute4033-dca1_8274 : 0
compute4036-dca1_8274 : 0
compute4103-dca1_8274 : 0
compute4141-dca1_8274 : 0
compute4161-dca1_8274 : 0
compute4191-dca1_8274 : 0
compute4239-dca1_8274 : 0
compute42-dca1_8274 : 0
compute4305-dca1_8274 : 0
compute4339-dca1_8274 : 0
compute4396-dca1_8274 : 0
compute4474-dca1_8274 : 0
compute4502-dca1_8274 : 0
compute4532-dca1_8274 : 0
compute4548-dca1_8274 : 0
compute4716-dca1_8274 : 0
compute4764-dca1_8274 : 0
compute4817-dca1_8274 : 0
compute4873-dca1_8274 : 0
compute4887-dca1_8274 : 0
compute4900-dca1_8274 : 0
compute4924-dca1_8274 : 0
compute4962-dca1_8274 : 0
compute4966-dca1_8274 : 0
compute4967-dca1_8274 : 0
compute4980-dca1_8274 : 0
compute4994-dca1_8274 : 0
compute4998-dca1_8274 : 0
compute5303-dca1_8274 : 0
compute5338-dca1_8274 : 0
compute5659-dca1_8274 : 0
compute5661-dca1_8274 : 0
compute5675-dca1_8274 : 0
compute5698-dca1_8274 : 0
compute5710-dca1_8274 : 0
compute5933-dca1_8274 : 0
compute5978-dca1_8274 : 0
compute6011-dca1_8274 : 0
compute6034-dca1_8274 : 0
compute6089-dca1_8274 : 0
compute6269-dca1_8274 : 0
compute6339-dca1_8274 : 0
compute6358-dca1_8274 : 0
compute6366-dca1_8274 : 0
compute6432-dca1_8274 : 0
compute6716-dca1_8274 : 0
compute6717-dca1_8274 : 0
compute6767-dca1_8274 : 0
compute6791-dca1_8274 : 0
compute6825-dca1_8274 : 0
compute6892-dca1_8274 : 0
compute68-dca1_8274 : 0
compute6905-dca1_8274 : 0
compute6937-dca1_8274 : 0
compute6992-dca1_8274 : 0
compute6994-dca1_8274 : 0
compute7029-dca1_8274 : 0
compute7179-dca1_8274 : 0
compute73-dca1_8274 : 0
compute7582-dca1_8274 : 0
compute7586-dca1_8274 : 0
compute7601-dca1_8274 : 0
compute7614-dca1_8274 : 0
compute7700-dca1_8274 : 0
compute7832-dca1_8274 : 0
compute7837-dca1_8274 : 0
compute8696-dca1_8274 : 0
compute8697-dca1_8274 : 0
compute8786-dca1_8274 : 0
compute8864-dca1_8274 : 0
compute8868-dca1_8274 : 0
mpdocker01-dca1_8274 : 0
mpdocker02-dca1_8274 : 0
mpdocker03-dca1_8274 : 0
mpdocker04-dca1_8274 : 0
mpdocker05-dca1_8274 : 0
mpdocker06-dca1_8274 : 0
mpdocker07-dca1_8274 : 0
mpdocker08-dca1_8274 : 0
mpdocker09-dca1_8274 : 0
agent1601-dca1_8274 : 2
agent201-dca1_8274 : 2
agent1415-dca1_8274 : 3
agent4605-dca1_8274 : 3
agent5212-dca1_8274 : 3
agent5236-dca1_8274 : 3
agent5242-dca1_8274 : 3
compute4763-dca1_8274 : 3
compute4916-dca1_8274 : 3
compute6933-dca1_8274 : 3
compute6984-dca1_8274 : 3
compute7713-dca1_8274 : 3
agent2213-dca1_8274 : 5
agent3394-dca1_8274 : 5
agent3618-dca1_8274 : 5
agent4574-dca1_8274 : 5
agent4677-dca1_8274 : 5
agent47-dca1_8274 : 5
compute2824-dca1_8274 : 5
compute3640-dca1_8274 : 5
compute3861-dca1_8274 : 5
compute7159-dca1_8274 : 5
compute7600-dca1_8274 : 5
compute7839-dca1_8274 : 5
compute2985-dca1_8274 : 6
compute3615-dca1_8274 : 6
compute4692-dca1_8274 : 6
agent2209-dca1_8274 : 8
agent2214-dca1_8274 : 8
compute3710-dca1_8274 : 8
compute6329-dca1_8274 : 8
agent5265-dca1_8274 : 9
compute7746-dca1_8274 : 13
agent5179-dca1_8274 : 14
agent4548-dca1_8274 : 15
agent3611-dca1_8274 : 20
agent3721-dca1_8274 : 23
compute3764-dca1_8274 : 23
agent3989-dca1_8274 : 30
agent4145-dca1_8274 : 51
compute3781-dca1_8274 : 55
agent2168-dca1_8274 : 60
agent5352-dca1_8274 : 68
agent3533-dca1_8274 : 78
compute4857-dca1_8274 : 78
compute2982-dca1_8274 : 110
agent4552-dca1_8274 : 113
appdocker1082-dca1_8274 : 135
appdocker538-dca1_8274 : 137
compute1620-dca1_8274 : 512
All LIVEINSTANCES do not have any message ->
$ zkcli -h dlmzk ls /DLM/LIVEINSTANCES | awk -F \' '{print $2}' | while read host; do echo -n "$host : "; zkcli -h dlmzk ls /DLM/INSTANCES/$host/MESSAGES | awk -F \' '{print $2}' | grep -v "^$" | wc -l ;done | sort -nk 3
agent1412-dca1_8274 : 0
agent1584-dca1_8274 : 0
agent2149-dca1_8274 : 0
agent3435-dca1_8274 : 0
agent3473-dca1_8274 : 0
agent3564-dca1_8274 : 0
agent3572-dca1_8274 : 0
agent3677-dca1_8274 : 0
agent4070-dca1_8274 : 0
agent4096-dca1_8274 : 0
agent6444-dca1_8274 : 0
compute3045-dca1_8274 : 0
compute3525-dca1_8274 : 0
compute3678-dca1_8274 : 0
compute4239-dca1_8274 : 0
compute4305-dca1_8274 : 0
compute4967-dca1_8274 : 0
compute4980-dca1_8274 : 0
compute6716-dca1_8274 : 0
compute6992-dca1_8274 : 0
{noformat}
 

Current Version: 0.8.2

StateModel: OfflineOnline
{code:java}
./helix-admin.sh -zkSvr dlmzk --listStateModel DLM OnlineOffline StateModelDefinition: { "id" : "OnlineOffline", "mapFields" : { "DROPPED.meta" : { "count" : "-1" }, "OFFLINE.meta" : { "count" : "-1" }, "OFFLINE.next" : { "DROPPED" : "DROPPED", "ONLINE" : "ONLINE" }, "ONLINE.meta" : { "count" : "R" }, "ONLINE.next" : { "DROPPED" : "OFFLINE", "OFFLINE" : "OFFLINE" } }, "listFields" : { "STATE_PRIORITY_LIST" : [ "ONLINE", "OFFLINE", "DROPPED" ], "STATE_TRANSITION_PRIORITYLIST" : [ "OFFLINE-ONLINE", "ONLINE-OFFLINE", "OFFLINE-DROPPED" ] }, "simpleFields" : { "INITIAL_STATE" : "OFFLINE" } }
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)