You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@helix.apache.org by "Craig Murphey (Jira)" <ji...@apache.org> on 2020/03/20 16:14:00 UTC

[jira] [Resolved] (HELIX-822) OnlineOffline cluster stops rebalancing

     [ https://issues.apache.org/jira/browse/HELIX-822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Craig Murphey resolved HELIX-822.
---------------------------------
    Resolution: Fixed

Upgraded to the latest and we no longer see the issue. 

> OnlineOffline cluster stops rebalancing
> ---------------------------------------
>
>                 Key: HELIX-822
>                 URL: https://issues.apache.org/jira/browse/HELIX-822
>             Project: Apache Helix
>          Issue Type: Bug
>          Components: helix-core
>    Affects Versions: 0.8.x
>            Reporter: Craig Murphey
>            Priority: Major
>         Attachments: Screen Shot 2020-03-05 at 11.28.53 AM.png
>
>
> We recently upgraded our controller to use 0.8.4, then downgraded it back to 0.8.2.   After this and after some time after a controller is elected master, we've seen our LiveInstanceChangeListener not get called for a live instance update.
> On the controller, we have a thread that's spun up on controller start that constantly logs the external state and it sees the instance count decrease.
> At the same time as the expected notification to the listener, we do see a large amount of zknodes being created and deleted.
> !Screen Shot 2020-03-05 at 11.28.53 AM.png!
> Upon inspection of our instances with helix-admin.sh, we found we have many more instances, than we have live-instances (20 live instance, 60-100 instances).  This is because we register the participant with hostname, which can change over time.
> Looking into these instances, we found many of the non-live instances have many messages left over.
> We are able to mitigate the issue by restarting the master controller manually.
> How do left over instances affect the overall cluster health?  Is it possible that the controller is trying to tell offline instances that their resource is dropped, which is preventing the controller from issuing the live instance change event?
> Here's a snapshot of what we saw in zk:
>  
> {noformat}
> So, in DCA, there are a lot of messages in Zookeeper for instances that are not live ->
> $ zkcli -h dlmzk ls /DLM/INSTANCES | awk -F \' '{print $2}' | while read host; do echo -n "$host : "; zkcli -h dlmzk ls /DLM/INSTANCES/$host/MESSAGES | awk -F \' '{print $2}' | grep -v "^$" | wc -l ;done | sort -nk 3
> agent1016-dca1_8274 : 0
> agent1053-dca1_8274 : 0
> agent1100-dca1_8274 : 0
> agent1346-dca1_8274 : 0
> agent1397-dca1_8274 : 0
> agent1406-dca1_8274 : 0
> agent1412-dca1_8274 : 0
> agent1549-dca1_8274 : 0
> agent1558-dca1_8274 : 0
> agent1573-dca1_8274 : 0
> agent1584-dca1_8274 : 0
> agent211-dca1_8274 : 0
> agent2124-dca1_8274 : 0
> agent2148-dca1_8274 : 0
> agent2149-dca1_8274 : 0
> agent2153-dca1_8274 : 0
> agent2184-dca1_8274 : 0
> agent21-dca1_8274 : 0
> agent2287-dca1_8274 : 0
> agent2713-dca1_8274 : 0
> agent2763-dca1_8274 : 0
> agent27-dca1_8274 : 0
> agent2878-dca1_8274 : 0
> agent2900-dca1_8274 : 0
> agent2930-dca1_8274 : 0
> agent31-dca1_8274 : 0
> agent3372-dca1_8274 : 0
> agent3376-dca1_8274 : 0
> agent3435-dca1_8274 : 0
> agent3436-dca1_8274 : 0
> agent3473-dca1_8274 : 0
> agent3543-dca1_8274 : 0
> agent3564-dca1_8274 : 0
> agent3572-dca1_8274 : 0
> agent3601-dca1_8274 : 0
> agent3646-dca1_8274 : 0
> agent3647-dca1_8274 : 0
> agent3648-dca1_8274 : 0
> agent3651-dca1_8274 : 0
> agent3671-dca1_8274 : 0
> agent3677-dca1_8274 : 0
> agent3678-dca1_8274 : 0
> agent3699-dca1_8274 : 0
> agent3714-dca1_8274 : 0
> agent3726-dca1_8274 : 0
> agent3991-dca1_8274 : 0
> agent4070-dca1_8274 : 0
> agent4096-dca1_8274 : 0
> agent4121-dca1_8274 : 0
> agent4545-dca1_8274 : 0
> agent4581-dca1_8274 : 0
> agent4601-dca1_8274 : 0
> agent4612-dca1_8274 : 0
> agent4649-dca1_8274 : 0
> agent4650-dca1_8274 : 0
> agent4651-dca1_8274 : 0
> agent4664-dca1_8274 : 0
> agent4672-dca1_8274 : 0
> agent4678-dca1_8274 : 0
> agent46-dca1_8274 : 0
> agent4702-dca1_8274 : 0
> agent4722-dca1_8274 : 0
> agent4726-dca1_8274 : 0
> agent4729-dca1_8274 : 0
> agent4730-dca1_8274 : 0
> agent5233-dca1_8274 : 0
> agent5261-dca1_8274 : 0
> agent5284-dca1_8274 : 0
> agent63-dca1_8274 : 0
> agent6444-dca1_8274 : 0
> agent79-dca1_8274 : 0
> agent83-dca1_8274 : 0
> agent84-dca1_8274 : 0
> agent90-dca1_8274 : 0
> appdocker1204-dca1_8274 : 0
> appdocker1454-dca1_8274 : 0
> appdocker1858-dca1_8274 : 0
> appdocker1950-dca1_8274 : 0
> appdocker1966-dca1_8274 : 0
> appdocker1970-dca1_8274 : 0
> appdocker1985-dca1_8274 : 0
> appdocker2012-dca1_8274 : 0
> appdocker2046-dca1_8274 : 0
> appdocker255-dca1_8274 : 0
> appdocker30-dca1_8274 : 0
> appdocker507-dca1_8274 : 0
> appdocker568-dca1_8274 : 0
> appdocker580-dca1_8274 : 0
> appdocker61-dca1_8274 : 0
> appdocker661-dca1_8274 : 0
> appdocker693-dca1_8274 : 0
> appdocker77-dca1_8274 : 0
> appdocker791-dca1_8274 : 0
> appdocker874-dca1_8274 : 0
> appdocker909-dca1_8274 : 0
> appdocker949-dca1_8274 : 0
> compute1699-dca1_8274 : 0
> compute2072-dca1_8274 : 0
> compute228-dca1_8274 : 0
> compute2527-dca1_8274 : 0
> compute2541-dca1_8274 : 0
> compute2579-dca1_8274 : 0
> compute2608-dca1_8274 : 0
> compute2792-dca1_8274 : 0
> compute2822-dca1_8274 : 0
> compute2842-dca1_8274 : 0
> compute2849-dca1_8274 : 0
> compute2862-dca1_8274 : 0
> compute2928-dca1_8274 : 0
> compute2937-dca1_8274 : 0
> compute2946-dca1_8274 : 0
> compute295-dca1_8274 : 0
> compute2964-dca1_8274 : 0
> compute2999-dca1_8274 : 0
> compute3026-dca1_8274 : 0
> compute3045-dca1_8274 : 0
> compute3209-dca1_8274 : 0
> compute3217-dca1_8274 : 0
> compute3244-dca1_8274 : 0
> compute3247-dca1_8274 : 0
> compute3363-dca1_8274 : 0
> compute3373-dca1_8274 : 0
> compute3383-dca1_8274 : 0
> compute3385-dca1_8274 : 0
> compute3391-dca1_8274 : 0
> compute3413-dca1_8274 : 0
> compute3449-dca1_8274 : 0
> compute3452-dca1_8274 : 0
> compute3525-dca1_8274 : 0
> compute3526-dca1_8274 : 0
> compute3530-dca1_8274 : 0
> compute3546-dca1_8274 : 0
> compute3571-dca1_8274 : 0
> compute3584-dca1_8274 : 0
> compute3600-dca1_8274 : 0
> compute3621-dca1_8274 : 0
> compute3678-dca1_8274 : 0
> compute3691-dca1_8274 : 0
> compute3695-dca1_8274 : 0
> compute36-dca1_8274 : 0
> compute3750-dca1_8274 : 0
> compute3770-dca1_8274 : 0
> compute3809-dca1_8274 : 0
> compute3846-dca1_8274 : 0
> compute3857-dca1_8274 : 0
> compute3919-dca1_8274 : 0
> compute3985-dca1_8274 : 0
> compute4033-dca1_8274 : 0
> compute4036-dca1_8274 : 0
> compute4103-dca1_8274 : 0
> compute4141-dca1_8274 : 0
> compute4161-dca1_8274 : 0
> compute4191-dca1_8274 : 0
> compute4239-dca1_8274 : 0
> compute42-dca1_8274 : 0
> compute4305-dca1_8274 : 0
> compute4339-dca1_8274 : 0
> compute4396-dca1_8274 : 0
> compute4474-dca1_8274 : 0
> compute4502-dca1_8274 : 0
> compute4532-dca1_8274 : 0
> compute4548-dca1_8274 : 0
> compute4716-dca1_8274 : 0
> compute4764-dca1_8274 : 0
> compute4817-dca1_8274 : 0
> compute4873-dca1_8274 : 0
> compute4887-dca1_8274 : 0
> compute4900-dca1_8274 : 0
> compute4924-dca1_8274 : 0
> compute4962-dca1_8274 : 0
> compute4966-dca1_8274 : 0
> compute4967-dca1_8274 : 0
> compute4980-dca1_8274 : 0
> compute4994-dca1_8274 : 0
> compute4998-dca1_8274 : 0
> compute5303-dca1_8274 : 0
> compute5338-dca1_8274 : 0
> compute5659-dca1_8274 : 0
> compute5661-dca1_8274 : 0
> compute5675-dca1_8274 : 0
> compute5698-dca1_8274 : 0
> compute5710-dca1_8274 : 0
> compute5933-dca1_8274 : 0
> compute5978-dca1_8274 : 0
> compute6011-dca1_8274 : 0
> compute6034-dca1_8274 : 0
> compute6089-dca1_8274 : 0
> compute6269-dca1_8274 : 0
> compute6339-dca1_8274 : 0
> compute6358-dca1_8274 : 0
> compute6366-dca1_8274 : 0
> compute6432-dca1_8274 : 0
> compute6716-dca1_8274 : 0
> compute6717-dca1_8274 : 0
> compute6767-dca1_8274 : 0
> compute6791-dca1_8274 : 0
> compute6825-dca1_8274 : 0
> compute6892-dca1_8274 : 0
> compute68-dca1_8274 : 0
> compute6905-dca1_8274 : 0
> compute6937-dca1_8274 : 0
> compute6992-dca1_8274 : 0
> compute6994-dca1_8274 : 0
> compute7029-dca1_8274 : 0
> compute7179-dca1_8274 : 0
> compute73-dca1_8274 : 0
> compute7582-dca1_8274 : 0
> compute7586-dca1_8274 : 0
> compute7601-dca1_8274 : 0
> compute7614-dca1_8274 : 0
> compute7700-dca1_8274 : 0
> compute7832-dca1_8274 : 0
> compute7837-dca1_8274 : 0
> compute8696-dca1_8274 : 0
> compute8697-dca1_8274 : 0
> compute8786-dca1_8274 : 0
> compute8864-dca1_8274 : 0
> compute8868-dca1_8274 : 0
> mpdocker01-dca1_8274 : 0
> mpdocker02-dca1_8274 : 0
> mpdocker03-dca1_8274 : 0
> mpdocker04-dca1_8274 : 0
> mpdocker05-dca1_8274 : 0
> mpdocker06-dca1_8274 : 0
> mpdocker07-dca1_8274 : 0
> mpdocker08-dca1_8274 : 0
> mpdocker09-dca1_8274 : 0
> agent1601-dca1_8274 : 2
> agent201-dca1_8274 : 2
> agent1415-dca1_8274 : 3
> agent4605-dca1_8274 : 3
> agent5212-dca1_8274 : 3
> agent5236-dca1_8274 : 3
> agent5242-dca1_8274 : 3
> compute4763-dca1_8274 : 3
> compute4916-dca1_8274 : 3
> compute6933-dca1_8274 : 3
> compute6984-dca1_8274 : 3
> compute7713-dca1_8274 : 3
> agent2213-dca1_8274 : 5
> agent3394-dca1_8274 : 5
> agent3618-dca1_8274 : 5
> agent4574-dca1_8274 : 5
> agent4677-dca1_8274 : 5
> agent47-dca1_8274 : 5
> compute2824-dca1_8274 : 5
> compute3640-dca1_8274 : 5
> compute3861-dca1_8274 : 5
> compute7159-dca1_8274 : 5
> compute7600-dca1_8274 : 5
> compute7839-dca1_8274 : 5
> compute2985-dca1_8274 : 6
> compute3615-dca1_8274 : 6
> compute4692-dca1_8274 : 6
> agent2209-dca1_8274 : 8
> agent2214-dca1_8274 : 8
> compute3710-dca1_8274 : 8
> compute6329-dca1_8274 : 8
> agent5265-dca1_8274 : 9
> compute7746-dca1_8274 : 13
> agent5179-dca1_8274 : 14
> agent4548-dca1_8274 : 15
> agent3611-dca1_8274 : 20
> agent3721-dca1_8274 : 23
> compute3764-dca1_8274 : 23
> agent3989-dca1_8274 : 30
> agent4145-dca1_8274 : 51
> compute3781-dca1_8274 : 55
> agent2168-dca1_8274 : 60
> agent5352-dca1_8274 : 68
> agent3533-dca1_8274 : 78
> compute4857-dca1_8274 : 78
> compute2982-dca1_8274 : 110
> agent4552-dca1_8274 : 113
> appdocker1082-dca1_8274 : 135
> appdocker538-dca1_8274 : 137
> compute1620-dca1_8274 : 512
> All LIVEINSTANCES do not have any message ->
> $ zkcli -h dlmzk ls /DLM/LIVEINSTANCES | awk -F \' '{print $2}' | while read host; do echo -n "$host : "; zkcli -h dlmzk ls /DLM/INSTANCES/$host/MESSAGES | awk -F \' '{print $2}' | grep -v "^$" | wc -l ;done | sort -nk 3
> agent1412-dca1_8274 : 0
> agent1584-dca1_8274 : 0
> agent2149-dca1_8274 : 0
> agent3435-dca1_8274 : 0
> agent3473-dca1_8274 : 0
> agent3564-dca1_8274 : 0
> agent3572-dca1_8274 : 0
> agent3677-dca1_8274 : 0
> agent4070-dca1_8274 : 0
> agent4096-dca1_8274 : 0
> agent6444-dca1_8274 : 0
> compute3045-dca1_8274 : 0
> compute3525-dca1_8274 : 0
> compute3678-dca1_8274 : 0
> compute4239-dca1_8274 : 0
> compute4305-dca1_8274 : 0
> compute4967-dca1_8274 : 0
> compute4980-dca1_8274 : 0
> compute6716-dca1_8274 : 0
> compute6992-dca1_8274 : 0
> {noformat}
>  
> Current Version: 0.8.2
> StateModel: OfflineOnline
> {code:java}
> ./helix-admin.sh -zkSvr dlmzk --listStateModel DLM OnlineOffline StateModelDefinition: { "id" : "OnlineOffline", "mapFields" : { "DROPPED.meta" : { "count" : "-1" }, "OFFLINE.meta" : { "count" : "-1" }, "OFFLINE.next" : { "DROPPED" : "DROPPED", "ONLINE" : "ONLINE" }, "ONLINE.meta" : { "count" : "R" }, "ONLINE.next" : { "DROPPED" : "OFFLINE", "OFFLINE" : "OFFLINE" } }, "listFields" : { "STATE_PRIORITY_LIST" : [ "ONLINE", "OFFLINE", "DROPPED" ], "STATE_TRANSITION_PRIORITYLIST" : [ "OFFLINE-ONLINE", "ONLINE-OFFLINE", "OFFLINE-DROPPED" ] }, "simpleFields" : { "INITIAL_STATE" : "OFFLINE" } }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)