You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@helix.apache.org by "Junkai Xue (Jira)" <ji...@apache.org> on 2021/09/13 07:09:00 UTC

[jira] [Closed] (HELIX-818) State transition callbacks for online -> offline and offline -> dropped are sometimes not received

     [ https://issues.apache.org/jira/browse/HELIX-818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Junkai Xue closed HELIX-818.
----------------------------
    Resolution: Fixed

Should be fixed for Pinot already.

> State transition callbacks for online -> offline and offline -> dropped are sometimes not received
> --------------------------------------------------------------------------------------------------
>
>                 Key: HELIX-818
>                 URL: https://issues.apache.org/jira/browse/HELIX-818
>             Project: Apache Helix
>          Issue Type: Bug
>            Reporter: Siddharth Teotia
>            Priority: Major
>
> As part of a cluster integration tests in Pinot, we have seen that state transition callbacks are sometimes not received. Each unit test [here |[https://github.com/apache/incubator-pinot/pull/4498/commits/75c0d7eb76f38fd60497876eb7aa501ae048b05c#diff-30ee437b5c9317721c0d35de40a4f36dR456]] rebalances tables and moves segments between servers. 
> After the test finishes rebalancing (which also means that external view has converged to new ideal state because we ensure it), we check for stats related to state transitions from ONLINE to OFFLINE and OFFLINE to DROPPED with the expectation that as part of rebalance, if a segment lost a server, then it should have received these 2 transitions. The test has a custom state model factory registered with Helix for each fake server it creates. 
> For the above 2 state transitions, the factory methods bump stats and that's what we check for in tests. 
> Earlier when these were failing intermittently, it was possibly due to stat variables not being volatile. The PR pointed to above actually attempts to re-enable these tests by changing the stats to atomic int since they will be bumped by helix code that invokes callback.
> Seems like even after this, for some reasons, once in a while I have seen some test failing randomly at any of the 2 state transitions – this happens both in travis builds and sometimes running the test locally in IDE
> An example failure is [here |[https://travis-ci.org/apache/incubator-pinot/jobs/569442912]]
> Wondering if there is a potential bug due to which sometimes the state transition callbacks are not invoked. This begs the question how is external view getting updated as expected since our tests check for that too (server that lost a segment as part of rebalancing is no longer present in the host-state mapping of that segment in external view). If the callback invocations are missed sometimes, how is it possible for current-state and subsequently external view to get updated in the right manner/
> Thanks for help
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)