You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Sergey Chugunov (JIRA)" <ji...@apache.org> on 2018/07/06 09:50:00 UTC

[jira] [Commented] (IGNITE-8131) ZookeeperDiscoverySpiTest#testClientReconnectSessionExpire* tests fail on TC

    [ https://issues.apache.org/jira/browse/IGNITE-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16534641#comment-16534641 ] 

Sergey Chugunov commented on IGNITE-8131:
-----------------------------------------

[~garus.d.g],

I reviewed the change and it looks somewhat reasonable for me, tests look fine as well. But I still have a feeling that we don't fix the root cause of the problem but mask it (most likely it is some kind of race as introducing a delay helps to fix it).

What makes me think like this is that (again from analysis of attached logs) is that in failure example I don't see even report about disconnected event: like client was never able to detect that it has disconnected from topology.
And your analysis doesn't explain lack of disconnected event but talks only about reconnect process.

Could you please explain from your understanding the sequence of events as detailed as possible? Maybe even with references into the code.

Because I see in logs that in successful scenario client detects connection loss almost immediately and switches its state to Disconnected:
{noformat}
[2018-06-09 20:12:35,312][INFO ][zk-internal.ZookeeperDiscoverySpiTest1-EventThread][ZookeeperClient] ZooKeeper client state changed [prevState=Connected, newState=Disconnected]
{noformat}
And in failure scenario client does something different at probably similar moment in time:
{noformat}
[2018-06-09 20:12:45,591][WARN ][zk-internal.ZookeeperDiscoverySpiTest1-EventThread][ZookeeperClient] Failed to execute ZooKeeper operation [err=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /apacheIgnite/n/81b80f1f-744f-47d9-bd8b-5cd17c946376:a7926ce8-3713-4270-a788-1e3e8b000001:81|0000000052, state=Connected]
[2018-06-09 20:12:45,591][WARN ][zk-internal.ZookeeperDiscoverySpiTest1-EventThread][ZookeeperClient] ZooKeeper operation failed, will retry [err=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /apacheIgnite/n/81b80f1f-744f-47d9-bd8b-5cd17c946376:a7926ce8-3713-4270-a788-1e3e8b000001:81|0000000052, retryTimeout=2000, connLossTimeout=2000, path=/apacheIgnite/n/81b80f1f-744f-47d9-bd8b-5cd17c946376:a7926ce8-3713-4270-a788-1e3e8b000001:81|0000000052, remainingWaitTime=2000]
{noformat}
It seems to me that in failure scenario client receives ConnectionLoss when executing the code that is not ready for this exception and handles it wrongly.

Another idea here maybe that on connection loss client cannot do necessary cleanup in ZooKeeper and when it establishes new connection to ZK it cannot figure out that it has to generate disconnected event and make a reconnect attempt.

Thanks.

> ZookeeperDiscoverySpiTest#testClientReconnectSessionExpire* tests fail on TC
> ----------------------------------------------------------------------------
>
>                 Key: IGNITE-8131
>                 URL: https://issues.apache.org/jira/browse/IGNITE-8131
>             Project: Ignite
>          Issue Type: Bug
>          Components: zookeeper
>            Reporter: Sergey Chugunov
>            Assignee: Denis Garus
>            Priority: Major
>              Labels: MakeTeamcityGreenAgain
>             Fix For: 2.7
>
>         Attachments: ZK_client_reconnect_failure.log, ZK_client_reconnect_success.log
>
>
> Two tests always fail on TC with the assertion
> {noformat}
> junit.framework.AssertionFailedError: Failed to wait for disconnect/reconnect event.
>     at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoverySpiTest.waitReconnectEvent(ZookeeperDiscoverySpiTest.java:4221)
>     at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoverySpiTest.reconnectClientNodes(ZookeeperDiscoverySpiTest.java:4183)
>     at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoverySpiTest.clientReconnectSessionExpire(ZookeeperDiscoverySpiTest.java:2231)
>     at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoverySpiTest.testClientReconnectSessionExpire1_1(ZookeeperDiscoverySpiTest.java:2206)
> {noformat}
> from client disconnect/reconnect events check. Obviously client doesn't generate these events as it supposed to do.
> (TC runs can be found [here|https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_IgniteZooKeeperDiscovery&branch_IgniteTests24Java8=pull%2F3730%2Fhead&tab=buildTypeStatusDiv]).
> It is possible to reproduce test failure locally as well, but with low probability: one failure for 50 or even 300 successful executions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)