You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Sergey Chugunov (JIRA)" <ji...@apache.org> on 2018/07/17 13:28:00 UTC

[jira] [Comment Edited] (IGNITE-8131) ZookeeperDiscoverySpiTest#testClientReconnectSessionExpire* tests fail on TC

    [ https://issues.apache.org/jira/browse/IGNITE-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16546603#comment-16546603 ] 

Sergey Chugunov edited comment on IGNITE-8131 at 7/17/18 1:27 PM:
------------------------------------------------------------------

[~garus.d.g],

I finally got the whole picture around the issue.

As part of execution of *ZookeeperDiscoveryImpl#processNewEvents(byte[] data)* client tries to save some data to ZooKeeper (step 0) in *ZookeeperClient#setData(String path, byte[] data, int ver)* but fails because of *KeeperException$SessionExpiredException*.
Program flow doesn't return to *processNewEvents* and goes to *ZookeeperClient#onZookeeperError* where reconnect closure is scheduled and exception is thrown.

But *processNewEvents* didn't finish some logic crucial for successful reconnect: it hasn't initialized *rtState.evtsData* where topVer for reconnect is obtained from.

Proposed fix tries to avoid this race by postponing saving data to ZooKeeper at step 0 so *rtState.evtsData* will be initialized to the moment when session expiration is detected. To me it is risky because it looks like all acks from client nodes will be postponed for a significant amount of time (one minute by default).

As we need *rtState.evtsData* only to retrieve topVer from it on reconnect, why not save topVer earlier, before going to code which could face session expiration? 
So I opened a [PR|https://github.com/apache/ignite/pull/4366] with a quick fix for the problem, could you take a look at it, please, and share your opinion?
It may be not a perfect place to do such early initialization, but we could change it - the question is more about approach in general.




was (Author: sergey-chugunov):
[~garus.d.g],

I finally got the whole picture around the issue.

As part of execution of *ZookeeperDiscoveryImpl#processNewEvents(byte[] data)* client tries to save some data to ZooKeeper (step 0) in *ZookeeperClient#setData(String path, byte[] data, int ver)* but fails because of *KeeperException$SessionExpiredException*.
Program flow doesn't return to *processNewEvents* but goes to *ZookeeperClient#onZookeeperError* where reconnect closure is scheduled and exception is thrown.

But *processNewEvents* didn't finish some logic crucial for successful reconnect: it hasn't initialized *rtState.evtsData* where topVer for reconnect is obtained from.

Proposed fix tries to avoid this race by postponing saving data to ZooKeeper at step 0 so *rtState.evtsData* will be initialized to the moment when session expiration is detected. To me it is risky because it looks like all acks from client nodes will be postponed for a significant amount of time (one minute by default).

As we need *rtState.evtsData* only to retrieve topVer from it on reconnect, why not save topVer earlier, before going to code which could face session expiration? 
So I opened a [PR|https://github.com/apache/ignite/pull/4366] with a quick fix for the problem, could you take a look at it, please, and share your opinion?
It may be not a perfect place to do such early initialization, but we could change it - the question is more about approach in general.



> ZookeeperDiscoverySpiTest#testClientReconnectSessionExpire* tests fail on TC
> ----------------------------------------------------------------------------
>
>                 Key: IGNITE-8131
>                 URL: https://issues.apache.org/jira/browse/IGNITE-8131
>             Project: Ignite
>          Issue Type: Bug
>          Components: zookeeper
>            Reporter: Sergey Chugunov
>            Assignee: Denis Garus
>            Priority: Major
>              Labels: MakeTeamcityGreenAgain
>             Fix For: 2.7
>
>         Attachments: ZK_client_reconnect_failure.log, ZK_client_reconnect_success.log
>
>
> Two tests always fail on TC with the assertion
> {noformat}
> junit.framework.AssertionFailedError: Failed to wait for disconnect/reconnect event.
>     at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoverySpiTest.waitReconnectEvent(ZookeeperDiscoverySpiTest.java:4221)
>     at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoverySpiTest.reconnectClientNodes(ZookeeperDiscoverySpiTest.java:4183)
>     at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoverySpiTest.clientReconnectSessionExpire(ZookeeperDiscoverySpiTest.java:2231)
>     at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoverySpiTest.testClientReconnectSessionExpire1_1(ZookeeperDiscoverySpiTest.java:2206)
> {noformat}
> from client disconnect/reconnect events check. Obviously client doesn't generate these events as it supposed to do.
> (TC runs can be found [here|https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_IgniteZooKeeperDiscovery&branch_IgniteTests24Java8=pull%2F3730%2Fhead&tab=buildTypeStatusDiv]).
> It is possible to reproduce test failure locally as well, but with low probability: one failure for 50 or even 300 successful executions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)