You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Andrew Mashenkov (JIRA)" <ji...@apache.org> on 2019/03/12 13:16:00 UTC
[jira] [Comment Edited] (IGNITE-11460) MVCC: Possible race on coordinator changing on client reconnection.

    [ https://issues.apache.org/jira/browse/IGNITE-11460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790530#comment-16790530 ] 

Andrew Mashenkov edited comment on IGNITE-11460 at 3/12/19 1:15 PM:
--------------------------------------------------------------------

[~NSAmelchev], got it.

Seems, the root of issue is that we process NODE_FAILED events after CLIENT_DISCONNECT happens.
 To resolve this, we should ignore all topology change events between onDisconected() and next onLocalJoin(), that is what your fix do.

I've found kernalContext.clientDisconnected flag is set to 'true' in onDisconnected() and is set to 'false' in onLocalJoin() methods.
 I'd think we can use this flag and skip all topology change events in onDicovery() method via simple check "if (ctx.clientDisconnected) return". This fix works for me and makes your test passed.

If any reordering between all those events are possible (e.g. due to event processing from different threads) than it look like bug in discovery.


was (Author: amashenkov):
[~NSAmelchev], got it.

Seems, the root of issue is that we process NODE_FAILED events after CLIENT_DISCONNECT happens.
To resolve this, we should ignore all topology change events between onDisconected() and next onLocalJoin(), that is what your fix do.

I've found kernalContext.clientDisconnected flag is set to 'true' in onDisconnected() and is set to 'false' in onLocalJoin() methods.
I'd think we can use this flag and skip all topology change events in onDicovery() method via simple check "if (ctx.clientDisconnected) return".



If any reordering between all those events are possible (e.g. due to event processing from different threads) than it look like bug in discovery.

> MVCC: Possible race on coordinator changing on client reconnection.
> -------------------------------------------------------------------
>
>                 Key: IGNITE-11460
>                 URL: https://issues.apache.org/jira/browse/IGNITE-11460
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Amelchev Nikita
>            Assignee: Amelchev Nikita
>            Priority: Major
>              Labels: MakeTeamcityGreenAgain
>             Fix For: 2.8
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> I found that the wrong coordinator can be set in case of client reconnect:
> {noformat}
> assert newCrd.topologyVersion().compareTo(curCrd.topologyVersion()) > 0;
> java.lang.AssertionError
>     at org.apache.ignite.internal.processors.cache.mvcc.MvccProcessorImpl.onCoordinatorChanged(MvccProcessorImpl.java:541)
>     at org.apache.ignite.internal.processors.cache.mvcc.MvccProcessorImpl.onLocalJoin(MvccProcessorImpl.java:416)
>     at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$4.onDiscovery0(GridDiscoveryManager.java:851)
>     at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$4.lambda$onDiscovery$0(GridDiscoveryManager.java:601)
>     at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$DiscoveryMessageNotifierWorker.body0(GridDiscoveryManager.java:2681)
>     at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$DiscoveryMessageNotifierWorker.body(GridDiscoveryManager.java:2719)
>     at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
>     at java.lang.Thread.run(Thread.java:748)
> {noformat}
> I have attached reproducer in PR.
> The main reason is that coordinator can be changed from discovery event thread when the client already disconnect (disconnection processed in notifier thread and change coordinator on onDisconnected method).
> Coordinator can be changed in cases:
> 1. notifier disco thread: onDisconnected method
> 2. event disco thread: onDiscovery listener.
> and events can be processed with some delay and override coordinator that set in notifier thread. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)