You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Anton Vinogradov (JIRA)" <ji...@apache.org> on 2018/07/12 14:10:00 UTC

[jira] [Commented] (IGNITE-8783) Failover tests periodically cause hanging of the whole Data Structures suite on TC

    [ https://issues.apache.org/jira/browse/IGNITE-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541697#comment-16541697 ] 

Anton Vinogradov commented on IGNITE-8783:
------------------------------------------

Hang reason found 
at {{org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager#createClientLatch}}
you can see code
{noformat}
 // There is final ack for created latch.
if (pendingAcks.containsKey(latchId)) {
	latch.complete();
	pendingAcks.remove(latchId); // this cause pending acks loss when coordinator failure was not handled yet (eg. we handling another node fail)
}
else
	clientLatches.put(latchId, latch);
{noformat}

so, I propose to replace this code with simple 

{noformat}
clientLatches.put(latchId, latch);
{noformat}

[~Jokser],
Could you please explain idea of handling final message from old_coordinator?
As far as I see - latches will be recreated on each topology change and acks will be resent.

> Failover tests periodically cause hanging of the whole Data Structures suite on TC
> ----------------------------------------------------------------------------------
>
>                 Key: IGNITE-8783
>                 URL: https://issues.apache.org/jira/browse/IGNITE-8783
>             Project: Ignite
>          Issue Type: Bug
>          Components: data structures
>            Reporter: Ivan Rakov
>            Assignee: Anton Vinogradov
>            Priority: Major
>              Labels: MakeTeamcityGreenAgain
>
> History of suite runs: https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_DataStructures&tab=buildTypeHistoryList&branch_IgniteTests24Java8=%3Cdefault%3E
> Chance of suite hang is 18% in master (based on previous 50 runs).
> Hang is always caused by one of the following failover tests:
> {noformat}
> GridCacheReplicatedDataStructuresFailoverSelfTest#testAtomicSequenceConstantTopologyChange
> GridCachePartitionedDataStructuresFailoverSelfTest#testFairReentrantLockConstantTopologyChangeNonFailoverSafe
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)