You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Rohit Kanchan (JIRA)" <ji...@apache.org> on 2017/07/17 21:44:00 UTC

[jira] [Commented] (IGNITE-3212) Servers get stuck with the warning "Failed to wait for initial partition map exchange" during falover test

    [ https://issues.apache.org/jira/browse/IGNITE-3212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16090627#comment-16090627 ] 

Rohit Kanchan commented on IGNITE-3212:
---------------------------------------

We are trying to use Embedded ignite for our spark streaming application, we are also seeing similar issue where it is getting stuck with similar stack trace and thread dump. We are using Ignite 2.0 version. I will try to dig deeper and see if I can find something what is causing this. 

> Servers get stuck with the warning "Failed to wait for initial partition map exchange" during falover test
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-3212
>                 URL: https://issues.apache.org/jira/browse/IGNITE-3212
>             Project: Ignite
>          Issue Type: Bug
>    Affects Versions: 1.6
>            Reporter: Ksenia Rybakova
>            Assignee: Ksenia Rybakova
>            Priority: Critical
>             Fix For: 1.9
>
>
> Servers being restarted during failover test get stuck after some time with the warning "Failed to wait for initial partition map exchange". 
> {noformat}
> [08:44:41,303][INFO ][disco-event-worker-#80%null%][GridDiscoveryManager] Added new node to topology: TcpDiscoveryNode [id=db557f04-43b7-4e28-ae0d-d4dcf4139c89, addrs=
> [10.20.0.222, 127.0.0.1], sockAddrs=[fosters-222/10.20.0.222:47503, /10.20.0.222:47503, /127.0.0.1:47503], discPort=47503, order=44, intOrder=32, lastExchangeTime=1464
> 363880917, loc=false, ver=1.6.0#20160525-sha1:48321a40, isClient=false]
> [08:44:41,304][INFO ][disco-event-worker-#80%null%][GridDiscoveryManager] Topology snapshot [ver=44, servers=19, clients=1, CPUs=64, heap=160.0GB]
> [08:45:11,455][INFO ][disco-event-worker-#80%null%][GridDiscoveryManager] Added new node to topology: TcpDiscoveryNode [id=6fae61a7-c1c1-40e5-8ad0-8bf5d6c86eb7, addrs=
> [10.20.0.223, 127.0.0.1], sockAddrs=[fosters-223/10.20.0.223:47503, /10.20.0.223:47503, /127.0.0.1:47503], discPort=47503, order=45, intOrder=33, lastExchangeTime=1464
> 363910999, loc=false, ver=1.6.0#20160525-sha1:48321a40, isClient=false]
> [08:45:11,455][INFO ][disco-event-worker-#80%null%][GridDiscoveryManager] Topology snapshot [ver=45, servers=20, clients=1, CPUs=64, heap=170.0GB]
> [08:45:19,942][INFO ][ignite-update-notifier-timer][GridUpdateNotifier] Update status is not available.
> [08:46:20,370][WARN ][main][GridCachePartitionExchangeManager] Failed to wait for initial partition map exchange. Possible reasons are:
>   ^-- Transactions in deadlock.
>   ^-- Long running transactions (ignore if this is the case).
>   ^-- Unreleased explicit locks.
> [08:48:30,375][WARN ][main][GridCachePartitionExchangeManager] Still waiting for initial partition map exchange ...
> {noformat}
> "Failed to wait for partition release future" warnings are on other nodes.
> {noformat}
> [08:09:45,822][WARN ][exchange-worker-#82%null%][GridDhtPartitionsExchangeFuture] Failed to wait for partition release future [topVer=AffinityTopologyVersion [topVer=29, minorTopVer=0], node=cab5d0e0-7365-4774-8f99-d9f131c5d896]. Dumping pending objects that might be the cause:
> [08:09:45,822][WARN ][exchange-worker-#82%null%][GridCachePartitionExchangeManager] Ready affinity version: AffinityTopologyVersion [topVer=28, minorTopVer=1]
> [08:09:45,826][WARN ][exchange-worker-#82%null%][GridCachePartitionExchangeManager] Last exchange future: GridDhtPartitionsExchangeFuture ...
> {noformat}
> Load config:
> - 1 client, 20 servers (5 servers per 1 host)
> - warmup 60
> - duration 66h
> - preload 5M
> - key range 10M
> - operations: PUT PUT_ALL GET GET_ALL INVOKE INVOKE_ALL REMOVE REMOVE_ALL PUT_IF_ABSENT REPLACE
> - backups count 3
> - 3 servers restart every 15 min with 30 sec step, pause between stop and start 5min



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)