You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by Ray <ra...@cisco.com> on 2018/08/02 09:11:36 UTC

Re: "Unable to await partitions release latch within timeout: ServerLatch" exception causing cluster freeze

The root cause for this issue is the network throttle between client and
servers.

When I move the clients to run in the same cluster as the servers, there's
no such problem any more.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: "Unable to await partitions release latch within timeout: ServerLatch" exception causing cluster freeze

Posted by userx <ga...@gmail.com>.
Hi Pavel

I am encountering the same issue in which it seems like the Server has
entered into an infinite loop and every 10 seconds i am seeing the following
message.


2018-12-06 15:49:23,188 WARN
[exchange-worker-#122%5b9b0820-ec94-493c-ae58-bc31aac873c6%] {}
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture
- Unable to await partitions release latch within timeout: ClientLatch
[coordinator=TcpDiscoveryNode [id=a8b0f10f-8ad9-45a4-aab3-a0562fd0d202,
addrs=[10.62.21.54, 10.62.44.22, 10.63.216.22, 127.0.0.1],
sockAddrs=[ueu-ip-lapp0002.coresit.msci.org/10.63.216.22:47500,
ueu-ip-lapp0002.mgmt.msci.org/10.62.44.22:47500, /10.62.21.54:47500,
/127.0.0.1:47500], discPort=47500, order=1, intOrder=1,
lastExchangeTime=1544102518093, loc=false, ver=2.6.0#20180710-sha1:669feacc,
isClient=false], ackSent=true, super=CompletableLatch [id=exchange,
topVer=AffinityTopologyVersion [topVer=24, minorTopVer=1]]]


2018-12-06 15:49:33,189 WARN
[exchange-worker-#122%5b9b0820-ec94-493c-ae58-bc31aac873c6%] {}
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture
- Unable to await partitions release latch within timeout: ClientLatch
[coordinator=TcpDiscoveryNode [id=a8b0f10f-8ad9-45a4-aab3-a0562fd0d202,
addrs=[10.62.21.54, 10.62.44.22, 10.63.216.22, 127.0.0.1],
sockAddrs=[ueu-ip-lapp0002.coresit.msci.org/10.63.216.22:47500,
ueu-ip-lapp0002.mgmt.msci.org/10.62.44.22:47500, /10.62.21.54:47500,
/127.0.0.1:47500], discPort=47500, order=1, intOrder=1,
lastExchangeTime=1544102518093, loc=false, ver=2.6.0#20180710-sha1:669feacc,
isClient=false], ackSent=true, super=CompletableLatch [id=exchange,
topVer=AffinityTopologyVersion [topVer=24, minorTopVer=1]]]


The code in which it seems to be in an infinite loop is 


if (!localJoinExchange()) {
            try {
                while (true) {
                    try {
                        releaseLatch.await(waitTimeout,
TimeUnit.MILLISECONDS);

                        if (log.isInfoEnabled())
                            log.info("Finished waiting for partitions
release latch: " + releaseLatch);

                        break;
                    }
                    catch (IgniteFutureTimeoutCheckedException ignored) {
                        U.warn(log, "Unable to await partitions release
latch within timeout: " + releaseLatch);

                        // Try to resend ack.
                        releaseLatch.countDown();
                    }
                }
            }
            catch (IgniteCheckedException e) {
                U.warn(log, "Stop waiting for partitions release latch: " +
e.getMessage());
            }
        }

Interestingly the PME is only seeming to be stuck with topology version 24.
Although the last topology version was 83 because of a new node added.

Where as for topology version 23, there was no such iteration of 10 seconds
so the PME would have completed just fine.

Can you help me out here ?



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: "Unable to await partitions release latch within timeout: ServerLatch" exception causing cluster freeze

Posted by Pavel Kovalenko <jo...@gmail.com>.
Hello Ray,

I'm glad that your problem was resolved. I just want to add that on PME
beginning phase we're waiting for all current client operations finishing,
new operations are freezed till PME end. After node finishes all ongoing
client operations it counts down latch that you see in logs which in the
message "Unable to await". When all nodes finish all their operations,
exchange latch completes and PME continues. This latch was added to reach
data consistency on all nodes during main PME phase (partition information
exchange, affinity calculation, etc.). If you have network throttling
between client and server, it becomes hard to notify a client that
his datastreamer operation has finished and latch completing process is
slowed down.

2018-08-02 12:11 GMT+03:00 Ray <ra...@cisco.com>:

> The root cause for this issue is the network throttle between client and
> servers.
>
> When I move the clients to run in the same cluster as the servers, there's
> no such problem any more.
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>