You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by userx <ga...@gmail.com> on 2020/05/01 08:53:01 UTC

Re: Cluster went down after "Unable to await partitions release latch within timeout" WARN

Hi Pavel,

I am using 2.8 and still getting the same issue. Here is the ecosystem 

19 Ignite servers (S1 to S19) running at 16GB of max JVM and in persistent
mode.

96 Clients (C1 to C96)

There are 19 machines, 1 Ignite server is started on 1 machine. The clients
are evenly distributed across machines.

C19 tries to create a cache, it gets a timeout exception as i have 5 mins of
timeout. When I looked into the coordinator logs, between a span of 5
minutes, it gets the messages 


2020-04-24 15:37:09,434 WARN [exchange-worker-#45%S1%] {}
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture
- Unable to await partitions release latch within timeout. Some nodes have
not sent acknowledgement for latch completion. It's possible due to
unfinishined atomic updates, transactions or not released explicit locks on
that nodes. Please check logs for errors on nodes with ids reported in latch
`pendingAcks` collection [latch=ServerLatch [permits=4, pendingAcks=HashSet
[84b8416c-fa06-4544-9ce0-e3dfba41038a, 19bd7744-0ced-4123-a35f-ddf0cf9f55c4,
533af8f9-c0f6-44b6-92d4-658f86ffaca0, 1b31cb25-abbc-4864-88a3-5a4df37a0cf4],
super=CompletableLatch [id=CompletableLatchUid [id=exchange,
topVer=AffinityTopologyVersion [topVer=174, minorTopVer=1]]]]]

And the 4 nodes which have not been able to acknowledge latch completion are
S14, S7, S18, S4

I went to see the logs of S4, it just records the addition of C19 into
topology and then C19 leaving it after 5 minutes. The only thing is that in
GC I see this consistently "Total time for which application threads were
stopped: 0.0006225 seconds, Stopping threads took: 0.0000887 seconds"

I understand that until the time all the atomic updates and transactions are
finished Clients are not able to create caches by communicating with
Coordinator but is there a way around ?

So the question is that is it still prevalent on 2.8 ?









--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Cluster went down after "Unable to await partitions release latch within timeout" WARN

Posted by userx <ga...@gmail.com>.

Hi Pavel,

The exchange finished taking its time, but during that time, new client was
not able to write to the cache.

So what happened was that

There were 4 Ignite servers out of a bunch of 19 (as you can see in the
consistentids) in my message above,  that their acknowledgement to
Coordinator node was pending because they possibly were finishing some
Atomic updates or transactions. This almost went for 2 hours. During those 2
hours, clients tried to activate 

if (ignite == null) {
        Ignition.setClientMode(true);
        String fileName = getRelevantFileName();
        ignite = Ignition.start(fileName);
      }
      ignite.cluster().active(true);

But the activation couldnt happen. For this task we have a timeout of 5
minutes. If this doesnt happen client gives up unless the next time it needs
to create a cache.

So when i talk about clients, say they are just individual java processes
running.





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Cluster went down after "Unable to await partitions release latch within timeout" WARN

Posted by Pavel Kovalenko <jo...@gmail.com>.

Hello,

I don't clearly understand from your message, but have the exchange finally
finished? Or you were getting this WARN message all the time?

пт, 1 мая 2020 г. в 12:32, Ilya Kasnacheev <il...@gmail.com>:

> Hello!
>
> This description sounds like a typical hanging Partition Map Exchange, but
> you should be able to see that in logs.
> If you don't, you can collect thread dumps from all nodes with jstack and
> check it for any stalling operations (or share with us).
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> пт, 1 мая 2020 г. в 11:53, userx <ga...@gmail.com>:
>
>> Hi Pavel,
>>
>> I am using 2.8 and still getting the same issue. Here is the ecosystem
>>
>> 19 Ignite servers (S1 to S19) running at 16GB of max JVM and in persistent
>> mode.
>>
>> 96 Clients (C1 to C96)
>>
>> There are 19 machines, 1 Ignite server is started on 1 machine. The
>> clients
>> are evenly distributed across machines.
>>
>> C19 tries to create a cache, it gets a timeout exception as i have 5 mins
>> of
>> timeout. When I looked into the coordinator logs, between a span of 5
>> minutes, it gets the messages
>>
>>
>> 2020-04-24 15:37:09,434 WARN [exchange-worker-#45%S1%] {}
>>
>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture
>> - Unable to await partitions release latch within timeout. Some nodes have
>> not sent acknowledgement for latch completion. It's possible due to
>> unfinishined atomic updates, transactions or not released explicit locks
>> on
>> that nodes. Please check logs for errors on nodes with ids reported in
>> latch
>> `pendingAcks` collection [latch=ServerLatch [permits=4,
>> pendingAcks=HashSet
>> [84b8416c-fa06-4544-9ce0-e3dfba41038a,
>> 19bd7744-0ced-4123-a35f-ddf0cf9f55c4,
>> 533af8f9-c0f6-44b6-92d4-658f86ffaca0,
>> 1b31cb25-abbc-4864-88a3-5a4df37a0cf4],
>> super=CompletableLatch [id=CompletableLatchUid [id=exchange,
>> topVer=AffinityTopologyVersion [topVer=174, minorTopVer=1]]]]]
>>
>> And the 4 nodes which have not been able to acknowledge latch completion
>> are
>> S14, S7, S18, S4
>>
>> I went to see the logs of S4, it just records the addition of C19 into
>> topology and then C19 leaving it after 5 minutes. The only thing is that
>> in
>> GC I see this consistently "Total time for which application threads were
>> stopped: 0.0006225 seconds, Stopping threads took: 0.0000887 seconds"
>>
>> I understand that until the time all the atomic updates and transactions
>> are
>> finished Clients are not able to create caches by communicating with
>> Coordinator but is there a way around ?
>>
>> So the question is that is it still prevalent on 2.8 ?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>>
>

Re: Cluster went down after "Unable to await partitions release latch within timeout" WARN

Posted by Ilya Kasnacheev <il...@gmail.com>.

Hello!

This description sounds like a typical hanging Partition Map Exchange, but
you should be able to see that in logs.
If you don't, you can collect thread dumps from all nodes with jstack and
check it for any stalling operations (or share with us).

Regards,
-- 
Ilya Kasnacheev


пт, 1 мая 2020 г. в 11:53, userx <ga...@gmail.com>:

> Hi Pavel,
>
> I am using 2.8 and still getting the same issue. Here is the ecosystem
>
> 19 Ignite servers (S1 to S19) running at 16GB of max JVM and in persistent
> mode.
>
> 96 Clients (C1 to C96)
>
> There are 19 machines, 1 Ignite server is started on 1 machine. The clients
> are evenly distributed across machines.
>
> C19 tries to create a cache, it gets a timeout exception as i have 5 mins
> of
> timeout. When I looked into the coordinator logs, between a span of 5
> minutes, it gets the messages
>
>
> 2020-04-24 15:37:09,434 WARN [exchange-worker-#45%S1%] {}
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture
> - Unable to await partitions release latch within timeout. Some nodes have
> not sent acknowledgement for latch completion. It's possible due to
> unfinishined atomic updates, transactions or not released explicit locks on
> that nodes. Please check logs for errors on nodes with ids reported in
> latch
> `pendingAcks` collection [latch=ServerLatch [permits=4, pendingAcks=HashSet
> [84b8416c-fa06-4544-9ce0-e3dfba41038a,
> 19bd7744-0ced-4123-a35f-ddf0cf9f55c4,
> 533af8f9-c0f6-44b6-92d4-658f86ffaca0,
> 1b31cb25-abbc-4864-88a3-5a4df37a0cf4],
> super=CompletableLatch [id=CompletableLatchUid [id=exchange,
> topVer=AffinityTopologyVersion [topVer=174, minorTopVer=1]]]]]
>
> And the 4 nodes which have not been able to acknowledge latch completion
> are
> S14, S7, S18, S4
>
> I went to see the logs of S4, it just records the addition of C19 into
> topology and then C19 leaving it after 5 minutes. The only thing is that in
> GC I see this consistently "Total time for which application threads were
> stopped: 0.0006225 seconds, Stopping threads took: 0.0000887 seconds"
>
> I understand that until the time all the atomic updates and transactions
> are
> finished Clients are not able to create caches by communicating with
> Coordinator but is there a way around ?
>
> So the question is that is it still prevalent on 2.8 ?
>
>
>
>
>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>