You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by Zhenya Stanilovsky <ar...@mail.ru> on 2021/10/06 06:16:57 UTC

Re[2]: What does "First 10 long running cache futures" ?


Ok, seems something goes wrong on node with id=36edbfd5-4feb-417e-b965-bdc34a0a6f4f If you still have a problem, can u send here or directly by me these logs ?


 
>And finally this on the coordinator node....
>
>[14:07:41,282][WARNING][exchange-worker-#42%xxxxxx%][GridDhtPartitionsExchangeFuture] Unable to await partitions release latch within timeout. Some nodes have not sent acknowledgement for latch completion. It's possible due to unfinishined atomic updates, transactions or not released explicit locks on that nodes. Please check logs for errors on nodes with ids reported in latch `pendingAcks` collection [latch=ServerLatch [permits=1, pendingAcks=HashSet [36edbfd5-4feb-417e-b965-bdc34a0a6f4f], super=CompletableLatch [id=CompletableLatchUid [id=exchange, topVer=AffinityTopologyVersion [topVer=103, minorTopVer=0]]]]]  
>On Tue, 5 Oct 2021 at 10:07, John Smith < java.dev.mtl@gmail.com > wrote:
>>And I see this...
>>
>>[14:04:15,150][WARNING][exchange-worker-#43%raange%][GridDhtPartitionsExchangeFuture] Unable to await partitions release latch within timeout. For more details please check coordinator node logs [crdNode=TcpDiscoveryNode [id=36ad785d-e344-43bb-b685-e79557572b54, consistentId=8172e45d-3ff8-4fe4-aeda-e7d30c1e11e2, addrs=ArrayList [127.0.0.1, xxxxxx.65], sockAddrs=HashSet [xxxxxx-0002/xxxxxx.65:47500, / 127.0.0.1:47500 ], discPort=47500, order=1, intOrder=1, lastExchangeTime=1633370987399, loc=false, ver=2.8.1#20200521-sha1:86422096, isClient=false]] [latch=ClientLatch [coordinator=TcpDiscoveryNode [id=36ad785d-e344-43bb-b685-e79557572b54, consistentId=8172e45d-3ff8-4fe4-aeda-e7d30c1e11e2, addrs=ArrayList [127.0.0.1, xxxxxx.65], sockAddrs=HashSet [xxxxxx-0002/xxxxxx.65:47500, / 127.0.0.1:47500 ], discPort=47500, order=1, intOrder=1, lastExchangeTime=1633370987399, loc=false, ver=2.8.1#20200521-sha1:86422096, isClient=false], ackSent=true, super=CompletableLatch [id=CompletableLatchUid [id=exchange, topVer=AffinityTopologyVersion [topVer=103, minorTopVer=0]]]]]  
>>On Tue, 5 Oct 2021 at 10:02, John Smith < java.dev.mtl@gmail.com > wrote:
>>>Actually to be more clear...
>>>
>>>http://xxxxxx-0001:8080/ignite?cmd=version responds immediately.
>>>
>>>http://xxxxxx-0001:8080/ignite?cmd=size&cacheName=my-cache doesn't respond at all.  
>>>On Tue, 5 Oct 2021 at 09:59, John Smith < java.dev.mtl@gmail.com > wrote:
>>>>Yeah ever since I got this erro for example the REST APi wont return and the request are slower. But when I connect with visor I can get stats I can scan the cache etc...
>>>>
>>>>Is it possible that these async futures/threads are not released?  
>>>>On Tue, 5 Oct 2021 at 04:11, Zhenya Stanilovsky < arzamas123@mail.ru > wrote:
>>>>>Hi, this is just a warning shows that something suspicious observed.
>>>>>There is no simple reply for your question, in common case all these messages are due to cluster (resources or settings)  limitation.
>>>>>Check documentation for tuning performance [1]
>>>>> 
>>>>>[1]  https://ignite.apache.org/docs/latest/perf-and-troubleshooting/general-perf-tips
>>>>>   
>>>>>>Hi, using 2.8.1 I understand the message as in my async TRX is taking longer but is there a way to prevent it?
>>>>>> 
>>>>>>When this happened I was pushing about 50, 000 get/puts per second from my API. 
>>>>> 
>>>>> 
>>>>> 
>>>>>

Re: Re[2]: What does "First 10 long running cache futures" ?

Posted by John Smith <ja...@gmail.com>.

That feels like what happened. But I just rebooted the servers. But I also
force all my clients = true.

On Mon, 1 Nov 2021 at 16:40, Mike Wiesenberg <mi...@gmail.com>
wrote:

> In my experience, seeing the ' First 10 long running cache futures'
> warning (which should probably be logged as Fatal) means the ignite node
> your client is connected to is busted and needs to be rebooted. In
> addition, since ignite clients aren't smart enough to disconnect from a
> node in a busted state and try another one, you need to reboot all clients
> which may be connected to those nodes. All in all, a very problematic
> situation, and the lack of automatic failover in particular has made me
> seriously question if Ignite is production-ready software.
>
> On Wed, Oct 6, 2021 at 8:52 AM John Smith <ja...@gmail.com> wrote:
>
>> Ok. For now I rebooted all nodes... But it's fairly easy to reproduce.
>>
>> On Wed., Oct. 6, 2021, 2:17 a.m. Zhenya Stanilovsky, <ar...@mail.ru>
>> wrote:
>>
>>>
>>> Ok, seems something goes wrong on node with
>>> id=36edbfd5-4feb-417e-b965-bdc34a0a6f4f If you still have a problem, can u
>>> send here or directly by me these logs ?
>>>
>>>
>>>
>>>
>>> And finally this on the coordinator node....
>>>
>>> [14:07:41,282][WARNING][exchange-worker-#42%xxxxxx%][GridDhtPartitionsExchangeFuture]
>>> Unable to await partitions release latch within timeout. Some nodes have
>>> not sent acknowledgement for latch completion. It's possible due to
>>> unfinishined atomic updates, transactions or not released explicit locks on
>>> that nodes. Please check logs for errors on nodes with ids reported in
>>> latch `pendingAcks` collection [latch=ServerLatch [permits=1,
>>> pendingAcks=HashSet [36edbfd5-4feb-417e-b965-bdc34a0a6f4f],
>>> super=CompletableLatch [id=CompletableLatchUid [id=exchange,
>>> topVer=AffinityTopologyVersion [topVer=103, minorTopVer=0]]]]]
>>>
>>> On Tue, 5 Oct 2021 at 10:07, John Smith <java.dev.mtl@gmail.com
>>> <//...@gmail.com>> wrote:
>>>
>>> And I see this...
>>>
>>> [14:04:15,150][WARNING][exchange-worker-#43%raange%][GridDhtPartitionsExchangeFuture]
>>> Unable to await partitions release latch within timeout. For more details
>>> please check coordinator node logs [crdNode=TcpDiscoveryNode
>>> [id=36ad785d-e344-43bb-b685-e79557572b54,
>>> consistentId=8172e45d-3ff8-4fe4-aeda-e7d30c1e11e2, addrs=ArrayList
>>> [127.0.0.1, xxxxxx.65], sockAddrs=HashSet [xxxxxx-0002/xxxxxx.65:47500, /
>>> 127.0.0.1:47500], discPort=47500, order=1, intOrder=1,
>>> lastExchangeTime=1633370987399, loc=false,
>>> ver=2.8.1#20200521-sha1:86422096, isClient=false]] [latch=ClientLatch
>>> [coordinator=TcpDiscoveryNode [id=36ad785d-e344-43bb-b685-e79557572b54,
>>> consistentId=8172e45d-3ff8-4fe4-aeda-e7d30c1e11e2, addrs=ArrayList
>>> [127.0.0.1, xxxxxx.65], sockAddrs=HashSet [xxxxxx-0002/xxxxxx.65:47500, /
>>> 127.0.0.1:47500], discPort=47500, order=1, intOrder=1,
>>> lastExchangeTime=1633370987399, loc=false,
>>> ver=2.8.1#20200521-sha1:86422096, isClient=false], ackSent=true,
>>> super=CompletableLatch [id=CompletableLatchUid [id=exchange,
>>> topVer=AffinityTopologyVersion [topVer=103, minorTopVer=0]]]]]
>>>
>>> On Tue, 5 Oct 2021 at 10:02, John Smith <java.dev.mtl@gmail.com
>>> <//...@gmail.com>> wrote:
>>>
>>> Actually to be more clear...
>>>
>>> http://xxxxxx-0001:8080/ignite?cmd=version responds immediately.
>>>
>>> http://xxxxxx-0001:8080/ignite?cmd=size&cacheName=my-cache doesn't
>>> respond at all.
>>>
>>> On Tue, 5 Oct 2021 at 09:59, John Smith <java.dev.mtl@gmail.com
>>> <//...@gmail.com>> wrote:
>>>
>>> Yeah ever since I got this erro for example the REST APi wont return and
>>> the request are slower. But when I connect with visor I can get stats I can
>>> scan the cache etc...
>>>
>>> Is it possible that these async futures/threads are not released?
>>>
>>> On Tue, 5 Oct 2021 at 04:11, Zhenya Stanilovsky <arzamas123@mail.ru
>>> <//...@mail.ru>> wrote:
>>>
>>> Hi, this is just a warning shows that something suspicious observed.
>>> There is no simple reply for your question, in common case all these
>>> messages are due to cluster (resources or settings)  limitation.
>>> Check documentation for tuning performance [1]
>>>
>>> [1]
>>> https://ignite.apache.org/docs/latest/perf-and-troubleshooting/general-perf-tips
>>>
>>>
>>>
>>> Hi, using 2.8.1 I understand the message as in my async TRX is taking
>>> longer but is there a way to prevent it?
>>>
>>> When this happened I was pushing about 50, 000 get/puts per second from
>>> my API.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>

Re: Re[2]: What does "First 10 long running cache futures" ?

Posted by Mike Wiesenberg <mi...@gmail.com>.

In my experience, seeing the ' First 10 long running cache futures' warning
(which should probably be logged as Fatal) means the ignite node your
client is connected to is busted and needs to be rebooted. In addition,
since ignite clients aren't smart enough to disconnect from a node in a
busted state and try another one, you need to reboot all clients which may
be connected to those nodes. All in all, a very problematic situation, and
the lack of automatic failover in particular has made me seriously question
if Ignite is production-ready software.

On Wed, Oct 6, 2021 at 8:52 AM John Smith <ja...@gmail.com> wrote:

> Ok. For now I rebooted all nodes... But it's fairly easy to reproduce.
>
> On Wed., Oct. 6, 2021, 2:17 a.m. Zhenya Stanilovsky, <ar...@mail.ru>
> wrote:
>
>>
>> Ok, seems something goes wrong on node with
>> id=36edbfd5-4feb-417e-b965-bdc34a0a6f4f If you still have a problem, can u
>> send here or directly by me these logs ?
>>
>>
>>
>>
>> And finally this on the coordinator node....
>>
>> [14:07:41,282][WARNING][exchange-worker-#42%xxxxxx%][GridDhtPartitionsExchangeFuture]
>> Unable to await partitions release latch within timeout. Some nodes have
>> not sent acknowledgement for latch completion. It's possible due to
>> unfinishined atomic updates, transactions or not released explicit locks on
>> that nodes. Please check logs for errors on nodes with ids reported in
>> latch `pendingAcks` collection [latch=ServerLatch [permits=1,
>> pendingAcks=HashSet [36edbfd5-4feb-417e-b965-bdc34a0a6f4f],
>> super=CompletableLatch [id=CompletableLatchUid [id=exchange,
>> topVer=AffinityTopologyVersion [topVer=103, minorTopVer=0]]]]]
>>
>> On Tue, 5 Oct 2021 at 10:07, John Smith <java.dev.mtl@gmail.com
>> <//...@gmail.com>> wrote:
>>
>> And I see this...
>>
>> [14:04:15,150][WARNING][exchange-worker-#43%raange%][GridDhtPartitionsExchangeFuture]
>> Unable to await partitions release latch within timeout. For more details
>> please check coordinator node logs [crdNode=TcpDiscoveryNode
>> [id=36ad785d-e344-43bb-b685-e79557572b54,
>> consistentId=8172e45d-3ff8-4fe4-aeda-e7d30c1e11e2, addrs=ArrayList
>> [127.0.0.1, xxxxxx.65], sockAddrs=HashSet [xxxxxx-0002/xxxxxx.65:47500, /
>> 127.0.0.1:47500], discPort=47500, order=1, intOrder=1,
>> lastExchangeTime=1633370987399, loc=false,
>> ver=2.8.1#20200521-sha1:86422096, isClient=false]] [latch=ClientLatch
>> [coordinator=TcpDiscoveryNode [id=36ad785d-e344-43bb-b685-e79557572b54,
>> consistentId=8172e45d-3ff8-4fe4-aeda-e7d30c1e11e2, addrs=ArrayList
>> [127.0.0.1, xxxxxx.65], sockAddrs=HashSet [xxxxxx-0002/xxxxxx.65:47500, /
>> 127.0.0.1:47500], discPort=47500, order=1, intOrder=1,
>> lastExchangeTime=1633370987399, loc=false,
>> ver=2.8.1#20200521-sha1:86422096, isClient=false], ackSent=true,
>> super=CompletableLatch [id=CompletableLatchUid [id=exchange,
>> topVer=AffinityTopologyVersion [topVer=103, minorTopVer=0]]]]]
>>
>> On Tue, 5 Oct 2021 at 10:02, John Smith <java.dev.mtl@gmail.com
>> <//...@gmail.com>> wrote:
>>
>> Actually to be more clear...
>>
>> http://xxxxxx-0001:8080/ignite?cmd=version responds immediately.
>>
>> http://xxxxxx-0001:8080/ignite?cmd=size&cacheName=my-cache doesn't
>> respond at all.
>>
>> On Tue, 5 Oct 2021 at 09:59, John Smith <java.dev.mtl@gmail.com
>> <//...@gmail.com>> wrote:
>>
>> Yeah ever since I got this erro for example the REST APi wont return and
>> the request are slower. But when I connect with visor I can get stats I can
>> scan the cache etc...
>>
>> Is it possible that these async futures/threads are not released?
>>
>> On Tue, 5 Oct 2021 at 04:11, Zhenya Stanilovsky <arzamas123@mail.ru
>> <//...@mail.ru>> wrote:
>>
>> Hi, this is just a warning shows that something suspicious observed.
>> There is no simple reply for your question, in common case all these
>> messages are due to cluster (resources or settings)  limitation.
>> Check documentation for tuning performance [1]
>>
>> [1]
>> https://ignite.apache.org/docs/latest/perf-and-troubleshooting/general-perf-tips
>>
>>
>>
>> Hi, using 2.8.1 I understand the message as in my async TRX is taking
>> longer but is there a way to prevent it?
>>
>> When this happened I was pushing about 50, 000 get/puts per second from
>> my API.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

Re: Re[2]: What does "First 10 long running cache futures" ?

Posted by John Smith <ja...@gmail.com>.

Ok. For now I rebooted all nodes... But it's fairly easy to reproduce.

On Wed., Oct. 6, 2021, 2:17 a.m. Zhenya Stanilovsky, <ar...@mail.ru>
wrote:

>
> Ok, seems something goes wrong on node with
> id=36edbfd5-4feb-417e-b965-bdc34a0a6f4f If you still have a problem, can u
> send here or directly by me these logs ?
>
>
>
>
> And finally this on the coordinator node....
>
> [14:07:41,282][WARNING][exchange-worker-#42%xxxxxx%][GridDhtPartitionsExchangeFuture]
> Unable to await partitions release latch within timeout. Some nodes have
> not sent acknowledgement for latch completion. It's possible due to
> unfinishined atomic updates, transactions or not released explicit locks on
> that nodes. Please check logs for errors on nodes with ids reported in
> latch `pendingAcks` collection [latch=ServerLatch [permits=1,
> pendingAcks=HashSet [36edbfd5-4feb-417e-b965-bdc34a0a6f4f],
> super=CompletableLatch [id=CompletableLatchUid [id=exchange,
> topVer=AffinityTopologyVersion [topVer=103, minorTopVer=0]]]]]
>
> On Tue, 5 Oct 2021 at 10:07, John Smith <java.dev.mtl@gmail.com
> <//...@gmail.com>> wrote:
>
> And I see this...
>
> [14:04:15,150][WARNING][exchange-worker-#43%raange%][GridDhtPartitionsExchangeFuture]
> Unable to await partitions release latch within timeout. For more details
> please check coordinator node logs [crdNode=TcpDiscoveryNode
> [id=36ad785d-e344-43bb-b685-e79557572b54,
> consistentId=8172e45d-3ff8-4fe4-aeda-e7d30c1e11e2, addrs=ArrayList
> [127.0.0.1, xxxxxx.65], sockAddrs=HashSet [xxxxxx-0002/xxxxxx.65:47500, /
> 127.0.0.1:47500], discPort=47500, order=1, intOrder=1,
> lastExchangeTime=1633370987399, loc=false,
> ver=2.8.1#20200521-sha1:86422096, isClient=false]] [latch=ClientLatch
> [coordinator=TcpDiscoveryNode [id=36ad785d-e344-43bb-b685-e79557572b54,
> consistentId=8172e45d-3ff8-4fe4-aeda-e7d30c1e11e2, addrs=ArrayList
> [127.0.0.1, xxxxxx.65], sockAddrs=HashSet [xxxxxx-0002/xxxxxx.65:47500, /
> 127.0.0.1:47500], discPort=47500, order=1, intOrder=1,
> lastExchangeTime=1633370987399, loc=false,
> ver=2.8.1#20200521-sha1:86422096, isClient=false], ackSent=true,
> super=CompletableLatch [id=CompletableLatchUid [id=exchange,
> topVer=AffinityTopologyVersion [topVer=103, minorTopVer=0]]]]]
>
> On Tue, 5 Oct 2021 at 10:02, John Smith <java.dev.mtl@gmail.com
> <//...@gmail.com>> wrote:
>
> Actually to be more clear...
>
> http://xxxxxx-0001:8080/ignite?cmd=version responds immediately.
>
> http://xxxxxx-0001:8080/ignite?cmd=size&cacheName=my-cache doesn't
> respond at all.
>
> On Tue, 5 Oct 2021 at 09:59, John Smith <java.dev.mtl@gmail.com
> <//...@gmail.com>> wrote:
>
> Yeah ever since I got this erro for example the REST APi wont return and
> the request are slower. But when I connect with visor I can get stats I can
> scan the cache etc...
>
> Is it possible that these async futures/threads are not released?
>
> On Tue, 5 Oct 2021 at 04:11, Zhenya Stanilovsky <arzamas123@mail.ru
> <//...@mail.ru>> wrote:
>
> Hi, this is just a warning shows that something suspicious observed.
> There is no simple reply for your question, in common case all these
> messages are due to cluster (resources or settings)  limitation.
> Check documentation for tuning performance [1]
>
> [1]
> https://ignite.apache.org/docs/latest/perf-and-troubleshooting/general-perf-tips
>
>
>
> Hi, using 2.8.1 I understand the message as in my async TRX is taking
> longer but is there a way to prevent it?
>
> When this happened I was pushing about 50, 000 get/puts per second from my
> API.
>
>
>
>
>
>
>
>
>
>
>