You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by Sambhaji Sawant <sa...@gmail.com> on 2018/06/04 05:29:35 UTC

Ignite Cluster getting stuck when new node Join or release

I have 3 node cluster with 20+ client and it's running in spark
context.Initially it working fine but randomly get issue whenever new node
i.e. client try to connect with cluster.The cluster getting inoperative.I
have got following logs when its stuck.If I restart any Ignite server
explicitly then its release and work fine.I have use Ignite 2.4.0 version.
same issue produced in Ignite 2.5.0 version too.

client side Logs

Failed to wait for partition map exchange [topVer=AffinityTopologyVersion
[topVer=44, minorTopVer=0], node=4d885cfd-45ed-43a2-8088-f35c9469797f].
Dumping pending objects that might be the cause:

        GridDhtPartitionsExchangeFuture
[topVer=AffinityTopologyVersion [topVer=44, minorTopVer=0],
evt=NODE_JOINED, evtNode=TcpDiscoveryNode
[id=4d885cfd-45ed-43a2-8088-f35c9469797f, addrs=[0:0:0:0:0:0:0:1%lo,
10.13.10.179, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1%lo:0,
/127.0.0.1:0, hdn6.mstorm.com/10.13.10.179:0], discPort=0, order=44,
intOrder=0, lastExchangeTime=1527651620413, loc=true,
ver=2.4.0#20180305-sha1:aa342270, isClient=true], done=false]

Failed to wait for partition map exchange [topVer=AffinityTopologyVersion
[topVer=44, minorTopVer=0], node=4d885cfd-45ed-43a2-8088-f35c9469797f].
Dumping pending objects that might be the cause:

        GridDhtPartitionsExchangeFuture
[topVer=AffinityTopologyVersion [topVer=44, minorTopVer=0],
evt=NODE_JOINED, evtNode=TcpDiscoveryNode
[id=4d885cfd-45ed-43a2-8088-f35c9469797f, addrs=[0:0:0:0:0:0:0:1%lo,
10.13.10.179, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1%lo:0,
/127.0.0.1:0, hdn6.mstorm.com/10.13.10.179:0], discPort=0, order=44,
intOrder=0, lastExchangeTime=1527651620413, loc=true,
ver=2.4.0#20180305-sha1:aa342270, isClient=true], done=false]

Failed to wait for initial partition map exchange. Possible reasons are:
^-- Transactions in deadlock. ^-- Long running transactions (ignore if this
is the case). ^-- Unreleased explicit locks.

Still waiting for initial partition map exchange
[fut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent
[evtNode=TcpDiscoveryNode [id=4d885cfd-45ed-43a2-8088-f35c9469797f, addrs=

Server Side Logs

Possible starvation in striped pool. Thread name: sys-stripe-0-#1 Queue:
[Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8,
ordered=false, timeout=0, skipOnTimeout=false, msg=GridDhtTxPrepareResponse
[nearEvicted=null, futId=869dd4ca361-fe7e167d-4d80-4f57-b004-13359a9f2c11,
miniId=1, super=GridDistributedTxPrepareResponse [txState=null, part=-1,
err=null, super=GridDistributedBaseMessage [ver=GridCacheVersion
[topVer=139084030, order=1527604094903, nodeOrder=1], committedVers=null,
rolledbackVers=null, cnt=0, super=GridCacheIdMessage [cacheId=0]]]]]],
Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8,
ordered=false, timeout=0, skipOnTimeout=false,
msg=GridDhtAtomicSingleUpdateRequest [key=KeyCacheObjectImpl [part=984,
val=null, hasValBytes=true], val=BinaryObjectImpl [arr= true, ctx=false,
start=0], prevVal=null, super=GridDhtAtomicAbstractUpdateRequest
[onRes=false, nearNodeId=null, nearFutId=0, flags=]]]],
o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$DeferredUpdateTimeout@2735c674,
Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8,
ordered=false, timeout=0, skipOnTimeout=false, msg=GridDhtTxPrepareRequest
[nearNodeId=628e3078-17fd-4e49-b9ae-ad94ad97a2f1,
futId=6576e4ca361-6e7cdac2-d5a3-4624-9ad3-b93f25546cc3, miniId=1,
topVer=AffinityTopologyVersion [topVer=20, minorTopVer=0],
invalidateNearEntries={}, nearWrites=null, owned=null,
nearXidVer=GridCacheVersion [topVer=139084030, order=1527604094933,
nodeOrder=2], subjId=628e3078-17fd-4e49-b9ae-ad94ad97a2f1, taskNameHash=0,
preloadKeys=null, super=GridDistributedTxPrepareRequest [threadId=86,
concurrency=OPTIMISTIC, isolation=READ_COMMITTED, writeVer=GridCacheVersion
[topVer=139084030, order=1527604094935, nodeOrder=2], timeout=0,
reads=null, writes=[IgniteTxEntry [key=BinaryObjectImpl [arr= true,
ctx=false, start=0], cacheId=-1755241537, txKey=null, val=[op=UPDATE,
val=BinaryObjectImpl [arr= true, ctx=false, start=0]], prevVal=[op=NOOP,
val=null], oldVal=[op=NOOP, val=null], entryProcessorsCol=null, ttl=-1,
conflictExpireTime=-1, conflictVer=null, explicitVer=null, dhtVer=null,
filters=null, filtersPassed=false, filtersSet=false, entry=null,
prepared=0, locked=false, nodeId=null, locMapped=false, expiryPlc=null,
transferExpiryPlc=false, flags=0, partUpdateCntr=0, serReadVer=null,
xidVer=null]], dhtVers=null, txSize=0, plc=2, txState=null,
flags=onePhase|last, super=GridDistributedBaseMessage [ver=GridCacheVersion
[topVer=139084030, order=1527604094933, nodeOrder=2], committedVers=null,
rolledbackVers=null, cnt=0, super=GridCacheIdMessage [cacheId=0]]]]]],
Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8,
ordered=false, timeout=0, skipOnTimeout=false,
msg=GridDhtAtomicDeferredUpdateResponse [futIds=GridLongList [idx=2,
arr=[65774,65775]]]]], Message closure [msg=GridIoMessage [plc=2,
topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0,
skipOnTimeout=false, msg=GridNearAtomicSingleUpdateRequest
[key=KeyCacheObjectImpl [part=1016, val=null, hasValBytes=true],
parent=GridNearAtomicAbstractSingleUpdateRequest [nodeId=null, futId=49328,
topVer=AffinityTopologyVersion [topVer=20, minorTopVer=0],
parent=GridNearAtomicAbstractUpdateRequest [res=null, flags=needRes]]]]],
Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8,
ordered=false, timeout=0, skipOnTimeout=false,
msg=GridDhtAtomicDeferredUpdateResponse [futIds=GridLongList [idx=1,
arr=[98591]]]]], Message closure [msg=GridIoMessage [plc=2,
topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0,
skipOnTimeout=false, msg=GridDhtAtomicDeferredUpdateResponse
[futIds=GridLongList [idx=1, arr=[114926]]]]], Message closure
[msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false,
timeout=0, skipOnTimeout=false, msg=GridNearAtomicSingleUpdateRequest
[key=KeyCacheObjectImpl [part=1016, val=null, hasValBytes=true],
parent=GridNearAtomicAbstractSingleUpdateRequest [nodeId=null, futId=32946,
topVer=AffinityTopologyVersion [topVer=20, minorTopVer=0], parent=GridNear

Re: Ignite Cluster getting stuck when new node Join or release

Posted by dkarachentsev <dk...@gridgain.com>.

Hi,

Thread dumps look healthy. Please share full logs at that time when you took
that thread dumps or take a new ones (thread dumps + logs).

Thanks!
-Dmitry



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Ignite Cluster getting stuck when new node Join or release

Posted by Sambhaji Sawant <sa...@gmail.com>.

Attaching  all server nodes Thread Dump files PFA.

On Mon, Jun 4, 2018 at 6:32 PM, dkarachentsev <dk...@gridgain.com>
wrote:

> Hi,
>
> It's hard to get what's going wrong from your question.
> Please attach full logs and thread dumps from all server nodes.
>
> Thanks!
> -Dmitry
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: Ignite Cluster getting stuck when new node Join or release

Posted by Andrey Mashenkov <an...@gmail.com>.

Hi,

It is ok if you kill client node. Grid will wait for
failureDetectionTimeout before drop failed node from topology.
All topology operations will stuck during that time as ignite nodes will
wait for answer from failed node until they detected failure.

On Thu, Jun 7, 2018 at 8:22 AM, Sambhaji Sawant <sa...@gmail.com>
wrote:

> An issue occurred when we abnormally stop Spark Java application which
> having Ignite client running inside that Spark context.So when we kill
> spark application its abnormally stop Ignite client and then when we
> restart our application and client try to connect with Ignite cluster then
> it getting stuck.
>
> On Mon, Jun 4, 2018 at 6:32 PM, dkarachentsev <dk...@gridgain.com>
> wrote:
>
>> Hi,
>>
>> It's hard to get what's going wrong from your question.
>> Please attach full logs and thread dumps from all server nodes.
>>
>> Thanks!
>> -Dmitry
>>
>>
>>
>> --
>> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>>
>
>

-- 
Best regards,
Andrey V. Mashenkov

Re: Ignite Cluster getting stuck when new node Join or release

Posted by Sambhaji Sawant <sa...@gmail.com>.

An issue occurred when we abnormally stop Spark Java application which
having Ignite client running inside that Spark context.So when we kill
spark application its abnormally stop Ignite client and then when we
restart our application and client try to connect with Ignite cluster then
it getting stuck.

On Mon, Jun 4, 2018 at 6:32 PM, dkarachentsev <dk...@gridgain.com>
wrote:

> Hi,
>
> It's hard to get what's going wrong from your question.
> Please attach full logs and thread dumps from all server nodes.
>
> Thanks!
> -Dmitry
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: Ignite Cluster getting stuck when new node Join or release

Posted by dkarachentsev <dk...@gridgain.com>.

Hi,

It's hard to get what's going wrong from your question.
Please attach full logs and thread dumps from all server nodes.

Thanks!
-Dmitry



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/