You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by "aaron@tophold.com" <aa...@tophold.com> on 2018/01/16 02:35:00 UTC

Nodes can not join the cluster after reboot

Hi All, 

We have a ignite cluster running about 20+ nodes,   for any case JVM memory issue we schedule reboot those nodes at middle night. 

but in order to keep the service supplied, we reboot them one by one like A,B,C,D nodes we reboot them at 5 mins delay; but if we doing so, the reboot nodes can never join to the cluster again. 

Eventually the entire cluster can not work any more forever waiting for joining to the topology; we need to kill all and reboot from started, this sound incredible. 

I not sure whether any more meet this issue before, or any mistake we may make, attached is the ignite log. 


Thanks for your time!

Regards
Aaron


Aaron.Kuai

Re: Re: Nodes can not join the cluster after reboot

Posted by Evgenii Zhuravlev <e....@gmail.com>.

I'd recommend to set it on all nodes.

Evgenii

2018-05-24 12:19 GMT+03:00 arunkjn <ar...@gmail.com>:

> Hi Evgenii,
>
> I am experiencing the same issue with grid affinity topology.
>
> We have a topology where we have 3 data nodes, 2 service nodes and several
> client nodes in the cluster. The issue occurs randomly sometimes when
> deploying the nodes.
>
> I wanted to know that the suggested solution of setting
> -DIGNITE_EXCHANGE_COMPATIBILITY_VER_1=true, does it have to be done on all
> the nodes? or just service nodes?
>
> Regards,
> Arun
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: Re: Nodes can not join the cluster after reboot

Posted by arunkjn <ar...@gmail.com>.

Hi Evgenii,

I am experiencing the same issue with grid affinity topology.

We have a topology where we have 3 data nodes, 2 service nodes and several
client nodes in the cluster. The issue occurs randomly sometimes when
deploying the nodes. 

I wanted to know that the suggested solution of setting
-DIGNITE_EXCHANGE_COMPATIBILITY_VER_1=true, does it have to be done on all
the nodes? or just service nodes?

Regards,
Arun




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Re: Nodes can not join the cluster after reboot

Posted by Evgenii Zhuravlev <e....@gmail.com>.

Hi Aaron,

Most probably this issue won't be fixed in 2.4, since it's not in progress
yet.

As a workaround, setting -DIGNITE_EXCHANGE_COMPATIBILITY_VER_1=true JVM
argument seems to fix this problem.

Evgenii

2018-01-23 14:47 GMT+03:00 aaron@tophold.com <aa...@tophold.com>:

> Hi Evgenii,
>
>
> Now that error gone , but another exception thrown:
>
>
> [ERROR] 2018-01-23 04:11:58.849 [srvc-deploy-#146] [ig]
> GridServiceProcessor - Error when executing service: null
>
> java.lang.IllegalStateException: Getting affinity for topology version
> earlier than affinity is calculated [locNode=TcpDiscoveryNode
> [id=a732a474-cd9a-4ae6-8366-0d058f1f80a8, addrs=[10.30.91.134], sockA
>
> ddrs=[fx2/10.30.91.134:47500], discPort=47500, order=15, intOrder=15,
> lastExchangeTime=1516680718809, loc=true, ver=2.3.0#20171028-sha1:8add7fd5,
> isClient=false], grp=ignite-sys-cache, topVer=AffinityTo
>
> pologyVersion [topVer=18, minorTopVer=0], head=AffinityTopologyVersion
> [topVer=19, minorTopVer=0], history=[AffinityTopologyVersion [topVer=15,
> minorTopVer=0], AffinityTopologyVersion [topVer=16, minorT
>
> opVer=0], AffinityTopologyVersion [topVer=17, minorTopVer=0],
> AffinityTopologyVersion [topVer=19, minorTopVer=0]]]
>
> at org.apache.ignite.internal.processors.affinity.
> GridAffinityAssignmentCache.cachedAffinity(GridAffinityAssignmentCache.java:514)
> ~[ignite-core-2.3.0.jar!/:2.3.0]
>
> at org.apache.ignite.internal.processors.affinity.
> GridAffinityAssignmentCache.nodes(GridAffinityAssignmentCache.java:419)
> ~[ignite-core-2.3.0.jar!/:2.3.0]
>
> at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.
> nodesByPartition(GridCacheAffinityManager.java:220)
> ~[ignite-core-2.3.0.jar!/:2.3.0]
>
> at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.
> primaryByPartition(GridCacheAffinityManager.java:256)
> ~[ignite-core-2.3.0.jar!/:2.3.0]
>
> at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.
> primaryByKey(GridCacheAffinityManager.java:247)
> ~[ignite-core-2.3.0.jar!/:2.3.0]
>
> at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.
> primaryByKey(GridCacheAffinityManager.java:271)
> ~[ignite-core-2.3.0.jar!/:2.3.0]
>
> at org.apache.ignite.internal.processors.service.GridServiceProcessor$
> TopologyListener$1.run0(GridServiceProcessor.java:1771)
> ~[ignite-core-2.3.0.jar!/:2.3.0]
>
> at org.apache.ignite.internal.processors.service.GridServiceProcessor$
> DepRunnable.run(GridServiceProcessor.java:1958)
> [ignite-core-2.3.0.jar!/:2.3.0]
>
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [?:1.8.0_131]
>
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [?:1.8.0_131]
>
> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
>
>
>
> Notice there is a similar JIRA:  https://issues.apache.org/
> jira/browse/IGNITE-7366
> <https://issues.apache.org/jira/browse/IGNITE-7366?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel>
>
> Not sure whether this issue confirm and any plan to fix it in 2.4?
>
>
>
> Regards
> Aaron
> ------------------------------
> Aaron.Kuai
>
>
> *From:* Evgenii Zhuravlev <e....@gmail.com>
> *Date:* 2018-01-19 16:36
> *To:* user <us...@ignite.apache.org>
> *Subject:* Re: Re: Nodes can not join the cluster after reboot
> Please let us know if this helped you
>
> Evgenii
>
> 2018-01-19 11:35 GMT+03:00 aaron@tophold.com <aa...@tophold.com>:
>
>> Hi Evgenii,
>>
>> I trying to remove this part and use the @LoggerResource;  will have a
>> try!  thanks for your time.
>>
>>
>> Regards
>> Aaron
>> ------------------------------
>> Aaron.Kuai
>>
>>
>> *From:* Evgenii Zhuravlev <e....@gmail.com>
>> *Date:* 2018-01-19 16:28
>> *To:* user <us...@ignite.apache.org>
>> *Subject:* Re: Re: Nodes can not join the cluster after reboot
>> Aaron,
>>
>> This Service instance after creating will be serialized and deserialized
>> on the target nodes. So, the field of Logger will be serialized too and I
>> don't think that it will be properly serialized with the possibility of
>> deserialization since it holds context internally. It's not recommended to
>> use such kinds of fields in Service, you should use
>>
>> @LoggerResource
>> private IgniteLogger log;
>>
>> instead. I'm not sure if it's the root cause, but it's definitely could cause some problems.
>>
>> Evgenii
>>
>>
>> 2018-01-19 4:51 GMT+03:00 aaron@tophold.com <aa...@tophold.com>:
>>
>>> HI Evgenii,,
>>>
>>> Sure, thanks for your time!  this service work as a delegate and all
>>> request will route to a bean in our spring context.
>>>
>>> Thanks again!
>>>
>>> Regards
>>> Aaron
>>> ------------------------------
>>> Aaron.Kuai
>>>
>>>
>>> *From:* Evgenii Zhuravlev <e....@gmail.com>
>>> *Date:* 2018-01-18 21:59
>>> *To:* user <us...@ignite.apache.org>
>>> *Subject:* Re: Re: Nodes can not join the cluster after reboot
>>> Aaron, could you share code of com.tophold.trade.ignite.service.CommandRemoteService
>>> ?
>>>
>>> Thanks,
>>> Evgenii
>>>
>>> 2018-01-18 16:43 GMT+03:00 Evgenii Zhuravlev <e....@gmail.com>:
>>>
>>>> Hi Aaron,
>>>>
>>>> I think that the main problem is here:
>>>>
>>>> GridServiceProcessor - Error when executing service: null
>>>>
>>>> diagnostic - Pending transactions:
>>>> [WARN ] 2018-01-17 10:55:19.632 [exchange-worker-#97%PortfolioEventIgnite%]
>>>> [ig] diagnostic - >>> [txVer=AffinityTopologyVersion [topVer=15,
>>>> minorTopVer=0], exchWait=true, tx=GridDhtTxRemote
>>>> [nearNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb,
>>>> rmtFutId=14d5c930161-e4bd34f6-8b10-40b7-8f30-d243ec91c3f1,
>>>> nearXidVer=GridCacheVersion [topVer=127664000, order=1516193727313,
>>>> nodeOrder=1], storeWriteThrough=false, super=GridDistributedTxRemoteAdapter
>>>> [explicitVers=null, started=true, commitAllowed=0,
>>>> txState=IgniteTxRemoteSingleStateImpl [entry=IgniteTxEntry
>>>> [key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey
>>>> [name=CRS_com_tophold_trade_product_command], hasValBytes=true],
>>>> cacheId=-2100569601, txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=72,
>>>> val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command],
>>>> hasValBytes=true], cacheId=-2100569601], val=[op=UPDATE,
>>>> val=CacheObjectImpl [val=GridServiceAssignments
>>>> [nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, topVer=15,
>>>> cfg=LazyServiceConfiguration [srvcClsName=com.tophold.trade
>>>> .ignite.service.CommandRemoteService, svcCls=,
>>>> nodeFilterCls=CommandServiceNodeFilter], assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}],
>>>> hasValBytes=true]], prevVal=[op=NOOP, val=null], oldVal=[op=NOOP,
>>>> val=null], entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1,
>>>> conflictVer=null, explicitVer=null, dhtVer=null, filters=[],
>>>> filtersPassed=false, filtersSet=false, entry=GridDhtCacheEntry [rdrs=[],
>>>> part=72, super=GridDistributedCacheEntry [super=GridCacheMapEntry
>>>> [key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey
>>>> [name=CRS_com_tophold_trade_product_command], hasValBytes=true],
>>>> val=CacheObjectImpl [val=GridServiceAssignments
>>>> [nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, topVer=13,
>>>> cfg=LazyServiceConfiguration [srvcClsName=com.tophold.trade
>>>> .ignite.service.CommandRemoteService, svcCls=,
>>>> nodeFilterCls=CommandServiceNodeFilter], assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}],
>>>> hasValBytes=true], startVer=1516183996434, ver=GridCacheVersion
>>>> [topVer=127663998, order=1516184119343, nodeOrder=10], hash=-1440463172,
>>>> extras=GridCacheMvccEntryExtras [mvcc=GridCacheMvcc [locs=null,
>>>> rmts=[GridCacheMvccCandidate [nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00,
>>>> ver=GridCacheVersion [topVer=127664000, order=1516193727420, nodeOrder=10],
>>>> threadId=585, id=82, topVer=AffinityTopologyVersion [topVer=-1,
>>>> minorTopVer=0], reentry=null, otherNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb,
>>>> otherVer=null, mappedDhtNodes=null, mappedNearNodes=null, ownerVer=null,
>>>> serOrder=null, key=KeyCacheObjectImpl [part=72,
>>>> val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command],
>>>> hasValBytes=true], masks=local=0|owner=0|ready=0|
>>>> reentry=0|used=0|tx=1|single_implicit=0|dht_local=0|near_local=0|removed=0|read=0,
>>>> prevVer=null, nextVer=null]]]], flags=2]]], prepared=1, locked=false,
>>>> nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false,
>>>> flags=0, partUpdateCntr=0, serReadVer=null, xidVer=null]],
>>>> super=IgniteTxAdapter [xidVer=GridCacheVersion [topVer=127664000,
>>>> order=1516193727420, nodeOrder=10], writeVer=GridCacheVersion
>>>> [topVer=127664000, order=1516193727421, nodeOrder=10], implicit=false,
>>>> loc=false, threadId=585, startTime=1516186483489,
>>>> nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00, startVer=GridCacheVersion
>>>> [topVer=127664000, order=1516193739547, nodeOrder=5], endVer=null,
>>>> isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, timeout=0,
>>>> sysInvalidate=false, sys=true, plc=5, commitVer=null, finalizing=NONE,
>>>> invalidParts=null, state=PREPARED, timedOut=false,
>>>> topVer=AffinityTopologyVersion [topVer=15, minorTopVer=0],
>>>> duration=36138ms, onePhaseCommit=false]]]]
>>>>
>>>> You have the pending transaction in logs related to the service
>>>> deployment. Most possible that your service threw NPE in init(or other)
>>>> method and wasn't deployed. Could you check if it's possible that your
>>>> service will throw NPE?
>>>>
>>>> Evgenii
>>>>
>>>>
>>>> 2018-01-17 15:40 GMT+03:00 aaron@tophold.com <aa...@tophold.com>:
>>>>
>>>>> Hi Evgenii,
>>>>>
>>>>> What's more interesting If we reboot them in very shut time like one
>>>>> hour,  from our monitor log we can find
>>>>>
>>>>> such like NODE_LEFT and NODE_JOIN events, every thing move smoothly .
>>>>>
>>>>> But if after several hours, problem below sure will happen if you try
>>>>> to reboot any node from cluster.
>>>>>
>>>>>
>>>>> Regards
>>>>> Aaron
>>>>> ------------------------------
>>>>> Aaron.Kuai
>>>>>
>>>>> *From:* aaron@tophold.com
>>>>> *Date:* 2018-01-17 20:05
>>>>> *To:* user <us...@ignite.apache.org>
>>>>> *Subject:* Re: Re: Nodes can not join the cluster after reboot
>>>>> hi Evgenii,
>>>>>
>>>>> Thanks!  We collect some logs, one is the server which is reboot,
>>>>> another two are two servers exist,  one client only nodes.  after reboot:
>>>>>
>>>>> 1. the reboot node never be totally brought up, waiting for ever.
>>>>> 2. other server nodes after get notification the reboot node down,
>>>>> soon hang up there also.
>>>>> 3. the pure client node, only call a remote service on the reboot
>>>>> node, also hang up there
>>>>>
>>>>> At around 2018-01-17 10:54  we reboot the node. From the log we can
>>>>> find:
>>>>>
>>>>> [WARN ] 2018-01-17 10:54:43.277 [sys-#471] [ig] ExchangeDisc
>>>>> overyEvents - All server nodes for the following caches have
>>>>>  left the cluster: 'PortfolioCommandService_SVC_CO_DUM_CACHE
>>>>> ', 'PortfolioSnapshotGenericDomainEventEntry', 'PortfolioGen
>>>>> ericDomainEventEntry'
>>>>>
>>>>> Soon a ERROR log(Seem the only ERROR level log):
>>>>>
>>>>> [ERROR] 2018-01-17 10:54:43.280 [srvc-deploy-#143] [ig] Grid
>>>>> ServiceProcessor - Error when executing service: null java.l
>>>>> ang.IllegalStateException: Getting affinity for topology ver
>>>>> sion earlier than affinity is calculated
>>>>>
>>>>> Then a lot WARN of
>>>>>
>>>>> "Failed to wait for partition release future........................."
>>>>>
>>>>> Then this forever loop there, from the diagnose nothing seem suspicious,
>>>>>  All node eventually output very similar.
>>>>>
>>>>> [WARN ] 2018-01-17 10:55:19.608 [exchange-worker-#97] [ig] d
>>>>> iagnostic - Pending explicit locks:
>>>>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] d
>>>>> iagnostic - Pending cache futures:
>>>>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] d
>>>>> iagnostic - Pending atomic cache futures:
>>>>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] d
>>>>> iagnostic - Pending data streamer futures:
>>>>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] d
>>>>> iagnostic - Pending transaction deadlock detection futures:
>>>>>
>>>>> Some of our environment:
>>>>>
>>>>> 1. we open the peer class loading flag, but in fact we use fat jar
>>>>> every class is shared.
>>>>> 2. some nodes deploy service, we use them as RPC way.
>>>>> 3. most cache in fact is LOCAL, only when must we make them shared
>>>>> 4. use JDBC to persist important caches
>>>>> 5. TcpDiscoveryJdbcIpFinder as the finder
>>>>>
>>>>> All others configuration is according to the stand.
>>>>>
>>>>> Thanks for your time!
>>>>>
>>>>> Regards
>>>>> Aaron
>>>>> ------------------------------
>>>>> Aaron.Kuai
>>>>>
>>>>>
>>>>> *From:* Evgenii Zhuravlev <e....@gmail.com>
>>>>> *Date:* 2018-01-16 20:32
>>>>> *To:* user <us...@ignite.apache.org>
>>>>> *Subject:* Re: Nodes can not join the cluster after reboot
>>>>> Hi,
>>>>>
>>>>> Most possible that on the of the nodes you have hanged
>>>>> transaction/future/lock or even a deadlock, that's why new nodes can't join
>>>>> cluster - they can't perform exchange due to pending operation. Please
>>>>> share full logs from all nodes with thread dumps, it will help to find a
>>>>> root cause.
>>>>>
>>>>> Evgenii
>>>>>
>>>>> 2018-01-16 5:35 GMT+03:00 aaron@tophold.com <aa...@tophold.com>:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> We have a ignite cluster running about 20+ nodes,   for any case JVM
>>>>>> memory issue we schedule reboot those nodes at middle night.
>>>>>>
>>>>>> but in order to keep the service supplied, we reboot them one by one
>>>>>> like A,B,C,D nodes we reboot them at 5 mins delay; but if we doing so, the
>>>>>> reboot nodes can never join to the cluster again.
>>>>>>
>>>>>> Eventually the entire cluster can not work any more forever waiting
>>>>>> for joining to the topology; we need to kill all and reboot from started,
>>>>>> this sound incredible.
>>>>>>
>>>>>> I not sure whether any more meet this issue before, or any mistake we
>>>>>> may make, attached is the ignite log.
>>>>>>
>>>>>>
>>>>>> Thanks for your time!
>>>>>>
>>>>>> Regards
>>>>>> Aaron
>>>>>> ------------------------------
>>>>>> Aaron.Kuai
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Re: Nodes can not join the cluster after reboot

Posted by "aaron@tophold.com" <aa...@tophold.com>.

Hi Evgenii, 


Now that error gone , but another exception thrown:


[ERROR] 2018-01-23 04:11:58.849 [srvc-deploy-#146] [ig] GridServiceProcessor - Error when executing service: null
java.lang.IllegalStateException: Getting affinity for topology version earlier than affinity is calculated [locNode=TcpDiscoveryNode [id=a732a474-cd9a-4ae6-8366-0d058f1f80a8, addrs=[10.30.91.134], sockA
ddrs=[fx2/10.30.91.134:47500], discPort=47500, order=15, intOrder=15, lastExchangeTime=1516680718809, loc=true, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false], grp=ignite-sys-cache, topVer=AffinityTo
pologyVersion [topVer=18, minorTopVer=0], head=AffinityTopologyVersion [topVer=19, minorTopVer=0], history=[AffinityTopologyVersion [topVer=15, minorTopVer=0], AffinityTopologyVersion [topVer=16, minorT
opVer=0], AffinityTopologyVersion [topVer=17, minorTopVer=0], AffinityTopologyVersion [topVer=19, minorTopVer=0]]]
at org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.cachedAffinity(GridAffinityAssignmentCache.java:514) ~[ignite-core-2.3.0.jar!/:2.3.0]
at org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.nodes(GridAffinityAssignmentCache.java:419) ~[ignite-core-2.3.0.jar!/:2.3.0]
at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.nodesByPartition(GridCacheAffinityManager.java:220) ~[ignite-core-2.3.0.jar!/:2.3.0]
at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.primaryByPartition(GridCacheAffinityManager.java:256) ~[ignite-core-2.3.0.jar!/:2.3.0]
at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.primaryByKey(GridCacheAffinityManager.java:247) ~[ignite-core-2.3.0.jar!/:2.3.0]
at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.primaryByKey(GridCacheAffinityManager.java:271) ~[ignite-core-2.3.0.jar!/:2.3.0]
at org.apache.ignite.internal.processors.service.GridServiceProcessor$TopologyListener$1.run0(GridServiceProcessor.java:1771) ~[ignite-core-2.3.0.jar!/:2.3.0]
at org.apache.ignite.internal.processors.service.GridServiceProcessor$DepRunnable.run(GridServiceProcessor.java:1958) [ignite-core-2.3.0.jar!/:2.3.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]


Notice there is a similar JIRA:  https://issues.apache.org/jira/browse/IGNITE-7366

Not sure whether this issue confirm and any plan to fix it in 2.4?



Regards
Aaron


Aaron.Kuai
 
From: Evgenii Zhuravlev
Date: 2018-01-19 16:36
To: user
Subject: Re: Re: Nodes can not join the cluster after reboot
Please let us know if this helped you

Evgenii

2018-01-19 11:35 GMT+03:00 aaron@tophold.com <aa...@tophold.com>:
Hi Evgenii, 

I trying to remove this part and use the @LoggerResource;  will have a try!  thanks for your time.


Regards
Aaron


Aaron.Kuai
 
From: Evgenii Zhuravlev
Date: 2018-01-19 16:28
To: user
Subject: Re: Re: Nodes can not join the cluster after reboot
Aaron, 

This Service instance after creating will be serialized and deserialized on the target nodes. So, the field of Logger will be serialized too and I don't think that it will be properly serialized with the possibility of deserialization since it holds context internally. It's not recommended to use such kinds of fields in Service, you should use 
@LoggerResource
private IgniteLogger log;instead. I'm not sure if it's the root cause, but it's definitely could cause some problems.Evgenii

2018-01-19 4:51 GMT+03:00 aaron@tophold.com <aa...@tophold.com>:
HI Evgenii,, 

Sure, thanks for your time!  this service work as a delegate and all request will route to a bean in our spring context. 

Thanks again!
 
Regards
Aaron


Aaron.Kuai
 
From: Evgenii Zhuravlev
Date: 2018-01-18 21:59
To: user
Subject: Re: Re: Nodes can not join the cluster after reboot
Aaron, could you share code of com.tophold.trade.ignite.service.CommandRemoteService ? 

Thanks,
Evgenii

2018-01-18 16:43 GMT+03:00 Evgenii Zhuravlev <e....@gmail.com>:
Hi Aaron,

I think that the main problem is here: 

GridServiceProcessor - Error when executing service: null

diagnostic - Pending transactions:
[WARN ] 2018-01-17 10:55:19.632 [exchange-worker-#97%PortfolioEventIgnite%] [ig] diagnostic - >>> [txVer=AffinityTopologyVersion [topVer=15, minorTopVer=0], exchWait=true, tx=GridDhtTxRemote [nearNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb, rmtFutId=14d5c930161-e4bd34f6-8b10-40b7-8f30-d243ec91c3f1, nearXidVer=GridCacheVersion [topVer=127664000, order=1516193727313, nodeOrder=1], storeWriteThrough=false, super=GridDistributedTxRemoteAdapter [explicitVers=null, started=true, commitAllowed=0, txState=IgniteTxRemoteSingleStateImpl [entry=IgniteTxEntry [key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command], hasValBytes=true], cacheId=-2100569601, txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command], hasValBytes=true], cacheId=-2100569601], val=[op=UPDATE, val=CacheObjectImpl [val=GridServiceAssignments [nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, topVer=15, cfg=LazyServiceConfiguration [srvcClsName=com.tophold.trade.ignite.service.CommandRemoteService, svcCls=, nodeFilterCls=CommandServiceNodeFilter], assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}], hasValBytes=true]], prevVal=[op=NOOP, val=null], oldVal=[op=NOOP, val=null], entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null, explicitVer=null, dhtVer=null, filters=[], filtersPassed=false, filtersSet=false, entry=GridDhtCacheEntry [rdrs=[], part=72, super=GridDistributedCacheEntry [super=GridCacheMapEntry [key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command], hasValBytes=true], val=CacheObjectImpl [val=GridServiceAssignments [nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, topVer=13, cfg=LazyServiceConfiguration [srvcClsName=com.tophold.trade.ignite.service.CommandRemoteService, svcCls=, nodeFilterCls=CommandServiceNodeFilter], assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}], hasValBytes=true], startVer=1516183996434, ver=GridCacheVersion [topVer=127663998, order=1516184119343, nodeOrder=10], hash=-1440463172, extras=GridCacheMvccEntryExtras [mvcc=GridCacheMvcc [locs=null, rmts=[GridCacheMvccCandidate [nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00, ver=GridCacheVersion [topVer=127664000, order=1516193727420, nodeOrder=10], threadId=585, id=82, topVer=AffinityTopologyVersion [topVer=-1, minorTopVer=0], reentry=null, otherNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb, otherVer=null, mappedDhtNodes=null, mappedNearNodes=null, ownerVer=null, serOrder=null, key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command], hasValBytes=true], masks=local=0|owner=0|ready=0|reentry=0|used=0|tx=1|single_implicit=0|dht_local=0|near_local=0|removed=0|read=0, prevVer=null, nextVer=null]]]], flags=2]]], prepared=1, locked=false, nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false, flags=0, partUpdateCntr=0, serReadVer=null, xidVer=null]], super=IgniteTxAdapter [xidVer=GridCacheVersion [topVer=127664000, order=1516193727420, nodeOrder=10], writeVer=GridCacheVersion [topVer=127664000, order=1516193727421, nodeOrder=10], implicit=false, loc=false, threadId=585, startTime=1516186483489, nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00, startVer=GridCacheVersion [topVer=127664000, order=1516193739547, nodeOrder=5], endVer=null, isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, timeout=0, sysInvalidate=false, sys=true, plc=5, commitVer=null, finalizing=NONE, invalidParts=null, state=PREPARED, timedOut=false, topVer=AffinityTopologyVersion [topVer=15, minorTopVer=0], duration=36138ms, onePhaseCommit=false]]]]

You have the pending transaction in logs related to the service deployment. Most possible that your service threw NPE in init(or other) method and wasn't deployed. Could you check if it's possible that your service will throw NPE?

Evgenii


2018-01-17 15:40 GMT+03:00 aaron@tophold.com <aa...@tophold.com>:
Hi Evgenii, 

What's more interesting If we reboot them in very shut time like one hour,  from our monitor log we can find 

such like NODE_LEFT and NODE_JOIN events, every thing move smoothly .  

But if after several hours, problem below sure will happen if you try to reboot any node from cluster. 


Regards
Aaron


Aaron.Kuai

From: aaron@tophold.com
Date: 2018-01-17 20:05
To: user
Subject: Re: Re: Nodes can not join the cluster after reboot
hi Evgenii, 

Thanks!  We collect some logs, one is the server which is reboot, another two are two servers exist,  one client only nodes.  after reboot:

1. the reboot node never be totally brought up, waiting for ever. 
2. other server nodes after get notification the reboot node down, soon hang up there also. 
3. the pure client node, only call a remote service on the reboot node, also hang up there

At around 2018-01-17 10:54  we reboot the node. From the log we can find:

[WARN ] 2018-01-17 10:54:43.277 [sys-#471] [ig] ExchangeDiscoveryEvents - All server nodes for the following caches have left the cluster: 'PortfolioCommandService_SVC_CO_DUM_CACHE', 'PortfolioSnapshotGenericDomainEventEntry', 'PortfolioGenericDomainEventEntry' 

Soon a ERROR log(Seem the only ERROR level log):

[ERROR] 2018-01-17 10:54:43.280 [srvc-deploy-#143] [ig] GridServiceProcessor - Error when executing service: null java.lang.IllegalStateException: Getting affinity for topology version earlier than affinity is calculated

Then a lot WARN of 

"Failed to wait for partition release future........................."

Then this forever loop there, from the diagnose nothing seem suspicious,  All node eventually output very similar. 

[WARN ] 2018-01-17 10:55:19.608 [exchange-worker-#97] [ig] diagnostic - Pending explicit locks:
[WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] diagnostic - Pending cache futures:
[WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] diagnostic - Pending atomic cache futures:
[WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] diagnostic - Pending data streamer futures:
[WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] diagnostic - Pending transaction deadlock detection futures:

Some of our environment:

1. we open the peer class loading flag, but in fact we use fat jar every class is shared.
2. some nodes deploy service, we use them as RPC way. 
3. most cache in fact is LOCAL, only when must we make them shared
4. use JDBC to persist important caches
5. TcpDiscoveryJdbcIpFinder as the finder

All others configuration is according to the stand. 

Thanks for your time!

Regards
Aaron


Aaron.Kuai
 
From: Evgenii Zhuravlev
Date: 2018-01-16 20:32
To: user
Subject: Re: Nodes can not join the cluster after reboot
Hi,

Most possible that on the of the nodes you have hanged transaction/future/lock or even a deadlock, that's why new nodes can't join cluster - they can't perform exchange due to pending operation. Please share full logs from all nodes with thread dumps, it will help to find a root cause.

Evgenii

2018-01-16 5:35 GMT+03:00 aaron@tophold.com <aa...@tophold.com>:
Hi All, 

We have a ignite cluster running about 20+ nodes,   for any case JVM memory issue we schedule reboot those nodes at middle night. 

but in order to keep the service supplied, we reboot them one by one like A,B,C,D nodes we reboot them at 5 mins delay; but if we doing so, the reboot nodes can never join to the cluster again. 

Eventually the entire cluster can not work any more forever waiting for joining to the topology; we need to kill all and reboot from started, this sound incredible. 

I not sure whether any more meet this issue before, or any mistake we may make, attached is the ignite log. 


Thanks for your time!

Regards
Aaron


Aaron.Kuai

Re: Re: Nodes can not join the cluster after reboot

Posted by Evgenii Zhuravlev <e....@gmail.com>.

Please let us know if this helped you

Evgenii

2018-01-19 11:35 GMT+03:00 aaron@tophold.com <aa...@tophold.com>:

> Hi Evgenii,
>
> I trying to remove this part and use the @LoggerResource;  will have a
> try!  thanks for your time.
>
>
> Regards
> Aaron
> ------------------------------
> Aaron.Kuai
>
>
> *From:* Evgenii Zhuravlev <e....@gmail.com>
> *Date:* 2018-01-19 16:28
> *To:* user <us...@ignite.apache.org>
> *Subject:* Re: Re: Nodes can not join the cluster after reboot
> Aaron,
>
> This Service instance after creating will be serialized and deserialized
> on the target nodes. So, the field of Logger will be serialized too and I
> don't think that it will be properly serialized with the possibility of
> deserialization since it holds context internally. It's not recommended to
> use such kinds of fields in Service, you should use
>
> @LoggerResource
> private IgniteLogger log;
>
> instead. I'm not sure if it's the root cause, but it's definitely could cause some problems.
>
> Evgenii
>
>
> 2018-01-19 4:51 GMT+03:00 aaron@tophold.com <aa...@tophold.com>:
>
>> HI Evgenii,,
>>
>> Sure, thanks for your time!  this service work as a delegate and all
>> request will route to a bean in our spring context.
>>
>> Thanks again!
>>
>> Regards
>> Aaron
>> ------------------------------
>> Aaron.Kuai
>>
>>
>> *From:* Evgenii Zhuravlev <e....@gmail.com>
>> *Date:* 2018-01-18 21:59
>> *To:* user <us...@ignite.apache.org>
>> *Subject:* Re: Re: Nodes can not join the cluster after reboot
>> Aaron, could you share code of com.tophold.trade.ignite.service.CommandRemoteService
>> ?
>>
>> Thanks,
>> Evgenii
>>
>> 2018-01-18 16:43 GMT+03:00 Evgenii Zhuravlev <e....@gmail.com>:
>>
>>> Hi Aaron,
>>>
>>> I think that the main problem is here:
>>>
>>> GridServiceProcessor - Error when executing service: null
>>>
>>> diagnostic - Pending transactions:
>>> [WARN ] 2018-01-17 10:55:19.632 [exchange-worker-#97%PortfolioEventIgnite%]
>>> [ig] diagnostic - >>> [txVer=AffinityTopologyVersion [topVer=15,
>>> minorTopVer=0], exchWait=true, tx=GridDhtTxRemote
>>> [nearNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb,
>>> rmtFutId=14d5c930161-e4bd34f6-8b10-40b7-8f30-d243ec91c3f1,
>>> nearXidVer=GridCacheVersion [topVer=127664000, order=1516193727313,
>>> nodeOrder=1], storeWriteThrough=false, super=GridDistributedTxRemoteAdapter
>>> [explicitVers=null, started=true, commitAllowed=0,
>>> txState=IgniteTxRemoteSingleStateImpl [entry=IgniteTxEntry
>>> [key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey
>>> [name=CRS_com_tophold_trade_product_command], hasValBytes=true],
>>> cacheId=-2100569601, txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=72,
>>> val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command],
>>> hasValBytes=true], cacheId=-2100569601], val=[op=UPDATE,
>>> val=CacheObjectImpl [val=GridServiceAssignments
>>> [nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, topVer=15,
>>> cfg=LazyServiceConfiguration [srvcClsName=com.tophold.trade
>>> .ignite.service.CommandRemoteService, svcCls=,
>>> nodeFilterCls=CommandServiceNodeFilter], assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}],
>>> hasValBytes=true]], prevVal=[op=NOOP, val=null], oldVal=[op=NOOP,
>>> val=null], entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1,
>>> conflictVer=null, explicitVer=null, dhtVer=null, filters=[],
>>> filtersPassed=false, filtersSet=false, entry=GridDhtCacheEntry [rdrs=[],
>>> part=72, super=GridDistributedCacheEntry [super=GridCacheMapEntry
>>> [key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey
>>> [name=CRS_com_tophold_trade_product_command], hasValBytes=true],
>>> val=CacheObjectImpl [val=GridServiceAssignments
>>> [nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, topVer=13,
>>> cfg=LazyServiceConfiguration [srvcClsName=com.tophold.trade
>>> .ignite.service.CommandRemoteService, svcCls=,
>>> nodeFilterCls=CommandServiceNodeFilter], assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}],
>>> hasValBytes=true], startVer=1516183996434, ver=GridCacheVersion
>>> [topVer=127663998, order=1516184119343, nodeOrder=10], hash=-1440463172,
>>> extras=GridCacheMvccEntryExtras [mvcc=GridCacheMvcc [locs=null,
>>> rmts=[GridCacheMvccCandidate [nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00,
>>> ver=GridCacheVersion [topVer=127664000, order=1516193727420, nodeOrder=10],
>>> threadId=585, id=82, topVer=AffinityTopologyVersion [topVer=-1,
>>> minorTopVer=0], reentry=null, otherNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb,
>>> otherVer=null, mappedDhtNodes=null, mappedNearNodes=null, ownerVer=null,
>>> serOrder=null, key=KeyCacheObjectImpl [part=72,
>>> val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command],
>>> hasValBytes=true], masks=local=0|owner=0|ready=0|
>>> reentry=0|used=0|tx=1|single_implicit=0|dht_local=0|near_local=0|removed=0|read=0,
>>> prevVer=null, nextVer=null]]]], flags=2]]], prepared=1, locked=false,
>>> nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false,
>>> flags=0, partUpdateCntr=0, serReadVer=null, xidVer=null]],
>>> super=IgniteTxAdapter [xidVer=GridCacheVersion [topVer=127664000,
>>> order=1516193727420, nodeOrder=10], writeVer=GridCacheVersion
>>> [topVer=127664000, order=1516193727421, nodeOrder=10], implicit=false,
>>> loc=false, threadId=585, startTime=1516186483489,
>>> nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00, startVer=GridCacheVersion
>>> [topVer=127664000, order=1516193739547, nodeOrder=5], endVer=null,
>>> isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, timeout=0,
>>> sysInvalidate=false, sys=true, plc=5, commitVer=null, finalizing=NONE,
>>> invalidParts=null, state=PREPARED, timedOut=false,
>>> topVer=AffinityTopologyVersion [topVer=15, minorTopVer=0],
>>> duration=36138ms, onePhaseCommit=false]]]]
>>>
>>> You have the pending transaction in logs related to the service
>>> deployment. Most possible that your service threw NPE in init(or other)
>>> method and wasn't deployed. Could you check if it's possible that your
>>> service will throw NPE?
>>>
>>> Evgenii
>>>
>>>
>>> 2018-01-17 15:40 GMT+03:00 aaron@tophold.com <aa...@tophold.com>:
>>>
>>>> Hi Evgenii,
>>>>
>>>> What's more interesting If we reboot them in very shut time like one
>>>> hour,  from our monitor log we can find
>>>>
>>>> such like NODE_LEFT and NODE_JOIN events, every thing move smoothly .
>>>>
>>>> But if after several hours, problem below sure will happen if you try
>>>> to reboot any node from cluster.
>>>>
>>>>
>>>> Regards
>>>> Aaron
>>>> ------------------------------
>>>> Aaron.Kuai
>>>>
>>>> *From:* aaron@tophold.com
>>>> *Date:* 2018-01-17 20:05
>>>> *To:* user <us...@ignite.apache.org>
>>>> *Subject:* Re: Re: Nodes can not join the cluster after reboot
>>>> hi Evgenii,
>>>>
>>>> Thanks!  We collect some logs, one is the server which is reboot,
>>>> another two are two servers exist,  one client only nodes.  after reboot:
>>>>
>>>> 1. the reboot node never be totally brought up, waiting for ever.
>>>> 2. other server nodes after get notification the reboot node down, soon
>>>> hang up there also.
>>>> 3. the pure client node, only call a remote service on the reboot node,
>>>> also hang up there
>>>>
>>>> At around 2018-01-17 10:54  we reboot the node. From the log we can
>>>> find:
>>>>
>>>> [WARN ] 2018-01-17 10:54:43.277 [sys-#471] [ig] ExchangeDisc
>>>> overyEvents - All server nodes for the following caches have
>>>>  left the cluster: 'PortfolioCommandService_SVC_CO_DUM_
>>>> CACHE', 'PortfolioSnapshotGenericDomainEventEntry', 'Portfol
>>>> ioGenericDomainEventEntry'
>>>>
>>>> Soon a ERROR log(Seem the only ERROR level log):
>>>>
>>>> [ERROR] 2018-01-17 10:54:43.280 [srvc-deploy-#143] [ig] Grid
>>>> ServiceProcessor - Error when executing service: null java.l
>>>> ang.IllegalStateException: Getting affinity for topology ver
>>>> sion earlier than affinity is calculated
>>>>
>>>> Then a lot WARN of
>>>>
>>>> "Failed to wait for partition release future........................."
>>>>
>>>> Then this forever loop there, from the diagnose nothing seem suspicious,
>>>>  All node eventually output very similar.
>>>>
>>>> [WARN ] 2018-01-17 10:55:19.608 [exchange-worker-#97] [ig] d
>>>> iagnostic - Pending explicit locks:
>>>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] d
>>>> iagnostic - Pending cache futures:
>>>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] d
>>>> iagnostic - Pending atomic cache futures:
>>>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] d
>>>> iagnostic - Pending data streamer futures:
>>>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] d
>>>> iagnostic - Pending transaction deadlock detection futures:
>>>>
>>>> Some of our environment:
>>>>
>>>> 1. we open the peer class loading flag, but in fact we use fat jar
>>>> every class is shared.
>>>> 2. some nodes deploy service, we use them as RPC way.
>>>> 3. most cache in fact is LOCAL, only when must we make them shared
>>>> 4. use JDBC to persist important caches
>>>> 5. TcpDiscoveryJdbcIpFinder as the finder
>>>>
>>>> All others configuration is according to the stand.
>>>>
>>>> Thanks for your time!
>>>>
>>>> Regards
>>>> Aaron
>>>> ------------------------------
>>>> Aaron.Kuai
>>>>
>>>>
>>>> *From:* Evgenii Zhuravlev <e....@gmail.com>
>>>> *Date:* 2018-01-16 20:32
>>>> *To:* user <us...@ignite.apache.org>
>>>> *Subject:* Re: Nodes can not join the cluster after reboot
>>>> Hi,
>>>>
>>>> Most possible that on the of the nodes you have hanged
>>>> transaction/future/lock or even a deadlock, that's why new nodes can't join
>>>> cluster - they can't perform exchange due to pending operation. Please
>>>> share full logs from all nodes with thread dumps, it will help to find a
>>>> root cause.
>>>>
>>>> Evgenii
>>>>
>>>> 2018-01-16 5:35 GMT+03:00 aaron@tophold.com <aa...@tophold.com>:
>>>>
>>>>> Hi All,
>>>>>
>>>>> We have a ignite cluster running about 20+ nodes,   for any case JVM
>>>>> memory issue we schedule reboot those nodes at middle night.
>>>>>
>>>>> but in order to keep the service supplied, we reboot them one by one
>>>>> like A,B,C,D nodes we reboot them at 5 mins delay; but if we doing so, the
>>>>> reboot nodes can never join to the cluster again.
>>>>>
>>>>> Eventually the entire cluster can not work any more forever waiting
>>>>> for joining to the topology; we need to kill all and reboot from started,
>>>>> this sound incredible.
>>>>>
>>>>> I not sure whether any more meet this issue before, or any mistake we
>>>>> may make, attached is the ignite log.
>>>>>
>>>>>
>>>>> Thanks for your time!
>>>>>
>>>>> Regards
>>>>> Aaron
>>>>> ------------------------------
>>>>> Aaron.Kuai
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Re: Nodes can not join the cluster after reboot

Posted by "aaron@tophold.com" <aa...@tophold.com>.

Hi Evgenii, 

I trying to remove this part and use the @LoggerResource;  will have a try!  thanks for your time.


Regards
Aaron


Aaron.Kuai
 
From: Evgenii Zhuravlev
Date: 2018-01-19 16:28
To: user
Subject: Re: Re: Nodes can not join the cluster after reboot
Aaron, 

This Service instance after creating will be serialized and deserialized on the target nodes. So, the field of Logger will be serialized too and I don't think that it will be properly serialized with the possibility of deserialization since it holds context internally. It's not recommended to use such kinds of fields in Service, you should use 
@LoggerResource
private IgniteLogger log;instead. I'm not sure if it's the root cause, but it's definitely could cause some problems.Evgenii

2018-01-19 4:51 GMT+03:00 aaron@tophold.com <aa...@tophold.com>:
HI Evgenii,, 

Sure, thanks for your time!  this service work as a delegate and all request will route to a bean in our spring context. 

Thanks again!
 
Regards
Aaron


Aaron.Kuai
 
From: Evgenii Zhuravlev
Date: 2018-01-18 21:59
To: user
Subject: Re: Re: Nodes can not join the cluster after reboot
Aaron, could you share code of com.tophold.trade.ignite.service.CommandRemoteService ? 

Thanks,
Evgenii

2018-01-18 16:43 GMT+03:00 Evgenii Zhuravlev <e....@gmail.com>:
Hi Aaron,

I think that the main problem is here: 

GridServiceProcessor - Error when executing service: null

diagnostic - Pending transactions:
[WARN ] 2018-01-17 10:55:19.632 [exchange-worker-#97%PortfolioEventIgnite%] [ig] diagnostic - >>> [txVer=AffinityTopologyVersion [topVer=15, minorTopVer=0], exchWait=true, tx=GridDhtTxRemote [nearNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb, rmtFutId=14d5c930161-e4bd34f6-8b10-40b7-8f30-d243ec91c3f1, nearXidVer=GridCacheVersion [topVer=127664000, order=1516193727313, nodeOrder=1], storeWriteThrough=false, super=GridDistributedTxRemoteAdapter [explicitVers=null, started=true, commitAllowed=0, txState=IgniteTxRemoteSingleStateImpl [entry=IgniteTxEntry [key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command], hasValBytes=true], cacheId=-2100569601, txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command], hasValBytes=true], cacheId=-2100569601], val=[op=UPDATE, val=CacheObjectImpl [val=GridServiceAssignments [nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, topVer=15, cfg=LazyServiceConfiguration [srvcClsName=com.tophold.trade.ignite.service.CommandRemoteService, svcCls=, nodeFilterCls=CommandServiceNodeFilter], assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}], hasValBytes=true]], prevVal=[op=NOOP, val=null], oldVal=[op=NOOP, val=null], entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null, explicitVer=null, dhtVer=null, filters=[], filtersPassed=false, filtersSet=false, entry=GridDhtCacheEntry [rdrs=[], part=72, super=GridDistributedCacheEntry [super=GridCacheMapEntry [key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command], hasValBytes=true], val=CacheObjectImpl [val=GridServiceAssignments [nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, topVer=13, cfg=LazyServiceConfiguration [srvcClsName=com.tophold.trade.ignite.service.CommandRemoteService, svcCls=, nodeFilterCls=CommandServiceNodeFilter], assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}], hasValBytes=true], startVer=1516183996434, ver=GridCacheVersion [topVer=127663998, order=1516184119343, nodeOrder=10], hash=-1440463172, extras=GridCacheMvccEntryExtras [mvcc=GridCacheMvcc [locs=null, rmts=[GridCacheMvccCandidate [nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00, ver=GridCacheVersion [topVer=127664000, order=1516193727420, nodeOrder=10], threadId=585, id=82, topVer=AffinityTopologyVersion [topVer=-1, minorTopVer=0], reentry=null, otherNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb, otherVer=null, mappedDhtNodes=null, mappedNearNodes=null, ownerVer=null, serOrder=null, key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command], hasValBytes=true], masks=local=0|owner=0|ready=0|reentry=0|used=0|tx=1|single_implicit=0|dht_local=0|near_local=0|removed=0|read=0, prevVer=null, nextVer=null]]]], flags=2]]], prepared=1, locked=false, nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false, flags=0, partUpdateCntr=0, serReadVer=null, xidVer=null]], super=IgniteTxAdapter [xidVer=GridCacheVersion [topVer=127664000, order=1516193727420, nodeOrder=10], writeVer=GridCacheVersion [topVer=127664000, order=1516193727421, nodeOrder=10], implicit=false, loc=false, threadId=585, startTime=1516186483489, nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00, startVer=GridCacheVersion [topVer=127664000, order=1516193739547, nodeOrder=5], endVer=null, isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, timeout=0, sysInvalidate=false, sys=true, plc=5, commitVer=null, finalizing=NONE, invalidParts=null, state=PREPARED, timedOut=false, topVer=AffinityTopologyVersion [topVer=15, minorTopVer=0], duration=36138ms, onePhaseCommit=false]]]]

You have the pending transaction in logs related to the service deployment. Most possible that your service threw NPE in init(or other) method and wasn't deployed. Could you check if it's possible that your service will throw NPE?

Evgenii


2018-01-17 15:40 GMT+03:00 aaron@tophold.com <aa...@tophold.com>:
Hi Evgenii, 

What's more interesting If we reboot them in very shut time like one hour,  from our monitor log we can find 

such like NODE_LEFT and NODE_JOIN events, every thing move smoothly .  

But if after several hours, problem below sure will happen if you try to reboot any node from cluster. 


Regards
Aaron


Aaron.Kuai

From: aaron@tophold.com
Date: 2018-01-17 20:05
To: user
Subject: Re: Re: Nodes can not join the cluster after reboot
hi Evgenii, 

Thanks!  We collect some logs, one is the server which is reboot, another two are two servers exist,  one client only nodes.  after reboot:

1. the reboot node never be totally brought up, waiting for ever. 
2. other server nodes after get notification the reboot node down, soon hang up there also. 
3. the pure client node, only call a remote service on the reboot node, also hang up there

At around 2018-01-17 10:54  we reboot the node. From the log we can find:

[WARN ] 2018-01-17 10:54:43.277 [sys-#471] [ig] ExchangeDiscoveryEvents - All server nodes for the following caches have left the cluster: 'PortfolioCommandService_SVC_CO_DUM_CACHE', 'PortfolioSnapshotGenericDomainEventEntry', 'PortfolioGenericDomainEventEntry' 

Soon a ERROR log(Seem the only ERROR level log):

[ERROR] 2018-01-17 10:54:43.280 [srvc-deploy-#143] [ig] GridServiceProcessor - Error when executing service: null java.lang.IllegalStateException: Getting affinity for topology version earlier than affinity is calculated

Then a lot WARN of 

"Failed to wait for partition release future........................."

Then this forever loop there, from the diagnose nothing seem suspicious,  All node eventually output very similar. 

[WARN ] 2018-01-17 10:55:19.608 [exchange-worker-#97] [ig] diagnostic - Pending explicit locks:
[WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] diagnostic - Pending cache futures:
[WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] diagnostic - Pending atomic cache futures:
[WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] diagnostic - Pending data streamer futures:
[WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] diagnostic - Pending transaction deadlock detection futures:

Some of our environment:

1. we open the peer class loading flag, but in fact we use fat jar every class is shared.
2. some nodes deploy service, we use them as RPC way. 
3. most cache in fact is LOCAL, only when must we make them shared
4. use JDBC to persist important caches
5. TcpDiscoveryJdbcIpFinder as the finder

All others configuration is according to the stand. 

Thanks for your time!

Regards
Aaron


Aaron.Kuai
 
From: Evgenii Zhuravlev
Date: 2018-01-16 20:32
To: user
Subject: Re: Nodes can not join the cluster after reboot
Hi,

Most possible that on the of the nodes you have hanged transaction/future/lock or even a deadlock, that's why new nodes can't join cluster - they can't perform exchange due to pending operation. Please share full logs from all nodes with thread dumps, it will help to find a root cause.

Evgenii

2018-01-16 5:35 GMT+03:00 aaron@tophold.com <aa...@tophold.com>:
Hi All, 

We have a ignite cluster running about 20+ nodes,   for any case JVM memory issue we schedule reboot those nodes at middle night. 

but in order to keep the service supplied, we reboot them one by one like A,B,C,D nodes we reboot them at 5 mins delay; but if we doing so, the reboot nodes can never join to the cluster again. 

Eventually the entire cluster can not work any more forever waiting for joining to the topology; we need to kill all and reboot from started, this sound incredible. 

I not sure whether any more meet this issue before, or any mistake we may make, attached is the ignite log. 


Thanks for your time!

Regards
Aaron


Aaron.Kuai

Re: Re: Nodes can not join the cluster after reboot

Posted by Evgenii Zhuravlev <e....@gmail.com>.

Aaron,

This Service instance after creating will be serialized and deserialized on
the target nodes. So, the field of Logger will be serialized too and I
don't think that it will be properly serialized with the possibility of
deserialization since it holds context internally. It's not recommended to
use such kinds of fields in Service, you should use

@LoggerResource
private IgniteLogger log;

instead. I'm not sure if it's the root cause, but it's definitely
could cause some problems.

Evgenii


2018-01-19 4:51 GMT+03:00 aaron@tophold.com <aa...@tophold.com>:

> HI Evgenii,,
>
> Sure, thanks for your time!  this service work as a delegate and all
> request will route to a bean in our spring context.
>
> Thanks again!
>
> Regards
> Aaron
> ------------------------------
> Aaron.Kuai
>
>
> *From:* Evgenii Zhuravlev <e....@gmail.com>
> *Date:* 2018-01-18 21:59
> *To:* user <us...@ignite.apache.org>
> *Subject:* Re: Re: Nodes can not join the cluster after reboot
> Aaron, could you share code of com.tophold.trade.ignite.service.CommandRemoteService
> ?
>
> Thanks,
> Evgenii
>
> 2018-01-18 16:43 GMT+03:00 Evgenii Zhuravlev <e....@gmail.com>:
>
>> Hi Aaron,
>>
>> I think that the main problem is here:
>>
>> GridServiceProcessor - Error when executing service: null
>>
>> diagnostic - Pending transactions:
>> [WARN ] 2018-01-17 10:55:19.632 [exchange-worker-#97%PortfolioEventIgnite%]
>> [ig] diagnostic - >>> [txVer=AffinityTopologyVersion [topVer=15,
>> minorTopVer=0], exchWait=true, tx=GridDhtTxRemote
>> [nearNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb,
>> rmtFutId=14d5c930161-e4bd34f6-8b10-40b7-8f30-d243ec91c3f1,
>> nearXidVer=GridCacheVersion [topVer=127664000, order=1516193727313,
>> nodeOrder=1], storeWriteThrough=false, super=GridDistributedTxRemoteAdapter
>> [explicitVers=null, started=true, commitAllowed=0,
>> txState=IgniteTxRemoteSingleStateImpl [entry=IgniteTxEntry
>> [key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey
>> [name=CRS_com_tophold_trade_product_command], hasValBytes=true],
>> cacheId=-2100569601, txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=72,
>> val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command],
>> hasValBytes=true], cacheId=-2100569601], val=[op=UPDATE,
>> val=CacheObjectImpl [val=GridServiceAssignments
>> [nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, topVer=15,
>> cfg=LazyServiceConfiguration [srvcClsName=com.tophold.trade
>> .ignite.service.CommandRemoteService, svcCls=,
>> nodeFilterCls=CommandServiceNodeFilter], assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}],
>> hasValBytes=true]], prevVal=[op=NOOP, val=null], oldVal=[op=NOOP,
>> val=null], entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1,
>> conflictVer=null, explicitVer=null, dhtVer=null, filters=[],
>> filtersPassed=false, filtersSet=false, entry=GridDhtCacheEntry [rdrs=[],
>> part=72, super=GridDistributedCacheEntry [super=GridCacheMapEntry
>> [key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey
>> [name=CRS_com_tophold_trade_product_command], hasValBytes=true],
>> val=CacheObjectImpl [val=GridServiceAssignments
>> [nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, topVer=13,
>> cfg=LazyServiceConfiguration [srvcClsName=com.tophold.trade
>> .ignite.service.CommandRemoteService, svcCls=,
>> nodeFilterCls=CommandServiceNodeFilter], assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}],
>> hasValBytes=true], startVer=1516183996434, ver=GridCacheVersion
>> [topVer=127663998, order=1516184119343, nodeOrder=10], hash=-1440463172,
>> extras=GridCacheMvccEntryExtras [mvcc=GridCacheMvcc [locs=null,
>> rmts=[GridCacheMvccCandidate [nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00,
>> ver=GridCacheVersion [topVer=127664000, order=1516193727420, nodeOrder=10],
>> threadId=585, id=82, topVer=AffinityTopologyVersion [topVer=-1,
>> minorTopVer=0], reentry=null, otherNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb,
>> otherVer=null, mappedDhtNodes=null, mappedNearNodes=null, ownerVer=null,
>> serOrder=null, key=KeyCacheObjectImpl [part=72,
>> val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command],
>> hasValBytes=true], masks=local=0|owner=0|ready=0|
>> reentry=0|used=0|tx=1|single_implicit=0|dht_local=0|near_local=0|removed=0|read=0,
>> prevVer=null, nextVer=null]]]], flags=2]]], prepared=1, locked=false,
>> nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false,
>> flags=0, partUpdateCntr=0, serReadVer=null, xidVer=null]],
>> super=IgniteTxAdapter [xidVer=GridCacheVersion [topVer=127664000,
>> order=1516193727420, nodeOrder=10], writeVer=GridCacheVersion
>> [topVer=127664000, order=1516193727421, nodeOrder=10], implicit=false,
>> loc=false, threadId=585, startTime=1516186483489,
>> nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00, startVer=GridCacheVersion
>> [topVer=127664000, order=1516193739547, nodeOrder=5], endVer=null,
>> isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, timeout=0,
>> sysInvalidate=false, sys=true, plc=5, commitVer=null, finalizing=NONE,
>> invalidParts=null, state=PREPARED, timedOut=false,
>> topVer=AffinityTopologyVersion [topVer=15, minorTopVer=0],
>> duration=36138ms, onePhaseCommit=false]]]]
>>
>> You have the pending transaction in logs related to the service
>> deployment. Most possible that your service threw NPE in init(or other)
>> method and wasn't deployed. Could you check if it's possible that your
>> service will throw NPE?
>>
>> Evgenii
>>
>>
>> 2018-01-17 15:40 GMT+03:00 aaron@tophold.com <aa...@tophold.com>:
>>
>>> Hi Evgenii,
>>>
>>> What's more interesting If we reboot them in very shut time like one
>>> hour,  from our monitor log we can find
>>>
>>> such like NODE_LEFT and NODE_JOIN events, every thing move smoothly .
>>>
>>> But if after several hours, problem below sure will happen if you try to
>>> reboot any node from cluster.
>>>
>>>
>>> Regards
>>> Aaron
>>> ------------------------------
>>> Aaron.Kuai
>>>
>>> *From:* aaron@tophold.com
>>> *Date:* 2018-01-17 20:05
>>> *To:* user <us...@ignite.apache.org>
>>> *Subject:* Re: Re: Nodes can not join the cluster after reboot
>>> hi Evgenii,
>>>
>>> Thanks!  We collect some logs, one is the server which is reboot,
>>> another two are two servers exist,  one client only nodes.  after reboot:
>>>
>>> 1. the reboot node never be totally brought up, waiting for ever.
>>> 2. other server nodes after get notification the reboot node down, soon
>>> hang up there also.
>>> 3. the pure client node, only call a remote service on the reboot node,
>>> also hang up there
>>>
>>> At around 2018-01-17 10:54  we reboot the node. From the log we can
>>> find:
>>>
>>> [WARN ] 2018-01-17 10:54:43.277 [sys-#471] [ig] ExchangeDisc
>>> overyEvents - All server nodes for the following caches
>>> have left the cluster: 'PortfolioCommandService_SVC_CO
>>> _DUM_CACHE', 'PortfolioSnapshotGenericDomainEventEntry', 'Po
>>> rtfolioGenericDomainEventEntry'
>>>
>>> Soon a ERROR log(Seem the only ERROR level log):
>>>
>>> [ERROR] 2018-01-17 10:54:43.280 [srvc-deploy-#143] [ig] Grid
>>> ServiceProcessor - Error when executing service: null java.l
>>> ang.IllegalStateException: Getting affinity for topology ver
>>> sion earlier than affinity is calculated
>>>
>>> Then a lot WARN of
>>>
>>> "Failed to wait for partition release future........................."
>>>
>>> Then this forever loop there, from the diagnose nothing seem suspicious,
>>>  All node eventually output very similar.
>>>
>>> [WARN ] 2018-01-17 10:55:19.608 [exchange-worker-#97] [ig] d
>>> iagnostic - Pending explicit locks:
>>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] d
>>> iagnostic - Pending cache futures:
>>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] d
>>> iagnostic - Pending atomic cache futures:
>>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] d
>>> iagnostic - Pending data streamer futures:
>>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] d
>>> iagnostic - Pending transaction deadlock detection futures:
>>>
>>> Some of our environment:
>>>
>>> 1. we open the peer class loading flag, but in fact we use fat jar every
>>> class is shared.
>>> 2. some nodes deploy service, we use them as RPC way.
>>> 3. most cache in fact is LOCAL, only when must we make them shared
>>> 4. use JDBC to persist important caches
>>> 5. TcpDiscoveryJdbcIpFinder as the finder
>>>
>>> All others configuration is according to the stand.
>>>
>>> Thanks for your time!
>>>
>>> Regards
>>> Aaron
>>> ------------------------------
>>> Aaron.Kuai
>>>
>>>
>>> *From:* Evgenii Zhuravlev <e....@gmail.com>
>>> *Date:* 2018-01-16 20:32
>>> *To:* user <us...@ignite.apache.org>
>>> *Subject:* Re: Nodes can not join the cluster after reboot
>>> Hi,
>>>
>>> Most possible that on the of the nodes you have hanged
>>> transaction/future/lock or even a deadlock, that's why new nodes can't join
>>> cluster - they can't perform exchange due to pending operation. Please
>>> share full logs from all nodes with thread dumps, it will help to find a
>>> root cause.
>>>
>>> Evgenii
>>>
>>> 2018-01-16 5:35 GMT+03:00 aaron@tophold.com <aa...@tophold.com>:
>>>
>>>> Hi All,
>>>>
>>>> We have a ignite cluster running about 20+ nodes,   for any case JVM
>>>> memory issue we schedule reboot those nodes at middle night.
>>>>
>>>> but in order to keep the service supplied, we reboot them one by one
>>>> like A,B,C,D nodes we reboot them at 5 mins delay; but if we doing so, the
>>>> reboot nodes can never join to the cluster again.
>>>>
>>>> Eventually the entire cluster can not work any more forever waiting for
>>>> joining to the topology; we need to kill all and reboot from started, this
>>>> sound incredible.
>>>>
>>>> I not sure whether any more meet this issue before, or any mistake we
>>>> may make, attached is the ignite log.
>>>>
>>>>
>>>> Thanks for your time!
>>>>
>>>> Regards
>>>> Aaron
>>>> ------------------------------
>>>> Aaron.Kuai
>>>>
>>>
>>>
>>
>

Re: Re: Nodes can not join the cluster after reboot

Posted by "aaron@tophold.com" <aa...@tophold.com>.

HI Evgenii,, 

Sure, thanks for your time!  this service work as a delegate and all request will route to a bean in our spring context. 

Thanks again!
 
Regards
Aaron


Aaron.Kuai
 
From: Evgenii Zhuravlev
Date: 2018-01-18 21:59
To: user
Subject: Re: Re: Nodes can not join the cluster after reboot
Aaron, could you share code of com.tophold.trade.ignite.service.CommandRemoteService ? 

Thanks,
Evgenii

2018-01-18 16:43 GMT+03:00 Evgenii Zhuravlev <e....@gmail.com>:
Hi Aaron,

I think that the main problem is here: 

GridServiceProcessor - Error when executing service: null

diagnostic - Pending transactions:
[WARN ] 2018-01-17 10:55:19.632 [exchange-worker-#97%PortfolioEventIgnite%] [ig] diagnostic - >>> [txVer=AffinityTopologyVersion [topVer=15, minorTopVer=0], exchWait=true, tx=GridDhtTxRemote [nearNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb, rmtFutId=14d5c930161-e4bd34f6-8b10-40b7-8f30-d243ec91c3f1, nearXidVer=GridCacheVersion [topVer=127664000, order=1516193727313, nodeOrder=1], storeWriteThrough=false, super=GridDistributedTxRemoteAdapter [explicitVers=null, started=true, commitAllowed=0, txState=IgniteTxRemoteSingleStateImpl [entry=IgniteTxEntry [key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command], hasValBytes=true], cacheId=-2100569601, txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command], hasValBytes=true], cacheId=-2100569601], val=[op=UPDATE, val=CacheObjectImpl [val=GridServiceAssignments [nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, topVer=15, cfg=LazyServiceConfiguration [srvcClsName=com.tophold.trade.ignite.service.CommandRemoteService, svcCls=, nodeFilterCls=CommandServiceNodeFilter], assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}], hasValBytes=true]], prevVal=[op=NOOP, val=null], oldVal=[op=NOOP, val=null], entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null, explicitVer=null, dhtVer=null, filters=[], filtersPassed=false, filtersSet=false, entry=GridDhtCacheEntry [rdrs=[], part=72, super=GridDistributedCacheEntry [super=GridCacheMapEntry [key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command], hasValBytes=true], val=CacheObjectImpl [val=GridServiceAssignments [nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, topVer=13, cfg=LazyServiceConfiguration [srvcClsName=com.tophold.trade.ignite.service.CommandRemoteService, svcCls=, nodeFilterCls=CommandServiceNodeFilter], assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}], hasValBytes=true], startVer=1516183996434, ver=GridCacheVersion [topVer=127663998, order=1516184119343, nodeOrder=10], hash=-1440463172, extras=GridCacheMvccEntryExtras [mvcc=GridCacheMvcc [locs=null, rmts=[GridCacheMvccCandidate [nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00, ver=GridCacheVersion [topVer=127664000, order=1516193727420, nodeOrder=10], threadId=585, id=82, topVer=AffinityTopologyVersion [topVer=-1, minorTopVer=0], reentry=null, otherNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb, otherVer=null, mappedDhtNodes=null, mappedNearNodes=null, ownerVer=null, serOrder=null, key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command], hasValBytes=true], masks=local=0|owner=0|ready=0|reentry=0|used=0|tx=1|single_implicit=0|dht_local=0|near_local=0|removed=0|read=0, prevVer=null, nextVer=null]]]], flags=2]]], prepared=1, locked=false, nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false, flags=0, partUpdateCntr=0, serReadVer=null, xidVer=null]], super=IgniteTxAdapter [xidVer=GridCacheVersion [topVer=127664000, order=1516193727420, nodeOrder=10], writeVer=GridCacheVersion [topVer=127664000, order=1516193727421, nodeOrder=10], implicit=false, loc=false, threadId=585, startTime=1516186483489, nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00, startVer=GridCacheVersion [topVer=127664000, order=1516193739547, nodeOrder=5], endVer=null, isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, timeout=0, sysInvalidate=false, sys=true, plc=5, commitVer=null, finalizing=NONE, invalidParts=null, state=PREPARED, timedOut=false, topVer=AffinityTopologyVersion [topVer=15, minorTopVer=0], duration=36138ms, onePhaseCommit=false]]]]

You have the pending transaction in logs related to the service deployment. Most possible that your service threw NPE in init(or other) method and wasn't deployed. Could you check if it's possible that your service will throw NPE?

Evgenii


2018-01-17 15:40 GMT+03:00 aaron@tophold.com <aa...@tophold.com>:
Hi Evgenii, 

What's more interesting If we reboot them in very shut time like one hour,  from our monitor log we can find 

such like NODE_LEFT and NODE_JOIN events, every thing move smoothly .  

But if after several hours, problem below sure will happen if you try to reboot any node from cluster. 


Regards
Aaron


Aaron.Kuai

From: aaron@tophold.com
Date: 2018-01-17 20:05
To: user
Subject: Re: Re: Nodes can not join the cluster after reboot
hi Evgenii, 

Thanks!  We collect some logs, one is the server which is reboot, another two are two servers exist,  one client only nodes.  after reboot:

1. the reboot node never be totally brought up, waiting for ever. 
2. other server nodes after get notification the reboot node down, soon hang up there also. 
3. the pure client node, only call a remote service on the reboot node, also hang up there

At around 2018-01-17 10:54  we reboot the node. From the log we can find:

[WARN ] 2018-01-17 10:54:43.277 [sys-#471] [ig] ExchangeDiscoveryEvents - All server nodes for the following caches have left the cluster: 'PortfolioCommandService_SVC_CO_DUM_CACHE', 'PortfolioSnapshotGenericDomainEventEntry', 'PortfolioGenericDomainEventEntry' 

Soon a ERROR log(Seem the only ERROR level log):

[ERROR] 2018-01-17 10:54:43.280 [srvc-deploy-#143] [ig] GridServiceProcessor - Error when executing service: null java.lang.IllegalStateException: Getting affinity for topology version earlier than affinity is calculated

Then a lot WARN of 

"Failed to wait for partition release future........................."

Then this forever loop there, from the diagnose nothing seem suspicious,  All node eventually output very similar. 

[WARN ] 2018-01-17 10:55:19.608 [exchange-worker-#97] [ig] diagnostic - Pending explicit locks:
[WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] diagnostic - Pending cache futures:
[WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] diagnostic - Pending atomic cache futures:
[WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] diagnostic - Pending data streamer futures:
[WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] diagnostic - Pending transaction deadlock detection futures:

Some of our environment:

1. we open the peer class loading flag, but in fact we use fat jar every class is shared.
2. some nodes deploy service, we use them as RPC way. 
3. most cache in fact is LOCAL, only when must we make them shared
4. use JDBC to persist important caches
5. TcpDiscoveryJdbcIpFinder as the finder

All others configuration is according to the stand. 

Thanks for your time!

Regards
Aaron


Aaron.Kuai
 
From: Evgenii Zhuravlev
Date: 2018-01-16 20:32
To: user
Subject: Re: Nodes can not join the cluster after reboot
Hi,

Most possible that on the of the nodes you have hanged transaction/future/lock or even a deadlock, that's why new nodes can't join cluster - they can't perform exchange due to pending operation. Please share full logs from all nodes with thread dumps, it will help to find a root cause.

Evgenii

2018-01-16 5:35 GMT+03:00 aaron@tophold.com <aa...@tophold.com>:
Hi All, 

We have a ignite cluster running about 20+ nodes,   for any case JVM memory issue we schedule reboot those nodes at middle night. 

but in order to keep the service supplied, we reboot them one by one like A,B,C,D nodes we reboot them at 5 mins delay; but if we doing so, the reboot nodes can never join to the cluster again. 

Eventually the entire cluster can not work any more forever waiting for joining to the topology; we need to kill all and reboot from started, this sound incredible. 

I not sure whether any more meet this issue before, or any mistake we may make, attached is the ignite log. 


Thanks for your time!

Regards
Aaron


Aaron.Kuai

Re: Re: Nodes can not join the cluster after reboot

Posted by Evgenii Zhuravlev <e....@gmail.com>.

Aaron, could you share code
of com.tophold.trade.ignite.service.CommandRemoteService ?

Thanks,
Evgenii

2018-01-18 16:43 GMT+03:00 Evgenii Zhuravlev <e....@gmail.com>:

> Hi Aaron,
>
> I think that the main problem is here:
>
> GridServiceProcessor - Error when executing service: null
>
> diagnostic - Pending transactions:
> [WARN ] 2018-01-17 10:55:19.632 [exchange-worker-#97%PortfolioEventIgnite%]
> [ig] diagnostic - >>> [txVer=AffinityTopologyVersion [topVer=15,
> minorTopVer=0], exchWait=true, tx=GridDhtTxRemote [nearNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb,
> rmtFutId=14d5c930161-e4bd34f6-8b10-40b7-8f30-d243ec91c3f1,
> nearXidVer=GridCacheVersion [topVer=127664000, order=1516193727313,
> nodeOrder=1], storeWriteThrough=false, super=GridDistributedTxRemoteAdapter
> [explicitVers=null, started=true, commitAllowed=0, txState=IgniteTxRemoteSingleStateImpl
> [entry=IgniteTxEntry [key=KeyCacheObjectImpl [part=72,
> val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command],
> hasValBytes=true], cacheId=-2100569601, txKey=IgniteTxKey
> [key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey
> [name=CRS_com_tophold_trade_product_command], hasValBytes=true],
> cacheId=-2100569601], val=[op=UPDATE, val=CacheObjectImpl
> [val=GridServiceAssignments [nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed,
> topVer=15, cfg=LazyServiceConfiguration [srvcClsName=com.tophold.
> trade.ignite.service.CommandRemoteService, svcCls=, nodeFilterCls=CommandServiceNodeFilter],
> assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}], hasValBytes=true]],
> prevVal=[op=NOOP, val=null], oldVal=[op=NOOP, val=null],
> entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null,
> explicitVer=null, dhtVer=null, filters=[], filtersPassed=false,
> filtersSet=false, entry=GridDhtCacheEntry [rdrs=[], part=72, super=GridDistributedCacheEntry
> [super=GridCacheMapEntry [key=KeyCacheObjectImpl [part=72,
> val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command],
> hasValBytes=true], val=CacheObjectImpl [val=GridServiceAssignments
> [nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, topVer=13,
> cfg=LazyServiceConfiguration [srvcClsName=com.tophold.
> trade.ignite.service.CommandRemoteService, svcCls=, nodeFilterCls=CommandServiceNodeFilter],
> assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}], hasValBytes=true],
> startVer=1516183996434, ver=GridCacheVersion [topVer=127663998,
> order=1516184119343, nodeOrder=10], hash=-1440463172, extras=GridCacheMvccEntryExtras
> [mvcc=GridCacheMvcc [locs=null, rmts=[GridCacheMvccCandidate
> [nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00, ver=GridCacheVersion
> [topVer=127664000, order=1516193727420, nodeOrder=10], threadId=585, id=82,
> topVer=AffinityTopologyVersion [topVer=-1, minorTopVer=0], reentry=null,
> otherNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb, otherVer=null,
> mappedDhtNodes=null, mappedNearNodes=null, ownerVer=null, serOrder=null,
> key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey
> [name=CRS_com_tophold_trade_product_command], hasValBytes=true],
> masks=local=0|owner=0|ready=0|reentry=0|used=0|tx=1|single_
> implicit=0|dht_local=0|near_local=0|removed=0|read=0, prevVer=null,
> nextVer=null]]]], flags=2]]], prepared=1, locked=false, nodeId=null,
> locMapped=false, expiryPlc=null, transferExpiryPlc=false, flags=0,
> partUpdateCntr=0, serReadVer=null, xidVer=null]], super=IgniteTxAdapter
> [xidVer=GridCacheVersion [topVer=127664000, order=1516193727420,
> nodeOrder=10], writeVer=GridCacheVersion [topVer=127664000,
> order=1516193727421, nodeOrder=10], implicit=false, loc=false,
> threadId=585, startTime=1516186483489, nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00,
> startVer=GridCacheVersion [topVer=127664000, order=1516193739547,
> nodeOrder=5], endVer=null, isolation=REPEATABLE_READ,
> concurrency=PESSIMISTIC, timeout=0, sysInvalidate=false, sys=true, plc=5,
> commitVer=null, finalizing=NONE, invalidParts=null, state=PREPARED,
> timedOut=false, topVer=AffinityTopologyVersion [topVer=15, minorTopVer=0],
> duration=36138ms, onePhaseCommit=false]]]]
>
> You have the pending transaction in logs related to the service
> deployment. Most possible that your service threw NPE in init(or other)
> method and wasn't deployed. Could you check if it's possible that your
> service will throw NPE?
>
> Evgenii
>
>
> 2018-01-17 15:40 GMT+03:00 aaron@tophold.com <aa...@tophold.com>:
>
>> Hi Evgenii,
>>
>> What's more interesting If we reboot them in very shut time like one
>> hour,  from our monitor log we can find
>>
>> such like NODE_LEFT and NODE_JOIN events, every thing move smoothly .
>>
>> But if after several hours, problem below sure will happen if you try to
>> reboot any node from cluster.
>>
>>
>> Regards
>> Aaron
>> ------------------------------
>> Aaron.Kuai
>>
>> *From:* aaron@tophold.com
>> *Date:* 2018-01-17 20:05
>> *To:* user <us...@ignite.apache.org>
>> *Subject:* Re: Re: Nodes can not join the cluster after reboot
>> hi Evgenii,
>>
>> Thanks!  We collect some logs, one is the server which is reboot, another
>> two are two servers exist,  one client only nodes.  after reboot:
>>
>> 1. the reboot node never be totally brought up, waiting for ever.
>> 2. other server nodes after get notification the reboot node down, soon
>> hang up there also.
>> 3. the pure client node, only call a remote service on the reboot node,
>> also hang up there
>>
>> At around 2018-01-17 10:54  we reboot the node. From the log we can find:
>>
>> [WARN ] 2018-01-17 10:54:43.277 [sys-#471] [ig] ExchangeDisc
>> overyEvents - All server nodes for the following
>> caches have left the cluster: 'PortfolioCommandService_SVC_C
>> O_DUM_CACHE', 'PortfolioSnapshotGenericDomainEventEntry', 'P
>> ortfolioGenericDomainEventEntry'
>>
>> Soon a ERROR log(Seem the only ERROR level log):
>>
>> [ERROR] 2018-01-17 10:54:43.280 [srvc-deploy-#143] [ig] Grid
>> ServiceProcessor - Error when executing service: null java.
>> lang.IllegalStateException: Getting affinity for topology ve
>> rsion earlier than affinity is calculated
>>
>> Then a lot WARN of
>>
>> "Failed to wait for partition release future........................."
>>
>> Then this forever loop there, from the diagnose nothing seem suspicious,
>>  All node eventually output very similar.
>>
>> [WARN ] 2018-01-17 10:55:19.608 [exchange-worker-#97] [ig]
>> diagnostic - Pending explicit locks:
>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig]
>> diagnostic - Pending cache futures:
>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig]
>> diagnostic - Pending atomic cache futures:
>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig]
>> diagnostic - Pending data streamer futures:
>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig]
>> diagnostic - Pending transaction deadlock detection futures:
>>
>> Some of our environment:
>>
>> 1. we open the peer class loading flag, but in fact we use fat jar every
>> class is shared.
>> 2. some nodes deploy service, we use them as RPC way.
>> 3. most cache in fact is LOCAL, only when must we make them shared
>> 4. use JDBC to persist important caches
>> 5. TcpDiscoveryJdbcIpFinder as the finder
>>
>> All others configuration is according to the stand.
>>
>> Thanks for your time!
>>
>> Regards
>> Aaron
>> ------------------------------
>> Aaron.Kuai
>>
>>
>> *From:* Evgenii Zhuravlev <e....@gmail.com>
>> *Date:* 2018-01-16 20:32
>> *To:* user <us...@ignite.apache.org>
>> *Subject:* Re: Nodes can not join the cluster after reboot
>> Hi,
>>
>> Most possible that on the of the nodes you have hanged
>> transaction/future/lock or even a deadlock, that's why new nodes can't join
>> cluster - they can't perform exchange due to pending operation. Please
>> share full logs from all nodes with thread dumps, it will help to find a
>> root cause.
>>
>> Evgenii
>>
>> 2018-01-16 5:35 GMT+03:00 aaron@tophold.com <aa...@tophold.com>:
>>
>>> Hi All,
>>>
>>> We have a ignite cluster running about 20+ nodes,   for any case JVM
>>> memory issue we schedule reboot those nodes at middle night.
>>>
>>> but in order to keep the service supplied, we reboot them one by one
>>> like A,B,C,D nodes we reboot them at 5 mins delay; but if we doing so, the
>>> reboot nodes can never join to the cluster again.
>>>
>>> Eventually the entire cluster can not work any more forever waiting for
>>> joining to the topology; we need to kill all and reboot from started, this
>>> sound incredible.
>>>
>>> I not sure whether any more meet this issue before, or any mistake we
>>> may make, attached is the ignite log.
>>>
>>>
>>> Thanks for your time!
>>>
>>> Regards
>>> Aaron
>>> ------------------------------
>>> Aaron.Kuai
>>>
>>
>>
>

Re: Re: Nodes can not join the cluster after reboot

Posted by Evgenii Zhuravlev <e....@gmail.com>.

Hi Aaron,

I think that the main problem is here:

GridServiceProcessor - Error when executing service: null

diagnostic - Pending transactions:
[WARN ] 2018-01-17 10:55:19.632 [exchange-worker-#97%PortfolioEventIgnite%]
[ig] diagnostic - >>> [txVer=AffinityTopologyVersion [topVer=15,
minorTopVer=0], exchWait=true, tx=GridDhtTxRemote
[nearNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb,
rmtFutId=14d5c930161-e4bd34f6-8b10-40b7-8f30-d243ec91c3f1,
nearXidVer=GridCacheVersion [topVer=127664000, order=1516193727313,
nodeOrder=1], storeWriteThrough=false, super=GridDistributedTxRemoteAdapter
[explicitVers=null, started=true, commitAllowed=0,
txState=IgniteTxRemoteSingleStateImpl [entry=IgniteTxEntry
[key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey
[name=CRS_com_tophold_trade_product_command], hasValBytes=true],
cacheId=-2100569601, txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=72,
val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command],
hasValBytes=true], cacheId=-2100569601], val=[op=UPDATE,
val=CacheObjectImpl [val=GridServiceAssignments
[nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, topVer=15,
cfg=LazyServiceConfiguration
[srvcClsName=com.tophold.trade.ignite.service.CommandRemoteService,
svcCls=, nodeFilterCls=CommandServiceNodeFilter],
assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}], hasValBytes=true]],
prevVal=[op=NOOP, val=null], oldVal=[op=NOOP, val=null],
entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null,
explicitVer=null, dhtVer=null, filters=[], filtersPassed=false,
filtersSet=false, entry=GridDhtCacheEntry [rdrs=[], part=72,
super=GridDistributedCacheEntry [super=GridCacheMapEntry
[key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey
[name=CRS_com_tophold_trade_product_command], hasValBytes=true],
val=CacheObjectImpl [val=GridServiceAssignments
[nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, topVer=13,
cfg=LazyServiceConfiguration
[srvcClsName=com.tophold.trade.ignite.service.CommandRemoteService,
svcCls=, nodeFilterCls=CommandServiceNodeFilter],
assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}], hasValBytes=true],
startVer=1516183996434, ver=GridCacheVersion [topVer=127663998,
order=1516184119343, nodeOrder=10], hash=-1440463172,
extras=GridCacheMvccEntryExtras [mvcc=GridCacheMvcc [locs=null,
rmts=[GridCacheMvccCandidate [nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00,
ver=GridCacheVersion [topVer=127664000, order=1516193727420, nodeOrder=10],
threadId=585, id=82, topVer=AffinityTopologyVersion [topVer=-1,
minorTopVer=0], reentry=null,
otherNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb, otherVer=null,
mappedDhtNodes=null, mappedNearNodes=null, ownerVer=null, serOrder=null,
key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey
[name=CRS_com_tophold_trade_product_command], hasValBytes=true],
masks=local=0|owner=0|ready=0|reentry=0|used=0|tx=1|single_implicit=0|dht_local=0|near_local=0|removed=0|read=0,
prevVer=null, nextVer=null]]]], flags=2]]], prepared=1, locked=false,
nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false,
flags=0, partUpdateCntr=0, serReadVer=null, xidVer=null]],
super=IgniteTxAdapter [xidVer=GridCacheVersion [topVer=127664000,
order=1516193727420, nodeOrder=10], writeVer=GridCacheVersion
[topVer=127664000, order=1516193727421, nodeOrder=10], implicit=false,
loc=false, threadId=585, startTime=1516186483489,
nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00, startVer=GridCacheVersion
[topVer=127664000, order=1516193739547, nodeOrder=5], endVer=null,
isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, timeout=0,
sysInvalidate=false, sys=true, plc=5, commitVer=null, finalizing=NONE,
invalidParts=null, state=PREPARED, timedOut=false,
topVer=AffinityTopologyVersion [topVer=15, minorTopVer=0],
duration=36138ms, onePhaseCommit=false]]]]

You have the pending transaction in logs related to the service deployment.
Most possible that your service threw NPE in init(or other) method and
wasn't deployed. Could you check if it's possible that your service will
throw NPE?

Evgenii


2018-01-17 15:40 GMT+03:00 aaron@tophold.com <aa...@tophold.com>:

> Hi Evgenii,
>
> What's more interesting If we reboot them in very shut time like one hour,
>  from our monitor log we can find
>
> such like NODE_LEFT and NODE_JOIN events, every thing move smoothly .
>
> But if after several hours, problem below sure will happen if you try to
> reboot any node from cluster.
>
>
> Regards
> Aaron
> ------------------------------
> Aaron.Kuai
>
> *From:* aaron@tophold.com
> *Date:* 2018-01-17 20:05
> *To:* user <us...@ignite.apache.org>
> *Subject:* Re: Re: Nodes can not join the cluster after reboot
> hi Evgenii,
>
> Thanks!  We collect some logs, one is the server which is reboot, another
> two are two servers exist,  one client only nodes.  after reboot:
>
> 1. the reboot node never be totally brought up, waiting for ever.
> 2. other server nodes after get notification the reboot node down, soon
> hang up there also.
> 3. the pure client node, only call a remote service on the reboot node,
> also hang up there
>
> At around 2018-01-17 10:54  we reboot the node. From the log we can find:
>
> [WARN ] 2018-01-17 10:54:43.277 [sys-#471] [ig]
> ExchangeDiscoveryEvents - All server nodes for the
> following caches have left the cluster: 'PortfolioCommandService_SVC_
> CO_DUM_CACHE', 'PortfolioSnapshotGenericDomainEventEntry', '
> PortfolioGenericDomainEventEntry'
>
> Soon a ERROR log(Seem the only ERROR level log):
>
> [ERROR] 2018-01-17 10:54:43.280 [srvc-deploy-#143] [ig]
> GridServiceProcessor - Error when executing service: null java.lang.
> IllegalStateException: Getting affinity for topology
> version earlier than affinity is calculated
>
> Then a lot WARN of
>
> "Failed to wait for partition release future........................."
>
> Then this forever loop there, from the diagnose nothing seem suspicious,
>  All node eventually output very similar.
>
> [WARN ] 2018-01-17 10:55:19.608 [exchange-worker-#97] [ig]
>  diagnostic - Pending explicit locks:
> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig]
>  diagnostic - Pending cache futures:
> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig]
>  diagnostic - Pending atomic cache futures:
> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig]
>  diagnostic - Pending data streamer futures:
> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig]
>  diagnostic - Pending transaction deadlock detection futures:
>
> Some of our environment:
>
> 1. we open the peer class loading flag, but in fact we use fat jar every
> class is shared.
> 2. some nodes deploy service, we use them as RPC way.
> 3. most cache in fact is LOCAL, only when must we make them shared
> 4. use JDBC to persist important caches
> 5. TcpDiscoveryJdbcIpFinder as the finder
>
> All others configuration is according to the stand.
>
> Thanks for your time!
>
> Regards
> Aaron
> ------------------------------
> Aaron.Kuai
>
>
> *From:* Evgenii Zhuravlev <e....@gmail.com>
> *Date:* 2018-01-16 20:32
> *To:* user <us...@ignite.apache.org>
> *Subject:* Re: Nodes can not join the cluster after reboot
> Hi,
>
> Most possible that on the of the nodes you have hanged
> transaction/future/lock or even a deadlock, that's why new nodes can't join
> cluster - they can't perform exchange due to pending operation. Please
> share full logs from all nodes with thread dumps, it will help to find a
> root cause.
>
> Evgenii
>
> 2018-01-16 5:35 GMT+03:00 aaron@tophold.com <aa...@tophold.com>:
>
>> Hi All,
>>
>> We have a ignite cluster running about 20+ nodes,   for any case JVM
>> memory issue we schedule reboot those nodes at middle night.
>>
>> but in order to keep the service supplied, we reboot them one by one like
>> A,B,C,D nodes we reboot them at 5 mins delay; but if we doing so, the
>> reboot nodes can never join to the cluster again.
>>
>> Eventually the entire cluster can not work any more forever waiting for
>> joining to the topology; we need to kill all and reboot from started, this
>> sound incredible.
>>
>> I not sure whether any more meet this issue before, or any mistake we may
>> make, attached is the ignite log.
>>
>>
>> Thanks for your time!
>>
>> Regards
>> Aaron
>> ------------------------------
>> Aaron.Kuai
>>
>
>

Re: Re: Nodes can not join the cluster after reboot

Posted by "aaron@tophold.com" <aa...@tophold.com>.

Hi Evgenii, 

What's more interesting If we reboot them in very shut time like one hour,  from our monitor log we can find 

such like NODE_LEFT and NODE_JOIN events, every thing move smoothly .  

But if after several hours, problem below sure will happen if you try to reboot any node from cluster. 


Regards
Aaron


Aaron.Kuai

From: aaron@tophold.com
Date: 2018-01-17 20:05
To: user
Subject: Re: Re: Nodes can not join the cluster after reboot
hi Evgenii, 

Thanks!  We collect some logs, one is the server which is reboot, another two are two servers exist,  one client only nodes.  after reboot:

1. the reboot node never be totally brought up, waiting for ever. 
2. other server nodes after get notification the reboot node down, soon hang up there also. 
3. the pure client node, only call a remote service on the reboot node, also hang up there

At around 2018-01-17 10:54  we reboot the node. From the log we can find:

[WARN ] 2018-01-17 10:54:43.277 [sys-#471] [ig] ExchangeDiscoveryEvents - All server nodes for the following caches have left the cluster: 'PortfolioCommandService_SVC_CO_DUM_CACHE', 'PortfolioSnapshotGenericDomainEventEntry', 'PortfolioGenericDomainEventEntry' 

Soon a ERROR log(Seem the only ERROR level log):

[ERROR] 2018-01-17 10:54:43.280 [srvc-deploy-#143] [ig] GridServiceProcessor - Error when executing service: null java.lang.IllegalStateException: Getting affinity for topology version earlier than affinity is calculated

Then a lot WARN of 

"Failed to wait for partition release future........................."

Then this forever loop there, from the diagnose nothing seem suspicious,  All node eventually output very similar. 

[WARN ] 2018-01-17 10:55:19.608 [exchange-worker-#97] [ig] diagnostic - Pending explicit locks:
[WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] diagnostic - Pending cache futures:
[WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] diagnostic - Pending atomic cache futures:
[WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] diagnostic - Pending data streamer futures:
[WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] diagnostic - Pending transaction deadlock detection futures:

Some of our environment:

1. we open the peer class loading flag, but in fact we use fat jar every class is shared.
2. some nodes deploy service, we use them as RPC way. 
3. most cache in fact is LOCAL, only when must we make them shared
4. use JDBC to persist important caches
5. TcpDiscoveryJdbcIpFinder as the finder

All others configuration is according to the stand. 

Thanks for your time!

Regards
Aaron


Aaron.Kuai
 
From: Evgenii Zhuravlev
Date: 2018-01-16 20:32
To: user
Subject: Re: Nodes can not join the cluster after reboot
Hi,

Most possible that on the of the nodes you have hanged transaction/future/lock or even a deadlock, that's why new nodes can't join cluster - they can't perform exchange due to pending operation. Please share full logs from all nodes with thread dumps, it will help to find a root cause.

Evgenii

2018-01-16 5:35 GMT+03:00 aaron@tophold.com <aa...@tophold.com>:
Hi All, 

We have a ignite cluster running about 20+ nodes,   for any case JVM memory issue we schedule reboot those nodes at middle night. 

but in order to keep the service supplied, we reboot them one by one like A,B,C,D nodes we reboot them at 5 mins delay; but if we doing so, the reboot nodes can never join to the cluster again. 

Eventually the entire cluster can not work any more forever waiting for joining to the topology; we need to kill all and reboot from started, this sound incredible. 

I not sure whether any more meet this issue before, or any mistake we may make, attached is the ignite log. 


Thanks for your time!

Regards
Aaron


Aaron.Kuai

Re: Re: Nodes can not join the cluster after reboot

Posted by "aaron@tophold.com" <aa...@tophold.com>.

hi Evgenii, 

Thanks!  We collect some logs, one is the server which is reboot, another two are two servers exist,  one client only nodes.  after reboot:

1. the reboot node never be totally brought up, waiting for ever. 
2. other server nodes after get notification the reboot node down, soon hang up there also. 
3. the pure client node, only call a remote service on the reboot node, also hang up there

At around 2018-01-17 10:54  we reboot the node. From the log we can find:

[WARN ] 2018-01-17 10:54:43.277 [sys-#471] [ig] ExchangeDiscoveryEvents - All server nodes for the following caches have left the cluster: 'PortfolioCommandService_SVC_CO_DUM_CACHE', 'PortfolioSnapshotGenericDomainEventEntry', 'PortfolioGenericDomainEventEntry' 

Soon a ERROR log(Seem the only ERROR level log):

[ERROR] 2018-01-17 10:54:43.280 [srvc-deploy-#143] [ig] GridServiceProcessor - Error when executing service: null java.lang.IllegalStateException: Getting affinity for topology version earlier than affinity is calculated

Then a lot WARN of 

"Failed to wait for partition release future........................."

Then this forever loop there, from the diagnose nothing seem suspicious,  All node eventually output very similar. 

[WARN ] 2018-01-17 10:55:19.608 [exchange-worker-#97] [ig] diagnostic - Pending explicit locks:
[WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] diagnostic - Pending cache futures:
[WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] diagnostic - Pending atomic cache futures:
[WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] diagnostic - Pending data streamer futures:
[WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] diagnostic - Pending transaction deadlock detection futures:

Some of our environment:

1. we open the peer class loading flag, but in fact we use fat jar every class is shared.
2. some nodes deploy service, we use them as RPC way. 
3. most cache in fact is LOCAL, only when must we make them shared
4. use JDBC to persist important caches
5. TcpDiscoveryJdbcIpFinder as the finder

All others configuration is according to the stand. 

Thanks for your time!

Regards
Aaron


Aaron.Kuai
 
From: Evgenii Zhuravlev
Date: 2018-01-16 20:32
To: user
Subject: Re: Nodes can not join the cluster after reboot
Hi,

Most possible that on the of the nodes you have hanged transaction/future/lock or even a deadlock, that's why new nodes can't join cluster - they can't perform exchange due to pending operation. Please share full logs from all nodes with thread dumps, it will help to find a root cause.

Evgenii

2018-01-16 5:35 GMT+03:00 aaron@tophold.com <aa...@tophold.com>:
Hi All, 

We have a ignite cluster running about 20+ nodes,   for any case JVM memory issue we schedule reboot those nodes at middle night. 

but in order to keep the service supplied, we reboot them one by one like A,B,C,D nodes we reboot them at 5 mins delay; but if we doing so, the reboot nodes can never join to the cluster again. 

Eventually the entire cluster can not work any more forever waiting for joining to the topology; we need to kill all and reboot from started, this sound incredible. 

I not sure whether any more meet this issue before, or any mistake we may make, attached is the ignite log. 


Thanks for your time!

Regards
Aaron


Aaron.Kuai

Re: Nodes can not join the cluster after reboot

Posted by Evgenii Zhuravlev <e....@gmail.com>.

Hi,

Most possible that on the of the nodes you have hanged
transaction/future/lock or even a deadlock, that's why new nodes can't join
cluster - they can't perform exchange due to pending operation. Please
share full logs from all nodes with thread dumps, it will help to find a
root cause.

Evgenii

2018-01-16 5:35 GMT+03:00 aaron@tophold.com <aa...@tophold.com>:

> Hi All,
>
> We have a ignite cluster running about 20+ nodes,   for any case JVM
> memory issue we schedule reboot those nodes at middle night.
>
> but in order to keep the service supplied, we reboot them one by one like
> A,B,C,D nodes we reboot them at 5 mins delay; but if we doing so, the
> reboot nodes can never join to the cluster again.
>
> Eventually the entire cluster can not work any more forever waiting for
> joining to the topology; we need to kill all and reboot from started, this
> sound incredible.
>
> I not sure whether any more meet this issue before, or any mistake we may
> make, attached is the ignite log.
>
>
> Thanks for your time!
>
> Regards
> Aaron
> ------------------------------
> Aaron.Kuai
>