You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by yfernando <yo...@tudor.com> on 2016/09/16 09:44:09 UTC

Re: One failing node stalling the whole cluster

Hi Denis,

We have been able to reproduce this situation where a node failure freezes
the entire grid.

Please find the full thread dumps of the 5 nodes that are locked up.

The memoryMode of the caches are configured to be OFFHEAP_TIERED
The cacheMode is PARTITIONED
The atomicityMode is TRANSACTIONAL

We have also seen ALL the clients freeze during a FULL GC occurring on ANY
single node.

Please let us know if you require any more information.

grid-tp1-dev-11220-201609141523318.txt
<http://apache-ignite-users.70518.x6.nabble.com/file/n7791/grid-tp1-dev-11220-201609141523318.txt>  
grid-tp1-dev-11223-201609141523318.txt
<http://apache-ignite-users.70518.x6.nabble.com/file/n7791/grid-tp1-dev-11223-201609141523318.txt>  
grid-tp3-dev-11220-201609141523318.txt
<http://apache-ignite-users.70518.x6.nabble.com/file/n7791/grid-tp3-dev-11220-201609141523318.txt>  
grid-tp3-dev-11221-201609141523318.txt
<http://apache-ignite-users.70518.x6.nabble.com/file/n7791/grid-tp3-dev-11221-201609141523318.txt>  
grid-tp4-dev-11220-201609141523318.txt
<http://apache-ignite-users.70518.x6.nabble.com/file/n7791/grid-tp4-dev-11220-201609141523318.txt>  




--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/One-failing-node-stalling-the-whole-cluster-tp5372p7791.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: One failing node stalling the whole cluster

Posted by Denis Magda <dm...@apache.org>.
Vladimir Ozerov as far as I recall you have faced with the issue when a slow client affected performance of the whole cluster recently. Please chime in this discussion confirming that the symptoms are the same. Most likely you already created a JIRA ticket for this issue.

> Also it's not clear why the nodes would require to GC because all the caches
> are held off-heap and we have  a 10G heap running G1GC.

Your logic can fill up Java heap with temporal objects. Every time you need to get a value from cache its copy is moved from off_heap to Java heap for a period of time until your code holds a reference to it. This is related to all kind of queries. You need to get heap dumps and/or heap histogram to see what kind of objects are in your heap.

—
Denis

> On Sep 16, 2016, at 6:47 AM, yfernando <yo...@tudor.com> wrote:
> 
> Thanks for your reply Anmol. Do you know if there is a bug logged against
> this which we can track?
> 
> Also it's not clear why the nodes would require to GC because all the caches
> are held off-heap and we have  a 10G heap running G1GC.
> 
> 
> 
> --
> View this message in context: http://apache-ignite-users.70518.x6.nabble.com/One-failing-node-stalling-the-whole-cluster-tp5372p7799.html
> Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: One failing node stalling the whole cluster

Posted by DLopez <d....@gmail.com>.
FWIW, in our case, the GC was not the problem with Ignite. The heap issue
was already diagnosed, well known and unrelated. The problem was that the
slow down in one node was causing all the other nodes in the grid to
basically lock when reading from the cache, without suffering any GC issue.
A simple problem in one node should not be able to cause your grid to go
down, specially when talking about replicated caches.

GC issues, unrelated to Ignite, will probably appear associated with the
problem because it's one of the most common ways a Java app slows down
without dying completely.

My 2ec
D.


2016-09-16 18:19 GMT+02:00 Ignitebie [via Apache Ignite Users] <
ml-node+s70518n7807h37@n6.nabble.com>:

> That would be topic for discussion on how off heap actually work. My
> understanding  is to start with object creation will happen on heap (YG)
> and then moved to Old or Off heap.
>
> If allocation of object creation (I believe after that only you will
> associating them in a map (key.value cache), is quit high, say larger than
> YG size , per sec, then application will be engaged in GC triggered due to
> not being able to allocation.
>
>
> Do you see GC trigerred for YG and messages such as Allocation Failure.
> Have you enabled GC logging.
>
>
>
> On Fri, Sep 16, 2016 at 2:47 PM, yfernando <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=7807&i=0>> wrote:
>
>> Thanks for your reply Anmol. Do you know if there is a bug logged against
>> this which we can track?
>>
>> Also it's not clear why the nodes would require to GC because all the
>> caches
>> are held off-heap and we have  a 10G heap running G1GC.
>>
>>
>>
>> --
>> View this message in context: http://apache-ignite-users.705
>> 18.x6.nabble.com/One-failing-node-stalling-the-whole-
>> cluster-tp5372p7799.html
>> Sent from the Apache Ignite Users mailing list archive at Nabble.com.
>>
>
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-ignite-users.70518.x6.nabble.com/One-
> failing-node-stalling-the-whole-cluster-tp5372p7807.html
> To unsubscribe from One failing node stalling the whole cluster, click
> here
> <http://apache-ignite-users.70518.x6.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=5372&code=ZC5sb3Blei5qQGdtYWlsLmNvbXw1MzcyfDIwNTkzNjQ3OTE=>
> .
> NAML
> <http://apache-ignite-users.70518.x6.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/One-failing-node-stalling-the-whole-cluster-tp5372p7808.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: One failing node stalling the whole cluster

Posted by vkulichenko <va...@gmail.com>.
Hi Sparkle,

Please properly subscribe to the mailing list so that the community can
receive email notifications for your messages. To subscribe, send empty
email to user-subscribe@ignite.apache.org and follow simple instructions in
the reply.


sparkle_j wrote
> In one of our tests, we noticed that Ignite's TcpCommunicationSpi object
> is growing and retaining over 70% of heap memory. Not sure why. Is this a
> known issue..? Please let us know if this is already addressed.
> 
> We are testing with Ignite1.5, 10GB heap, with moderate amount of data on
> start up and just connect and disconnect dummy clients to grid. This is
> causing the heap to grow over time and eventually Full GC and node is
> being killed by Ignite. This is also causing cluster wide instability.
> Please see HeapDump screenshots attached.
> 
> TcpCommSpi.png
> <http://apache-ignite-users.70518.x6.nabble.com/file/n7879/TcpCommSpi.png>  
> 
> dom-tree-1.png
> <http://apache-ignite-users.70518.x6.nabble.com/file/n7879/dom-tree-1.png>  
> 
> dom-tree.png
> <http://apache-ignite-users.70518.x6.nabble.com/file/n7879/dom-tree.png>  
Can you take the full heap dump and upload it somewhere?

-Val



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/One-failing-node-stalling-the-whole-cluster-tp5372p7887.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: One failing node stalling the whole cluster

Posted by Anmol Rattan <an...@gmail.com>.
That would be topic for discussion on how off heap actually work. My
understanding  is to start with object creation will happen on heap (YG)
and then moved to Old or Off heap.

If allocation of object creation (I believe after that only you will
associating them in a map (key.value cache), is quit high, say larger than
YG size , per sec, then application will be engaged in GC triggered due to
not being able to allocation.


Do you see GC trigerred for YG and messages such as Allocation Failure.
Have you enabled GC logging.



On Fri, Sep 16, 2016 at 2:47 PM, yfernando <yo...@tudor.com> wrote:

> Thanks for your reply Anmol. Do you know if there is a bug logged against
> this which we can track?
>
> Also it's not clear why the nodes would require to GC because all the
> caches
> are held off-heap and we have  a 10G heap running G1GC.
>
>
>
> --
> View this message in context: http://apache-ignite-users.
> 70518.x6.nabble.com/One-failing-node-stalling-the-
> whole-cluster-tp5372p7799.html
> Sent from the Apache Ignite Users mailing list archive at Nabble.com.
>

Re: One failing node stalling the whole cluster

Posted by yfernando <yo...@tudor.com>.
Thanks for your reply Anmol. Do you know if there is a bug logged against
this which we can track?

Also it's not clear why the nodes would require to GC because all the caches
are held off-heap and we have  a 10G heap running G1GC.



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/One-failing-node-stalling-the-whole-cluster-tp5372p7799.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: One failing node stalling the whole cluster

Posted by Anmol Rattan <an...@gmail.com>.
That is known error at least in 1.6. I am not sure a fix for this is even
in 1.7. For gc pause, if there are actually any, worth considering jvm
tuning and seeing allocation and promotion rate.

In our case, we had to increase younger gen to have  8GB space to deal
with.

However, slow client definitely hang whole grid, even if there are no GC,
 A chicken egg problem results. If you increase timeout, grid hangs for
longer time.

if your reduce timeout, clients/nodes will leave grid early and even go in
segmentation and Segmentation policy handling via starting ignite bean only
works if you start process with ignite script. If prcoess has been started
otherwise in a custom script, it does not support.

Thanks & Regards
Anmol Rattan
+91 9538901262


On Fri, Sep 16, 2016 at 10:44 AM, yfernando <yo...@tudor.com>
wrote:

> Hi Denis,
>
> We have been able to reproduce this situation where a node failure freezes
> the entire grid.
>
> Please find the full thread dumps of the 5 nodes that are locked up.
>
> The memoryMode of the caches are configured to be OFFHEAP_TIERED
> The cacheMode is PARTITIONED
> The atomicityMode is TRANSACTIONAL
>
> We have also seen ALL the clients freeze during a FULL GC occurring on ANY
> single node.
>
> Please let us know if you require any more information.
>
> grid-tp1-dev-11220-201609141523318.txt
> <http://apache-ignite-users.70518.x6.nabble.com/file/
> n7791/grid-tp1-dev-11220-201609141523318.txt>
> grid-tp1-dev-11223-201609141523318.txt
> <http://apache-ignite-users.70518.x6.nabble.com/file/
> n7791/grid-tp1-dev-11223-201609141523318.txt>
> grid-tp3-dev-11220-201609141523318.txt
> <http://apache-ignite-users.70518.x6.nabble.com/file/
> n7791/grid-tp3-dev-11220-201609141523318.txt>
> grid-tp3-dev-11221-201609141523318.txt
> <http://apache-ignite-users.70518.x6.nabble.com/file/
> n7791/grid-tp3-dev-11221-201609141523318.txt>
> grid-tp4-dev-11220-201609141523318.txt
> <http://apache-ignite-users.70518.x6.nabble.com/file/
> n7791/grid-tp4-dev-11220-201609141523318.txt>
>
>
>
>
> --
> View this message in context: http://apache-ignite-users.
> 70518.x6.nabble.com/One-failing-node-stalling-the-
> whole-cluster-tp5372p7791.html
> Sent from the Apache Ignite Users mailing list archive at Nabble.com.
>

Re: One failing node stalling the whole cluster

Posted by Andrey Mashenkov <an...@gmail.com>.
Hi,

Ignite does not have 1.6.x versions.

It seems, you would better to ask GridGain support.

On Wed, Feb 8, 2017 at 6:05 AM, ght230 <gh...@163.com> wrote:

> I have met the same problem.
>
> It seems that IGNITE-4003 will be fixed in version 2.0.
>
> I am using version 1.6.8 now.
>
> I want to know whether the patch about IGNITE-4003  will be merged in
> Vesion
> 1.6.x?
>
>
>
> --
> View this message in context: http://apache-ignite-users.
> 70518.x6.nabble.com/One-failing-node-stalling-the-
> whole-cluster-tp5372p10496.html
> Sent from the Apache Ignite Users mailing list archive at Nabble.com.
>



-- 
Best regards,
Andrey V. Mashenkov

Re: One failing node stalling the whole cluster

Posted by Denis Magda <dm...@apache.org>.
Hi,

Presently Apache Ignite community releases only 1.x versions. The version you’re referring to should be used by other product built on top of Ignite.

—
Denis

> On Feb 7, 2017, at 7:05 PM, ght230 <gh...@163.com> wrote:
> 
> I have met the same problem.
> 
> It seems that IGNITE-4003 will be fixed in version 2.0.
> 
> I am using version 1.6.8 now.
> 
> I want to know whether the patch about IGNITE-4003  will be merged in Vesion
> 1.6.x?
> 
> 
> 
> --
> View this message in context: http://apache-ignite-users.70518.x6.nabble.com/One-failing-node-stalling-the-whole-cluster-tp5372p10496.html
> Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: One failing node stalling the whole cluster

Posted by ght230 <gh...@163.com>.
I have met the same problem.

It seems that IGNITE-4003 will be fixed in version 2.0.

I am using version 1.6.8 now.

I want to know whether the patch about IGNITE-4003  will be merged in Vesion
1.6.x?



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/One-failing-node-stalling-the-whole-cluster-tp5372p10496.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: One failing node stalling the whole cluster

Posted by Denis Magda <dm...@gridgain.com>.
A correct link is the following
https://issues.apache.org/jira/browse/IGNITE-4003 <https://issues.apache.org/jira/browse/IGNITE-4003>

> On Sep 29, 2016, at 9:31 AM, Denis Magda <dm...@gridgain.com> wrote:
> 
> Good news to everyone. Looks like we could get to the bottom of this issue
> https://ggsystems.atlassian.net/browse/IGN-5958 <https://ggsystems.atlassian.net/browse/IGN-5958>
> 
> Hope it will be fixed soon.
> 
> —
> Denis
> 
>> On Sep 16, 2016, at 9:38 AM, yfernando <yohan.fernando@tudor.com <ma...@tudor.com>> wrote:
>> 
>> Unfortunately iam unable to send the full logs files but they contain the
>> following exceptions 
>> 
>> [14 Sep 2016 11:14:30.290 EDT] [pub-#16%DataGridServer-Development%] ERROR
>> 11223 (OrderHolderSaveRunnable.java:273) exception ocurred while generating
>> Trade Order for Order: OrderKey [traderId=5
>> 207, orderId=16084348]
>> javax.cache.CacheException: class
>> org.apache.ignite.transactions.TransactionTimeoutException: Failed to
>> acquire lock within provided timeout for transaction [timeout=5000,
>> tx=GridNearTxLocal [ma
>> ppings=IgniteTxMappingsImpl [], nearLocallyMapped=false,
>> colocatedLocallyMapped=false, needCheckBackup=null, hasRemoteLocks=false,
>> mappings=IgniteTxMappingsImpl [], super=GridDhtTxLocalAdapter [
>> nearOnOriginatingNode=false, nearNodes=[], dhtNodes=[], explicitLock=false,
>> super=IgniteTxLocalAdapter [completedBase=null, sndTransformedVals=false,
>> depEnabled=false, txState=IgniteTxStateImpl
>> [activeCacheIds=GridLongList [idx=1, arr=[1633849959]], txMap={IgniteTxKey
>> [key=KeyCacheObjectImpl [val=BatchIdKey [privDb=trim_sys],
>> hasValBytes=true], cacheId=1633849959]=IgniteTxEntry [key=Ke
>> yCacheObjectImpl [val=BatchIdKey [privDb=trim_sys], hasValBytes=true],
>> cacheId=1633849959, txKey=IgniteTxKey [key=KeyCacheObjectImpl
>> [val=BatchIdKey [privDb=trim_sys], hasValBytes=true], cacheId
>> =1633849959], val=[op=READ, val=null], prevVal=[op=NOOP, val=null],
>> entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null,
>> explicitVer=null, dhtVer=null, filters=null, filters
>> Passed=false, filtersSet=true, entry=GridDhtDetachedCacheEntry
>> [super=GridDistributedCacheEntry [super=GridCacheMapEntry
>> [key=KeyCacheObjectImpl [val=BatchIdKey [privDb=trim_sys], hasValBytes=tr
>> ue], val=null, startVer=1473869129773, ver=GridCacheVersion
>> [topVer=85333522, nodeOrderDrId=10, globalTime=1473859812640,
>> order=1473869129773], hash=1508409679, extras=null, flags=0]]], prepared
>> =false, locked=false, nodeId=3cd37805-46a7-4287-875e-9cbd0cf03fad,
>> locMapped=false, expiryPlc=null, transferExpiryPlc=false, flags=0,
>> partUpdateCntr=0, serReadVer=null, xidVer=GridCacheVersion [
>> topVer=85333522, nodeOrderDrId=10, globalTime=1473859812640,
>> order=1473869129772]]}], super=IgniteTxAdapter [xidVer=GridCacheVersion
>> [topVer=85333522, nodeOrderDrId=10, globalTime=1473859812640,
>> order=1473869129772], writeVer=null, implicit=false, loc=true, threadId=50,
>> startTime=1473859812630, nodeId=6f7a39ba-c520-435e-9480-a42ecf0d9a58,
>> startVer=GridCacheVersion [topVer=85333522, nod
>> eOrderDrId=10, globalTime=1473859812640, order=1473869129772], endVer=null,
>> isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, timeout=5000,
>> sysInvalidate=false, sys=false, plc=2, commitVer=nul
>> l, finalizing=NONE, preparing=false, invalidParts=null,
>> state=MARKED_ROLLBACK, timedOut=false, topVer=AffinityTopologyVersion
>> [topVer=101, minorTopVer=0], duration=5007ms, onePhaseCommit=false],
>> size=1]]]]
>>        at
>> org.apache.ignite.internal.processors.cache.GridCacheUtils.convertToCacheException(GridCacheUtils.java:1618)
>> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>>        at
>> org.apache.ignite.internal.processors.cache.IgniteCacheProxy.cacheException(IgniteCacheProxy.java:1841)
>> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>>        at
>> org.apache.ignite.internal.processors.cache.IgniteCacheProxy.get(IgniteCacheProxy.java:871)
>> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>>        at
>> com.somecompany.grid.server.tradegen.BatchIdHelper.getListOfIds(BatchIdHelper.java:69)
>> ~[data-grid-server-ignite-3.0-SNAPSHOT.jar:3.0-SNAPSHOT]
>>        at
>> com.somecompany.grid.server.tradegen.TradeGenerator.generateUniqueTradeId64(TradeGenerator.java:47)
>> ~[data-grid-server-ignite-3.0-SNAPSHOT.jar:3.0-SNAPSHOT]
>>        at
>> com.somecompany.grid.server.tradegen.TradeGenerator.allocateTradesFromFills(TradeGenerator.java:158)
>> ~[data-grid-server-ignite-3.0-SNAPSHOT.jar:3.0-SNAPSHOT]
>>        at
>> com.somecompany.grid.server.tradegen.OrderHolderSaveRunnable.run(OrderHolderSaveRunnable.java:271)
>> ~[data-grid-server-ignite-3.0-SNAPSHOT.jar:3.0-SNAPSHOT]
>>        at
>> org.apache.ignite.internal.processors.closure.GridClosureProcessor$C4.execute(GridClosureProcessor.java:1879)
>> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>>        at
>> org.apache.ignite.internal.processors.job.GridJobWorker$2.call(GridJobWorker.java:509)
>> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>>        at
>> org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6397)
>> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>>        at
>> org.apache.ignite.internal.processors.job.GridJobWorker.execute0(GridJobWorker.java:503)
>> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>>        at
>> org.apache.ignite.internal.processors.job.GridJobWorker.body(GridJobWorker.java:456)
>> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>>        at
>> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
>> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>>        at
>> org.apache.ignite.internal.processors.job.GridJobProcessor.processJobExecuteRequest(GridJobProcessor.java:1166)
>> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>>        at
>> org.apache.ignite.internal.processors.job.GridJobProcessor$JobExecutionListener.onMessage(GridJobProcessor.java:1770)
>> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>>        at
>> org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:821)
>> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>>        at
>> org.apache.ignite.internal.managers.communication.GridIoManager.access$1600(GridIoManager.java:103)
>> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>>        at
>> org.apache.ignite.internal.managers.communication.GridIoManager$5.run(GridIoManager.java:784)
>> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>>        at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> [?:1.8.0_60]
>>        at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> [?:1.8.0_60]
>>        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_60]
>> Caused by: org.apache.ignite.transactions.TransactionTimeoutException:
>> Failed to acquire lock within provided timeout for transaction
>> [timeout=5000, tx=GridNearTxLocal [mappings=IgniteTxMappings
>> Impl [], nearLocallyMapped=false, colocatedLocallyMapped=false,
>> needCheckBackup=null, hasRemoteLocks=false, mappings=IgniteTxMappingsImpl
>> [], super=GridDhtTxLocalAdapter [nearOnOriginatingNode=f
>> alse, nearNodes=[], dhtNodes=[], explicitLock=false,
>> super=IgniteTxLocalAdapter [completedBase=null, sndTransformedVals=false,
>> depEnabled=false, txState=IgniteTxStateImpl [activeCacheIds=GridLon
>> gList [idx=1, arr=[1633849959]], txMap={IgniteTxKey [key=KeyCacheObjectImpl
>> [val=BatchIdKey [privDb=trim_sys], hasValBytes=true],
>> cacheId=1633849959]=IgniteTxEntry [key=KeyCacheObjectImpl [val=B
>> atchIdKey [privDb=trim_sys], hasValBytes=true], cacheId=1633849959,
>> txKey=IgniteTxKey [key=KeyCacheObjectImpl [val=BatchIdKey [privDb=trim_sys],
>> hasValBytes=true], cacheId=1633849959], val=[op=R
>> EAD, val=null], prevVal=[op=NOOP, val=null], entryProcessorsCol=null,
>> ttl=-1, conflictExpireTime=-1, conflictVer=null, explicitVer=null,
>> dhtVer=null, filters=null, filtersPassed=false, filtersSe
>> t=true, entry=GridDhtDetachedCacheEntry [super=GridDistributedCacheEntry
>> [super=GridCacheMapEntry [key=KeyCacheObjectImpl [val=BatchIdKey
>> [privDb=trim_sys], hasValBytes=true], val=null, startVer
>> =1473869129773, ver=GridCacheVersion [topVer=85333522, nodeOrderDrId=10,
>> globalTime=1473859812640, order=1473869129773], hash=1508409679,
>> extras=null, flags=0]]], prepared=false, locked=false, n
>> odeId=3cd37805-46a7-4287-875e-9cbd0cf03fad, locMapped=false, expiryPlc=null,
>> transferExpiryPlc=false, flags=0, partUpdateCntr=0, serReadVer=null,
>> xidVer=GridCacheVersion [topVer=85333522, nodeOr
>> derDrId=10, globalTime=1473859812640, order=1473869129772]]}],
>> super=IgniteTxAdapter [xidVer=GridCacheVersion [topVer=85333522,
>> nodeOrderDrId=10, globalTime=1473859812640, order=1473869129772],
>> writeVer=null, implicit=false, loc=true, threadId=50,
>> startTime=1473859812630, nodeId=6f7a39ba-c520-435e-9480-a42ecf0d9a58,
>> startVer=GridCacheVersion [topVer=85333522, nodeOrderDrId=10, globalTi
>> me=1473859812640, order=1473869129772], endVer=null,
>> isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, timeout=5000,
>> sysInvalidate=false, sys=false, plc=2, commitVer=null, finalizing=NONE, pre
>> paring=false, invalidParts=null, state=MARKED_ROLLBACK, timedOut=false,
>> topVer=AffinityTopologyVersion [topVer=101, minorTopVer=0], duration=5007ms,
>> onePhaseCommit=false], size=1]]]]
>>        at
>> org.apache.ignite.internal.util.IgniteUtils$12.apply(IgniteUtils.java:791)
>> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>>        at
>> org.apache.ignite.internal.util.IgniteUtils$12.apply(IgniteUtils.java:789)
>> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>>        ... 21 more
>> Caused by:
>> org.apache.ignite.internal.transactions.IgniteTxTimeoutCheckedException:
>> Failed to acquire lock within provided timeout for transaction
>> [timeout=5000, tx=GridNearTxLocal [mappings=Ign
>> iteTxMappingsImpl [], nearLocallyMapped=false, colocatedLocallyMapped=false,
>> needCheckBackup=null, hasRemoteLocks=false, mappings=IgniteTxMappingsImpl
>> [], super=GridDhtTxLocalAdapter [nearOnOrig
>> inatingNode=false, nearNodes=[], dhtNodes=[], explicitLock=false,
>> super=IgniteTxLocalAdapter [completedBase=null, sndTransformedVals=false,
>> depEnabled=false, txState=IgniteTxStateImpl [activeCac
>> heIds=GridLongList [idx=1, arr=[1633849959]], txMap={IgniteTxKey
>> [key=KeyCacheObjectImpl [val=BatchIdKey [privDb=trim_sys],
>> hasValBytes=true], cacheId=1633849959]=IgniteTxEntry [key=KeyCacheObje
>> ctImpl [val=BatchIdKey [privDb=trim_sys], hasValBytes=true],
>> cacheId=1633849959, txKey=IgniteTxKey [key=KeyCacheObjectImpl
>> [val=BatchIdKey [privDb=trim_sys], hasValBytes=true], cacheId=163384995
>> 9], val=[op=READ, val=null], prevVal=[op=NOOP, val=null],
>> entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null,
>> explicitVer=null, dhtVer=null, filters=null, filtersPassed=fal
>> se, filtersSet=true, entry=GridDhtDetachedCacheEntry
>> [super=GridDistributedCacheEntry [super=GridCacheMapEntry
>> [key=KeyCacheObjectImpl [val=BatchIdKey [privDb=trim_sys],
>> hasValBytes=true], val=n
>> ull, startVer=1473869129773, ver=GridCacheVersion [topVer=85333522,
>> nodeOrderDrId=10, globalTime=1473859812640, order=1473869129773],
>> hash=1508409679, extras=null, flags=0]]], prepared=false, lo
>> cked=false, nodeId=3cd37805-46a7-4287-875e-9cbd0cf03fad, locMapped=false,
>> expiryPlc=null, transferExpiryPlc=false, flags=0, partUpdateCntr=0,
>> serReadVer=null, xidVer=GridCacheVersion [topVer=853
>> 33522, nodeOrderDrId=10, globalTime=1473859812640, order=1473869129772]]}],
>> super=IgniteTxAdapter [xidVer=GridCacheVersion [topVer=85333522,
>> nodeOrderDrId=10, globalTime=1473859812640, order=147
>> 3869129772], writeVer=null, implicit=false, loc=true, threadId=50,
>> startTime=1473859812630, nodeId=6f7a39ba-c520-435e-9480-a42ecf0d9a58,
>> startVer=GridCacheVersion [topVer=85333522, nodeOrderDrId
>> =10, globalTime=1473859812640, order=1473869129772], endVer=null,
>> isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, timeout=5000,
>> sysInvalidate=false, sys=false, plc=2, commitVer=null, finaliz
>> ing=NONE, preparing=false, invalidParts=null, state=MARKED_ROLLBACK,
>> timedOut=false, topVer=AffinityTopologyVersion [topVer=101, minorTopVer=0],
>> duration=5007ms, onePhaseCommit=false], size=1]]]
>> ]
>>        at
>> org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter$PostLockClosure2.apply(IgniteTxLocalAdapter.java:4023)
>> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>>        at
>> org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter$PostLockClosure2.apply(IgniteTxLocalAdapter.java:4010)
>> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>>        at
>> org.apache.ignite.internal.util.future.GridEmbeddedFuture$3.applyx(GridEmbeddedFuture.java:158)
>> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>>        at
>> org.apache.ignite.internal.util.future.GridEmbeddedFuture$AsyncListener1.apply(GridEmbeddedFuture.java:297)
>> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>>        at
>> org.apache.ignite.internal.util.future.GridEmbeddedFuture$AsyncListener1.apply(GridEmbeddedFuture.java:290)
>> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>>        at
>> org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:262)
>> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>>        at
>> org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListeners(GridFutureAdapter.java:250)
>> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>>        at
>> org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:380)
>> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>>        at
>> org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:346)
>> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>>        at
>> org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedLockFuture.onComplete(GridDhtColocatedLockFuture.java:535)
>> ~[ignite-core-1.5.0.final.jar:1.5.0.fi <http://1.5.0.fi/>
>> nal]
>>        at
>> org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedLockFuture.access$1100(GridDhtColocatedLockFuture.java:78)
>> ~[ignite-core-1.5.0.final.jar:1.5.0.fi <http://1.5.0.fi/>
>> nal]
>>        at
>> org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedLockFuture$LockTimeoutObject.onTimeout(GridDhtColocatedLockFuture.java:1291)
>> ~[ignite-core-1.5.0.
>> final.jar:1.5.0.final]
>>        at
>> org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:159)
>> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>>        at
>> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
>> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>> 
>> 
>> 
>> 
>> --
>> View this message in context: http://apache-ignite-users.70518.x6.nabble.com/One-failing-node-stalling-the-whole-cluster-tp5372p7809.html <http://apache-ignite-users.70518.x6.nabble.com/One-failing-node-stalling-the-whole-cluster-tp5372p7809.html>
>> Sent from the Apache Ignite Users mailing list archive at Nabble.com <http://nabble.com/>.
> 


Re: One failing node stalling the whole cluster

Posted by Denis Magda <dm...@gridgain.com>.
Good news to everyone. Looks like we could get to the bottom of this issue
https://ggsystems.atlassian.net/browse/IGN-5958 <https://ggsystems.atlassian.net/browse/IGN-5958>

Hope it will be fixed soon.

—
Denis

> On Sep 16, 2016, at 9:38 AM, yfernando <yo...@tudor.com> wrote:
> 
> Unfortunately iam unable to send the full logs files but they contain the
> following exceptions 
> 
> [14 Sep 2016 11:14:30.290 EDT] [pub-#16%DataGridServer-Development%] ERROR
> 11223 (OrderHolderSaveRunnable.java:273) exception ocurred while generating
> Trade Order for Order: OrderKey [traderId=5
> 207, orderId=16084348]
> javax.cache.CacheException: class
> org.apache.ignite.transactions.TransactionTimeoutException: Failed to
> acquire lock within provided timeout for transaction [timeout=5000,
> tx=GridNearTxLocal [ma
> ppings=IgniteTxMappingsImpl [], nearLocallyMapped=false,
> colocatedLocallyMapped=false, needCheckBackup=null, hasRemoteLocks=false,
> mappings=IgniteTxMappingsImpl [], super=GridDhtTxLocalAdapter [
> nearOnOriginatingNode=false, nearNodes=[], dhtNodes=[], explicitLock=false,
> super=IgniteTxLocalAdapter [completedBase=null, sndTransformedVals=false,
> depEnabled=false, txState=IgniteTxStateImpl
> [activeCacheIds=GridLongList [idx=1, arr=[1633849959]], txMap={IgniteTxKey
> [key=KeyCacheObjectImpl [val=BatchIdKey [privDb=trim_sys],
> hasValBytes=true], cacheId=1633849959]=IgniteTxEntry [key=Ke
> yCacheObjectImpl [val=BatchIdKey [privDb=trim_sys], hasValBytes=true],
> cacheId=1633849959, txKey=IgniteTxKey [key=KeyCacheObjectImpl
> [val=BatchIdKey [privDb=trim_sys], hasValBytes=true], cacheId
> =1633849959], val=[op=READ, val=null], prevVal=[op=NOOP, val=null],
> entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null,
> explicitVer=null, dhtVer=null, filters=null, filters
> Passed=false, filtersSet=true, entry=GridDhtDetachedCacheEntry
> [super=GridDistributedCacheEntry [super=GridCacheMapEntry
> [key=KeyCacheObjectImpl [val=BatchIdKey [privDb=trim_sys], hasValBytes=tr
> ue], val=null, startVer=1473869129773, ver=GridCacheVersion
> [topVer=85333522, nodeOrderDrId=10, globalTime=1473859812640,
> order=1473869129773], hash=1508409679, extras=null, flags=0]]], prepared
> =false, locked=false, nodeId=3cd37805-46a7-4287-875e-9cbd0cf03fad,
> locMapped=false, expiryPlc=null, transferExpiryPlc=false, flags=0,
> partUpdateCntr=0, serReadVer=null, xidVer=GridCacheVersion [
> topVer=85333522, nodeOrderDrId=10, globalTime=1473859812640,
> order=1473869129772]]}], super=IgniteTxAdapter [xidVer=GridCacheVersion
> [topVer=85333522, nodeOrderDrId=10, globalTime=1473859812640,
> order=1473869129772], writeVer=null, implicit=false, loc=true, threadId=50,
> startTime=1473859812630, nodeId=6f7a39ba-c520-435e-9480-a42ecf0d9a58,
> startVer=GridCacheVersion [topVer=85333522, nod
> eOrderDrId=10, globalTime=1473859812640, order=1473869129772], endVer=null,
> isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, timeout=5000,
> sysInvalidate=false, sys=false, plc=2, commitVer=nul
> l, finalizing=NONE, preparing=false, invalidParts=null,
> state=MARKED_ROLLBACK, timedOut=false, topVer=AffinityTopologyVersion
> [topVer=101, minorTopVer=0], duration=5007ms, onePhaseCommit=false],
> size=1]]]]
>        at
> org.apache.ignite.internal.processors.cache.GridCacheUtils.convertToCacheException(GridCacheUtils.java:1618)
> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>        at
> org.apache.ignite.internal.processors.cache.IgniteCacheProxy.cacheException(IgniteCacheProxy.java:1841)
> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>        at
> org.apache.ignite.internal.processors.cache.IgniteCacheProxy.get(IgniteCacheProxy.java:871)
> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>        at
> com.somecompany.grid.server.tradegen.BatchIdHelper.getListOfIds(BatchIdHelper.java:69)
> ~[data-grid-server-ignite-3.0-SNAPSHOT.jar:3.0-SNAPSHOT]
>        at
> com.somecompany.grid.server.tradegen.TradeGenerator.generateUniqueTradeId64(TradeGenerator.java:47)
> ~[data-grid-server-ignite-3.0-SNAPSHOT.jar:3.0-SNAPSHOT]
>        at
> com.somecompany.grid.server.tradegen.TradeGenerator.allocateTradesFromFills(TradeGenerator.java:158)
> ~[data-grid-server-ignite-3.0-SNAPSHOT.jar:3.0-SNAPSHOT]
>        at
> com.somecompany.grid.server.tradegen.OrderHolderSaveRunnable.run(OrderHolderSaveRunnable.java:271)
> ~[data-grid-server-ignite-3.0-SNAPSHOT.jar:3.0-SNAPSHOT]
>        at
> org.apache.ignite.internal.processors.closure.GridClosureProcessor$C4.execute(GridClosureProcessor.java:1879)
> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>        at
> org.apache.ignite.internal.processors.job.GridJobWorker$2.call(GridJobWorker.java:509)
> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>        at
> org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6397)
> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>        at
> org.apache.ignite.internal.processors.job.GridJobWorker.execute0(GridJobWorker.java:503)
> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>        at
> org.apache.ignite.internal.processors.job.GridJobWorker.body(GridJobWorker.java:456)
> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>        at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>        at
> org.apache.ignite.internal.processors.job.GridJobProcessor.processJobExecuteRequest(GridJobProcessor.java:1166)
> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>        at
> org.apache.ignite.internal.processors.job.GridJobProcessor$JobExecutionListener.onMessage(GridJobProcessor.java:1770)
> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>        at
> org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:821)
> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>        at
> org.apache.ignite.internal.managers.communication.GridIoManager.access$1600(GridIoManager.java:103)
> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>        at
> org.apache.ignite.internal.managers.communication.GridIoManager$5.run(GridIoManager.java:784)
> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>        at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [?:1.8.0_60]
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [?:1.8.0_60]
>        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_60]
> Caused by: org.apache.ignite.transactions.TransactionTimeoutException:
> Failed to acquire lock within provided timeout for transaction
> [timeout=5000, tx=GridNearTxLocal [mappings=IgniteTxMappings
> Impl [], nearLocallyMapped=false, colocatedLocallyMapped=false,
> needCheckBackup=null, hasRemoteLocks=false, mappings=IgniteTxMappingsImpl
> [], super=GridDhtTxLocalAdapter [nearOnOriginatingNode=f
> alse, nearNodes=[], dhtNodes=[], explicitLock=false,
> super=IgniteTxLocalAdapter [completedBase=null, sndTransformedVals=false,
> depEnabled=false, txState=IgniteTxStateImpl [activeCacheIds=GridLon
> gList [idx=1, arr=[1633849959]], txMap={IgniteTxKey [key=KeyCacheObjectImpl
> [val=BatchIdKey [privDb=trim_sys], hasValBytes=true],
> cacheId=1633849959]=IgniteTxEntry [key=KeyCacheObjectImpl [val=B
> atchIdKey [privDb=trim_sys], hasValBytes=true], cacheId=1633849959,
> txKey=IgniteTxKey [key=KeyCacheObjectImpl [val=BatchIdKey [privDb=trim_sys],
> hasValBytes=true], cacheId=1633849959], val=[op=R
> EAD, val=null], prevVal=[op=NOOP, val=null], entryProcessorsCol=null,
> ttl=-1, conflictExpireTime=-1, conflictVer=null, explicitVer=null,
> dhtVer=null, filters=null, filtersPassed=false, filtersSe
> t=true, entry=GridDhtDetachedCacheEntry [super=GridDistributedCacheEntry
> [super=GridCacheMapEntry [key=KeyCacheObjectImpl [val=BatchIdKey
> [privDb=trim_sys], hasValBytes=true], val=null, startVer
> =1473869129773, ver=GridCacheVersion [topVer=85333522, nodeOrderDrId=10,
> globalTime=1473859812640, order=1473869129773], hash=1508409679,
> extras=null, flags=0]]], prepared=false, locked=false, n
> odeId=3cd37805-46a7-4287-875e-9cbd0cf03fad, locMapped=false, expiryPlc=null,
> transferExpiryPlc=false, flags=0, partUpdateCntr=0, serReadVer=null,
> xidVer=GridCacheVersion [topVer=85333522, nodeOr
> derDrId=10, globalTime=1473859812640, order=1473869129772]]}],
> super=IgniteTxAdapter [xidVer=GridCacheVersion [topVer=85333522,
> nodeOrderDrId=10, globalTime=1473859812640, order=1473869129772],
> writeVer=null, implicit=false, loc=true, threadId=50,
> startTime=1473859812630, nodeId=6f7a39ba-c520-435e-9480-a42ecf0d9a58,
> startVer=GridCacheVersion [topVer=85333522, nodeOrderDrId=10, globalTi
> me=1473859812640, order=1473869129772], endVer=null,
> isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, timeout=5000,
> sysInvalidate=false, sys=false, plc=2, commitVer=null, finalizing=NONE, pre
> paring=false, invalidParts=null, state=MARKED_ROLLBACK, timedOut=false,
> topVer=AffinityTopologyVersion [topVer=101, minorTopVer=0], duration=5007ms,
> onePhaseCommit=false], size=1]]]]
>        at
> org.apache.ignite.internal.util.IgniteUtils$12.apply(IgniteUtils.java:791)
> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>        at
> org.apache.ignite.internal.util.IgniteUtils$12.apply(IgniteUtils.java:789)
> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>        ... 21 more
> Caused by:
> org.apache.ignite.internal.transactions.IgniteTxTimeoutCheckedException:
> Failed to acquire lock within provided timeout for transaction
> [timeout=5000, tx=GridNearTxLocal [mappings=Ign
> iteTxMappingsImpl [], nearLocallyMapped=false, colocatedLocallyMapped=false,
> needCheckBackup=null, hasRemoteLocks=false, mappings=IgniteTxMappingsImpl
> [], super=GridDhtTxLocalAdapter [nearOnOrig
> inatingNode=false, nearNodes=[], dhtNodes=[], explicitLock=false,
> super=IgniteTxLocalAdapter [completedBase=null, sndTransformedVals=false,
> depEnabled=false, txState=IgniteTxStateImpl [activeCac
> heIds=GridLongList [idx=1, arr=[1633849959]], txMap={IgniteTxKey
> [key=KeyCacheObjectImpl [val=BatchIdKey [privDb=trim_sys],
> hasValBytes=true], cacheId=1633849959]=IgniteTxEntry [key=KeyCacheObje
> ctImpl [val=BatchIdKey [privDb=trim_sys], hasValBytes=true],
> cacheId=1633849959, txKey=IgniteTxKey [key=KeyCacheObjectImpl
> [val=BatchIdKey [privDb=trim_sys], hasValBytes=true], cacheId=163384995
> 9], val=[op=READ, val=null], prevVal=[op=NOOP, val=null],
> entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null,
> explicitVer=null, dhtVer=null, filters=null, filtersPassed=fal
> se, filtersSet=true, entry=GridDhtDetachedCacheEntry
> [super=GridDistributedCacheEntry [super=GridCacheMapEntry
> [key=KeyCacheObjectImpl [val=BatchIdKey [privDb=trim_sys],
> hasValBytes=true], val=n
> ull, startVer=1473869129773, ver=GridCacheVersion [topVer=85333522,
> nodeOrderDrId=10, globalTime=1473859812640, order=1473869129773],
> hash=1508409679, extras=null, flags=0]]], prepared=false, lo
> cked=false, nodeId=3cd37805-46a7-4287-875e-9cbd0cf03fad, locMapped=false,
> expiryPlc=null, transferExpiryPlc=false, flags=0, partUpdateCntr=0,
> serReadVer=null, xidVer=GridCacheVersion [topVer=853
> 33522, nodeOrderDrId=10, globalTime=1473859812640, order=1473869129772]]}],
> super=IgniteTxAdapter [xidVer=GridCacheVersion [topVer=85333522,
> nodeOrderDrId=10, globalTime=1473859812640, order=147
> 3869129772], writeVer=null, implicit=false, loc=true, threadId=50,
> startTime=1473859812630, nodeId=6f7a39ba-c520-435e-9480-a42ecf0d9a58,
> startVer=GridCacheVersion [topVer=85333522, nodeOrderDrId
> =10, globalTime=1473859812640, order=1473869129772], endVer=null,
> isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, timeout=5000,
> sysInvalidate=false, sys=false, plc=2, commitVer=null, finaliz
> ing=NONE, preparing=false, invalidParts=null, state=MARKED_ROLLBACK,
> timedOut=false, topVer=AffinityTopologyVersion [topVer=101, minorTopVer=0],
> duration=5007ms, onePhaseCommit=false], size=1]]]
> ]
>        at
> org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter$PostLockClosure2.apply(IgniteTxLocalAdapter.java:4023)
> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>        at
> org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter$PostLockClosure2.apply(IgniteTxLocalAdapter.java:4010)
> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>        at
> org.apache.ignite.internal.util.future.GridEmbeddedFuture$3.applyx(GridEmbeddedFuture.java:158)
> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>        at
> org.apache.ignite.internal.util.future.GridEmbeddedFuture$AsyncListener1.apply(GridEmbeddedFuture.java:297)
> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>        at
> org.apache.ignite.internal.util.future.GridEmbeddedFuture$AsyncListener1.apply(GridEmbeddedFuture.java:290)
> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>        at
> org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:262)
> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>        at
> org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListeners(GridFutureAdapter.java:250)
> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>        at
> org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:380)
> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>        at
> org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:346)
> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>        at
> org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedLockFuture.onComplete(GridDhtColocatedLockFuture.java:535)
> ~[ignite-core-1.5.0.final.jar:1.5.0.fi
> nal]
>        at
> org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedLockFuture.access$1100(GridDhtColocatedLockFuture.java:78)
> ~[ignite-core-1.5.0.final.jar:1.5.0.fi
> nal]
>        at
> org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedLockFuture$LockTimeoutObject.onTimeout(GridDhtColocatedLockFuture.java:1291)
> ~[ignite-core-1.5.0.
> final.jar:1.5.0.final]
>        at
> org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:159)
> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
>        at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
> ~[ignite-core-1.5.0.final.jar:1.5.0.final]
> 
> 
> 
> 
> --
> View this message in context: http://apache-ignite-users.70518.x6.nabble.com/One-failing-node-stalling-the-whole-cluster-tp5372p7809.html
> Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: One failing node stalling the whole cluster

Posted by yfernando <yo...@tudor.com>.
Unfortunately iam unable to send the full logs files but they contain the
following exceptions 

[14 Sep 2016 11:14:30.290 EDT] [pub-#16%DataGridServer-Development%] ERROR
11223 (OrderHolderSaveRunnable.java:273) exception ocurred while generating
Trade Order for Order: OrderKey [traderId=5
207, orderId=16084348]
javax.cache.CacheException: class
org.apache.ignite.transactions.TransactionTimeoutException: Failed to
acquire lock within provided timeout for transaction [timeout=5000,
tx=GridNearTxLocal [ma
ppings=IgniteTxMappingsImpl [], nearLocallyMapped=false,
colocatedLocallyMapped=false, needCheckBackup=null, hasRemoteLocks=false,
mappings=IgniteTxMappingsImpl [], super=GridDhtTxLocalAdapter [
nearOnOriginatingNode=false, nearNodes=[], dhtNodes=[], explicitLock=false,
super=IgniteTxLocalAdapter [completedBase=null, sndTransformedVals=false,
depEnabled=false, txState=IgniteTxStateImpl
[activeCacheIds=GridLongList [idx=1, arr=[1633849959]], txMap={IgniteTxKey
[key=KeyCacheObjectImpl [val=BatchIdKey [privDb=trim_sys],
hasValBytes=true], cacheId=1633849959]=IgniteTxEntry [key=Ke
yCacheObjectImpl [val=BatchIdKey [privDb=trim_sys], hasValBytes=true],
cacheId=1633849959, txKey=IgniteTxKey [key=KeyCacheObjectImpl
[val=BatchIdKey [privDb=trim_sys], hasValBytes=true], cacheId
=1633849959], val=[op=READ, val=null], prevVal=[op=NOOP, val=null],
entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null,
explicitVer=null, dhtVer=null, filters=null, filters
Passed=false, filtersSet=true, entry=GridDhtDetachedCacheEntry
[super=GridDistributedCacheEntry [super=GridCacheMapEntry
[key=KeyCacheObjectImpl [val=BatchIdKey [privDb=trim_sys], hasValBytes=tr
ue], val=null, startVer=1473869129773, ver=GridCacheVersion
[topVer=85333522, nodeOrderDrId=10, globalTime=1473859812640,
order=1473869129773], hash=1508409679, extras=null, flags=0]]], prepared
=false, locked=false, nodeId=3cd37805-46a7-4287-875e-9cbd0cf03fad,
locMapped=false, expiryPlc=null, transferExpiryPlc=false, flags=0,
partUpdateCntr=0, serReadVer=null, xidVer=GridCacheVersion [
topVer=85333522, nodeOrderDrId=10, globalTime=1473859812640,
order=1473869129772]]}], super=IgniteTxAdapter [xidVer=GridCacheVersion
[topVer=85333522, nodeOrderDrId=10, globalTime=1473859812640,
 order=1473869129772], writeVer=null, implicit=false, loc=true, threadId=50,
startTime=1473859812630, nodeId=6f7a39ba-c520-435e-9480-a42ecf0d9a58,
startVer=GridCacheVersion [topVer=85333522, nod
eOrderDrId=10, globalTime=1473859812640, order=1473869129772], endVer=null,
isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, timeout=5000,
sysInvalidate=false, sys=false, plc=2, commitVer=nul
l, finalizing=NONE, preparing=false, invalidParts=null,
state=MARKED_ROLLBACK, timedOut=false, topVer=AffinityTopologyVersion
[topVer=101, minorTopVer=0], duration=5007ms, onePhaseCommit=false],
 size=1]]]]
        at
org.apache.ignite.internal.processors.cache.GridCacheUtils.convertToCacheException(GridCacheUtils.java:1618)
~[ignite-core-1.5.0.final.jar:1.5.0.final]
        at
org.apache.ignite.internal.processors.cache.IgniteCacheProxy.cacheException(IgniteCacheProxy.java:1841)
~[ignite-core-1.5.0.final.jar:1.5.0.final]
        at
org.apache.ignite.internal.processors.cache.IgniteCacheProxy.get(IgniteCacheProxy.java:871)
~[ignite-core-1.5.0.final.jar:1.5.0.final]
        at
com.somecompany.grid.server.tradegen.BatchIdHelper.getListOfIds(BatchIdHelper.java:69)
~[data-grid-server-ignite-3.0-SNAPSHOT.jar:3.0-SNAPSHOT]
        at
com.somecompany.grid.server.tradegen.TradeGenerator.generateUniqueTradeId64(TradeGenerator.java:47)
~[data-grid-server-ignite-3.0-SNAPSHOT.jar:3.0-SNAPSHOT]
        at
com.somecompany.grid.server.tradegen.TradeGenerator.allocateTradesFromFills(TradeGenerator.java:158)
~[data-grid-server-ignite-3.0-SNAPSHOT.jar:3.0-SNAPSHOT]
        at
com.somecompany.grid.server.tradegen.OrderHolderSaveRunnable.run(OrderHolderSaveRunnable.java:271)
~[data-grid-server-ignite-3.0-SNAPSHOT.jar:3.0-SNAPSHOT]
        at
org.apache.ignite.internal.processors.closure.GridClosureProcessor$C4.execute(GridClosureProcessor.java:1879)
~[ignite-core-1.5.0.final.jar:1.5.0.final]
        at
org.apache.ignite.internal.processors.job.GridJobWorker$2.call(GridJobWorker.java:509)
~[ignite-core-1.5.0.final.jar:1.5.0.final]
        at
org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6397)
~[ignite-core-1.5.0.final.jar:1.5.0.final]
        at
org.apache.ignite.internal.processors.job.GridJobWorker.execute0(GridJobWorker.java:503)
~[ignite-core-1.5.0.final.jar:1.5.0.final]
        at
org.apache.ignite.internal.processors.job.GridJobWorker.body(GridJobWorker.java:456)
~[ignite-core-1.5.0.final.jar:1.5.0.final]
        at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
~[ignite-core-1.5.0.final.jar:1.5.0.final]
        at
org.apache.ignite.internal.processors.job.GridJobProcessor.processJobExecuteRequest(GridJobProcessor.java:1166)
~[ignite-core-1.5.0.final.jar:1.5.0.final]
        at
org.apache.ignite.internal.processors.job.GridJobProcessor$JobExecutionListener.onMessage(GridJobProcessor.java:1770)
~[ignite-core-1.5.0.final.jar:1.5.0.final]
        at
org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:821)
~[ignite-core-1.5.0.final.jar:1.5.0.final]
        at
org.apache.ignite.internal.managers.communication.GridIoManager.access$1600(GridIoManager.java:103)
~[ignite-core-1.5.0.final.jar:1.5.0.final]
        at
org.apache.ignite.internal.managers.communication.GridIoManager$5.run(GridIoManager.java:784)
~[ignite-core-1.5.0.final.jar:1.5.0.final]
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[?:1.8.0_60]
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[?:1.8.0_60]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_60]
Caused by: org.apache.ignite.transactions.TransactionTimeoutException:
Failed to acquire lock within provided timeout for transaction
[timeout=5000, tx=GridNearTxLocal [mappings=IgniteTxMappings
Impl [], nearLocallyMapped=false, colocatedLocallyMapped=false,
needCheckBackup=null, hasRemoteLocks=false, mappings=IgniteTxMappingsImpl
[], super=GridDhtTxLocalAdapter [nearOnOriginatingNode=f
alse, nearNodes=[], dhtNodes=[], explicitLock=false,
super=IgniteTxLocalAdapter [completedBase=null, sndTransformedVals=false,
depEnabled=false, txState=IgniteTxStateImpl [activeCacheIds=GridLon
gList [idx=1, arr=[1633849959]], txMap={IgniteTxKey [key=KeyCacheObjectImpl
[val=BatchIdKey [privDb=trim_sys], hasValBytes=true],
cacheId=1633849959]=IgniteTxEntry [key=KeyCacheObjectImpl [val=B
atchIdKey [privDb=trim_sys], hasValBytes=true], cacheId=1633849959,
txKey=IgniteTxKey [key=KeyCacheObjectImpl [val=BatchIdKey [privDb=trim_sys],
hasValBytes=true], cacheId=1633849959], val=[op=R
EAD, val=null], prevVal=[op=NOOP, val=null], entryProcessorsCol=null,
ttl=-1, conflictExpireTime=-1, conflictVer=null, explicitVer=null,
dhtVer=null, filters=null, filtersPassed=false, filtersSe
t=true, entry=GridDhtDetachedCacheEntry [super=GridDistributedCacheEntry
[super=GridCacheMapEntry [key=KeyCacheObjectImpl [val=BatchIdKey
[privDb=trim_sys], hasValBytes=true], val=null, startVer
=1473869129773, ver=GridCacheVersion [topVer=85333522, nodeOrderDrId=10,
globalTime=1473859812640, order=1473869129773], hash=1508409679,
extras=null, flags=0]]], prepared=false, locked=false, n
odeId=3cd37805-46a7-4287-875e-9cbd0cf03fad, locMapped=false, expiryPlc=null,
transferExpiryPlc=false, flags=0, partUpdateCntr=0, serReadVer=null,
xidVer=GridCacheVersion [topVer=85333522, nodeOr
derDrId=10, globalTime=1473859812640, order=1473869129772]]}],
super=IgniteTxAdapter [xidVer=GridCacheVersion [topVer=85333522,
nodeOrderDrId=10, globalTime=1473859812640, order=1473869129772],
writeVer=null, implicit=false, loc=true, threadId=50,
startTime=1473859812630, nodeId=6f7a39ba-c520-435e-9480-a42ecf0d9a58,
startVer=GridCacheVersion [topVer=85333522, nodeOrderDrId=10, globalTi
me=1473859812640, order=1473869129772], endVer=null,
isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, timeout=5000,
sysInvalidate=false, sys=false, plc=2, commitVer=null, finalizing=NONE, pre
paring=false, invalidParts=null, state=MARKED_ROLLBACK, timedOut=false,
topVer=AffinityTopologyVersion [topVer=101, minorTopVer=0], duration=5007ms,
onePhaseCommit=false], size=1]]]]
        at
org.apache.ignite.internal.util.IgniteUtils$12.apply(IgniteUtils.java:791)
~[ignite-core-1.5.0.final.jar:1.5.0.final]
        at
org.apache.ignite.internal.util.IgniteUtils$12.apply(IgniteUtils.java:789)
~[ignite-core-1.5.0.final.jar:1.5.0.final]
        ... 21 more
Caused by:
org.apache.ignite.internal.transactions.IgniteTxTimeoutCheckedException:
Failed to acquire lock within provided timeout for transaction
[timeout=5000, tx=GridNearTxLocal [mappings=Ign
iteTxMappingsImpl [], nearLocallyMapped=false, colocatedLocallyMapped=false,
needCheckBackup=null, hasRemoteLocks=false, mappings=IgniteTxMappingsImpl
[], super=GridDhtTxLocalAdapter [nearOnOrig
inatingNode=false, nearNodes=[], dhtNodes=[], explicitLock=false,
super=IgniteTxLocalAdapter [completedBase=null, sndTransformedVals=false,
depEnabled=false, txState=IgniteTxStateImpl [activeCac
heIds=GridLongList [idx=1, arr=[1633849959]], txMap={IgniteTxKey
[key=KeyCacheObjectImpl [val=BatchIdKey [privDb=trim_sys],
hasValBytes=true], cacheId=1633849959]=IgniteTxEntry [key=KeyCacheObje
ctImpl [val=BatchIdKey [privDb=trim_sys], hasValBytes=true],
cacheId=1633849959, txKey=IgniteTxKey [key=KeyCacheObjectImpl
[val=BatchIdKey [privDb=trim_sys], hasValBytes=true], cacheId=163384995
9], val=[op=READ, val=null], prevVal=[op=NOOP, val=null],
entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null,
explicitVer=null, dhtVer=null, filters=null, filtersPassed=fal
se, filtersSet=true, entry=GridDhtDetachedCacheEntry
[super=GridDistributedCacheEntry [super=GridCacheMapEntry
[key=KeyCacheObjectImpl [val=BatchIdKey [privDb=trim_sys],
hasValBytes=true], val=n
ull, startVer=1473869129773, ver=GridCacheVersion [topVer=85333522,
nodeOrderDrId=10, globalTime=1473859812640, order=1473869129773],
hash=1508409679, extras=null, flags=0]]], prepared=false, lo
cked=false, nodeId=3cd37805-46a7-4287-875e-9cbd0cf03fad, locMapped=false,
expiryPlc=null, transferExpiryPlc=false, flags=0, partUpdateCntr=0,
serReadVer=null, xidVer=GridCacheVersion [topVer=853
33522, nodeOrderDrId=10, globalTime=1473859812640, order=1473869129772]]}],
super=IgniteTxAdapter [xidVer=GridCacheVersion [topVer=85333522,
nodeOrderDrId=10, globalTime=1473859812640, order=147
3869129772], writeVer=null, implicit=false, loc=true, threadId=50,
startTime=1473859812630, nodeId=6f7a39ba-c520-435e-9480-a42ecf0d9a58,
startVer=GridCacheVersion [topVer=85333522, nodeOrderDrId
=10, globalTime=1473859812640, order=1473869129772], endVer=null,
isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, timeout=5000,
sysInvalidate=false, sys=false, plc=2, commitVer=null, finaliz
ing=NONE, preparing=false, invalidParts=null, state=MARKED_ROLLBACK,
timedOut=false, topVer=AffinityTopologyVersion [topVer=101, minorTopVer=0],
duration=5007ms, onePhaseCommit=false], size=1]]]
]
        at
org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter$PostLockClosure2.apply(IgniteTxLocalAdapter.java:4023)
~[ignite-core-1.5.0.final.jar:1.5.0.final]
        at
org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter$PostLockClosure2.apply(IgniteTxLocalAdapter.java:4010)
~[ignite-core-1.5.0.final.jar:1.5.0.final]
        at
org.apache.ignite.internal.util.future.GridEmbeddedFuture$3.applyx(GridEmbeddedFuture.java:158)
~[ignite-core-1.5.0.final.jar:1.5.0.final]
        at
org.apache.ignite.internal.util.future.GridEmbeddedFuture$AsyncListener1.apply(GridEmbeddedFuture.java:297)
~[ignite-core-1.5.0.final.jar:1.5.0.final]
        at
org.apache.ignite.internal.util.future.GridEmbeddedFuture$AsyncListener1.apply(GridEmbeddedFuture.java:290)
~[ignite-core-1.5.0.final.jar:1.5.0.final]
        at
org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:262)
~[ignite-core-1.5.0.final.jar:1.5.0.final]
        at
org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListeners(GridFutureAdapter.java:250)
~[ignite-core-1.5.0.final.jar:1.5.0.final]
        at
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:380)
~[ignite-core-1.5.0.final.jar:1.5.0.final]
        at
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:346)
~[ignite-core-1.5.0.final.jar:1.5.0.final]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedLockFuture.onComplete(GridDhtColocatedLockFuture.java:535)
~[ignite-core-1.5.0.final.jar:1.5.0.fi
nal]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedLockFuture.access$1100(GridDhtColocatedLockFuture.java:78)
~[ignite-core-1.5.0.final.jar:1.5.0.fi
nal]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedLockFuture$LockTimeoutObject.onTimeout(GridDhtColocatedLockFuture.java:1291)
~[ignite-core-1.5.0.
final.jar:1.5.0.final]
        at
org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:159)
~[ignite-core-1.5.0.final.jar:1.5.0.final]
        at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
~[ignite-core-1.5.0.final.jar:1.5.0.final]




--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/One-failing-node-stalling-the-whole-cluster-tp5372p7809.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: One failing node stalling the whole cluster

Posted by Denis Magda <dm...@apache.org>.
It’s a topology change that impacts transactions rollback, not vice verse. I would wait while more experienced Apache committers who maintain caching components will chime in. They should be able to get to the bottom.

In the meanwhile please attach the logs from all the nodes (servers and clients, that are running on your application side). Hope you preserved them.

—
Denis

> On Sep 16, 2016, at 9:09 AM, yfernando <yo...@tudor.com> wrote:
> 
> No, the node that failed was a server node.
> 
> About the rollback, yes indeed. A few times that grid has hung, we have seen
> a similar lock on rollback. Why would a transaction rollback impact the
> topology?
> 
> This thread dump was taken at least 10 minutes after the node died so in an
> ideal world, the grid should have recovered from the topology change which
> occurred by the node going down.
> 
> 
> 
> 
> 
> --
> View this message in context: http://apache-ignite-users.70518.x6.nabble.com/One-failing-node-stalling-the-whole-cluster-tp5372p7805.html
> Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: One failing node stalling the whole cluster

Posted by yfernando <yo...@tudor.com>.
No, the node that failed was a server node.

About the rollback, yes indeed. A few times that grid has hung, we have seen
a similar lock on rollback. Why would a transaction rollback impact the
topology?

This thread dump was taken at least 10 minutes after the node died so in an
ideal world, the grid should have recovered from the topology change which
occurred by the node going down.





--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/One-failing-node-stalling-the-whole-cluster-tp5372p7805.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: One failing node stalling the whole cluster

Posted by Denis Magda <dm...@apache.org>.
Is the node that expected a long GC pause and failed eventually is a client node? This is important to know.

From thread dumps I see that some of the nodes unable to rollback transactions

"pub-#1%DataGridServer-Development%" Id=35 in WAITING on lock=org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxFinishFuture@3e4dcc0c
  at sun.misc.Unsafe.park(Native Method)
  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
  at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
  at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
  at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
  at org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:155)
  at org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:115)
  at org.apache.ignite.internal.processors.cache.transactions.TransactionProxyImpl.rollback(TransactionProxyImpl.java:296)
  at com.somecompany.grid.server.tradegen.BatchIdHelper.getListOfIds(BatchIdHelper.java:84)
  at com.somecompany.grid.server.tradegen.TradeGenerator.generateUniqueTradeId64(TradeGenerator.java:47)
  at com.somecompany.grid.server.tradegen.TradeGenerator.allocateTradesFromFills(TradeGenerator.java:158)

while the others are waiting while an affinity topology changes which, in my understanding, prevents the first nodes from successful transactions rollback.

ock=org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache$AffinityReadyFuture@14a5b2c7
  at sun.misc.Unsafe.park(Native Method)
  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
  at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
  at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
  at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
  at org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:157)
  at org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:115)
  at org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.awaitTopologyVersion(GridAffinityAssignmentCache.java:477)
  at org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.cachedAffinity(GridAffinityAssignmentCache.java:435)
  at org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.primaryPartitions(GridAffinityAssignmentCache.java:399)
  at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.primaryPartitions(GridCacheAffinityManager.java:366)
  at org.apache.ignite.internal.processors.query.h2.twostep.GridMapQueryExecutor.reservePartitions(GridMapQueryExecutor.java:316)
  at org.apache.ignite.internal.processors.query.h2.twostep.GridMapQueryExecutor.onQueryRequest(GridMapQueryExecutor.java:428)
  at org.apache.ignite.internal.processors.query.h2.twostep.GridMapQueryExecutor.onMessage(GridMapQueryExecutor.java:184)
  at org.apache.ignite.internal.processors.query.h2.twostep.GridMapQueryExecutor$2.onMessage(GridMapQueryExecutor.java:159)
  at org.apache.ignite.internal.managers.communication.GridIoManager$ArrayListener.onMessage(GridIoManager.java:1821)
  at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:821)
  at org.apache.ignite.internal.managers.communication.GridIoManager.access$1600(GridIoManager.java:103)
  at org.apache.ignite.internal.managers.communication.GridIoManager$5.run(GridIoManager.java:784)
  at 

—
Denis

> On Sep 16, 2016, at 2:44 AM, yfernando <yo...@tudor.com> wrote:
> 
> Hi Denis,
> 
> We have been able to reproduce this situation where a node failure freezes
> the entire grid.
> 
> Please find the full thread dumps of the 5 nodes that are locked up.
> 
> The memoryMode of the caches are configured to be OFFHEAP_TIERED
> The cacheMode is PARTITIONED
> The atomicityMode is TRANSACTIONAL
> 
> We have also seen ALL the clients freeze during a FULL GC occurring on ANY
> single node.
> 
> Please let us know if you require any more information.
> 
> grid-tp1-dev-11220-201609141523318.txt
> <http://apache-ignite-users.70518.x6.nabble.com/file/n7791/grid-tp1-dev-11220-201609141523318.txt>  
> grid-tp1-dev-11223-201609141523318.txt
> <http://apache-ignite-users.70518.x6.nabble.com/file/n7791/grid-tp1-dev-11223-201609141523318.txt>  
> grid-tp3-dev-11220-201609141523318.txt
> <http://apache-ignite-users.70518.x6.nabble.com/file/n7791/grid-tp3-dev-11220-201609141523318.txt>  
> grid-tp3-dev-11221-201609141523318.txt
> <http://apache-ignite-users.70518.x6.nabble.com/file/n7791/grid-tp3-dev-11221-201609141523318.txt>  
> grid-tp4-dev-11220-201609141523318.txt
> <http://apache-ignite-users.70518.x6.nabble.com/file/n7791/grid-tp4-dev-11220-201609141523318.txt>  
> 
> 
> 
> 
> --
> View this message in context: http://apache-ignite-users.70518.x6.nabble.com/One-failing-node-stalling-the-whole-cluster-tp5372p7791.html
> Sent from the Apache Ignite Users mailing list archive at Nabble.com.