You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ignite.apache.org by mmyoussef <mu...@teradata.com> on 2018/05/15 16:33:02 UTC

Ignite hang on a complex SQL join occurring on 2 server nodes

The scenario is 
1- Data is loaded from cassandra to a lookup cache -> replicated cache
2- Data is fetched from kafka input topic to a map
3- the map is inserted - using putall method - to another cache ->
partitioned cache 
4- a complex join is done between both caches
5- The result set is processed and inserted into layer 2 cache
6- Then processed again to a layer 3 cache then to kafka output topic

the jar is compiled and run on 2 different nodes. 

some records are fetched, processing runs on both nodes -results of the
first batch are not accurate- and then hangs and keeps logging this message
?!

10/05/2018 16:39:48  INFO [tcp-disco-stats-printer-#4%cluster-VEON%] 
TcpDiscoverySpi: Discovery SPI statistics [statistics=TcpDiscoveryStatistics 
[joinStartedTs=1525962688220, joinFinishedTs=1525962706508, 
crdSinceTs=1525962706498, joinedNodesCnt=1, failedNodesCnt=0,
leftNodesCnt=0, 
ackTimeoutsCnt=0, sockTimeoutsCnt=0, rcvdMsgs= 
{TcpDiscoveryMetricsUpdateMessage=474, TcpDiscoveryNodeAddedMessage=1, 
TcpDiscoveryNodeAddFinishedMessage=1, TcpDiscoveryJoinRequestMessage=2, 
TcpDiscoveryDiscardMessage=35, TcpDiscoveryConnectionCheckMessage=706, 
TcpDiscoveryCustomEventMessage=39}, procMsgs= 
{TcpDiscoveryMetricsUpdateMessage=715, TcpDiscoveryNodeAddedMessage=1, 
TcpDiscoveryNodeAddFinishedMessage=1, TcpDiscoveryJoinRequestMessage=2, 
TcpDiscoveryCustomEventMessage=53, TcpDiscoveryDiscardMessage=76}, 
avgMsgsSndTimes={TcpDiscoveryMetricsUpdateMessage=0, 
TcpDiscoveryNodeAddedMessage=10, TcpDiscoveryNodeAddFinishedMessage=0, 
TcpDiscoveryJoinRequestMessage=10, TcpDiscoveryConnectionCheckMessage=0, 
TcpDiscoveryDiscardMessage=0, TcpDiscoveryCustomEventMessage=0}, 
maxMsgsSndTimes={TcpDiscoveryMetricsUpdateMessage=10, 
TcpDiscoveryNodeAddedMessage=10, TcpDiscoveryNodeAddFinishedMessage=0, 
TcpDiscoveryJoinRequestMessage=10, TcpDiscoveryConnectionCheckMessage=42, 
TcpDiscoveryDiscardMessage=0, TcpDiscoveryCustomEventMessage=0}, sentMsgs= 
{TcpDiscoveryMetricsUpdateMessage=474, TcpDiscoveryNodeAddedMessage=1, 
TcpDiscoveryNodeAddFinishedMessage=1, TcpDiscoveryJoinRequestMessage=1, 
TcpDiscoveryConnectionCheckMessage=706, TcpDiscoveryDiscardMessage=35, 
TcpDiscoveryCustomEventMessage=33}, avgMsgsAckTimes= 
{TcpDiscoveryMetricsUpdateMessage=0, TcpDiscoveryNodeAddedMessage=10, 
TcpDiscoveryNodeAddFinishedMessage=0, TcpDiscoveryJoinRequestMessage=10, 
TcpDiscoveryConnectionCheckMessage=0, TcpDiscoveryDiscardMessage=0, 
TcpDiscoveryCustomEventMessage=0}, maxMsgsAckTimes= 
{TcpDiscoveryMetricsUpdateMessage=10, TcpDiscoveryNodeAddedMessage=10, 
TcpDiscoveryNodeAddFinishedMessage=0, TcpDiscoveryJoinRequestMessage=10, 
TcpDiscoveryConnectionCheckMessage=42, TcpDiscoveryDiscardMessage=0, 
TcpDiscoveryCustomEventMessage=0}, avgMsgQueueTime=0, maxMsgQueueTime=11, 
ringMsgsSent=35, avgRingMsgTime=0, maxRingMsgTime=121, 
maxRingTimeMsgCls=TcpDiscoveryNodeAddedMessage, avgMsgProcTime=0, 
maxMsgProcTime=91, maxProcTimeMsgCls=TcpDiscoveryJoinRequestMessage, 
sockReadersCreated=3, sockReadersRmv=2, avgSrvSockInitTime=3, 
maxSrvSockInitTime=10, clientSockCreatedCnt=2, avgClientSockInitTime=15, 
maxClientSockInitTime=21, pendingMsgsRegistered=35, pendingMsgsDiscarded=0], 
spiState=CONNECTED, coord=TcpDiscoveryNode [id=8a03f4a3-a631-42e7-bdf2- 
b062e1e977b6, addrs=[0:0:0:0:0:0:0:1, 10.19.1.14, 127.0.0.1, 172.16.44.139, 
172.23.48.1], sockAddrs=[WEGMM250132-O5F.TD.TERADATA.COM/10.19.1.14:47501, 
/172.16.44.139:47501, /172.23.48.1:47501, /0:0:0:0:0:0:0:1:47501, 
/127.0.0.1:47501], discPort=47501, order=1, intOrder=1, 
lastExchangeTime=1525962688210, loc=true, ver=2.0.0#20170430-sha1:d4eef3c6, 
isClient=false], next=TcpDiscoveryNode [id=ee98d2ff-40fe-4981-8a59- 
1b12b478bc11, addrs=[0:0:0:0:0:0:0:1, 10.19.1.14, 127.0.0.1, 172.16.44.139, 
172.23.48.1], sockAddrs=[WEGMM250132-O5F.TD.TERADATA.COM/10.19.1.14:47500, 
/172.23.48.1:47500, /0:0:0:0:0:0:0:1:47500, /127.0.0.1:47500, 
/172.16.44.139:47500], discPort=47500, order=2, intOrder=2, 
lastExchangeTime=1525962714266, loc=false, ver=2.0.0#20170430-sha1:d4eef3c6, 
isClient=false], intOrder=1, topSize=2, leavingNodesSize=0, 
failedNodesSize=0, 
joiningNodesSize=0, pendingCustomMsgs=0, msgWorker.queue.size=0, clients=0, 
clientWorkers=0, lastUpdate=05/10/2018 16:39:47, heapFree=5727M, 
heapTotal=5888M]



some times I get this message too : 
[GridCachePartitionExchangeManager] Failed to wait for initial partition map
exchange. Possible reasons are: 
  ^-- Transactions in deadlock. 
  ^-- Long running transactions (ignore if this is the case). 
  ^-- Unreleased explicit locks. 

How can I solve this in your opinions?



--
Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/

Re: Ignite hang on a complex SQL join occurring on 2 server nodes

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

* The message that you provided most likely means there's a prolonged
deadlock or some other severe problem that made your cluster inoperational.
* You are using putAll which is known to cause deadlocks if used in
parallel and maps in question are not sorted.
* Is the map that you're passing to putAll sorted? (e.g. TreeMap)

Regards,

-- 
Ilya Kasnacheev

2018-05-15 19:33 GMT+03:00 mmyoussef <mu...@teradata.com>:

> The scenario is
> 1- Data is loaded from cassandra to a lookup cache -> replicated cache
> 2- Data is fetched from kafka input topic to a map
> 3- the map is inserted - using putall method - to another cache ->
> partitioned cache
> 4- a complex join is done between both caches
> 5- The result set is processed and inserted into layer 2 cache
> 6- Then processed again to a layer 3 cache then to kafka output topic
>
> the jar is compiled and run on 2 different nodes.
>
> some records are fetched, processing runs on both nodes -results of the
> first batch are not accurate- and then hangs and keeps logging this message
> ?!
>
> 10/05/2018 16:39:48  INFO [tcp-disco-stats-printer-#4%cluster-VEON%]
> TcpDiscoverySpi: Discovery SPI statistics [statistics=TcpDiscoveryStatistics
>
> [joinStartedTs=1525962688220, joinFinishedTs=1525962706508,
> crdSinceTs=1525962706498, joinedNodesCnt=1, failedNodesCnt=0,
> leftNodesCnt=0,
> ackTimeoutsCnt=0, sockTimeoutsCnt=0, rcvdMsgs=
> {TcpDiscoveryMetricsUpdateMessage=474, TcpDiscoveryNodeAddedMessage=1,
> TcpDiscoveryNodeAddFinishedMessage=1, TcpDiscoveryJoinRequestMessage=2,
> TcpDiscoveryDiscardMessage=35, TcpDiscoveryConnectionCheckMessage=706,
> TcpDiscoveryCustomEventMessage=39}, procMsgs=
> {TcpDiscoveryMetricsUpdateMessage=715, TcpDiscoveryNodeAddedMessage=1,
> TcpDiscoveryNodeAddFinishedMessage=1, TcpDiscoveryJoinRequestMessage=2,
> TcpDiscoveryCustomEventMessage=53, TcpDiscoveryDiscardMessage=76},
> avgMsgsSndTimes={TcpDiscoveryMetricsUpdateMessage=0,
> TcpDiscoveryNodeAddedMessage=10, TcpDiscoveryNodeAddFinishedMessage=0,
> TcpDiscoveryJoinRequestMessage=10, TcpDiscoveryConnectionCheckMessage=0,
> TcpDiscoveryDiscardMessage=0, TcpDiscoveryCustomEventMessage=0},
> maxMsgsSndTimes={TcpDiscoveryMetricsUpdateMessage=10,
> TcpDiscoveryNodeAddedMessage=10, TcpDiscoveryNodeAddFinishedMessage=0,
> TcpDiscoveryJoinRequestMessage=10, TcpDiscoveryConnectionCheckMessage=42,
> TcpDiscoveryDiscardMessage=0, TcpDiscoveryCustomEventMessage=0},
> sentMsgs=
> {TcpDiscoveryMetricsUpdateMessage=474, TcpDiscoveryNodeAddedMessage=1,
> TcpDiscoveryNodeAddFinishedMessage=1, TcpDiscoveryJoinRequestMessage=1,
> TcpDiscoveryConnectionCheckMessage=706, TcpDiscoveryDiscardMessage=35,
> TcpDiscoveryCustomEventMessage=33}, avgMsgsAckTimes=
> {TcpDiscoveryMetricsUpdateMessage=0, TcpDiscoveryNodeAddedMessage=10,
> TcpDiscoveryNodeAddFinishedMessage=0, TcpDiscoveryJoinRequestMessage=10,
> TcpDiscoveryConnectionCheckMessage=0, TcpDiscoveryDiscardMessage=0,
> TcpDiscoveryCustomEventMessage=0}, maxMsgsAckTimes=
> {TcpDiscoveryMetricsUpdateMessage=10, TcpDiscoveryNodeAddedMessage=10,
> TcpDiscoveryNodeAddFinishedMessage=0, TcpDiscoveryJoinRequestMessage=10,
> TcpDiscoveryConnectionCheckMessage=42, TcpDiscoveryDiscardMessage=0,
> TcpDiscoveryCustomEventMessage=0}, avgMsgQueueTime=0, maxMsgQueueTime=11,
> ringMsgsSent=35, avgRingMsgTime=0, maxRingMsgTime=121,
> maxRingTimeMsgCls=TcpDiscoveryNodeAddedMessage, avgMsgProcTime=0,
> maxMsgProcTime=91, maxProcTimeMsgCls=TcpDiscoveryJoinRequestMessage,
> sockReadersCreated=3, sockReadersRmv=2, avgSrvSockInitTime=3,
> maxSrvSockInitTime=10, clientSockCreatedCnt=2, avgClientSockInitTime=15,
> maxClientSockInitTime=21, pendingMsgsRegistered=35,
> pendingMsgsDiscarded=0],
> spiState=CONNECTED, coord=TcpDiscoveryNode [id=8a03f4a3-a631-42e7-bdf2-
> b062e1e977b6, addrs=[0:0:0:0:0:0:0:1, 10.19.1.14, 127.0.0.1,
> 172.16.44.139,
> 172.23.48.1], sockAddrs=[WEGMM250132-O5F.TD.TERADATA.COM/10.19.1.14:47501,
>
> /172.16.44.139:47501, /172.23.48.1:47501, /0:0:0:0:0:0:0:1:47501,
> /127.0.0.1:47501], discPort=47501, order=1, intOrder=1,
> lastExchangeTime=1525962688210, loc=true, ver=2.0.0#20170430-sha1:d4eef3c6,
>
> isClient=false], next=TcpDiscoveryNode [id=ee98d2ff-40fe-4981-8a59-
> 1b12b478bc11, addrs=[0:0:0:0:0:0:0:1, 10.19.1.14, 127.0.0.1,
> 172.16.44.139,
> 172.23.48.1], sockAddrs=[WEGMM250132-O5F.TD.TERADATA.COM/10.19.1.14:47500,
>
> /172.23.48.1:47500, /0:0:0:0:0:0:0:1:47500, /127.0.0.1:47500,
> /172.16.44.139:47500], discPort=47500, order=2, intOrder=2,
> lastExchangeTime=1525962714266, loc=false, ver=2.0.0#20170430-sha1:d4eef3c6,
>
> isClient=false], intOrder=1, topSize=2, leavingNodesSize=0,
> failedNodesSize=0,
> joiningNodesSize=0, pendingCustomMsgs=0, msgWorker.queue.size=0,
> clients=0,
> clientWorkers=0, lastUpdate=05/10/2018 16:39:47, heapFree=5727M,
> heapTotal=5888M]
>
>
>
> some times I get this message too :
> [GridCachePartitionExchangeManager] Failed to wait for initial partition
> map
> exchange. Possible reasons are:
>   ^-- Transactions in deadlock.
>   ^-- Long running transactions (ignore if this is the case).
>   ^-- Unreleased explicit locks.
>
> How can I solve this in your opinions?
>
>
>
> --
> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
>