You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by jpmoore40 <jo...@ca-cib.com> on 2017/05/02 14:17:37 UTC

Nodes hanging when accessing queue

Hi,

I've been having a couple of issues with a shared IgniteQueue which seem to
occur sporadically. I have a single publisher process which pushes data onto
a queue, and several worker processes that pull the data from the queue and
perform some processing. However these nodes are part of a larger qrid
containing other nodes that submit and process IgniteCallables. I partition
these using an environment variable clusterName=xxx so the IgniteCallables
are only computed on a particular cluster group. This seemed like the best
way of doing things as I am using the TcpDiscoveryJdbcIpFinder and didn't
want to set up a different database for each discrete grid. 

Several times I have found that the publishers and workers accessing the
IgniteQueue stop processing and there seem to be two separate problems
occurring.

The first was that I would get an exception such as the following when a
node was stopped:

java.lang.IllegalStateException: Cache has been stopped: datastructures_0
      at
org.apache.ignite.internal.processors.cache.GridCacheGateway.checkStatus(GridCacheGateway.java:85)
      at
org.apache.ignite.internal.processors.cache.GridCacheGateway.enter(GridCacheGateway.java:68)
      at
org.apache.ignite.internal.processors.datastructures.GridCacheQueueProxy.contains(GridCacheQueueProxy.java:160)
      at
my.com.CrossFxCurvePubService.addToQueue(CrossFxCurvePubService.java:267)
      ...

This I think I solved (i.e. it hasn't happened since) by ensuring that the
CollectionConfiguration was initialised with backups, though if anyone can
confirm that would be helpful. 

However the second problem (which also causes the queue publisher and
workers to stop processing) is accompanied by repeated blocks of messages
such as the following:

2017-04-28 14:08:05,468 WARN  [grid-nio-worker-2-#11%null%] java.JavaLogger
(JavaLogger.java:278) - Failed to process selector key (will close):
GridSelectorNioSessionImpl [selectorIdx=2, queueSize=0,
writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
recovery=GridNioRecoveryDescriptor [acked=9, resendCnt=0, rcvCnt=8,
sentCnt=9, reserved=true, lastAck=8, nodeLeft=false, node=TcpDiscoveryNode
[id=c91ce074-964e-4497-ac77-a3828b301ed3, addrs=[0:0:0:0:0:0:0:1,
10.127.197.150, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1:0,
/10.127.197.150:0, /127.0.0.1:0], discPort=0, order=161, intOrder=84,
lastExchangeTime=1493384810687, loc=false, ver=1.8.0#20161205-sha1:9ca40dbe,
isClient=true], connected=true, connectCnt=1, queueLimit=5120,
reserveCnt=1], super=GridNioSessionImpl [locAddr=/10.127.246.164:60985,
rmtAddr=/10.127.197.150:47100, createTime=1493384812272, closeTime=0,
bytesSent=73469, bytesRcvd=1053, sndSchedTime=1493384869270,
lastSndTime=1493384831058, lastRcvTime=1493384869270, readsPaused=false,
filterChain=FilterChain[filters=[GridNioCodecFilter
[parser=o.a.i.i.util.nio.GridDirectParser@1b4d47c, directMode=true],
GridConnectionBytesVerifyFilter], accepted=false]]

2017-04-28 14:08:05,470 WARN  [grid-nio-worker-2-#11%null%] java.JavaLogger
(JavaLogger.java:278) - Closing NIO session because of unhandled exception
[cls=class o.a.i.i.util.nio.GridNioException, msg=An existing connection was
forcibly closed by the remote host]

2017-04-28 14:08:14,279 WARN  [disco-event-worker-#20%null%] java.JavaLogger
(JavaLogger.java:278) - Node FAILED: TcpDiscoveryNode
[id=c91ce074-964e-4497-ac77-a3828b301ed3, addrs=[0:0:0:0:0:0:0:1,
10.127.197.150, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1:0,
/10.127.197.150:0, /127.0.0.1:0], discPort=0, order=161, intOrder=84,
lastExchangeTime=1493384810687, loc=false, ver=1.8.0#20161205-sha1:9ca40dbe,
isClient=true]
2017-04-28 14:08:14,287 INFO  [disco-event-worker-#20%null%] java.JavaLogger
(JavaLogger.java:273) - Topology snapshot [ver=162, servers=6, clients=0,
CPUs=24, heap=3.5GB]
2017-04-28 14:08:14,295 INFO  [exchange-worker-#24%null%] java.JavaLogger
(JavaLogger.java:273) - Skipping rebalancing (nothing scheduled)
[top=AffinityTopologyVersion [topVer=162, minorTopVer=0], evt=NODE_FAILED,
node=c91ce074-964e-4497-ac77-a3828b301ed3]

2017-04-28 14:08:35,853 WARN  [grid-timeout-worker-#7%null%] java.JavaLogger
(JavaLogger.java:278) - Found long running cache future
[startTime=14:06:52.182, curTime=14:08:35.828,
fut=GridPartitionedSingleGetFuture [topVer=AffinityTopologyVersion
[topVer=161, minorTopVer=0], key=UserKeyCacheObjectImpl
[val=GridCacheQueueItemKey
[queueId=9c0396aab51-f5c26da7-4123-4ba7-aa40-857ccd042342,
queueName=cross-fx-curves, idx=519195], hasValBytes=true], readThrough=true,
forcePrimary=false, futId=c8ca4a5cb51-f5c26da7-4123-4ba7-aa40-857ccd042342,
trackable=true, subjId=efe9e46d-6dbd-4ca1-b7fb-7ace46d37571, taskName=null,
deserializeBinary=true, skipVals=false, expiryPlc=null, canRemap=true,
needVer=false, keepCacheObjects=false, node=TcpDiscoveryNode
[id=1bbf12c4-74b7-490d-b3fc-d8b3ef713ac0, addrs=[0:0:0:0:0:0:0:1,
10.127.246.74, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1:47500,
/127.0.0.1:47500, LNAPLN343PRD.ldn.emea.cib/10.127.246.74:47500],
discPort=47500, order=159, intOrder=83, lastExchangeTime=1493213394382,
loc=false, ver=1.8.0#20161205-sha1:9ca40dbe, isClient=false]]]

When this happens the only solutions seems to be to restart all nodes on the
grid. The key to this I think is the "long running cache future" as this is
accessing my queue (named cross-fx-curves), but I've no idea what this is
doing and why it should be stuck. These events do seem to conincide with
restarts of nodes on the grid, but I have been unable to reproduce the issue
- I have tried killing each of the nodes individually with no impact on the
rest of the grid.

Can anyone provide any feedback on what could be causing this and how best
to rectify?

Thanks 
Jon



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Nodes-hanging-when-accessing-queue-tp12343.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Nodes hanging when accessing queue

Posted by jpmoore40 <jo...@ca-cib.com>.

That's fine thanks Andrew - I was just curious. I will use preferIPv4.

Thanks
Jon



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Nodes-hanging-when-accessing-queue-tp12343p12458.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Nodes hanging when accessing queue

Posted by Andrey Mashenkov <an...@gmail.com>.

Hi,

There is  known issue, sometimes one of nodes use IPv6 for communication
that results in whole grid hanging with series of similar messages in logs.
Yes, it should be applied to both client and server nodes. There is a
commented line with preferIPv4 option in ignite.sh.

Would you please share full logs if preferIPv4 will not helpful?

On Fri, May 5, 2017 at 10:55 AM, jpmoore40 <jo...@ca-cib.com>
wrote:

> Thanks for the reply Andrew - I will look into this.
>
> Out of interest what in the logs gave away the IPv6? I can't spot it. Also
> I
> presume this applies to nodes both in client and non-client mode, as I use
> a
> GUI client to interrogate the current grid state in client mode?
>
>
>
>
>
> --
> View this message in context: http://apache-ignite-users.
> 70518.x6.nabble.com/Nodes-hanging-when-accessing-queue-tp12343p12450.html
> Sent from the Apache Ignite Users mailing list archive at Nabble.com.
>

-- 
Best regards,
Andrey V. Mashenkov

Re: Nodes hanging when accessing queue

Posted by jpmoore40 <jo...@ca-cib.com>.

Thanks for the reply Andrew - I will look into this.

Out of interest what in the logs gave away the IPv6? I can't spot it. Also I
presume this applies to nodes both in client and non-client mode, as I use a
GUI client to interrogate the current grid state in client mode? 





--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Nodes-hanging-when-accessing-queue-tp12343p12450.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Nodes hanging when accessing queue

Posted by Andrey Mashenkov <an...@gmail.com>.

Hi,

1. Yes, adding backups for datastructures cache (via AtomicConfiguration)
shoud help.

2. Seems, one or more of your nodes register with ipv6 address. Try to use
JVM option for all nodes: either -Djava.net.preferIPv4Stack=true or
 -Djava.net.preferIPv6Stack=true.

On Tue, May 2, 2017 at 5:17 PM, jpmoore40 <jo...@ca-cib.com> wrote:

> Hi,
>
> I've been having a couple of issues with a shared IgniteQueue which seem to
> occur sporadically. I have a single publisher process which pushes data
> onto
> a queue, and several worker processes that pull the data from the queue and
> perform some processing. However these nodes are part of a larger qrid
> containing other nodes that submit and process IgniteCallables. I partition
> these using an environment variable clusterName=xxx so the IgniteCallables
> are only computed on a particular cluster group. This seemed like the best
> way of doing things as I am using the TcpDiscoveryJdbcIpFinder and didn't
> want to set up a different database for each discrete grid.
>
> Several times I have found that the publishers and workers accessing the
> IgniteQueue stop processing and there seem to be two separate problems
> occurring.
>
> The first was that I would get an exception such as the following when a
> node was stopped:
>
> java.lang.IllegalStateException: Cache has been stopped: datastructures_0
>       at
> org.apache.ignite.internal.processors.cache.GridCacheGateway.checkStatus(
> GridCacheGateway.java:85)
>       at
> org.apache.ignite.internal.processors.cache.GridCacheGateway.enter(
> GridCacheGateway.java:68)
>       at
> org.apache.ignite.internal.processors.datastructures.
> GridCacheQueueProxy.contains(GridCacheQueueProxy.java:160)
>       at
> my.com.CrossFxCurvePubService.addToQueue(CrossFxCurvePubService.java:267)
>       ...
>
> This I think I solved (i.e. it hasn't happened since) by ensuring that the
> CollectionConfiguration was initialised with backups, though if anyone can
> confirm that would be helpful.
>
> However the second problem (which also causes the queue publisher and
> workers to stop processing) is accompanied by repeated blocks of messages
> such as the following:
>
> 2017-04-28 14:08:05,468 WARN  [grid-nio-worker-2-#11%null%] java.JavaLogger
> (JavaLogger.java:278) - Failed to process selector key (will close):
> GridSelectorNioSessionImpl [selectorIdx=2, queueSize=0,
> writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
> readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
> recovery=GridNioRecoveryDescriptor [acked=9, resendCnt=0, rcvCnt=8,
> sentCnt=9, reserved=true, lastAck=8, nodeLeft=false, node=TcpDiscoveryNode
> [id=c91ce074-964e-4497-ac77-a3828b301ed3, addrs=[0:0:0:0:0:0:0:1,
> 10.127.197.150, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1:0,
> /10.127.197.150:0, /127.0.0.1:0], discPort=0, order=161, intOrder=84,
> lastExchangeTime=1493384810687, loc=false, ver=1.8.0#20161205-sha1:
> 9ca40dbe,
> isClient=true], connected=true, connectCnt=1, queueLimit=5120,
> reserveCnt=1], super=GridNioSessionImpl [locAddr=/10.127.246.164:60985,
> rmtAddr=/10.127.197.150:47100, createTime=1493384812272, closeTime=0,
> bytesSent=73469, bytesRcvd=1053, sndSchedTime=1493384869270,
> lastSndTime=1493384831058, lastRcvTime=1493384869270, readsPaused=false,
> filterChain=FilterChain[filters=[GridNioCodecFilter
> [parser=o.a.i.i.util.nio.GridDirectParser@1b4d47c, directMode=true],
> GridConnectionBytesVerifyFilter], accepted=false]]
>
> 2017-04-28 14:08:05,470 WARN  [grid-nio-worker-2-#11%null%] java.JavaLogger
> (JavaLogger.java:278) - Closing NIO session because of unhandled exception
> [cls=class o.a.i.i.util.nio.GridNioException, msg=An existing connection
> was
> forcibly closed by the remote host]
>
> 2017-04-28 14:08:14,279 WARN  [disco-event-worker-#20%null%]
> java.JavaLogger
> (JavaLogger.java:278) - Node FAILED: TcpDiscoveryNode
> [id=c91ce074-964e-4497-ac77-a3828b301ed3, addrs=[0:0:0:0:0:0:0:1,
> 10.127.197.150, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1:0,
> /10.127.197.150:0, /127.0.0.1:0], discPort=0, order=161, intOrder=84,
> lastExchangeTime=1493384810687, loc=false, ver=1.8.0#20161205-sha1:
> 9ca40dbe,
> isClient=true]
> 2017-04-28 14:08:14,287 INFO  [disco-event-worker-#20%null%]
> java.JavaLogger
> (JavaLogger.java:273) - Topology snapshot [ver=162, servers=6, clients=0,
> CPUs=24, heap=3.5GB]
> 2017-04-28 14:08:14,295 INFO  [exchange-worker-#24%null%] java.JavaLogger
> (JavaLogger.java:273) - Skipping rebalancing (nothing scheduled)
> [top=AffinityTopologyVersion [topVer=162, minorTopVer=0], evt=NODE_FAILED,
> node=c91ce074-964e-4497-ac77-a3828b301ed3]
>
> 2017-04-28 14:08:35,853 WARN  [grid-timeout-worker-#7%null%]
> java.JavaLogger
> (JavaLogger.java:278) - Found long running cache future
> [startTime=14:06:52.182, curTime=14:08:35.828,
> fut=GridPartitionedSingleGetFuture [topVer=AffinityTopologyVersion
> [topVer=161, minorTopVer=0], key=UserKeyCacheObjectImpl
> [val=GridCacheQueueItemKey
> [queueId=9c0396aab51-f5c26da7-4123-4ba7-aa40-857ccd042342,
> queueName=cross-fx-curves, idx=519195], hasValBytes=true],
> readThrough=true,
> forcePrimary=false, futId=c8ca4a5cb51-f5c26da7-
> 4123-4ba7-aa40-857ccd042342,
> trackable=true, subjId=efe9e46d-6dbd-4ca1-b7fb-7ace46d37571,
> taskName=null,
> deserializeBinary=true, skipVals=false, expiryPlc=null, canRemap=true,
> needVer=false, keepCacheObjects=false, node=TcpDiscoveryNode
> [id=1bbf12c4-74b7-490d-b3fc-d8b3ef713ac0, addrs=[0:0:0:0:0:0:0:1,
> 10.127.246.74, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1:47500,
> /127.0.0.1:47500, LNAPLN343PRD.ldn.emea.cib/10.127.246.74:47500],
> discPort=47500, order=159, intOrder=83, lastExchangeTime=1493213394382,
> loc=false, ver=1.8.0#20161205-sha1:9ca40dbe, isClient=false]]]
>
> When this happens the only solutions seems to be to restart all nodes on
> the
> grid. The key to this I think is the "long running cache future" as this is
> accessing my queue (named cross-fx-curves), but I've no idea what this is
> doing and why it should be stuck. These events do seem to conincide with
> restarts of nodes on the grid, but I have been unable to reproduce the
> issue
> - I have tried killing each of the nodes individually with no impact on the
> rest of the grid.
>
> Can anyone provide any feedback on what could be causing this and how best
> to rectify?
>
> Thanks
> Jon
>
>
>
> --
> View this message in context: http://apache-ignite-users.
> 70518.x6.nabble.com/Nodes-hanging-when-accessing-queue-tp12343.html
> Sent from the Apache Ignite Users mailing list archive at Nabble.com.
>



-- 
Best regards,
Andrey V. Mashenkov