You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by Juan Barani <Ju...@unitedplanet.com> on 2017/07/10 09:34:07 UTC

TcpDiscoverySpi worker thread failed with assertion error

Hi everyone,

We are testing a new application running Ignite 2.0 with 3 nodes running on Azure platform.

We left the cluster running a couple of days without load, and randomly one of the Nodes stopped with the following stacktrace:

INFO  2017-07-07T23:07:28,475 -
org.apache.ignite.internal.IgniteKernal%IntrexxDataGrid[grid-timeout-worker-#15%IntrexxDataGrid%]
  FreeList [name=IntrexxDataGrid, buckets=256, dataPages=3, reusePages=0] INFO  2017-07-07T23:07:28,475 - org.apache.ignite.internal.IgniteKernal%IntrexxDataGrid[grid-timeout-worker-#15%IntrexxDataGrid%]
  FreeList [name=IntrexxDataGrid, buckets=256, dataPages=9, reusePages=0] ERROR 2017-07-07T23:08:10,632 - org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi[tcp-disco-msg-worker-#2%IntrexxDataGrid%]


  TcpDiscoverSpi's message worker thread failed abnormally. Stopping the node in order to prevent cluster wide instability.
java.lang.AssertionError: -21
    at
org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryStatistics.onMessageSent(TcpDiscoveryStatistics.java:317)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.sendMessageAcrossRing(ServerImpl.java:2980)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMetricsUpdateMessage(ServerImpl.java:5001)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2562)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2364)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6460)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2450)
[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
[ignite-core-2.0.0.jar:2.0.0]
ERROR 2017-07-07T23:08:10,661 -
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi[tcp-disco-msg-worker-#2%IntrexxDataGrid%]
  Runtime error caught during grid runnable execution: IgniteSpiThread [name=tcp-disco-msg-worker-#2%IntrexxDataGrid%]
java.lang.AssertionError: -21
     at
org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryStatistics.onMessageSent(TcpDiscoveryStatistics.java:317)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.sendMessageAcrossRing(ServerImpl.java:2980)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMetricsUpdateMessage(ServerImpl.java:5001)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2562)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2364)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6460)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2450)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)

I researched online, and it seems that the thread fails when there is a long gc pause (>10 s), or not so much memory. In all those cases it comes also an InterruptedException.

However, that does not seem to be the case, since we do have enough memory, and no load whatsoever. What strikes me the most is the assertion error:

  TcpDiscoverSpi's message worker thread failed abnormally. Stopping the node in order to prevent cluster wide instability.
java.lang.AssertionError: -21

This is an assertion in method onMessageSent:

assert time >= 0 : time;

which in turns comes from (ServerImpl.java:2980):

spi.stats.onMessageSent(msg, tstamp0 - tstamp);

tstamp0 and tsamp are in turn values returned from U.currentTimeMillis();

which if I understand correctly should return an ever increasing number (in 10 seconds interval).

Hence I cannot figure out why this asserted value is negative.

Can someone give us a hint what we should do?

Thanks!

Juan

RE: TcpDiscoverySpi worker thread failed with assertion error

Posted by Juan Barani <Ju...@unitedplanet.com>.

Hi Slava,

Thank you very much for your answer.

Juan


From: Вячеслав Коптилин [mailto:slava.koptilin@gmail.com]
Sent: Monday, July 10, 2017 12:21 PM
To: user@ignite.apache.org
Subject: Re: TcpDiscoverySpi worker thread failed with assertion error

Hi Juan,

U.currentTimeMillis(), which is based on System.currentTimeMillis(), is not monotonic,
so there is no guarantee that returned values will always be greater than all previous calls.

please see the following ticket: https://issues.apache.org/jira/browse/IGNITE-5562

Thanks,
Slava.

2017-07-10 12:34 GMT+03:00 Juan Barani <Ju...@unitedplanet.com>>:
Hi everyone,

We are testing a new application running Ignite 2.0 with 3 nodes running on Azure platform.

We left the cluster running a couple of days without load, and randomly one of the Nodes stopped with the following stacktrace:

INFO  2017-07-07T23:07:28,475 -
org.apache.ignite.internal.IgniteKernal%IntrexxDataGrid[grid-timeout-worker-#15%IntrexxDataGrid%]
  FreeList [name=IntrexxDataGrid, buckets=256, dataPages=3, reusePages=0] INFO  2017-07-07T23:07:28,475 - org.apache.ignite.internal.IgniteKernal%IntrexxDataGrid[grid-timeout-worker-#15%IntrexxDataGrid%]
  FreeList [name=IntrexxDataGrid, buckets=256, dataPages=9, reusePages=0] ERROR 2017-07-07T23:08:10,632 - org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi[tcp-disco-msg-worker-#2%IntrexxDataGrid%]


  TcpDiscoverSpi's message worker thread failed abnormally. Stopping the node in order to prevent cluster wide instability.
java.lang.AssertionError: -21
    at
org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryStatistics.onMessageSent(TcpDiscoveryStatistics.java:317)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.sendMessageAcrossRing(ServerImpl.java:2980)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMetricsUpdateMessage(ServerImpl.java:5001)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2562)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2364)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6460)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2450)
[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
[ignite-core-2.0.0.jar:2.0.0]
ERROR 2017-07-07T23:08:10,661 -
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi[tcp-disco-msg-worker-#2%IntrexxDataGrid%]
  Runtime error caught during grid runnable execution: IgniteSpiThread [name=tcp-disco-msg-worker-#2%IntrexxDataGrid%]
java.lang.AssertionError: -21
     at
org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryStatistics.onMessageSent(TcpDiscoveryStatistics.java:317)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.sendMessageAcrossRing(ServerImpl.java:2980)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMetricsUpdateMessage(ServerImpl.java:5001)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2562)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2364)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6460)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2450)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)

I researched online, and it seems that the thread fails when there is a long gc pause (>10 s), or not so much memory. In all those cases it comes also an InterruptedException.

However, that does not seem to be the case, since we do have enough memory, and no load whatsoever. What strikes me the most is the assertion error:

  TcpDiscoverSpi's message worker thread failed abnormally. Stopping the node in order to prevent cluster wide instability.
java.lang.AssertionError: -21

This is an assertion in method onMessageSent:

assert time >= 0 : time;

which in turns comes from (ServerImpl.java:2980):

spi.stats.onMessageSent(msg, tstamp0 - tstamp);

tstamp0 and tsamp are in turn values returned from U.currentTimeMillis();

which if I understand correctly should return an ever increasing number (in 10 seconds interval).

Hence I cannot figure out why this asserted value is negative.

Can someone give us a hint what we should do?

Thanks!

Juan

Re: TcpDiscoverySpi worker thread failed with assertion error

Posted by Вячеслав Коптилин <sl...@gmail.com>.

Hi Juan,

U.currentTimeMillis(), which is based on System.currentTimeMillis(), is not
monotonic,
so there is no guarantee that returned values will always be greater than
all previous calls.

please see the following ticket:
https://issues.apache.org/jira/browse/IGNITE-5562

Thanks,
Slava.

2017-07-10 12:34 GMT+03:00 Juan Barani <Ju...@unitedplanet.com>:

> Hi everyone,
>
>
>
> We are testing a new application running Ignite 2.0 with 3 nodes running
> on Azure platform.
>
>
>
> We left the cluster running a couple of days without load, and randomly
> one of the Nodes stopped with the following stacktrace:
>
>
>
> INFO  2017-07-07T23:07:28,475 -
>
> org.apache.ignite.internal.IgniteKernal%IntrexxDataGrid[
> grid-timeout-worker-#15%IntrexxDataGrid%]
>
>   FreeList [name=IntrexxDataGrid, buckets=256, dataPages=3, reusePages=0]
> INFO  2017-07-07T23:07:28,475 - org.apache.ignite.internal.
> IgniteKernal%IntrexxDataGrid[grid-timeout-worker-#15%IntrexxDataGrid%]
>
>   FreeList [name=IntrexxDataGrid, buckets=256, dataPages=9, reusePages=0]
> ERROR 2017-07-07T23:08:10,632 - org.apache.ignite.spi.
> discovery.tcp.TcpDiscoverySpi[tcp-disco-msg-worker-#2%IntrexxDataGrid%]
>
>
>
>
>
>   TcpDiscoverSpi's message worker thread failed abnormally. Stopping the
> node in order to prevent cluster wide instability.
>
> java.lang.AssertionError: -21
>
>     at
>
> org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryStatistics.
> onMessageSent(TcpDiscoveryStatistics.java:317)
>
> ~[ignite-core-2.0.0.jar:2.0.0]
>
>      at
>
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.
> sendMessageAcrossRing(ServerImpl.java:2980)
>
> ~[ignite-core-2.0.0.jar:2.0.0]
>
>      at
>
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.
> processMetricsUpdateMessage(ServerImpl.java:5001)
>
> ~[ignite-core-2.0.0.jar:2.0.0]
>
>      at
>
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.
> processMessage(ServerImpl.java:2562)
>
> ~[ignite-core-2.0.0.jar:2.0.0]
>
>      at
>
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.
> processMessage(ServerImpl.java:2364)
>
> ~[ignite-core-2.0.0.jar:2.0.0]
>
>      at
>
> org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(
> ServerImpl.java:6460)
>
> ~[ignite-core-2.0.0.jar:2.0.0]
>
>      at
>
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(
> ServerImpl.java:2450)
>
> [ignite-core-2.0.0.jar:2.0.0]
>
>      at
>
> org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
>
> [ignite-core-2.0.0.jar:2.0.0]
>
> ERROR 2017-07-07T23:08:10,661 -
>
> org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi[
> tcp-disco-msg-worker-#2%IntrexxDataGrid%]
>
>   Runtime error caught during grid runnable execution: IgniteSpiThread
> [name=tcp-disco-msg-worker-#2%IntrexxDataGrid%]
>
> java.lang.AssertionError: -21
>
>      at
>
> org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryStatistics.
> onMessageSent(TcpDiscoveryStatistics.java:317)
>
> ~[ignite-core-2.0.0.jar:2.0.0]
>
>      at
>
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.
> sendMessageAcrossRing(ServerImpl.java:2980)
>
> ~[ignite-core-2.0.0.jar:2.0.0]
>
>      at
>
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.
> processMetricsUpdateMessage(ServerImpl.java:5001)
>
> ~[ignite-core-2.0.0.jar:2.0.0]
>
>      at
>
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.
> processMessage(ServerImpl.java:2562)
>
> ~[ignite-core-2.0.0.jar:2.0.0]
>
>      at
>
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.
> processMessage(ServerImpl.java:2364)
>
> ~[ignite-core-2.0.0.jar:2.0.0]
>
>      at
>
> org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(
> ServerImpl.java:6460)
>
> ~[ignite-core-2.0.0.jar:2.0.0]
>
>      at
>
> org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(
> ServerImpl.java:2450)
>
> ~[ignite-core-2.0.0.jar:2.0.0]
>
>      at
>
> org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
>
>
>
> I researched online, and it seems that the thread fails when there is a
> long gc pause (>10 s), or not so much memory. In all those cases it comes
> also an InterruptedException.
>
>
>
> However, that does not seem to be the case, since we do have enough
> memory, and no load whatsoever. What strikes me the most is the assertion
> error:
>
>
>
>   TcpDiscoverSpi's message worker thread failed abnormally. Stopping the
> node in order to prevent cluster wide instability.
>
> java.lang.AssertionError: -21
>
>
>
> This is an assertion in method onMessageSent:
>
>
>
> assert time >= 0 : time;
>
>
>
> which in turns comes from (ServerImpl.java:2980):
>
>
>
> spi.stats.onMessageSent(msg, tstamp0 - tstamp);
>
>
>
> tstamp0 and tsamp are in turn values returned from U.currentTimeMillis();
>
>
>
> which if I understand correctly should return an ever increasing number
> (in 10 seconds interval).
>
>
>
> Hence I cannot figure out why this asserted value is negative.
>
>
>
> Can someone give us a hint what we should do?
>
>
>
> Thanks!
>
>
>
> Juan
>
>
>
>
>
>
>
>
>

RE: TcpDiscoverySpi worker thread failed with assertion error

Posted by Juan Barani <Ju...@unitedplanet.com>.

Hi,

One more thing, if we disable the assertions, would it be any direct impact besides the metrics?

Thanks

Juan


From: Juan Barani [mailto:Juan.Barani@unitedplanet.com]
Sent: Monday, July 10, 2017 11:44 AM
To: user@ignite.apache.org
Subject: RE: TcpDiscoverySpi worker thread failed with assertion error

Just to add something else to the scenario, can it happen that due to NTP adjustments in the servers then System.currentTimeMillis() returns a smaller value than before? May it be that using System.nanoTime() this problem does not appear?

Thanks!

Juan


From: Juan Barani [mailto:Juan.Barani@unitedplanet.com]
Sent: Monday, July 10, 2017 11:34 AM
To: user@ignite.apache.org<ma...@ignite.apache.org>
Subject: TcpDiscoverySpi worker thread failed with assertion error

Hi everyone,

We are testing a new application running Ignite 2.0 with 3 nodes running on Azure platform.

We left the cluster running a couple of days without load, and randomly one of the Nodes stopped with the following stacktrace:

INFO  2017-07-07T23:07:28,475 -
org.apache.ignite.internal.IgniteKernal%IntrexxDataGrid[grid-timeout-worker-#15%IntrexxDataGrid%]
  FreeList [name=IntrexxDataGrid, buckets=256, dataPages=3, reusePages=0] INFO  2017-07-07T23:07:28,475 - org.apache.ignite.internal.IgniteKernal%IntrexxDataGrid[grid-timeout-worker-#15%IntrexxDataGrid%]
  FreeList [name=IntrexxDataGrid, buckets=256, dataPages=9, reusePages=0] ERROR 2017-07-07T23:08:10,632 - org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi[tcp-disco-msg-worker-#2%IntrexxDataGrid%]


  TcpDiscoverSpi's message worker thread failed abnormally. Stopping the node in order to prevent cluster wide instability.
java.lang.AssertionError: -21
    at
org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryStatistics.onMessageSent(TcpDiscoveryStatistics.java:317)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.sendMessageAcrossRing(ServerImpl.java:2980)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMetricsUpdateMessage(ServerImpl.java:5001)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2562)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2364)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6460)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2450)
[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
[ignite-core-2.0.0.jar:2.0.0]
ERROR 2017-07-07T23:08:10,661 -
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi[tcp-disco-msg-worker-#2%IntrexxDataGrid%]
  Runtime error caught during grid runnable execution: IgniteSpiThread [name=tcp-disco-msg-worker-#2%IntrexxDataGrid%]
java.lang.AssertionError: -21
     at
org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryStatistics.onMessageSent(TcpDiscoveryStatistics.java:317)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.sendMessageAcrossRing(ServerImpl.java:2980)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMetricsUpdateMessage(ServerImpl.java:5001)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2562)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2364)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6460)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2450)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)

I researched online, and it seems that the thread fails when there is a long gc pause (>10 s), or not so much memory. In all those cases it comes also an InterruptedException.

However, that does not seem to be the case, since we do have enough memory, and no load whatsoever. What strikes me the most is the assertion error:

  TcpDiscoverSpi's message worker thread failed abnormally. Stopping the node in order to prevent cluster wide instability.
java.lang.AssertionError: -21

This is an assertion in method onMessageSent:

assert time >= 0 : time;

which in turns comes from (ServerImpl.java:2980):

spi.stats.onMessageSent(msg, tstamp0 - tstamp);

tstamp0 and tsamp are in turn values returned from U.currentTimeMillis();

which if I understand correctly should return an ever increasing number (in 10 seconds interval).

Hence I cannot figure out why this asserted value is negative.

Can someone give us a hint what we should do?

Thanks!

Juan

Re: TcpDiscoverySpi worker thread failed with assertion error

Posted by Johannes Lichtenberger <jo...@unitedplanet.com>.

On 10.07.2017 11:44, Juan Barani wrote:
>
> Just to add something else to the scenario, can it happen that due to 
> NTP adjustments in the servers then System.currentTimeMillis() returns 
> a smaller value than before? May it be that using System.nanoTime() 
> this problem does not appear?
>
I guess nanoTime() should indeed be used (not related to system or 
wall-clock time within a VM):

    "Returns the current value of the running Java Virtual Machine's
    high-resolution time source, in nanoseconds.

    This method can only be used to measure elapsed time and is not
    related to any other notion of system or wall-clock time. The value
    returned represents nanoseconds since some fixed but arbitrary
    /origin/ time (perhaps in the future, so values may be negative).
    The same origin is used by all invocations of this method in an
    instance of a Java virtual machine; other virtual machine instances
    are likely to use a different origin.

    This method provides nanosecond precision, but not necessarily
    nanosecond resolution (that is, how frequently the value changes) -
    no guarantees are made except that the resolution is at least as
    good as that of |currentTimeMillis()|
    <http://docs.oracle.com/javase/7/docs/api/java/lang/System.html#currentTimeMillis%28%29>.

    Differences in successive calls that span greater than approximately
    292 years (2^63 nanoseconds) will not correctly compute elapsed time
    due to numerical overflow.

    The values returned by this method become meaningful only when the
    difference between two such values, obtained within the same
    instance of a Java virtual machine, is computed."

RE: TcpDiscoverySpi worker thread failed with assertion error

Posted by Juan Barani <Ju...@unitedplanet.com>.

Just to add something else to the scenario, can it happen that due to NTP adjustments in the servers then System.currentTimeMillis() returns a smaller value than before? May it be that using System.nanoTime() this problem does not appear?

Thanks!

Juan


From: Juan Barani [mailto:Juan.Barani@unitedplanet.com]
Sent: Monday, July 10, 2017 11:34 AM
To: user@ignite.apache.org
Subject: TcpDiscoverySpi worker thread failed with assertion error

Hi everyone,

We are testing a new application running Ignite 2.0 with 3 nodes running on Azure platform.

We left the cluster running a couple of days without load, and randomly one of the Nodes stopped with the following stacktrace:

INFO  2017-07-07T23:07:28,475 -
org.apache.ignite.internal.IgniteKernal%IntrexxDataGrid[grid-timeout-worker-#15%IntrexxDataGrid%]
  FreeList [name=IntrexxDataGrid, buckets=256, dataPages=3, reusePages=0] INFO  2017-07-07T23:07:28,475 - org.apache.ignite.internal.IgniteKernal%IntrexxDataGrid[grid-timeout-worker-#15%IntrexxDataGrid%]
  FreeList [name=IntrexxDataGrid, buckets=256, dataPages=9, reusePages=0] ERROR 2017-07-07T23:08:10,632 - org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi[tcp-disco-msg-worker-#2%IntrexxDataGrid%]


  TcpDiscoverSpi's message worker thread failed abnormally. Stopping the node in order to prevent cluster wide instability.
java.lang.AssertionError: -21
    at
org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryStatistics.onMessageSent(TcpDiscoveryStatistics.java:317)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.sendMessageAcrossRing(ServerImpl.java:2980)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMetricsUpdateMessage(ServerImpl.java:5001)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2562)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2364)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6460)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2450)
[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
[ignite-core-2.0.0.jar:2.0.0]
ERROR 2017-07-07T23:08:10,661 -
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi[tcp-disco-msg-worker-#2%IntrexxDataGrid%]
  Runtime error caught during grid runnable execution: IgniteSpiThread [name=tcp-disco-msg-worker-#2%IntrexxDataGrid%]
java.lang.AssertionError: -21
     at
org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryStatistics.onMessageSent(TcpDiscoveryStatistics.java:317)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.sendMessageAcrossRing(ServerImpl.java:2980)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMetricsUpdateMessage(ServerImpl.java:5001)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2562)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2364)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6460)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2450)
~[ignite-core-2.0.0.jar:2.0.0]
     at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)

I researched online, and it seems that the thread fails when there is a long gc pause (>10 s), or not so much memory. In all those cases it comes also an InterruptedException.

However, that does not seem to be the case, since we do have enough memory, and no load whatsoever. What strikes me the most is the assertion error:

  TcpDiscoverSpi's message worker thread failed abnormally. Stopping the node in order to prevent cluster wide instability.
java.lang.AssertionError: -21

This is an assertion in method onMessageSent:

assert time >= 0 : time;

which in turns comes from (ServerImpl.java:2980):

spi.stats.onMessageSent(msg, tstamp0 - tstamp);

tstamp0 and tsamp are in turn values returned from U.currentTimeMillis();

which if I understand correctly should return an ever increasing number (in 10 seconds interval).

Hence I cannot figure out why this asserted value is negative.

Can someone give us a hint what we should do?

Thanks!

Juan