You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by tarunk <ta...@gmail.com> on 2021/05/21 12:47:52 UTC

Nodes taking 5 minutes to reattempt joining grid when failed to join

Hi Team,

We are seeing quite a long delay in node reattempting to join the when it
failed to join in first attempt.
From the logs it appears it attempts after 5 minutes, in our case it failed
3 times and then joined after 15 minutes. We have setup the networkTimeout
to 10000 but not sure if updated anything else from default. Can you please
suggest if we can reduce this retry attempt time ?

Below is the error we see in stacktrace from
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi

Node has not been connected to topology and will repeat join process. Check
remote nodes logs for possible error messages. Note that large topology may
require significant time to start. Increase 'TcpDiscoverySpi.networkTimeout'
configuration property if getting this message on the starting nodes
[networkTimeout=10000]

and below screenshot suggest 3 attempts after ~5 minute interval and finally
node joining grid after ~15 minutes. 
<http://apache-ignite-users.70518.x6.nabble.com/file/t2364/ignite-error.jpg> 

Thanks
Tarun



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Nodes taking 5 minutes to reattempt joining grid when failed to join

Posted by andrei <ae...@gmail.com>.

Hi,

Do you use SSL?

If yes, then Ignite will tries to create an SSL connection until the 
ExponentialBackoffTimeoutStrategy.totalTimeout is exceeded.

During these attempts,  other threads can be blocked because of 
communication will be able to send the message over SSL.

I saw before the same behavior when TLS 1.3 was used and different 
versions of java on each hosts.

The problem was related to the following JDK issue:

https://bugs.openjdk.java.net/browse/JDK-8208526

The recommendations are:

1)set correct TLS version for every node via 
-Djdk.tls.client.protocols=TLSv1.2  and -Dhttps.protocols=TLSv1.2
2)check the versions of JAVA used. They should be the same for every 
node (servers and clients)

Please note that if you do not set the TLS version directly via 
properties, then some default versions will be used. For some JAVA 
versions, it may be 1.3.

However, your case is very similar to the situation described above. It 
is possible that some client was running with a newer version of JAVA, 
but at the same time your server node was started as needed.

BR,
Andrei

5/21/2021 3:47 PM, tarunk пишет:
> Hi Team,
>
> We are seeing quite a long delay in node reattempting to join the when it
> failed to join in first attempt.
>  From the logs it appears it attempts after 5 minutes, in our case it failed
> 3 times and then joined after 15 minutes. We have setup the networkTimeout
> to 10000 but not sure if updated anything else from default. Can you please
> suggest if we can reduce this retry attempt time ?
>
> Below is the error we see in stacktrace from
> org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi
>
> Node has not been connected to topology and will repeat join process. Check
> remote nodes logs for possible error messages. Note that large topology may
> require significant time to start. Increase 'TcpDiscoverySpi.networkTimeout'
> configuration property if getting this message on the starting nodes
> [networkTimeout=10000]
>
> and below screenshot suggest 3 attempts after ~5 minute interval and finally
> node joining grid after ~15 minutes.
> <http://apache-ignite-users.70518.x6.nabble.com/file/t2364/ignite-error.jpg>
>
> Thanks
> Tarun
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/