You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by bintisepaha <bi...@tudor.com> on 2016/04/21 20:16:28 UTC

Client fails to connect - joinTimeout vs networkTimeout

We are running into production issues with some clients unable to connect to
the grid (16 server nodes running on linux). The error is 

Caused by: class org.apache.ignite.spi.IgniteSpiException: Join process
timed out, did not receive response for join request (consider increasing
'joinTimeout' configuration property) [joinTimeout=5000, sock=null]
       at
org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.body(ClientImpl.java:1334)
       at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)


DiscoverySpi has these settings
                <property name="joinTimeout" value="5000"/>
                <property name="ackTimeout" value="5000"/>
                <property name="maxAckTimeout" value="30000"/>            
                <property name="reconnectCount" value="5"/>

At the time, the clients got this error we tried increasing the timeout to
30 seconds and even 50 seconds, new client connections from some windows
machine just won't happen. We read
http://apache-ignite-users.70518.x6.nabble.com/Help-with-tuning-for-larger-clusters-td1692.html 
[1]
<http://apache-ignite-users.70518.x6.nabble.com/Help-with-tuning-for-larger-clusters-td1692.html>  
and got rid of joinTimeout and started using networkTimeout. It seems to be
working this way so far (have not yet pushed to production).

When we specify joinTimeout along with networkTimeout, we still cannot
connect.

Question 1) What is the difference between these 2 settings - join and
network timeout. 
Question 2) Without a joinTimeout in test environment, if the cluster is
down the client hangs forever (because joinTimeout is infinite), how do we
make sure that the client still proceeds even if it could not connect to the
cluster. We need clients to proceed in testing environment even if the grid
is down.
Question 3) both clients and servers are using TcpDiscoverySPI - is that
right? Should we be using TcpCommunicationSPI instead? 

Thanks,
Binti




--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Client-fails-to-connect-joinTimeout-vs-networkTimeout-tp4419.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Client fails to connect - joinTimeout vs networkTimeout

Posted by vkulichenko <va...@gmail.com>.
Caches always use affinity, it defines how the data is distributed across
nodes. If you don't explicitly provide it in the configuration,
RendezvousAffinityFunction will be used with excludeNeighbors=false. So if
you want to enable this feature, you have to specify this in the
configuration.

-Val



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Client-fails-to-connect-joinTimeout-vs-networkTimeout-tp4419p4709.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Client fails to connect - joinTimeout vs networkTimeout

Posted by bintisepaha <bi...@tudor.com>.
Val, thanks a lot. Will this also work if the caches do not use affinity?
We are trying not to use affinity because our data is very skewed.



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Client-fails-to-connect-joinTimeout-vs-networkTimeout-tp4419p4706.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Client fails to connect - joinTimeout vs networkTimeout

Posted by vkulichenko <va...@gmail.com>.
Hi Binti,

Yes, you should set 'excludeNeighbors' property on the affinity function:

<property name="affinity">
    <bean
class="org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction">
        <property name="excludeNeighbors" value="true"/>
    </bean>
</property>

-Val



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Client-fails-to-connect-joinTimeout-vs-networkTimeout-tp4419p4675.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Client fails to connect - joinTimeout vs networkTimeout

Posted by bintisepaha <bi...@tudor.com>.
Is there a way to configure backups node on a different physical host in such
a scenario? I do not want the primary and back up on the same host in the
event that host crashes. 

Thanks,
Binti



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Client-fails-to-connect-joinTimeout-vs-networkTimeout-tp4419p4661.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Client fails to connect - joinTimeout vs networkTimeout

Posted by vkulichenko <va...@gmail.com>.
This is not a bad design, but in most cases is just needed and only
complicates the deployment. If it gives you the performance improvement,
then this is absolutely fine.

-Val



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Client-fails-to-connect-joinTimeout-vs-networkTimeout-tp4419p4636.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Client fails to connect - joinTimeout vs networkTimeout

Posted by bintisepaha <bi...@tudor.com>.
Thanks Val, we will look at the jvm tuning and see what fits our environment.
That seems really helpful.

Another question you had asked earlier was why do we have more than one JVM
node on one physical machine? Is this a bad design? We are just trying to
start up the grid by loading records from database in parallel and with many
nodes it helps us load the cluster caches faster. it also helps us keep
smaller and reasonable heap sizes instead of one large heap.

Do you have any different recommendations?



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Client-fails-to-connect-joinTimeout-vs-networkTimeout-tp4419p4623.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Client fails to connect - joinTimeout vs networkTimeout

Posted by vkulichenko <va...@gmail.com>.
Hi,

You can limit the size of the allocated off heap memory for a cache by
setting CacheConfiguration.offHeapMaxMemory property. If it's exceeded,
entries will start to evict from cache on LRU basis. If you have a
persistence store, it will then loaded from from there. Otherwise, you lose
the data.

Refer to this documentation chapter for some performance and tuning tips:
https://apacheignite.readme.io/docs/preparing-for-production

-Val



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Client-fails-to-connect-joinTimeout-vs-networkTimeout-tp4419p4524.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Client fails to connect - joinTimeout vs networkTimeout

Posted by bintisepaha <bi...@tudor.com>.
Val, we are going to try 10 GB off heap storage with 4 GB heap sizes this
week and next and will revert how that looks. A follow up question - what
happens when data does not fit off heap and we have defined affinity, where
does it go - on heap or do we start seeing inconsistent results? We do not
have swap sapce configured yet or any eviction policy.

Another question I asked earlier was about if you guys had any specific GC
tuning recommendations? 



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Client-fails-to-connect-joinTimeout-vs-networkTimeout-tp4419p4516.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Client fails to connect - joinTimeout vs networkTimeout

Posted by vkulichenko <va...@gmail.com>.
Binti,

It sounds like client were not able to connect due to instability on the
server, and increasing networkTimeout gave them a better chance to join, but
most likely they could not work properly anyway. Is I said, most likely this
was caused by memory issues.

Any particular reason you're starting several nodes on each box. I would
recommend to take a look at off-heap memory [1]. It will allow you to have
only 4 nodes will small heap sizes (e.g., 4GB per node) and store all the
data outside off-heap. In many cases it allows to solve memory issues
easier.

[1] https://apacheignite.readme.io/docs/off-heap-memory

-Val



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Client-fails-to-connect-joinTimeout-vs-networkTimeout-tp4419p4469.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Client fails to connect - joinTimeout vs networkTimeout

Posted by bintisepaha <bi...@tudor.com>.
2. I see your point, but setting joinTimeout looks like a good solution. Does
it work for you? 
joinTimeout was working earlier with 5 seconds, for some clients we had to
raise it. but eventually some clients could not connect at all with any
jointimeout. We had to remove joinTimeout and add networkTimeout and that
worked. But reading the API we could not understand why one works and not
the other. 

we have not yet tried making the networkTimeout change in production. 

is there anyway we can ping the cluster without ignition.start and see if it
is up? then we won't need the joinTimeout.

We will research into memory usage and get back to you. Are there any GC
recommendations for large clusters? 16 nodes - 12 GB each on 4 linux hosts.

Thanks,
Binti



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Client-fails-to-connect-joinTimeout-vs-networkTimeout-tp4419p4459.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Client fails to connect - joinTimeout vs networkTimeout

Posted by vkulichenko <va...@gmail.com>.
1. Failure detection timeout doesn't affect the join process, so this is
correct behavior.
2. I see your point, but setting joinTimeout looks like a good solution.
Does it work for you?
3. Yes, that's OK, and in most cases servers and clients should have the
same discovery configuration. TcpCommunicationSpi is used, but implicitly,
with all defaults.

As for the instability over time, I would check if you don't have any memory
issues. If the memory consumption grows, you will have longer GC pauses that
can cause issues.

-Val



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Client-fails-to-connect-joinTimeout-vs-networkTimeout-tp4419p4437.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Client fails to connect - joinTimeout vs networkTimeout

Posted by bintisepaha <bi...@tudor.com>.
Val, thanks for the quick response

1. I tried removing all the legacySettings and just using failureDetection
of 5 seconds and the client still hangs forever if the cluster is down.
2. Our clients are existing java apps that save orders to the database and
now to the grid too. If the grid is down, we don't want clients to stop
working, we want it to report an error and proceed further. without
specifying joinTimeout, this is not possible.
3. Is it ok if both server and client use the same TcpDsicoverySpi? I am not
using TcpCommunicationSpi anywhere.

Also, the clients were originally able to connect to grid fine, slowly over
time we are unable to establish new random connections. So it looks like the
grid becomes unstable over time.

Thanks,
Binti



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Client-fails-to-connect-joinTimeout-vs-networkTimeout-tp4419p4432.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Client fails to connect - joinTimeout vs networkTimeout

Posted by vkulichenko <va...@gmail.com>.
Hi Binti,

1. networkTimeout , ackTimeout and maxAckTimeout are legacy settings and I
would recommend to use failure detection timeout instead [1]. This is a
global setting that defines the period of time during which a node can be
unaccessible before being considered failed. joinTimeout on the other side
defines how long node waits for discovery during startup. If discovery
doesn't happen, there can be different reasons. I would check server logs
and network load first.
2. What do you expect from the client in this case? Client can't actually
work without servers, even if it start, it will throw an exception on any
operation.
3. TcpDiscoverySpi and TcpCommunicationSpi are different components and are
used by any node (client or server) in parallel for different purposes. The
failure detection timeout is properly applied to both of them.

[1]
https://apacheignite.readme.io/docs/cluster-config#failure-detection-timeout

-Val



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Client-fails-to-connect-joinTimeout-vs-networkTimeout-tp4419p4431.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.