You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by Maxim Volkomorov <22...@gmail.com> on 2021/03/11 20:43:56 UTC

slow node discovering on Kubernetes

Hi Ignite team,

We're running an Ignite cluster with 10 server nodes at Kubernetes.
Using empty ignite configuration we can't run more than 5 nodes in a normal
time.

Trying to deploy 10 nodes on our empty config leads to weird discovery
problems caused by "IgniteSpiException: Node with the same ID was found".

After increasing the AckTimeout=10000 and switching to G1GC, cluster was
started, but time still long and getting TcpDiscoverySpi errors.
I checked connections and port availability between nodes and found nothing
suspicious.

I've attached a log with one of 10 nodes deploying.
Our ignite config is default-config.xml with only
TcpDiscoveryKubernetesIpFinder.

Would you like to take a look and give us some suggestions on how to reduce
deploy time?

Thanks,
Maxim

RE: slow node discovering on Kubernetes

Posted by mvolkomorov <22...@gmail.com>.
> avoid stretching between multiple availability zones and some persistence
tuning,

Like disabling MMAP for WAL (IGNITE_WAL_MMAP=false)
We don't use persistence.

> Do you mean, that switching on GKE makes it working or it was the initial
> setup and nothing has changed since that?
I will clarify the details. After an unsuccessful deployment in our
corporate vmware and openshift, we just created a VM on google cloud to
compare and started the same configurated kubernetes cluster in less then 5
minutes.

I found many "duplicate message" at debug logs:
[2021-04-30
11:31:27,075][DEBUG][tcp-disco-msg-worker-[]-#2%datanode%-#36%datanode%][TcpDiscoverySpi]
Ignoring duplicate message: TcpDiscoveryCustomEventMessage [msg=null,
super=TcpDiscoveryAbstractMessage
[sndNodeId=7c2bc47c-a33f-4a32-9d1d-75f9f7c4ba45,
id=78f73e12971-92f0b0a3-1eae-4505-b08b-52b6f7f60780, verifierNodeId=null,
topVer=0, pendingIdx=0, failedNodes=null, isClient=false]]

Can any benchmark show us possible freezes of internode networking?
I have attached two archives with thread-dumps and dsicovery debug logs.
Сould you look at the logs?

thread-dumps-10-nodes.7z
<http://apache-ignite-users.70518.x6.nabble.com/file/t2921/thread-dumps-10-nodes.7z>  
discovery-debug-logs.7z
<http://apache-ignite-users.70518.x6.nabble.com/file/t2921/discovery-debug-logs.7z>  





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

RE: slow node discovering on Kubernetes

Posted by Alexandr Shapkin <le...@gmail.com>.
Hi Maxim,



Not really, I haven’t used flannel plugin and I’m not quite sure about network
recommendations as well.

Well, besides something basic, like avoid stretching between multiple
availability zones and some persistence tuning,

Like disabling MMAP for WAL (IGNITE_WAL_MMAP=false)



> The problem still actual, we deployed same 10 nodes on google kubernetes and

>got a normal time.



Do you mean, that switching on GKE makes it working or it was the initial
setup and nothing has changed since that?



> For now we did not define any limits or requests, our ignite is the only

deployment on the kubernetes.



I was wondering cause if I remember correctly there might be some issues if
your pods have insufficient resources,

but unfortunately, no direct numbers. Well, default GKE instances should works
ok on a default cluster, anyway.



> Do you have any idea how to estimate network or hardware performance to find

>possible bottlenecks?



I think enabling DEBUG logs for discovery might help to see what’s really
happening to the grid.



< **category** **name=** **" org.apache.ignite.spi.discovery"**>  
< **level** **value=** **" DEBUG"**/>  
</ **category** >



Btw, is it a persistent cluster or pure in-memory one?





 **From:**[mvolkomorov](mailto:2201416@gmail.com)  
 **Sent:** Friday, April 9, 2021 10:26 AM  
 **To:**[user@ignite.apache.org](mailto:user@ignite.apache.org)  
 **Subject:** RE: slow node discovering on Kubernetes



Hello, Alexandr!



The problem still actual, we deployed same 10 nodes on google kubernetes and

got a normal time.

For now we did not define any limits or requests, our ignite is the only

deployment on the kubernetes.

We use flannel network plugin (host-gw), are there any recommendations for

the network plugin?

Do you have any idea how to estimate network or hardware performance to find

possible bottlenecks?







\--

Sent from: http://apache-ignite-users.70518.x6.nabble.com/




RE: slow node discovering on Kubernetes

Posted by mvolkomorov <22...@gmail.com>.
Hello, Alexandr!

The problem still actual, we deployed same 10 nodes on google kubernetes and
got a normal time.
For now we did not define any limits or requests, our ignite is the only
deployment on the kubernetes.
We use flannel network plugin (host-gw), are there any recommendations for
the network plugin?
Do you have any idea how to estimate network or hardware performance to find
possible bottlenecks?



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

RE: slow node discovering on Kubernetes

Posted by Alexandr Shapkin <le...@gmail.com>.
Hello Maxim,



Could you please share the current state of this issue? Have you managed to
resolve it or it still exists?



What is your pods configuration in terms of resource usage?



 **From:**[Maxim Volkomorov](mailto:2201416@gmail.com)  
 **Sent:** Thursday, March 11, 2021 11:44 PM  
 **To:**[user@ignite.apache.org](mailto:user@ignite.apache.org)  
 **Subject:** slow node discovering on Kubernetes



Hi Ignite team,  
  
We're running an Ignite cluster with 10 server nodes at Kubernetes.  
Using empty ignite configuration we can't run more than 5 nodes in a normal
time.  
  
Trying to deploy 10 nodes on our empty config leads to weird discovery
problems caused by "IgniteSpiException: Node with the same ID was found".



After increasing the AckTimeout=10000 and switching to G1GC, cluster was
started, but time still long and getting TcpDiscoverySpi errors.  
I checked connections and port availability between nodes and found nothing
suspicious.  
  
I've attached a log with one of 10 nodes deploying.  
Our ignite config is default-config.xml with only
TcpDiscoveryKubernetesIpFinder.

Would you like to take a look and give us some suggestions on how to reduce
deploy time?  
  
Thanks,  
Maxim