You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by keinproblem <no...@qq.com> on 2017/05/02 19:20:27 UTC

Volatile Kubernetes Node Discovery

Dear Apache Ignite Users Community,

This may be a well-known problem, although the currently available
information does not provide enough help for solving this issue.

Inside my service I'm using a IgniteCache in /Replicated/ mode from Ignite
1.9.
Some replicas of this service run inside Kubernetes in form of Pods (1
Container/Pod).
I'm using the 
/org.apache.ignite.spi.discovery.tcp.ipfinder.kubernetes.TcpDiscoveryKubernetesIpFinder/
for the Node Discovery.
As I understood: each Pod is able to make an API Call to the Kubernetes API
and retrieve the list of currently available nodes. This works properly.
Even though the Pod's own IP will also be retrieved, which produces a
somehow harmless 

Here is how I get my /IgniteCache/ the used /IgniteConfiguration/
information:

    public IgniteCache<String,MyCacheObject> getCacheInstance(){
        final CacheConfiguration<String,Tenant> cacheConfiguration = new
CacheConfiguration<>();
        cacheConfiguration.setName("MyObjectCache");
        return ignite.getOrCreateCache(cacheConfiguration);
    }

    public static IgniteConfiguration getDefaultIgniteConfiguration(){
        final IgniteConfiguration cfg = new IgniteConfiguration();
        cfg.setGridLogger(new Slf4jLogger(log));
        cfg.setClientMode(false);

        final TcpDiscoveryKubernetesIpFinder kubernetesPodIpFinder = new
TcpDiscoveryKubernetesIpFinder();
       
kubernetesPodIpFinder.setServiceName(SystemDataProvider.getServiceNameEnv);
        final TcpDiscoverySpi tcpDiscoverySpi = new TcpDiscoverySpi();


        tcpDiscoverySpi.setIpFinder(kubernetesPodIpFinder);
        tcpDiscoverySpi.setLocalPort(47500);        //using a static port,
to decrease potential failure causes
        cfg.setFailureDetectionTimeout(90000);
        cfg.setDiscoverySpi(tcpDiscoverySpi);
        return cfg;
    }



The initial node will start up properly every time.

In most cases, the ~ 3rd node trying to connect will fail and gets restarted
by Kubernetes after some time. Sometimes this node will succeed in
connecting to the cluster after a few restarts, but the common case is that
the nodes will keep restarting forever.

But the major issue is that when a new node fails to connect to the cluster,
the cluster seems to become unstable: the number of nodes increases for a
very short time, then drops to the previous count or even lower.
I am not sure if those are the new connecting nodes loosing the connection
immediately again, or if the previous successfully connected nodes loose
connection.


I also deployed the bare Ignite Docker Image including a configuration for
the 
 /TcpDiscoveryKubernetesIpFinder/ as described here 
https://apacheignite.readme.io/docs/kubernetes-deployment
<https://apacheignite.readme.io/docs/kubernetes-deployment>  . 
Even with this minimal setup, I've experienced the same behavior.

There is no load on the Ignite Nodes and the network usage is very low.

Using another Kubernetes instance on another infrastructure showed the same
results, hence I assume this to be an Ignite related issue.

What I also tried is, increasing the specific time-outs like /ackTimeout/,
/sockTimeout/ etc.

Also using the /TcpDiscoveryVmIpFinder/ did not help. Where I got all the
endpoints via DNS.
Same behavior as described inb4.

Please find attached a log file providing information on WARN level. Please
let me know if DEBUG level is desired.



Kind regards and thanks in advance,
keinproblem



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Volatile-Kubernetes-Node-Discovery-tp12357.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Volatile Kubernetes Node Discovery

Posted by keinproblem <no...@qq.com>.

Providing the previously promised log-file. ignite_dmp.txt
<http://apache-ignite-users.70518.x6.nabble.com/file/n12358/ignite_dmp.txt>  
- Please excuse the DP



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Volatile-Kubernetes-Node-Discovery-tp12357p12358.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Volatile Kubernetes Node Discovery

Posted by vkulichenko <va...@gmail.com>.

Hi Stephen,

Please properly subscribe to the mailing list so that the community can
receive email notifications for your messages. To subscribe, send empty
email to user-subscribe@ignite.apache.org and follow simple instructions in
the reply.


macdonagh wrote
> Are there logs which will show the Kubernetes IP finder doing its job ?
> 
> I am running a similar scenario but i do not get as far.
> I expected to see traces from TcpDiscoveryKubernetesIpFinder in the ignite
> log but i do not.
> 
> Instead i see a warning 
> 
> [16:47:22,838][WARNING][main][TcpDiscoveryMulticastIpFinder]
> TcpDiscoveryMulticastIpFinder has no pre-configured addresses (it is
> recommended in production to specify at least one address in
> TcpDiscoveryMulticastIpFinder.getAddresses() configuration property)
> 
> even though my configuration file only contains the KubernetesIPFinder
> bean class.

You most likely misconfigured something. There is no way you will see logs
from TcpDiscoveryMulticastIpFinder if TcpDiscoveryKubernetesIpFinder is
configured. Check that correct configuration file is used.

-Val



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Volatile-Kubernetes-Node-Discovery-tp12357p14418.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Volatile Kubernetes Node Discovery

Posted by Denis Magda <dm...@apache.org>.

Hi,

This seems to be a networking related issue. The Kubernetes IP finder did his job pretty well - all the nodes could share assigned IP addresses with each other and form a cluster by connecting via 47500 port properly defined in your config.

But right after that the nodes attempted to communicate directly with each other relying on communication SPI [1] and this is where they failed.

By default communication SPI binds to port 47100 (this happened accordingly to the logs), but one node could get through to the other node via this port:

    Connection info [in=false, rmtAddr=/172.17.0.7:47100, locAddr=/172.17.0.4:47022, msgsSent=2, msgsAckedByRmt=0, descIdHash=899022737, msgsRcvd=0, lastAcked=0, descIdHash=899022737, bytesRcvd=0, bytesRcvd0=0, bytesSent=201, bytesSent0=0, opQueueSize=0, msgWriter=DirectMessageWriter [state=DirectMessageState [pos=0, stack=[StateItem [stream=DirectByteBufferStreamImplV2 [buf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], baseOff=140562938260976, arrOff=-1, tmpArrOff=0, tmpArrBytes=0, msgTypeDone=false, msg=null, mapIt=null, it=null, arrPos=-1, keyDone=false, readSize=-1, readItems=0, prim=0, primShift=0, uuidState=0, uuidMost=0, uuidLeast=0, uuidLocId=0, lastFinished=true], state=0, hdrWritten=false], StateItem [stream=DirectByteBufferStreamImplV2 [buf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], baseOff=140562938260976, arrOff=-1, tmpArrOff=0, tmpArrBytes=0, msgTypeDone=false, msg=null, mapIt=null, it=null, arrPos=-1, keyDone=false, readSize=-1, readItems=0, prim=0, primShift=0, uuidState=0, uuidMost=0, uuidLeast=0, uuidLocId=0, lastFinished=true], state=0, hdrWritten=false], StateItem [stream=DirectByteBufferStreamImplV2 [buf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], baseOff=140562938260976, arrOff=-1, tmpArrOff=0, tmpArrBytes=0, msgTypeDone=false, msg=null, mapIt=null, it=null, arrPos=-1, keyDone=false, readSize=-1, readItems=0, prim=0, primShift=0, uuidState=0, uuidMost=0, uuidLeast=0, uuidLocId=0, lastFinished=true], state=0, hdrWritten=false], null, null, null, null, null, null, null]]], msgReader=null]

As you see the node tried to connect to /172.17.0.7:47100 from it’s own address 172.17.0.4:47022. Make sure that all these port ranges are not blocked by firewalls.

 
[1] https://apacheignite.readme.io/docs/network-config
  
—
Denis

> On May 3, 2017, at 12:04 AM, keinproblem <no...@qq.com> wrote:
> 
> Denis Magda-2 wrote
>>> Inside my service I'm using a IgniteCache in /Replicated/ mode from
>>> Ignite
>>> 1.9.
>>> Some replicas of this service run inside Kubernetes in form of Pods (1
>>> Container/Pod).
>>> I'm using the 
>>> /org.apache.ignite.spi.discovery.tcp.ipfinder.kubernetes.TcpDiscoveryKubernetesIpFinder/
>>> for the Node Discovery.
>> 
>> Do you mean that a part of the cluster is running outside of Kubernetes?
>> If it’s so this might be an issue because containerized Ignite nodes can’t
>> get trough the network and reach out your nodes that are outside.
>> 
>> —
>> Denis
>> 
>>> On May 2, 2017, at 12:20 PM, keinproblem &lt;
> 
>> noli.m@
> 
>> &gt; wrote:
>>> 
>>> Dear Apache Ignite Users Community,
>>> 
>>> This may be a well-known problem, although the currently available
>>> information does not provide enough help for solving this issue.
>>> 
>>> Inside my service I'm using a IgniteCache in /Replicated/ mode from
>>> Ignite
>>> 1.9.
>>> Some replicas of this service run inside Kubernetes in form of Pods (1
>>> Container/Pod).
>>> I'm using the 
>>> /org.apache.ignite.spi.discovery.tcp.ipfinder.kubernetes.TcpDiscoveryKubernetesIpFinder/
>>> for the Node Discovery.
>>> As I understood: each Pod is able to make an API Call to the Kubernetes
>>> API
>>> and retrieve the list of currently available nodes. This works properly.
>>> Even though the Pod's own IP will also be retrieved, which produces a
>>> somehow harmless 
>>> 
>>> Here is how I get my /IgniteCache/ the used /IgniteConfiguration/
>>> information:
>>> 
>>>   public IgniteCache&lt;String,MyCacheObject&gt; getCacheInstance(){
>>>       final CacheConfiguration&lt;String,Tenant&gt; cacheConfiguration =
>>> new
>>> CacheConfiguration<>();
>>>       cacheConfiguration.setName("MyObjectCache");
>>>       return ignite.getOrCreateCache(cacheConfiguration);
>>>   }
>>> 
>>>   public static IgniteConfiguration getDefaultIgniteConfiguration(){
>>>       final IgniteConfiguration cfg = new IgniteConfiguration();
>>>       cfg.setGridLogger(new Slf4jLogger(log));
>>>       cfg.setClientMode(false);
>>> 
>>>       final TcpDiscoveryKubernetesIpFinder kubernetesPodIpFinder = new
>>> TcpDiscoveryKubernetesIpFinder();
>>> 
>>> kubernetesPodIpFinder.setServiceName(SystemDataProvider.getServiceNameEnv);
>>>       final TcpDiscoverySpi tcpDiscoverySpi = new TcpDiscoverySpi();
>>> 
>>> 
>>>       tcpDiscoverySpi.setIpFinder(kubernetesPodIpFinder);
>>>       tcpDiscoverySpi.setLocalPort(47500);        //using a static port,
>>> to decrease potential failure causes
>>>       cfg.setFailureDetectionTimeout(90000);
>>>       cfg.setDiscoverySpi(tcpDiscoverySpi);
>>>       return cfg;
>>>   }
>>> 
>>> 
>>> 
>>> The initial node will start up properly every time.
>>> 
>>> In most cases, the ~ 3rd node trying to connect will fail and gets
>>> restarted
>>> by Kubernetes after some time. Sometimes this node will succeed in
>>> connecting to the cluster after a few restarts, but the common case is
>>> that
>>> the nodes will keep restarting forever.
>>> 
>>> But the major issue is that when a new node fails to connect to the
>>> cluster,
>>> the cluster seems to become unstable: the number of nodes increases for a
>>> very short time, then drops to the previous count or even lower.
>>> I am not sure if those are the new connecting nodes loosing the
>>> connection
>>> immediately again, or if the previous successfully connected nodes loose
>>> connection.
>>> 
>>> 
>>> I also deployed the bare Ignite Docker Image including a configuration
>>> for
>>> the 
>>> /TcpDiscoveryKubernetesIpFinder/ as described here 
>>> https://apacheignite.readme.io/docs/kubernetes-deployment
>>> &lt;https://apacheignite.readme.io/docs/kubernetes-deployment&gt;  . 
>>> Even with this minimal setup, I've experienced the same behavior.
>>> 
>>> There is no load on the Ignite Nodes and the network usage is very low.
>>> 
>>> Using another Kubernetes instance on another infrastructure showed the
>>> same
>>> results, hence I assume this to be an Ignite related issue.
>>> 
>>> What I also tried is, increasing the specific time-outs like
>>> /ackTimeout/,
>>> /sockTimeout/ etc.
>>> 
>>> Also using the /TcpDiscoveryVmIpFinder/ did not help. Where I got all the
>>> endpoints via DNS.
>>> Same behavior as described inb4.
>>> 
>>> Please find attached a log file providing information on WARN level.
>>> Please
>>> let me know if DEBUG level is desired.
>>> 
>>> 
>>> 
>>> Kind regards and thanks in advance,
>>> keinproblem
>>> 
>>> 
>>> 
>>> --
>>> View this message in context:
>>> http://apache-ignite-users.70518.x6.nabble.com/Volatile-Kubernetes-Node-Discovery-tp12357.html
>>> Sent from the Apache Ignite Users mailing list archive at Nabble.com.
> 
> Hi Denis,
> 
> the whole cluster is running in Kubernetes.
> So basically I just have connections between my pods.
> 
> Kind regards,
> keinproblem
> 
> 
> 
> --
> View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Volatile-Kubernetes-Node-Discovery-tp12357p12373.html <http://apache-ignite-users.70518.x6.nabble.com/Volatile-Kubernetes-Node-Discovery-tp12357p12373.html>
> Sent from the Apache Ignite Users mailing list archive at Nabble.com <http://nabble.com/>.

Re: Volatile Kubernetes Node Discovery

Posted by keinproblem <no...@qq.com>.

Denis Magda-2 wrote
>> Inside my service I'm using a IgniteCache in /Replicated/ mode from
>> Ignite
>> 1.9.
>> Some replicas of this service run inside Kubernetes in form of Pods (1
>> Container/Pod).
>> I'm using the 
>> /org.apache.ignite.spi.discovery.tcp.ipfinder.kubernetes.TcpDiscoveryKubernetesIpFinder/
>> for the Node Discovery.
> 
> Do you mean that a part of the cluster is running outside of Kubernetes?
> If it’s so this might be an issue because containerized Ignite nodes can’t
> get trough the network and reach out your nodes that are outside.
> 
> —
> Denis
> 
>> On May 2, 2017, at 12:20 PM, keinproblem &lt;

> noli.m@

> &gt; wrote:
>> 
>> Dear Apache Ignite Users Community,
>> 
>> This may be a well-known problem, although the currently available
>> information does not provide enough help for solving this issue.
>> 
>> Inside my service I'm using a IgniteCache in /Replicated/ mode from
>> Ignite
>> 1.9.
>> Some replicas of this service run inside Kubernetes in form of Pods (1
>> Container/Pod).
>> I'm using the 
>> /org.apache.ignite.spi.discovery.tcp.ipfinder.kubernetes.TcpDiscoveryKubernetesIpFinder/
>> for the Node Discovery.
>> As I understood: each Pod is able to make an API Call to the Kubernetes
>> API
>> and retrieve the list of currently available nodes. This works properly.
>> Even though the Pod's own IP will also be retrieved, which produces a
>> somehow harmless 
>> 
>> Here is how I get my /IgniteCache/ the used /IgniteConfiguration/
>> information:
>> 
>>    public IgniteCache&lt;String,MyCacheObject&gt; getCacheInstance(){
>>        final CacheConfiguration&lt;String,Tenant&gt; cacheConfiguration =
>> new
>> CacheConfiguration<>();
>>        cacheConfiguration.setName("MyObjectCache");
>>        return ignite.getOrCreateCache(cacheConfiguration);
>>    }
>> 
>>    public static IgniteConfiguration getDefaultIgniteConfiguration(){
>>        final IgniteConfiguration cfg = new IgniteConfiguration();
>>        cfg.setGridLogger(new Slf4jLogger(log));
>>        cfg.setClientMode(false);
>> 
>>        final TcpDiscoveryKubernetesIpFinder kubernetesPodIpFinder = new
>> TcpDiscoveryKubernetesIpFinder();
>> 
>> kubernetesPodIpFinder.setServiceName(SystemDataProvider.getServiceNameEnv);
>>        final TcpDiscoverySpi tcpDiscoverySpi = new TcpDiscoverySpi();
>> 
>> 
>>        tcpDiscoverySpi.setIpFinder(kubernetesPodIpFinder);
>>        tcpDiscoverySpi.setLocalPort(47500);        //using a static port,
>> to decrease potential failure causes
>>        cfg.setFailureDetectionTimeout(90000);
>>        cfg.setDiscoverySpi(tcpDiscoverySpi);
>>        return cfg;
>>    }
>> 
>> 
>> 
>> The initial node will start up properly every time.
>> 
>> In most cases, the ~ 3rd node trying to connect will fail and gets
>> restarted
>> by Kubernetes after some time. Sometimes this node will succeed in
>> connecting to the cluster after a few restarts, but the common case is
>> that
>> the nodes will keep restarting forever.
>> 
>> But the major issue is that when a new node fails to connect to the
>> cluster,
>> the cluster seems to become unstable: the number of nodes increases for a
>> very short time, then drops to the previous count or even lower.
>> I am not sure if those are the new connecting nodes loosing the
>> connection
>> immediately again, or if the previous successfully connected nodes loose
>> connection.
>> 
>> 
>> I also deployed the bare Ignite Docker Image including a configuration
>> for
>> the 
>> /TcpDiscoveryKubernetesIpFinder/ as described here 
>> https://apacheignite.readme.io/docs/kubernetes-deployment
>> &lt;https://apacheignite.readme.io/docs/kubernetes-deployment&gt;  . 
>> Even with this minimal setup, I've experienced the same behavior.
>> 
>> There is no load on the Ignite Nodes and the network usage is very low.
>> 
>> Using another Kubernetes instance on another infrastructure showed the
>> same
>> results, hence I assume this to be an Ignite related issue.
>> 
>> What I also tried is, increasing the specific time-outs like
>> /ackTimeout/,
>> /sockTimeout/ etc.
>> 
>> Also using the /TcpDiscoveryVmIpFinder/ did not help. Where I got all the
>> endpoints via DNS.
>> Same behavior as described inb4.
>> 
>> Please find attached a log file providing information on WARN level.
>> Please
>> let me know if DEBUG level is desired.
>> 
>> 
>> 
>> Kind regards and thanks in advance,
>> keinproblem
>> 
>> 
>> 
>> --
>> View this message in context:
>> http://apache-ignite-users.70518.x6.nabble.com/Volatile-Kubernetes-Node-Discovery-tp12357.html
>> Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Hi Denis,

the whole cluster is running in Kubernetes.
So basically I just have connections between my pods.

Kind regards,
keinproblem



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Volatile-Kubernetes-Node-Discovery-tp12357p12373.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Volatile Kubernetes Node Discovery

Posted by Denis Magda <dm...@apache.org>.

> Inside my service I'm using a IgniteCache in /Replicated/ mode from Ignite
> 1.9.
> Some replicas of this service run inside Kubernetes in form of Pods (1
> Container/Pod).
> I'm using the 
> /org.apache.ignite.spi.discovery.tcp.ipfinder.kubernetes.TcpDiscoveryKubernetesIpFinder/
> for the Node Discovery.

Do you mean that a part of the cluster is running outside of Kubernetes? If it’s so this might be an issue because containerized Ignite nodes can’t get trough the network and reach out your nodes that are outside.

—
Denis

> On May 2, 2017, at 12:20 PM, keinproblem <no...@qq.com> wrote:
> 
> Dear Apache Ignite Users Community,
> 
> This may be a well-known problem, although the currently available
> information does not provide enough help for solving this issue.
> 
> Inside my service I'm using a IgniteCache in /Replicated/ mode from Ignite
> 1.9.
> Some replicas of this service run inside Kubernetes in form of Pods (1
> Container/Pod).
> I'm using the 
> /org.apache.ignite.spi.discovery.tcp.ipfinder.kubernetes.TcpDiscoveryKubernetesIpFinder/
> for the Node Discovery.
> As I understood: each Pod is able to make an API Call to the Kubernetes API
> and retrieve the list of currently available nodes. This works properly.
> Even though the Pod's own IP will also be retrieved, which produces a
> somehow harmless 
> 
> Here is how I get my /IgniteCache/ the used /IgniteConfiguration/
> information:
> 
>    public IgniteCache<String,MyCacheObject> getCacheInstance(){
>        final CacheConfiguration<String,Tenant> cacheConfiguration = new
> CacheConfiguration<>();
>        cacheConfiguration.setName("MyObjectCache");
>        return ignite.getOrCreateCache(cacheConfiguration);
>    }
> 
>    public static IgniteConfiguration getDefaultIgniteConfiguration(){
>        final IgniteConfiguration cfg = new IgniteConfiguration();
>        cfg.setGridLogger(new Slf4jLogger(log));
>        cfg.setClientMode(false);
> 
>        final TcpDiscoveryKubernetesIpFinder kubernetesPodIpFinder = new
> TcpDiscoveryKubernetesIpFinder();
> 
> kubernetesPodIpFinder.setServiceName(SystemDataProvider.getServiceNameEnv);
>        final TcpDiscoverySpi tcpDiscoverySpi = new TcpDiscoverySpi();
> 
> 
>        tcpDiscoverySpi.setIpFinder(kubernetesPodIpFinder);
>        tcpDiscoverySpi.setLocalPort(47500);        //using a static port,
> to decrease potential failure causes
>        cfg.setFailureDetectionTimeout(90000);
>        cfg.setDiscoverySpi(tcpDiscoverySpi);
>        return cfg;
>    }
> 
> 
> 
> The initial node will start up properly every time.
> 
> In most cases, the ~ 3rd node trying to connect will fail and gets restarted
> by Kubernetes after some time. Sometimes this node will succeed in
> connecting to the cluster after a few restarts, but the common case is that
> the nodes will keep restarting forever.
> 
> But the major issue is that when a new node fails to connect to the cluster,
> the cluster seems to become unstable: the number of nodes increases for a
> very short time, then drops to the previous count or even lower.
> I am not sure if those are the new connecting nodes loosing the connection
> immediately again, or if the previous successfully connected nodes loose
> connection.
> 
> 
> I also deployed the bare Ignite Docker Image including a configuration for
> the 
> /TcpDiscoveryKubernetesIpFinder/ as described here 
> https://apacheignite.readme.io/docs/kubernetes-deployment
> <https://apacheignite.readme.io/docs/kubernetes-deployment>  . 
> Even with this minimal setup, I've experienced the same behavior.
> 
> There is no load on the Ignite Nodes and the network usage is very low.
> 
> Using another Kubernetes instance on another infrastructure showed the same
> results, hence I assume this to be an Ignite related issue.
> 
> What I also tried is, increasing the specific time-outs like /ackTimeout/,
> /sockTimeout/ etc.
> 
> Also using the /TcpDiscoveryVmIpFinder/ did not help. Where I got all the
> endpoints via DNS.
> Same behavior as described inb4.
> 
> Please find attached a log file providing information on WARN level. Please
> let me know if DEBUG level is desired.
> 
> 
> 
> Kind regards and thanks in advance,
> keinproblem
> 
> 
> 
> --
> View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Volatile-Kubernetes-Node-Discovery-tp12357.html
> Sent from the Apache Ignite Users mailing list archive at Nabble.com.