You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by "privacyfirst@codesandnotes.be" <pr...@codesandnotes.be> on 2021/10/28 13:15:49 UTC

Reliability of Apache Ignite as a Multi-Tier Storage

Dear,

One of the "Core Features" listed by ignite.apache.org is the capability
of Ignite to be a Multi-Tier Storage. However, unless I have
misunderstood something, I am worried that this storage is not reliable...

I currently have an application that uses an Ignite cluster as a DB. The
cluster contains two nodes at the moment: the second node backs up the
first. Each Ignite node is on a VPS server at OVH.

Lately OVH had a series of issues which apparently brought down the
communication between those VPS servers. The consequence was that the
Ignite nodes couldn't talk to each other and therefore split, each node
upgrading to a new Baseline Topology and each one seeing the other node
as being offline.

Restarting the nodes would result in an error on one of them:
Caused by: class org.apache.ignite.spi.IgniteSpiException:
BaselineTopology of joining node (72fbc939-bf09-42cb-a7e4-12896046cfc0)
is not compatible with BaselineTopology in the cluster. Branching
history of cluster BlT ([1060612220]) doesn't contain branching point
hash of joining node BlT (173037243). Consider cleaning persistent
storage of the node and adding it to the cluster again.
at
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.checkFailedError(TcpDiscoverySpi.java:2052)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl.joinTopology(ServerImpl.java:1197)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl.spiStart(ServerImpl.java:472)
at
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.spiStart(TcpDiscoverySpi.java:2154)
at
org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:278)

At this point, the only option I found was to destroy my backup node
(deleting the "/work" folder"), restart it and add it back to the cluster.

Obviously this becomes a real problem when scaling up and having one's
data distributed among multiple Ignite nodes. If a major network issue
occurs (such as days of network outage) Ignite nodes might (will?) end
up in the same state than my backup node in my example, therefore losing
data.

So, is my theory above correct or have I misunderstood Ignite's
capabilities as a Multi-Tier storage solution?
Is a cluster node able to re-join a cluster it's been disconnected from
for a significant amount of time?
And if a node is disconnected and starts giving a "BaselineTopology of
joining node is not compatible with BaselineTopology in the cluster"
then how can I recover my data ?

Thanks for your help,

Diego

Re: Reliability of Apache Ignite as a Multi-Tier Storage

Posted by Stephen Darlington <st...@gridgain.com>.

As far as I know, “multi-tier” refers to the ability to store some data in memory, the rest on disk.

You experienced “split brain,” which is a difficult problem to solve in any distributed system. From your description my guess is that you’ve enabled baseline auto-adjust, which is generally not a good idea when you have persistence turned on.

With a catastrophic network failure like that, I would expect that you would need to restart some nodes. With a correctly configured cluster, you shouldn’t need to “destroy” a node.

> On 28 Oct 2021, at 14:15, privacyfirst@codesandnotes.be wrote:
> 
> Dear,
> 
> One of the "Core Features" listed by ignite.apache.org is the capability of Ignite to be a Multi-Tier Storage. However, unless I have misunderstood something, I am worried that this storage is not reliable...
> 
> I currently have an application that uses an Ignite cluster as a DB. The cluster contains two nodes at the moment: the second node backs up the first. Each Ignite node is on a VPS server at OVH.
> 
> Lately OVH had a series of issues which apparently brought down the communication between those VPS servers. The consequence was that the Ignite nodes couldn't talk to each other and therefore split, each node upgrading to a new Baseline Topology and each one seeing the other node as being offline.
> 
> Restarting the nodes would result in an error on one of them:
> Caused by: class org.apache.ignite.spi.IgniteSpiException: BaselineTopology of joining node (72fbc939-bf09-42cb-a7e4-12896046cfc0) is not compatible with BaselineTopology in the cluster. Branching history of cluster BlT ([1060612220]) doesn't contain branching point hash of joining node BlT (173037243). Consider cleaning persistent storage of the node and adding it to the cluster again.
>         at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.checkFailedError(TcpDiscoverySpi.java:2052)
>         at org.apache.ignite.spi.discovery.tcp.ServerImpl.joinTopology(ServerImpl.java:1197)
>         at org.apache.ignite.spi.discovery.tcp.ServerImpl.spiStart(ServerImpl.java:472)
>         at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.spiStart(TcpDiscoverySpi.java:2154)
>         at org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:278) 
> 
> At this point, the only option I found was to destroy my backup node (deleting the "/work" folder"), restart it and add it back to the cluster.
> 
> Obviously this becomes a real problem when scaling up and having one's data distributed among multiple Ignite nodes. If a major network issue occurs (such as days of network outage) Ignite nodes might (will?) end up in the same state than my backup node in my example, therefore losing data.
> 
> So, is my theory above correct or have I misunderstood Ignite's capabilities as a Multi-Tier storage solution?
> Is a cluster node able to re-join a cluster it's been disconnected from for a significant amount of time?
> And if a node is disconnected and starts giving a "BaselineTopology of joining node is not compatible with BaselineTopology in the cluster" then how can I recover my data ?
> 
> Thanks for your help,
> 
> Diego
>