You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ignite.apache.org by Alexei Scherbakov <al...@gmail.com> on 2020/05/06 08:45:03 UTC

[DISCUSS] Data loss handling improvements

Folks,

I've almost finished a patch bringing some improvements to the data loss
handling code, and I wish to discuss proposed changes with the community
before submitting.

*The issue*

During the grid's lifetime, it's possible to get into a situation when some
data nodes have failed or mistakenly stopped. If a number of stopped nodes
exceeds a certain threshold depending on configured backups, count a data
loss will occur. For example, a grid having one backup (meaning at least
two copies of each data partition exist at the same time) can tolerate only
one node loss at the time. Generally, data loss is guaranteed to occur if
backups + 1 or more nodes have failed simultaneously using default affinity
function.

For in-memory caches, data is lost forever. For persistent caches, data is
not physically lost and accessible again after failed nodes are returned to
the topology.

Possible data loss should be taken into consideration while designing an
application.

*Consider an example: money is transferred from one deposit to another, and
all nodes holding data for one of the deposits are gone.In such a case, a
transaction temporary cannot be completed until a cluster is recovered from
the data loss state. Ignoring this can cause data inconsistency.*
It is necessary to have an API telling us if an operation is safe to
complete from the perspective of data loss.

Such an API exists for some time [1] [2] [3]. In short, a grid can be
configured to switch caches to the partial availability mode if data loss
is detected.

Let's give two definitions according to the Javadoc for
*PartitionLossPolicy*:

· *Safe* (data loss handling) *policy* - cache operations are only
available for non-lost partitions (PartitionLossPolicy != IGNORE).

· *Unsafe policy* - cache operations are always possible
(PartitionLossPolicy = IGNORE). If the unsafe policy is configured, lost
partitions automatically re-created on the remaining nodes if needed or
immediately owned if a last supplier has left during rebalancing.

*That needs to be fixed*

1. The default loss policy is unsafe, even for persistent caches in the
current implementation. It can result in unintentional data loss and
business invariants' failure.

2. Node restarts in the persistent grid with detected data loss will cause
automatic resetting of LOST state after the restart, even if the safe
policy is configured. It can result in data loss or partition desync if not
all nodes are returned to the topology or returned in the wrong order.

*An example: a grid has three nodes, one backup. The grid is under load.
First, a node2 has left, soon a node3 has left. If the node2 is returned to
the topology first, it would have stale data for some keys. Most recent
data are on node3, which is not in the topology yet. Because a lost state
was reset, all caches are fully available, and most probably will become
inconsistent even in safe mode.*
3. Configured loss policy doesn't provide guarantees described in the
Javadoc depending on the cluster configuration[4]. In particular, unsafe
policy (IGNORE) cannot be guaranteed if a baseline is fixed (not
automatically readjusted on node left), because partitions are not
automatically get reassigned on topology change, and no nodes are existing
to fulfill a read/write request. Same for READ_ONLY_ALL and READ_WRITE_ALL.

4. Calling resetLostPartitions doesn't provide a guarantee for full cache
operations availability if a topology doesn't have at least one owner for
each lost partition.

The ultimate goal of the patch is to fix API inconsistencies and fix the
most crucial bugs related to data loss handling.

*The planned changes are:*

1. The safe policy is used by default, except for in-memory grids with
enabled baseline auto-adjust [5] with zero timeout [6]. In the latter case,
the unsafe policy is used by default. It protects from unintentional data
loss.

2. Lost state is never reset in the case of grid nodes restart (despite
full restart). It makes real data loss impossible in persistent grids if
following the recovery instruction.

3. Lost state is impossible to reset if a topology doesn't have at least
one owner for each lost partition. If nodes are physically dead, they
should be removed from a baseline first before calling resetLostPartitions.

4. READ_WRITE_ALL, READ_ONLY_ALL is a subject for deprecation because their
guarantees are impossible to fulfill, not on the full baseline.

5. Any operation failed due to data loss contains
CacheInvalidStateException as a root cause.

In addition to code fixes, I plan to write a tutorial for safe data loss
recovery in the persistent mode in the Ignite wiki.

Any comments for the proposed changes are welcome.

[1]
org.apache.ignite.configuration.CacheConfiguration#setPartitionLossPolicy(PartitionLossPolicy
partLossPlc)
[2] org.apache.ignite.Ignite#resetLostPartitions(caches)
[3] org.apache.ignite.IgniteCache#lostPartitions
[4] https://issues.apache.org/jira/browse/IGNITE-10041
[5] org.apache.ignite.IgniteCluster#baselineAutoAdjustEnabled(boolean)
[6] org.apache.ignite.IgniteCluster#baselineAutoAdjustTimeout(long)

Re: [DISCUSS] Data loss handling improvements

Posted by Ivan Pavlukhin <vo...@gmail.com>.

Have not chance to read this thread carefully, but a following
discussion sounds very similar and might be somehow useful [1].

[1] http://apache-ignite-developers.2346864.n4.nabble.com/Partition-Loss-Policies-issues-td37304.html

Best regards,
Ivan Pavlukhin

чт, 7 мая 2020 г. в 11:36, Alexei Scherbakov <al...@gmail.com>:
>
> Yes, it will work this way.
>
> чт, 7 мая 2020 г. в 10:43, Anton Vinogradov <av...@apache.org>:
>
> > Seems I got the vision, thanks.
> > There should be only 2 ways to reset lost partition: to gain an owner from
> > resurrected first or to remove ex-owner from baseline (partition will be
> > rearranged).
> > And we should make a decision for every lost partition before calling the
> > reset.
> >
> > On Wed, May 6, 2020 at 8:02 PM Alexei Scherbakov <
> > alexey.scherbakoff@gmail.com> wrote:
> >
> > > ср, 6 мая 2020 г. в 12:54, Anton Vinogradov <av...@apache.org>:
> > >
> > > > Alexei,
> > > >
> > > > 1,2,4,5 - looks good to me, no objections here.
> > > >
> > > > >> 3. Lost state is impossible to reset if a topology doesn't have at
> > > least
> > > > >> one owner for each lost partition.
> > > >
> > > > Do you mean that, according to your example, where
> > > > >> a node2 has left, soon a node3 has left. If the node2 is returned to
> > > > >> the topology first, it would have stale data for some keys.
> > > > we have to have node2 at cluster to be able to reset "lost" to node2's
> > > > data?
> > > >
> > >
> > > Not sure if I understand a question, but try to answer using an example:
> > > Assume 3 nodes n1, n2, n3, 1 backup, persistence enabled, partition p is
> > > owned by n2 and n3.
> > > 1. Topology is activated.
> > > 2. cache.put(p, 0) // n2 and n3 have p->0, updateCounter=1
> > > 3. n2 has failed.
> > > 4. cache.put(p, 1) // n3 has p->1, updateCounter=2
> > > 5. n3 has failed, partition loss is happened.
> > > 6. n2 joins a topology, it has stale data (p->0)
> > >
> > > We actually have 2 issues:
> > > 7. cache.put(p, 2) will success, n2 has p->2, n3 has p->0, data is
> > diverged
> > > and will not be adjusted by counters rebalancing if n3 is later joins a
> > > topology.
> > > or
> > > 8. n3 joins a topology, it has actual data (p->1) but rebalancing will
> > not
> > > work because joining node has highest counter (it can only be a demander
> > in
> > > this scenario).
> > >
> > > In both cases rebalancing by counters will not work causing data
> > divergence
> > > in copies.
> > >
> > >
> > > >
> > > > >> at least one owner for each lost partition.
> > > > What the reason to have owners for all lost partitions when we want to
> > > > reset only some (available)?
> > > >
> > >
> > > It's never were possible to reset only subset of lost partitions. The
> > > reason is to make guarantee of resetLostPartitions method - all cache
> > > operations are resumed, data is correct.
> > >
> > >
> > > > Will it be possible to perform operations on non-lost partitions when
> > the
> > > > cluster has at least one lost partition?
> > > >
> > >
> > > Yes it will be.
> > >
> > >
> > > >
> > > > On Wed, May 6, 2020 at 11:45 AM Alexei Scherbakov <
> > > > alexey.scherbakoff@gmail.com> wrote:
> > > >
> > > > > Folks,
> > > > >
> > > > > I've almost finished a patch bringing some improvements to the data
> > > loss
> > > > > handling code, and I wish to discuss proposed changes with the
> > > community
> > > > > before submitting.
> > > > >
> > > > > *The issue*
> > > > >
> > > > > During the grid's lifetime, it's possible to get into a situation
> > when
> > > > some
> > > > > data nodes have failed or mistakenly stopped. If a number of stopped
> > > > nodes
> > > > > exceeds a certain threshold depending on configured backups, count a
> > > data
> > > > > loss will occur. For example, a grid having one backup (meaning at
> > > least
> > > > > two copies of each data partition exist at the same time) can
> > tolerate
> > > > only
> > > > > one node loss at the time. Generally, data loss is guaranteed to
> > occur
> > > if
> > > > > backups + 1 or more nodes have failed simultaneously using default
> > > > affinity
> > > > > function.
> > > > >
> > > > > For in-memory caches, data is lost forever. For persistent caches,
> > data
> > > > is
> > > > > not physically lost and accessible again after failed nodes are
> > > returned
> > > > to
> > > > > the topology.
> > > > >
> > > > > Possible data loss should be taken into consideration while designing
> > > an
> > > > > application.
> > > > >
> > > > >
> > > > >
> > > > > *Consider an example: money is transferred from one deposit to
> > another,
> > > > and
> > > > > all nodes holding data for one of the deposits are gone.In such a
> > > case, a
> > > > > transaction temporary cannot be completed until a cluster is
> > recovered
> > > > from
> > > > > the data loss state. Ignoring this can cause data inconsistency.*
> > > > > It is necessary to have an API telling us if an operation is safe to
> > > > > complete from the perspective of data loss.
> > > > >
> > > > > Such an API exists for some time [1] [2] [3]. In short, a grid can be
> > > > > configured to switch caches to the partial availability mode if data
> > > loss
> > > > > is detected.
> > > > >
> > > > > Let's give two definitions according to the Javadoc for
> > > > > *PartitionLossPolicy*:
> > > > >
> > > > > ·   *Safe* (data loss handling) *policy* - cache operations are only
> > > > > available for non-lost partitions (PartitionLossPolicy != IGNORE).
> > > > >
> > > > > ·   *Unsafe policy* - cache operations are always possible
> > > > > (PartitionLossPolicy = IGNORE). If the unsafe policy is configured,
> > > lost
> > > > > partitions automatically re-created on the remaining nodes if needed
> > or
> > > > > immediately owned if a last supplier has left during rebalancing.
> > > > >
> > > > > *That needs to be fixed*
> > > > >
> > > > > 1. The default loss policy is unsafe, even for persistent caches in
> > the
> > > > > current implementation. It can result in unintentional data loss and
> > > > > business invariants' failure.
> > > > >
> > > > > 2. Node restarts in the persistent grid with detected data loss will
> > > > cause
> > > > > automatic resetting of LOST state after the restart, even if the safe
> > > > > policy is configured. It can result in data loss or partition desync
> > if
> > > > not
> > > > > all nodes are returned to the topology or returned in the wrong
> > order.
> > > > >
> > > > >
> > > > > *An example: a grid has three nodes, one backup. The grid is under
> > > load.
> > > > > First, a node2 has left, soon a node3 has left. If the node2 is
> > > returned
> > > > to
> > > > > the topology first, it would have stale data for some keys. Most
> > recent
> > > > > data are on node3, which is not in the topology yet. Because a lost
> > > state
> > > > > was reset, all caches are fully available, and most probably will
> > > become
> > > > > inconsistent even in safe mode.*
> > > > > 3. Configured loss policy doesn't provide guarantees described in the
> > > > > Javadoc depending on the cluster configuration[4]. In particular,
> > > unsafe
> > > > > policy (IGNORE) cannot be guaranteed if a baseline is fixed (not
> > > > > automatically readjusted on node left), because partitions are not
> > > > > automatically get reassigned on topology change, and no nodes are
> > > > existing
> > > > > to fulfill a read/write request. Same for READ_ONLY_ALL and
> > > > READ_WRITE_ALL.
> > > > >
> > > > > 4. Calling resetLostPartitions doesn't provide a guarantee for full
> > > cache
> > > > > operations availability if a topology doesn't have at least one owner
> > > for
> > > > > each lost partition.
> > > > >
> > > > > The ultimate goal of the patch is to fix API inconsistencies and fix
> > > the
> > > > > most crucial bugs related to data loss handling.
> > > > >
> > > > > *The planned changes are:*
> > > > >
> > > > > 1. The safe policy is used by default, except for in-memory grids
> > with
> > > > > enabled baseline auto-adjust [5] with zero timeout [6]. In the latter
> > > > case,
> > > > > the unsafe policy is used by default. It protects from unintentional
> > > data
> > > > > loss.
> > > > >
> > > > > 2. Lost state is never reset in the case of grid nodes restart
> > (despite
> > > > > full restart). It makes real data loss impossible in persistent grids
> > > if
> > > > > following the recovery instruction.
> > > > >
> > > > > 3. Lost state is impossible to reset if a topology doesn't have at
> > > least
> > > > > one owner for each lost partition. If nodes are physically dead, they
> > > > > should be removed from a baseline first before calling
> > > > resetLostPartitions.
> > > > >
> > > > > 4. READ_WRITE_ALL, READ_ONLY_ALL is a subject for deprecation because
> > > > their
> > > > > guarantees are impossible to fulfill, not on the full baseline.
> > > > >
> > > > > 5. Any operation failed due to data loss contains
> > > > > CacheInvalidStateException as a root cause.
> > > > >
> > > > > In addition to code fixes, I plan to write a tutorial for safe data
> > > loss
> > > > > recovery in the persistent mode in the Ignite wiki.
> > > > >
> > > > > Any comments for the proposed changes are welcome.
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> > org.apache.ignite.configuration.CacheConfiguration#setPartitionLossPolicy(PartitionLossPolicy
> > > > > partLossPlc)
> > > > > [2] org.apache.ignite.Ignite#resetLostPartitions(caches)
> > > > > [3] org.apache.ignite.IgniteCache#lostPartitions
> > > > > [4]  https://issues.apache.org/jira/browse/IGNITE-10041
> > > > > [5]
> > org.apache.ignite.IgniteCluster#baselineAutoAdjustEnabled(boolean)
> > > > > [6] org.apache.ignite.IgniteCluster#baselineAutoAdjustTimeout(long)
> > > > >
> > > >
> > >
> > >
> > > --
> > >
> > > Best regards,
> > > Alexei Scherbakov
> > >
> >
>
>
> --
>
> Best regards,
> Alexei Scherbakov

Re: [DISCUSS] Data loss handling improvements

Posted by Alexei Scherbakov <al...@gmail.com>.

Yes, it will work this way.

чт, 7 мая 2020 г. в 10:43, Anton Vinogradov <av...@apache.org>:

> Seems I got the vision, thanks.
> There should be only 2 ways to reset lost partition: to gain an owner from
> resurrected first or to remove ex-owner from baseline (partition will be
> rearranged).
> And we should make a decision for every lost partition before calling the
> reset.
>
> On Wed, May 6, 2020 at 8:02 PM Alexei Scherbakov <
> alexey.scherbakoff@gmail.com> wrote:
>
> > ср, 6 мая 2020 г. в 12:54, Anton Vinogradov <av...@apache.org>:
> >
> > > Alexei,
> > >
> > > 1,2,4,5 - looks good to me, no objections here.
> > >
> > > >> 3. Lost state is impossible to reset if a topology doesn't have at
> > least
> > > >> one owner for each lost partition.
> > >
> > > Do you mean that, according to your example, where
> > > >> a node2 has left, soon a node3 has left. If the node2 is returned to
> > > >> the topology first, it would have stale data for some keys.
> > > we have to have node2 at cluster to be able to reset "lost" to node2's
> > > data?
> > >
> >
> > Not sure if I understand a question, but try to answer using an example:
> > Assume 3 nodes n1, n2, n3, 1 backup, persistence enabled, partition p is
> > owned by n2 and n3.
> > 1. Topology is activated.
> > 2. cache.put(p, 0) // n2 and n3 have p->0, updateCounter=1
> > 3. n2 has failed.
> > 4. cache.put(p, 1) // n3 has p->1, updateCounter=2
> > 5. n3 has failed, partition loss is happened.
> > 6. n2 joins a topology, it has stale data (p->0)
> >
> > We actually have 2 issues:
> > 7. cache.put(p, 2) will success, n2 has p->2, n3 has p->0, data is
> diverged
> > and will not be adjusted by counters rebalancing if n3 is later joins a
> > topology.
> > or
> > 8. n3 joins a topology, it has actual data (p->1) but rebalancing will
> not
> > work because joining node has highest counter (it can only be a demander
> in
> > this scenario).
> >
> > In both cases rebalancing by counters will not work causing data
> divergence
> > in copies.
> >
> >
> > >
> > > >> at least one owner for each lost partition.
> > > What the reason to have owners for all lost partitions when we want to
> > > reset only some (available)?
> > >
> >
> > It's never were possible to reset only subset of lost partitions. The
> > reason is to make guarantee of resetLostPartitions method - all cache
> > operations are resumed, data is correct.
> >
> >
> > > Will it be possible to perform operations on non-lost partitions when
> the
> > > cluster has at least one lost partition?
> > >
> >
> > Yes it will be.
> >
> >
> > >
> > > On Wed, May 6, 2020 at 11:45 AM Alexei Scherbakov <
> > > alexey.scherbakoff@gmail.com> wrote:
> > >
> > > > Folks,
> > > >
> > > > I've almost finished a patch bringing some improvements to the data
> > loss
> > > > handling code, and I wish to discuss proposed changes with the
> > community
> > > > before submitting.
> > > >
> > > > *The issue*
> > > >
> > > > During the grid's lifetime, it's possible to get into a situation
> when
> > > some
> > > > data nodes have failed or mistakenly stopped. If a number of stopped
> > > nodes
> > > > exceeds a certain threshold depending on configured backups, count a
> > data
> > > > loss will occur. For example, a grid having one backup (meaning at
> > least
> > > > two copies of each data partition exist at the same time) can
> tolerate
> > > only
> > > > one node loss at the time. Generally, data loss is guaranteed to
> occur
> > if
> > > > backups + 1 or more nodes have failed simultaneously using default
> > > affinity
> > > > function.
> > > >
> > > > For in-memory caches, data is lost forever. For persistent caches,
> data
> > > is
> > > > not physically lost and accessible again after failed nodes are
> > returned
> > > to
> > > > the topology.
> > > >
> > > > Possible data loss should be taken into consideration while designing
> > an
> > > > application.
> > > >
> > > >
> > > >
> > > > *Consider an example: money is transferred from one deposit to
> another,
> > > and
> > > > all nodes holding data for one of the deposits are gone.In such a
> > case, a
> > > > transaction temporary cannot be completed until a cluster is
> recovered
> > > from
> > > > the data loss state. Ignoring this can cause data inconsistency.*
> > > > It is necessary to have an API telling us if an operation is safe to
> > > > complete from the perspective of data loss.
> > > >
> > > > Such an API exists for some time [1] [2] [3]. In short, a grid can be
> > > > configured to switch caches to the partial availability mode if data
> > loss
> > > > is detected.
> > > >
> > > > Let's give two definitions according to the Javadoc for
> > > > *PartitionLossPolicy*:
> > > >
> > > > ·   *Safe* (data loss handling) *policy* - cache operations are only
> > > > available for non-lost partitions (PartitionLossPolicy != IGNORE).
> > > >
> > > > ·   *Unsafe policy* - cache operations are always possible
> > > > (PartitionLossPolicy = IGNORE). If the unsafe policy is configured,
> > lost
> > > > partitions automatically re-created on the remaining nodes if needed
> or
> > > > immediately owned if a last supplier has left during rebalancing.
> > > >
> > > > *That needs to be fixed*
> > > >
> > > > 1. The default loss policy is unsafe, even for persistent caches in
> the
> > > > current implementation. It can result in unintentional data loss and
> > > > business invariants' failure.
> > > >
> > > > 2. Node restarts in the persistent grid with detected data loss will
> > > cause
> > > > automatic resetting of LOST state after the restart, even if the safe
> > > > policy is configured. It can result in data loss or partition desync
> if
> > > not
> > > > all nodes are returned to the topology or returned in the wrong
> order.
> > > >
> > > >
> > > > *An example: a grid has three nodes, one backup. The grid is under
> > load.
> > > > First, a node2 has left, soon a node3 has left. If the node2 is
> > returned
> > > to
> > > > the topology first, it would have stale data for some keys. Most
> recent
> > > > data are on node3, which is not in the topology yet. Because a lost
> > state
> > > > was reset, all caches are fully available, and most probably will
> > become
> > > > inconsistent even in safe mode.*
> > > > 3. Configured loss policy doesn't provide guarantees described in the
> > > > Javadoc depending on the cluster configuration[4]. In particular,
> > unsafe
> > > > policy (IGNORE) cannot be guaranteed if a baseline is fixed (not
> > > > automatically readjusted on node left), because partitions are not
> > > > automatically get reassigned on topology change, and no nodes are
> > > existing
> > > > to fulfill a read/write request. Same for READ_ONLY_ALL and
> > > READ_WRITE_ALL.
> > > >
> > > > 4. Calling resetLostPartitions doesn't provide a guarantee for full
> > cache
> > > > operations availability if a topology doesn't have at least one owner
> > for
> > > > each lost partition.
> > > >
> > > > The ultimate goal of the patch is to fix API inconsistencies and fix
> > the
> > > > most crucial bugs related to data loss handling.
> > > >
> > > > *The planned changes are:*
> > > >
> > > > 1. The safe policy is used by default, except for in-memory grids
> with
> > > > enabled baseline auto-adjust [5] with zero timeout [6]. In the latter
> > > case,
> > > > the unsafe policy is used by default. It protects from unintentional
> > data
> > > > loss.
> > > >
> > > > 2. Lost state is never reset in the case of grid nodes restart
> (despite
> > > > full restart). It makes real data loss impossible in persistent grids
> > if
> > > > following the recovery instruction.
> > > >
> > > > 3. Lost state is impossible to reset if a topology doesn't have at
> > least
> > > > one owner for each lost partition. If nodes are physically dead, they
> > > > should be removed from a baseline first before calling
> > > resetLostPartitions.
> > > >
> > > > 4. READ_WRITE_ALL, READ_ONLY_ALL is a subject for deprecation because
> > > their
> > > > guarantees are impossible to fulfill, not on the full baseline.
> > > >
> > > > 5. Any operation failed due to data loss contains
> > > > CacheInvalidStateException as a root cause.
> > > >
> > > > In addition to code fixes, I plan to write a tutorial for safe data
> > loss
> > > > recovery in the persistent mode in the Ignite wiki.
> > > >
> > > > Any comments for the proposed changes are welcome.
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> org.apache.ignite.configuration.CacheConfiguration#setPartitionLossPolicy(PartitionLossPolicy
> > > > partLossPlc)
> > > > [2] org.apache.ignite.Ignite#resetLostPartitions(caches)
> > > > [3] org.apache.ignite.IgniteCache#lostPartitions
> > > > [4]  https://issues.apache.org/jira/browse/IGNITE-10041
> > > > [5]
> org.apache.ignite.IgniteCluster#baselineAutoAdjustEnabled(boolean)
> > > > [6] org.apache.ignite.IgniteCluster#baselineAutoAdjustTimeout(long)
> > > >
> > >
> >
> >
> > --
> >
> > Best regards,
> > Alexei Scherbakov
> >
>


-- 

Best regards,
Alexei Scherbakov

Re: [DISCUSS] Data loss handling improvements

Posted by Anton Vinogradov <av...@apache.org>.

Seems I got the vision, thanks.
There should be only 2 ways to reset lost partition: to gain an owner from
resurrected first or to remove ex-owner from baseline (partition will be
rearranged).
And we should make a decision for every lost partition before calling the
reset.

On Wed, May 6, 2020 at 8:02 PM Alexei Scherbakov <
alexey.scherbakoff@gmail.com> wrote:

> ср, 6 мая 2020 г. в 12:54, Anton Vinogradov <av...@apache.org>:
>
> > Alexei,
> >
> > 1,2,4,5 - looks good to me, no objections here.
> >
> > >> 3. Lost state is impossible to reset if a topology doesn't have at
> least
> > >> one owner for each lost partition.
> >
> > Do you mean that, according to your example, where
> > >> a node2 has left, soon a node3 has left. If the node2 is returned to
> > >> the topology first, it would have stale data for some keys.
> > we have to have node2 at cluster to be able to reset "lost" to node2's
> > data?
> >
>
> Not sure if I understand a question, but try to answer using an example:
> Assume 3 nodes n1, n2, n3, 1 backup, persistence enabled, partition p is
> owned by n2 and n3.
> 1. Topology is activated.
> 2. cache.put(p, 0) // n2 and n3 have p->0, updateCounter=1
> 3. n2 has failed.
> 4. cache.put(p, 1) // n3 has p->1, updateCounter=2
> 5. n3 has failed, partition loss is happened.
> 6. n2 joins a topology, it has stale data (p->0)
>
> We actually have 2 issues:
> 7. cache.put(p, 2) will success, n2 has p->2, n3 has p->0, data is diverged
> and will not be adjusted by counters rebalancing if n3 is later joins a
> topology.
> or
> 8. n3 joins a topology, it has actual data (p->1) but rebalancing will not
> work because joining node has highest counter (it can only be a demander in
> this scenario).
>
> In both cases rebalancing by counters will not work causing data divergence
> in copies.
>
>
> >
> > >> at least one owner for each lost partition.
> > What the reason to have owners for all lost partitions when we want to
> > reset only some (available)?
> >
>
> It's never were possible to reset only subset of lost partitions. The
> reason is to make guarantee of resetLostPartitions method - all cache
> operations are resumed, data is correct.
>
>
> > Will it be possible to perform operations on non-lost partitions when the
> > cluster has at least one lost partition?
> >
>
> Yes it will be.
>
>
> >
> > On Wed, May 6, 2020 at 11:45 AM Alexei Scherbakov <
> > alexey.scherbakoff@gmail.com> wrote:
> >
> > > Folks,
> > >
> > > I've almost finished a patch bringing some improvements to the data
> loss
> > > handling code, and I wish to discuss proposed changes with the
> community
> > > before submitting.
> > >
> > > *The issue*
> > >
> > > During the grid's lifetime, it's possible to get into a situation when
> > some
> > > data nodes have failed or mistakenly stopped. If a number of stopped
> > nodes
> > > exceeds a certain threshold depending on configured backups, count a
> data
> > > loss will occur. For example, a grid having one backup (meaning at
> least
> > > two copies of each data partition exist at the same time) can tolerate
> > only
> > > one node loss at the time. Generally, data loss is guaranteed to occur
> if
> > > backups + 1 or more nodes have failed simultaneously using default
> > affinity
> > > function.
> > >
> > > For in-memory caches, data is lost forever. For persistent caches, data
> > is
> > > not physically lost and accessible again after failed nodes are
> returned
> > to
> > > the topology.
> > >
> > > Possible data loss should be taken into consideration while designing
> an
> > > application.
> > >
> > >
> > >
> > > *Consider an example: money is transferred from one deposit to another,
> > and
> > > all nodes holding data for one of the deposits are gone.In such a
> case, a
> > > transaction temporary cannot be completed until a cluster is recovered
> > from
> > > the data loss state. Ignoring this can cause data inconsistency.*
> > > It is necessary to have an API telling us if an operation is safe to
> > > complete from the perspective of data loss.
> > >
> > > Such an API exists for some time [1] [2] [3]. In short, a grid can be
> > > configured to switch caches to the partial availability mode if data
> loss
> > > is detected.
> > >
> > > Let's give two definitions according to the Javadoc for
> > > *PartitionLossPolicy*:
> > >
> > > ·   *Safe* (data loss handling) *policy* - cache operations are only
> > > available for non-lost partitions (PartitionLossPolicy != IGNORE).
> > >
> > > ·   *Unsafe policy* - cache operations are always possible
> > > (PartitionLossPolicy = IGNORE). If the unsafe policy is configured,
> lost
> > > partitions automatically re-created on the remaining nodes if needed or
> > > immediately owned if a last supplier has left during rebalancing.
> > >
> > > *That needs to be fixed*
> > >
> > > 1. The default loss policy is unsafe, even for persistent caches in the
> > > current implementation. It can result in unintentional data loss and
> > > business invariants' failure.
> > >
> > > 2. Node restarts in the persistent grid with detected data loss will
> > cause
> > > automatic resetting of LOST state after the restart, even if the safe
> > > policy is configured. It can result in data loss or partition desync if
> > not
> > > all nodes are returned to the topology or returned in the wrong order.
> > >
> > >
> > > *An example: a grid has three nodes, one backup. The grid is under
> load.
> > > First, a node2 has left, soon a node3 has left. If the node2 is
> returned
> > to
> > > the topology first, it would have stale data for some keys. Most recent
> > > data are on node3, which is not in the topology yet. Because a lost
> state
> > > was reset, all caches are fully available, and most probably will
> become
> > > inconsistent even in safe mode.*
> > > 3. Configured loss policy doesn't provide guarantees described in the
> > > Javadoc depending on the cluster configuration[4]. In particular,
> unsafe
> > > policy (IGNORE) cannot be guaranteed if a baseline is fixed (not
> > > automatically readjusted on node left), because partitions are not
> > > automatically get reassigned on topology change, and no nodes are
> > existing
> > > to fulfill a read/write request. Same for READ_ONLY_ALL and
> > READ_WRITE_ALL.
> > >
> > > 4. Calling resetLostPartitions doesn't provide a guarantee for full
> cache
> > > operations availability if a topology doesn't have at least one owner
> for
> > > each lost partition.
> > >
> > > The ultimate goal of the patch is to fix API inconsistencies and fix
> the
> > > most crucial bugs related to data loss handling.
> > >
> > > *The planned changes are:*
> > >
> > > 1. The safe policy is used by default, except for in-memory grids with
> > > enabled baseline auto-adjust [5] with zero timeout [6]. In the latter
> > case,
> > > the unsafe policy is used by default. It protects from unintentional
> data
> > > loss.
> > >
> > > 2. Lost state is never reset in the case of grid nodes restart (despite
> > > full restart). It makes real data loss impossible in persistent grids
> if
> > > following the recovery instruction.
> > >
> > > 3. Lost state is impossible to reset if a topology doesn't have at
> least
> > > one owner for each lost partition. If nodes are physically dead, they
> > > should be removed from a baseline first before calling
> > resetLostPartitions.
> > >
> > > 4. READ_WRITE_ALL, READ_ONLY_ALL is a subject for deprecation because
> > their
> > > guarantees are impossible to fulfill, not on the full baseline.
> > >
> > > 5. Any operation failed due to data loss contains
> > > CacheInvalidStateException as a root cause.
> > >
> > > In addition to code fixes, I plan to write a tutorial for safe data
> loss
> > > recovery in the persistent mode in the Ignite wiki.
> > >
> > > Any comments for the proposed changes are welcome.
> > >
> > > [1]
> > >
> > >
> >
> org.apache.ignite.configuration.CacheConfiguration#setPartitionLossPolicy(PartitionLossPolicy
> > > partLossPlc)
> > > [2] org.apache.ignite.Ignite#resetLostPartitions(caches)
> > > [3] org.apache.ignite.IgniteCache#lostPartitions
> > > [4]  https://issues.apache.org/jira/browse/IGNITE-10041
> > > [5] org.apache.ignite.IgniteCluster#baselineAutoAdjustEnabled(boolean)
> > > [6] org.apache.ignite.IgniteCluster#baselineAutoAdjustTimeout(long)
> > >
> >
>
>
> --
>
> Best regards,
> Alexei Scherbakov
>

Re: [DISCUSS] Data loss handling improvements

Posted by Alexei Scherbakov <al...@gmail.com>.

ср, 6 мая 2020 г. в 12:54, Anton Vinogradov <av...@apache.org>:

> Alexei,
>
> 1,2,4,5 - looks good to me, no objections here.
>
> >> 3. Lost state is impossible to reset if a topology doesn't have at least
> >> one owner for each lost partition.
>
> Do you mean that, according to your example, where
> >> a node2 has left, soon a node3 has left. If the node2 is returned to
> >> the topology first, it would have stale data for some keys.
> we have to have node2 at cluster to be able to reset "lost" to node2's
> data?
>

Not sure if I understand a question, but try to answer using an example:
Assume 3 nodes n1, n2, n3, 1 backup, persistence enabled, partition p is
owned by n2 and n3.
1. Topology is activated.
2. cache.put(p, 0) // n2 and n3 have p->0, updateCounter=1
3. n2 has failed.
4. cache.put(p, 1) // n3 has p->1, updateCounter=2
5. n3 has failed, partition loss is happened.
6. n2 joins a topology, it has stale data (p->0)

We actually have 2 issues:
7. cache.put(p, 2) will success, n2 has p->2, n3 has p->0, data is diverged
and will not be adjusted by counters rebalancing if n3 is later joins a
topology.
or
8. n3 joins a topology, it has actual data (p->1) but rebalancing will not
work because joining node has highest counter (it can only be a demander in
this scenario).

In both cases rebalancing by counters will not work causing data divergence
in copies.


>
> >> at least one owner for each lost partition.
> What the reason to have owners for all lost partitions when we want to
> reset only some (available)?
>

It's never were possible to reset only subset of lost partitions. The
reason is to make guarantee of resetLostPartitions method - all cache
operations are resumed, data is correct.


> Will it be possible to perform operations on non-lost partitions when the
> cluster has at least one lost partition?
>

Yes it will be.


>
> On Wed, May 6, 2020 at 11:45 AM Alexei Scherbakov <
> alexey.scherbakoff@gmail.com> wrote:
>
> > Folks,
> >
> > I've almost finished a patch bringing some improvements to the data loss
> > handling code, and I wish to discuss proposed changes with the community
> > before submitting.
> >
> > *The issue*
> >
> > During the grid's lifetime, it's possible to get into a situation when
> some
> > data nodes have failed or mistakenly stopped. If a number of stopped
> nodes
> > exceeds a certain threshold depending on configured backups, count a data
> > loss will occur. For example, a grid having one backup (meaning at least
> > two copies of each data partition exist at the same time) can tolerate
> only
> > one node loss at the time. Generally, data loss is guaranteed to occur if
> > backups + 1 or more nodes have failed simultaneously using default
> affinity
> > function.
> >
> > For in-memory caches, data is lost forever. For persistent caches, data
> is
> > not physically lost and accessible again after failed nodes are returned
> to
> > the topology.
> >
> > Possible data loss should be taken into consideration while designing an
> > application.
> >
> >
> >
> > *Consider an example: money is transferred from one deposit to another,
> and
> > all nodes holding data for one of the deposits are gone.In such a case, a
> > transaction temporary cannot be completed until a cluster is recovered
> from
> > the data loss state. Ignoring this can cause data inconsistency.*
> > It is necessary to have an API telling us if an operation is safe to
> > complete from the perspective of data loss.
> >
> > Such an API exists for some time [1] [2] [3]. In short, a grid can be
> > configured to switch caches to the partial availability mode if data loss
> > is detected.
> >
> > Let's give two definitions according to the Javadoc for
> > *PartitionLossPolicy*:
> >
> > ·   *Safe* (data loss handling) *policy* - cache operations are only
> > available for non-lost partitions (PartitionLossPolicy != IGNORE).
> >
> > ·   *Unsafe policy* - cache operations are always possible
> > (PartitionLossPolicy = IGNORE). If the unsafe policy is configured, lost
> > partitions automatically re-created on the remaining nodes if needed or
> > immediately owned if a last supplier has left during rebalancing.
> >
> > *That needs to be fixed*
> >
> > 1. The default loss policy is unsafe, even for persistent caches in the
> > current implementation. It can result in unintentional data loss and
> > business invariants' failure.
> >
> > 2. Node restarts in the persistent grid with detected data loss will
> cause
> > automatic resetting of LOST state after the restart, even if the safe
> > policy is configured. It can result in data loss or partition desync if
> not
> > all nodes are returned to the topology or returned in the wrong order.
> >
> >
> > *An example: a grid has three nodes, one backup. The grid is under load.
> > First, a node2 has left, soon a node3 has left. If the node2 is returned
> to
> > the topology first, it would have stale data for some keys. Most recent
> > data are on node3, which is not in the topology yet. Because a lost state
> > was reset, all caches are fully available, and most probably will become
> > inconsistent even in safe mode.*
> > 3. Configured loss policy doesn't provide guarantees described in the
> > Javadoc depending on the cluster configuration[4]. In particular, unsafe
> > policy (IGNORE) cannot be guaranteed if a baseline is fixed (not
> > automatically readjusted on node left), because partitions are not
> > automatically get reassigned on topology change, and no nodes are
> existing
> > to fulfill a read/write request. Same for READ_ONLY_ALL and
> READ_WRITE_ALL.
> >
> > 4. Calling resetLostPartitions doesn't provide a guarantee for full cache
> > operations availability if a topology doesn't have at least one owner for
> > each lost partition.
> >
> > The ultimate goal of the patch is to fix API inconsistencies and fix the
> > most crucial bugs related to data loss handling.
> >
> > *The planned changes are:*
> >
> > 1. The safe policy is used by default, except for in-memory grids with
> > enabled baseline auto-adjust [5] with zero timeout [6]. In the latter
> case,
> > the unsafe policy is used by default. It protects from unintentional data
> > loss.
> >
> > 2. Lost state is never reset in the case of grid nodes restart (despite
> > full restart). It makes real data loss impossible in persistent grids if
> > following the recovery instruction.
> >
> > 3. Lost state is impossible to reset if a topology doesn't have at least
> > one owner for each lost partition. If nodes are physically dead, they
> > should be removed from a baseline first before calling
> resetLostPartitions.
> >
> > 4. READ_WRITE_ALL, READ_ONLY_ALL is a subject for deprecation because
> their
> > guarantees are impossible to fulfill, not on the full baseline.
> >
> > 5. Any operation failed due to data loss contains
> > CacheInvalidStateException as a root cause.
> >
> > In addition to code fixes, I plan to write a tutorial for safe data loss
> > recovery in the persistent mode in the Ignite wiki.
> >
> > Any comments for the proposed changes are welcome.
> >
> > [1]
> >
> >
> org.apache.ignite.configuration.CacheConfiguration#setPartitionLossPolicy(PartitionLossPolicy
> > partLossPlc)
> > [2] org.apache.ignite.Ignite#resetLostPartitions(caches)
> > [3] org.apache.ignite.IgniteCache#lostPartitions
> > [4]  https://issues.apache.org/jira/browse/IGNITE-10041
> > [5] org.apache.ignite.IgniteCluster#baselineAutoAdjustEnabled(boolean)
> > [6] org.apache.ignite.IgniteCluster#baselineAutoAdjustTimeout(long)
> >
>


-- 

Best regards,
Alexei Scherbakov

Re: [DISCUSS] Data loss handling improvements

Posted by Anton Vinogradov <av...@apache.org>.

Alexei,

1,2,4,5 - looks good to me, no objections here.

>> 3. Lost state is impossible to reset if a topology doesn't have at least
>> one owner for each lost partition.

Do you mean that, according to your example, where
>> a node2 has left, soon a node3 has left. If the node2 is returned to
>> the topology first, it would have stale data for some keys.
we have to have node2 at cluster to be able to reset "lost" to node2's data?

>> at least one owner for each lost partition.
What the reason to have owners for all lost partitions when we want to
reset only some (available)?
Will it be possible to perform operations on non-lost partitions when the
cluster has at least one lost partition?

On Wed, May 6, 2020 at 11:45 AM Alexei Scherbakov <
alexey.scherbakoff@gmail.com> wrote:

> Folks,
>
> I've almost finished a patch bringing some improvements to the data loss
> handling code, and I wish to discuss proposed changes with the community
> before submitting.
>
> *The issue*
>
> During the grid's lifetime, it's possible to get into a situation when some
> data nodes have failed or mistakenly stopped. If a number of stopped nodes
> exceeds a certain threshold depending on configured backups, count a data
> loss will occur. For example, a grid having one backup (meaning at least
> two copies of each data partition exist at the same time) can tolerate only
> one node loss at the time. Generally, data loss is guaranteed to occur if
> backups + 1 or more nodes have failed simultaneously using default affinity
> function.
>
> For in-memory caches, data is lost forever. For persistent caches, data is
> not physically lost and accessible again after failed nodes are returned to
> the topology.
>
> Possible data loss should be taken into consideration while designing an
> application.
>
>
>
> *Consider an example: money is transferred from one deposit to another, and
> all nodes holding data for one of the deposits are gone.In such a case, a
> transaction temporary cannot be completed until a cluster is recovered from
> the data loss state. Ignoring this can cause data inconsistency.*
> It is necessary to have an API telling us if an operation is safe to
> complete from the perspective of data loss.
>
> Such an API exists for some time [1] [2] [3]. In short, a grid can be
> configured to switch caches to the partial availability mode if data loss
> is detected.
>
> Let's give two definitions according to the Javadoc for
> *PartitionLossPolicy*:
>
> ·   *Safe* (data loss handling) *policy* - cache operations are only
> available for non-lost partitions (PartitionLossPolicy != IGNORE).
>
> ·   *Unsafe policy* - cache operations are always possible
> (PartitionLossPolicy = IGNORE). If the unsafe policy is configured, lost
> partitions automatically re-created on the remaining nodes if needed or
> immediately owned if a last supplier has left during rebalancing.
>
> *That needs to be fixed*
>
> 1. The default loss policy is unsafe, even for persistent caches in the
> current implementation. It can result in unintentional data loss and
> business invariants' failure.
>
> 2. Node restarts in the persistent grid with detected data loss will cause
> automatic resetting of LOST state after the restart, even if the safe
> policy is configured. It can result in data loss or partition desync if not
> all nodes are returned to the topology or returned in the wrong order.
>
>
> *An example: a grid has three nodes, one backup. The grid is under load.
> First, a node2 has left, soon a node3 has left. If the node2 is returned to
> the topology first, it would have stale data for some keys. Most recent
> data are on node3, which is not in the topology yet. Because a lost state
> was reset, all caches are fully available, and most probably will become
> inconsistent even in safe mode.*
> 3. Configured loss policy doesn't provide guarantees described in the
> Javadoc depending on the cluster configuration[4]. In particular, unsafe
> policy (IGNORE) cannot be guaranteed if a baseline is fixed (not
> automatically readjusted on node left), because partitions are not
> automatically get reassigned on topology change, and no nodes are existing
> to fulfill a read/write request. Same for READ_ONLY_ALL and READ_WRITE_ALL.
>
> 4. Calling resetLostPartitions doesn't provide a guarantee for full cache
> operations availability if a topology doesn't have at least one owner for
> each lost partition.
>
> The ultimate goal of the patch is to fix API inconsistencies and fix the
> most crucial bugs related to data loss handling.
>
> *The planned changes are:*
>
> 1. The safe policy is used by default, except for in-memory grids with
> enabled baseline auto-adjust [5] with zero timeout [6]. In the latter case,
> the unsafe policy is used by default. It protects from unintentional data
> loss.
>
> 2. Lost state is never reset in the case of grid nodes restart (despite
> full restart). It makes real data loss impossible in persistent grids if
> following the recovery instruction.
>
> 3. Lost state is impossible to reset if a topology doesn't have at least
> one owner for each lost partition. If nodes are physically dead, they
> should be removed from a baseline first before calling resetLostPartitions.
>
> 4. READ_WRITE_ALL, READ_ONLY_ALL is a subject for deprecation because their
> guarantees are impossible to fulfill, not on the full baseline.
>
> 5. Any operation failed due to data loss contains
> CacheInvalidStateException as a root cause.
>
> In addition to code fixes, I plan to write a tutorial for safe data loss
> recovery in the persistent mode in the Ignite wiki.
>
> Any comments for the proposed changes are welcome.
>
> [1]
>
> org.apache.ignite.configuration.CacheConfiguration#setPartitionLossPolicy(PartitionLossPolicy
> partLossPlc)
> [2] org.apache.ignite.Ignite#resetLostPartitions(caches)
> [3] org.apache.ignite.IgniteCache#lostPartitions
> [4]  https://issues.apache.org/jira/browse/IGNITE-10041
> [5] org.apache.ignite.IgniteCluster#baselineAutoAdjustEnabled(boolean)
> [6] org.apache.ignite.IgniteCluster#baselineAutoAdjustTimeout(long)
>