You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ignite.apache.org by Anton Vinogradov <av...@apache.org> on 2019/09/18 07:53:13 UTC

Non-blocking PME Phase One (Node fail)

Igniters,

Recently we had a discussion devoted to the non-blocking PME.
We agreed that the most important case is a blocking on node failure and it
can be splitted to:

1) Affected partition’s operations latency will be increased by node
failure detection duration.
So, some operations may be freezed for 10+ seconds at real clusters just
waiting for a failed primary response.
In other words, some operations will be blocked even before blocking PME
started.

The good news here that "bigger cluster decrease blocked operations
percent".

Bad news that these operations may block non-affected operations at
- customers code (single_thread/striped pool usage)
- multikey operations (tx1 one locked A and waits for failed B,
non-affected tx2 waits for A)
- striped pools inside AI (when some task wais for tx.op() in sync way and
the striped thread is busy)
- etc ...

Seems, we already, thanks to StopNodeFailureHandler (if configured), always
send node left event before node stop to minimize the waiting period.
So, only cases cause the hang without the stop are the problems now.

Anyway, some additional research required here and it will be nice if
someone willing to help.

2) Some optimizations may speed-up node left case (eliminate upcoming
operations blocking).
A full list can be found at presentation [1].
List contains 8 optimizations, but I propose to implement some at phase one
and the rest at phase two.
Assuming that real production deployment has Baseline enabled we able to
gain speed-up by implementing the following:

#1 Switch on node_fail/node_left event locally instead of starting real PME
(Local switch).
Since BLT enabled we always able to switch to the new-affinity primaries
(no need to preload partitions).
In case we're not able to switch to new-affinity primaries (all missed or
BLT disabled) we'll just start regular PME.
The new-primary calculation can be performed locally or by the coordinator
(eg. attached to the node_fail message).

#2 We should not wait for any already started operations completion (since
they not related to failed primary partitions).
The only problem is a recovery which may cause update-counters duplications
in case of unsynced HWM.

#2.1 We may wait only for recovery completion (Micro-blocking switch).
Just block (all at this phase) upcoming operations during the recovery by
incrementing the topology version.
So in other words, it will be some kind of PME with waiting, but it will
wait for recovery (fast) instead of finishing current operations (long).

#2.2 Recovery, theoretically, can be async.
We have to solve unsynced HWM issue (to avoid concurrent usage of the same
counters) to make it happen.
We may just increment HWM with IGNITE_MAX_COMPLETED_TX_COUNT at new-primary
and continue recovery in an async way.
Currently, IGNITE_MAX_COMPLETED_TX_COUNT specifies the number of committed
transactions we expect between "the first backup committed tx1" and "the
last backup committed the same tx1".
I propose to use it to specify the number of prepared transactions we
expect between "the first backup prepared tx1" and "the last backup
prepared the same tx1".
Both cases look pretty similar.
In this case, we able to make switch fully non-blocking, with async
recovery.
Thoughts?

So, I'm going to implement both improvements at "Lightweight version of
partitions map exchange" issue [2] if no one minds.

[1]
https://docs.google.com/presentation/d/1Ay7OZk_iiJwBCcA8KFOlw6CRmKPXkkyxCXy_JNg4b0Q/edit?usp=sharing
[2] https://issues.apache.org/jira/browse/IGNITE-9913

Re: Non-blocking PME Phase One (Node fail)

Posted by Alexei Scherbakov <al...@gmail.com>.

Anton,

Thanks for effort on reducing PME blocking time.
I'm looking at.

чт, 5 дек. 2019 г. в 16:14, Anton Vinogradov <av...@apache.org>:

> Igniters,
>
> Solution ready to be reviewed
> TeamCity is green:
>
> https://issues.apache.org/jira/browse/IGNITE-9913?focusedCommentId=16988566&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16988566
> Issue: https://issues.apache.org/jira/browse/IGNITE-9913
> Relevant PR: https://github.com/apache/ignite/pull/7069
>
> In brief: Solution allows avoiding PME and running transactions waiting on
> node left.
>
> What done/changed:
> Extended Late Affinity Assignment semantic: from "guarantee to switch when
> ideal primaries rebalanced" to "guarantee to switch when all expected
> owners rebalanced".
> This extension provides the ability to perform PME-free switch on baseline
> node left when cluster fully rebalanced, avoiding a lot of corner-cases
> possible on a partially rebalanced cluster.
>
> The PME-free switch does not block or cancel or wait for current
> operations.
> It's just recover failed primaries and once cluster-wide recovery finished
> it finishes exchange future (allows new operations).
> So in other words, now new-topology operations allowed immediately after
> cluster-wide recovery finished (which is fast), not after previous-topology
> operations finished + recovery + PME, as it was before.
>
> Limitations:
> Optimization works only when baseline set (alive primaries require no
> relocation, recovery required for failed primaries).
>
> BTW, PME-free code seems to be small and simple, most of the changes
> related to the "rebalanced cluster guarantee".
>
> It will be nice if someone will check the code and tests.
> Test coverage proposals are welcome!
>
> P.s. Here's a benchmark used to check we have performance improvements
> listed above
>
> https://github.com/anton-vinogradov/ignite/blob/426f0de9185ac2938dadb7bd2cbfe488233fe7d6/modules/core/src/test/java/org/apache/ignite/internal/processors/cache/distributed/Bench.java
>
> P.p.s. We also checked the same case on the real environment and found no
> latency degradation.
>
> On Thu, Sep 19, 2019 at 2:18 PM Anton Vinogradov <av...@apache.org> wrote:
>
> > Ivan,
> >
> > The useful profile can be produced only by the reproducible benchmark.
> > We're working on such benchmark to analyze duration and measure the
> > improvement.
> >
> > Currently, only the typical overall duration from prod servers is known.
> >
> > On Thu, Sep 19, 2019 at 12:09 PM Павлухин Иван <vo...@gmail.com>
> > wrote:
> >
> >> Anton, folks,
> >>
> >> Out of curiosity. Do we have some measurements of PME execution in an
> >> environment similar to some real-world one? In my mind some kind of
> >> "profile" showing durations of different PME phases with an indication
> >> where we are "blocked" would be ideal.
> >>
> >> ср, 18 сент. 2019 г. в 13:27, Anton Vinogradov <av...@apache.org>:
> >> >
> >> > Alexey,
> >> >
> >> > >> Can you describe the complete protocol changes first
> >> > The current goal is to find a better way.
> >> > I had at least 5 scenarios discarded because of finding corner cases
> >> > (Thanks to Alexey Scherbakov, Aleksei Plekhanov and Nikita Amelchev).
> >> > That's why I explained what we able to improve and why I think it
> works.
> >> >
> >> > >> we need to remove this property, not add new logic that relies on
> it.
> >> > Agree.
> >> >
> >> > >> How are you going to synchronize this?
> >> > Thanks for providing this case, seems it discards #1 + #2.2 case and
> >> #2.1
> >> > still possible with some optimizations.
> >> >
> >> > "Zombie eating transactions" case can be theoretically solved, I
> think.
> >> > As I said before we may perform "Local switch" in case affinity was
> not
> >> > changed (except failed mode miss) other cases require regular PME.
> >> > In this case, new-primary is an ex-backup and we expect that
> old-primary
> >> > will try to update new-primary as a backup.
> >> > New primary will handle operations as a backup until it notified it's
> a
> >> > primary now.
> >> > Operations from ex-primary will be discarded or remapped once
> >> new-primary
> >> > notified it became the primary.
> >> >
> >> > Discarding is a big improvement,
> >> > remapping is a huge improvement,
> >> > there is no 100% warranty that ex-primary will try to update
> >> new-primary as
> >> > a backup.
> >> >
> >> > A lot of corner cases here.
> >> > So, seems minimized sync is a better solution.
> >> >
> >> > Finally, according to your and Alexey Scherbakov's comments, the
> better
> >> > case is just to improve PME to wait for less, at least now.
> >> > Seems, we have to wait for (or cancel, I vote for this case - any
> >> > objections?) operations related to the failed primaries and wait for
> >> > recovery finish (which is fast).
> >> > In case affinity was changed and backup-primary switch (not related to
> >> the
> >> > failed primaries) required between the owners or even rebalance
> >> required,
> >> > we should just ignore this and perform only "Recovery PME".
> >> > Regular PME should happen later (if necessary), it can be even delayed
> >> (if
> >> > necessary).
> >> >
> >> > Sounds good?
> >> >
> >> > On Wed, Sep 18, 2019 at 11:46 AM Alexey Goncharuk <
> >> > alexey.goncharuk@gmail.com> wrote:
> >> >
> >> > > Anton,
> >> > >
> >> > > I looked through the presentation and the ticket but did not find
> any
> >> new
> >> > > protocol description that you are going to implement. Can you
> >> describe the
> >> > > complete protocol changes first (to save time for both you and
> during
> >> the
> >> > > later review)?
> >> > >
> >> > > Some questions that I have in mind:
> >> > >  * It looks like that for "Local Switch" optimization you assume
> that
> >> node
> >> > > failure happens immediately for the whole cluster. This is not true
> -
> >> some
> >> > > nodes may "see" a node A failure, while others still consider it
> >> alive.
> >> > > Moreover, node A may not know yet that it is about to be failed and
> >> process
> >> > > requests correctly. This may happen, for example, due to a long GC
> >> pause on
> >> > > node A. In this case, node A resumes it's execution and proceeds to
> >> work as
> >> > > a primary (it has not received a segmentation event yet), node B
> also
> >> did
> >> > > not receive the A FAILED event yet. Node C received the event, ran
> the
> >> > > "Local switch" and assumed a primary role, node D also received the
> A
> >> > > FAILED event and switched to the new primary. Transactions from
> nodes
> >> B and
> >> > > D will be processed simultaneously on different primaries. How are
> you
> >> > > going to synchronize this?
> >> > >  * IGNITE_MAX_COMPLETED_TX_COUNT is fragile and we need to remove
> this
> >> > > property, not add new logic that relies on it. There is no way a
> user
> >> can
> >> > > calculate this property or adjust it in runtime. If a user decreases
> >> this
> >> > > property below a safe value, we will get inconsistent update
> counters
> >> and
> >> > > cluster desync. Besides, IGNITE_MAX_COMPLETED_TX_COUNT is quite a
> >> large
> >> > > value and will push HWM forward very quickly, much faster than
> during
> >> > > regular updates (you will have to increment it for each partition)
> >> > >
> >> > > ср, 18 сент. 2019 г. в 10:53, Anton Vinogradov <av...@apache.org>:
> >> > >
> >> > > > Igniters,
> >> > > >
> >> > > > Recently we had a discussion devoted to the non-blocking PME.
> >> > > > We agreed that the most important case is a blocking on node
> >> failure and
> >> > > it
> >> > > > can be splitted to:
> >> > > >
> >> > > > 1) Affected partition’s operations latency will be increased by
> node
> >> > > > failure detection duration.
> >> > > > So, some operations may be freezed for 10+ seconds at real
> clusters
> >> just
> >> > > > waiting for a failed primary response.
> >> > > > In other words, some operations will be blocked even before
> >> blocking PME
> >> > > > started.
> >> > > >
> >> > > > The good news here that "bigger cluster decrease blocked
> operations
> >> > > > percent".
> >> > > >
> >> > > > Bad news that these operations may block non-affected operations
> at
> >> > > > - customers code (single_thread/striped pool usage)
> >> > > > - multikey operations (tx1 one locked A and waits for failed B,
> >> > > > non-affected tx2 waits for A)
> >> > > > - striped pools inside AI (when some task wais for tx.op() in sync
> >> way
> >> > > and
> >> > > > the striped thread is busy)
> >> > > > - etc ...
> >> > > >
> >> > > > Seems, we already, thanks to StopNodeFailureHandler (if
> configured),
> >> > > always
> >> > > > send node left event before node stop to minimize the waiting
> >> period.
> >> > > > So, only cases cause the hang without the stop are the problems
> now.
> >> > > >
> >> > > > Anyway, some additional research required here and it will be nice
> >> if
> >> > > > someone willing to help.
> >> > > >
> >> > > > 2) Some optimizations may speed-up node left case (eliminate
> >> upcoming
> >> > > > operations blocking).
> >> > > > A full list can be found at presentation [1].
> >> > > > List contains 8 optimizations, but I propose to implement some at
> >> phase
> >> > > one
> >> > > > and the rest at phase two.
> >> > > > Assuming that real production deployment has Baseline enabled we
> >> able to
> >> > > > gain speed-up by implementing the following:
> >> > > >
> >> > > > #1 Switch on node_fail/node_left event locally instead of starting
> >> real
> >> > > PME
> >> > > > (Local switch).
> >> > > > Since BLT enabled we always able to switch to the new-affinity
> >> primaries
> >> > > > (no need to preload partitions).
> >> > > > In case we're not able to switch to new-affinity primaries (all
> >> missed or
> >> > > > BLT disabled) we'll just start regular PME.
> >> > > > The new-primary calculation can be performed locally or by the
> >> > > coordinator
> >> > > > (eg. attached to the node_fail message).
> >> > > >
> >> > > > #2 We should not wait for any already started operations
> completion
> >> > > (since
> >> > > > they not related to failed primary partitions).
> >> > > > The only problem is a recovery which may cause update-counters
> >> > > duplications
> >> > > > in case of unsynced HWM.
> >> > > >
> >> > > > #2.1 We may wait only for recovery completion (Micro-blocking
> >> switch).
> >> > > > Just block (all at this phase) upcoming operations during the
> >> recovery by
> >> > > > incrementing the topology version.
> >> > > > So in other words, it will be some kind of PME with waiting, but
> it
> >> will
> >> > > > wait for recovery (fast) instead of finishing current operations
> >> (long).
> >> > > >
> >> > > > #2.2 Recovery, theoretically, can be async.
> >> > > > We have to solve unsynced HWM issue (to avoid concurrent usage of
> >> the
> >> > > same
> >> > > > counters) to make it happen.
> >> > > > We may just increment HWM with IGNITE_MAX_COMPLETED_TX_COUNT at
> >> > > new-primary
> >> > > > and continue recovery in an async way.
> >> > > > Currently, IGNITE_MAX_COMPLETED_TX_COUNT specifies the number of
> >> > > committed
> >> > > > transactions we expect between "the first backup committed tx1"
> and
> >> "the
> >> > > > last backup committed the same tx1".
> >> > > > I propose to use it to specify the number of prepared transactions
> >> we
> >> > > > expect between "the first backup prepared tx1" and "the last
> backup
> >> > > > prepared the same tx1".
> >> > > > Both cases look pretty similar.
> >> > > > In this case, we able to make switch fully non-blocking, with
> async
> >> > > > recovery.
> >> > > > Thoughts?
> >> > > >
> >> > > > So, I'm going to implement both improvements at "Lightweight
> >> version of
> >> > > > partitions map exchange" issue [2] if no one minds.
> >> > > >
> >> > > > [1]
> >> > > >
> >> > > >
> >> > >
> >>
> https://docs.google.com/presentation/d/1Ay7OZk_iiJwBCcA8KFOlw6CRmKPXkkyxCXy_JNg4b0Q/edit?usp=sharing
> >> > > > [2] https://issues.apache.org/jira/browse/IGNITE-9913
> >> > > >
> >> > >
> >>
> >>
> >>
> >> --
> >> Best regards,
> >> Ivan Pavlukhin
> >>
> >
>


-- 

Best regards,
Alexei Scherbakov

Re: Non-blocking PME Phase One (Node fail)

Posted by Anton Vinogradov <av...@apache.org>.

Igniters,

Solution ready to be reviewed
TeamCity is green:
https://issues.apache.org/jira/browse/IGNITE-9913?focusedCommentId=16988566&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16988566
Issue: https://issues.apache.org/jira/browse/IGNITE-9913
Relevant PR: https://github.com/apache/ignite/pull/7069

In brief: Solution allows avoiding PME and running transactions waiting on
node left.

What done/changed:
Extended Late Affinity Assignment semantic: from "guarantee to switch when
ideal primaries rebalanced" to "guarantee to switch when all expected
owners rebalanced".
This extension provides the ability to perform PME-free switch on baseline
node left when cluster fully rebalanced, avoiding a lot of corner-cases
possible on a partially rebalanced cluster.

The PME-free switch does not block or cancel or wait for current operations.
It's just recover failed primaries and once cluster-wide recovery finished
it finishes exchange future (allows new operations).
So in other words, now new-topology operations allowed immediately after
cluster-wide recovery finished (which is fast), not after previous-topology
operations finished + recovery + PME, as it was before.

Limitations:
Optimization works only when baseline set (alive primaries require no
relocation, recovery required for failed primaries).

BTW, PME-free code seems to be small and simple, most of the changes
related to the "rebalanced cluster guarantee".

It will be nice if someone will check the code and tests.
Test coverage proposals are welcome!

P.s. Here's a benchmark used to check we have performance improvements
listed above
https://github.com/anton-vinogradov/ignite/blob/426f0de9185ac2938dadb7bd2cbfe488233fe7d6/modules/core/src/test/java/org/apache/ignite/internal/processors/cache/distributed/Bench.java

P.p.s. We also checked the same case on the real environment and found no
latency degradation.

On Thu, Sep 19, 2019 at 2:18 PM Anton Vinogradov <av...@apache.org> wrote:

> Ivan,
>
> The useful profile can be produced only by the reproducible benchmark.
> We're working on such benchmark to analyze duration and measure the
> improvement.
>
> Currently, only the typical overall duration from prod servers is known.
>
> On Thu, Sep 19, 2019 at 12:09 PM Павлухин Иван <vo...@gmail.com>
> wrote:
>
>> Anton, folks,
>>
>> Out of curiosity. Do we have some measurements of PME execution in an
>> environment similar to some real-world one? In my mind some kind of
>> "profile" showing durations of different PME phases with an indication
>> where we are "blocked" would be ideal.
>>
>> ср, 18 сент. 2019 г. в 13:27, Anton Vinogradov <av...@apache.org>:
>> >
>> > Alexey,
>> >
>> > >> Can you describe the complete protocol changes first
>> > The current goal is to find a better way.
>> > I had at least 5 scenarios discarded because of finding corner cases
>> > (Thanks to Alexey Scherbakov, Aleksei Plekhanov and Nikita Amelchev).
>> > That's why I explained what we able to improve and why I think it works.
>> >
>> > >> we need to remove this property, not add new logic that relies on it.
>> > Agree.
>> >
>> > >> How are you going to synchronize this?
>> > Thanks for providing this case, seems it discards #1 + #2.2 case and
>> #2.1
>> > still possible with some optimizations.
>> >
>> > "Zombie eating transactions" case can be theoretically solved, I think.
>> > As I said before we may perform "Local switch" in case affinity was not
>> > changed (except failed mode miss) other cases require regular PME.
>> > In this case, new-primary is an ex-backup and we expect that old-primary
>> > will try to update new-primary as a backup.
>> > New primary will handle operations as a backup until it notified it's a
>> > primary now.
>> > Operations from ex-primary will be discarded or remapped once
>> new-primary
>> > notified it became the primary.
>> >
>> > Discarding is a big improvement,
>> > remapping is a huge improvement,
>> > there is no 100% warranty that ex-primary will try to update
>> new-primary as
>> > a backup.
>> >
>> > A lot of corner cases here.
>> > So, seems minimized sync is a better solution.
>> >
>> > Finally, according to your and Alexey Scherbakov's comments, the better
>> > case is just to improve PME to wait for less, at least now.
>> > Seems, we have to wait for (or cancel, I vote for this case - any
>> > objections?) operations related to the failed primaries and wait for
>> > recovery finish (which is fast).
>> > In case affinity was changed and backup-primary switch (not related to
>> the
>> > failed primaries) required between the owners or even rebalance
>> required,
>> > we should just ignore this and perform only "Recovery PME".
>> > Regular PME should happen later (if necessary), it can be even delayed
>> (if
>> > necessary).
>> >
>> > Sounds good?
>> >
>> > On Wed, Sep 18, 2019 at 11:46 AM Alexey Goncharuk <
>> > alexey.goncharuk@gmail.com> wrote:
>> >
>> > > Anton,
>> > >
>> > > I looked through the presentation and the ticket but did not find any
>> new
>> > > protocol description that you are going to implement. Can you
>> describe the
>> > > complete protocol changes first (to save time for both you and during
>> the
>> > > later review)?
>> > >
>> > > Some questions that I have in mind:
>> > >  * It looks like that for "Local Switch" optimization you assume that
>> node
>> > > failure happens immediately for the whole cluster. This is not true -
>> some
>> > > nodes may "see" a node A failure, while others still consider it
>> alive.
>> > > Moreover, node A may not know yet that it is about to be failed and
>> process
>> > > requests correctly. This may happen, for example, due to a long GC
>> pause on
>> > > node A. In this case, node A resumes it's execution and proceeds to
>> work as
>> > > a primary (it has not received a segmentation event yet), node B also
>> did
>> > > not receive the A FAILED event yet. Node C received the event, ran the
>> > > "Local switch" and assumed a primary role, node D also received the A
>> > > FAILED event and switched to the new primary. Transactions from nodes
>> B and
>> > > D will be processed simultaneously on different primaries. How are you
>> > > going to synchronize this?
>> > >  * IGNITE_MAX_COMPLETED_TX_COUNT is fragile and we need to remove this
>> > > property, not add new logic that relies on it. There is no way a user
>> can
>> > > calculate this property or adjust it in runtime. If a user decreases
>> this
>> > > property below a safe value, we will get inconsistent update counters
>> and
>> > > cluster desync. Besides, IGNITE_MAX_COMPLETED_TX_COUNT is quite a
>> large
>> > > value and will push HWM forward very quickly, much faster than during
>> > > regular updates (you will have to increment it for each partition)
>> > >
>> > > ср, 18 сент. 2019 г. в 10:53, Anton Vinogradov <av...@apache.org>:
>> > >
>> > > > Igniters,
>> > > >
>> > > > Recently we had a discussion devoted to the non-blocking PME.
>> > > > We agreed that the most important case is a blocking on node
>> failure and
>> > > it
>> > > > can be splitted to:
>> > > >
>> > > > 1) Affected partition’s operations latency will be increased by node
>> > > > failure detection duration.
>> > > > So, some operations may be freezed for 10+ seconds at real clusters
>> just
>> > > > waiting for a failed primary response.
>> > > > In other words, some operations will be blocked even before
>> blocking PME
>> > > > started.
>> > > >
>> > > > The good news here that "bigger cluster decrease blocked operations
>> > > > percent".
>> > > >
>> > > > Bad news that these operations may block non-affected operations at
>> > > > - customers code (single_thread/striped pool usage)
>> > > > - multikey operations (tx1 one locked A and waits for failed B,
>> > > > non-affected tx2 waits for A)
>> > > > - striped pools inside AI (when some task wais for tx.op() in sync
>> way
>> > > and
>> > > > the striped thread is busy)
>> > > > - etc ...
>> > > >
>> > > > Seems, we already, thanks to StopNodeFailureHandler (if configured),
>> > > always
>> > > > send node left event before node stop to minimize the waiting
>> period.
>> > > > So, only cases cause the hang without the stop are the problems now.
>> > > >
>> > > > Anyway, some additional research required here and it will be nice
>> if
>> > > > someone willing to help.
>> > > >
>> > > > 2) Some optimizations may speed-up node left case (eliminate
>> upcoming
>> > > > operations blocking).
>> > > > A full list can be found at presentation [1].
>> > > > List contains 8 optimizations, but I propose to implement some at
>> phase
>> > > one
>> > > > and the rest at phase two.
>> > > > Assuming that real production deployment has Baseline enabled we
>> able to
>> > > > gain speed-up by implementing the following:
>> > > >
>> > > > #1 Switch on node_fail/node_left event locally instead of starting
>> real
>> > > PME
>> > > > (Local switch).
>> > > > Since BLT enabled we always able to switch to the new-affinity
>> primaries
>> > > > (no need to preload partitions).
>> > > > In case we're not able to switch to new-affinity primaries (all
>> missed or
>> > > > BLT disabled) we'll just start regular PME.
>> > > > The new-primary calculation can be performed locally or by the
>> > > coordinator
>> > > > (eg. attached to the node_fail message).
>> > > >
>> > > > #2 We should not wait for any already started operations completion
>> > > (since
>> > > > they not related to failed primary partitions).
>> > > > The only problem is a recovery which may cause update-counters
>> > > duplications
>> > > > in case of unsynced HWM.
>> > > >
>> > > > #2.1 We may wait only for recovery completion (Micro-blocking
>> switch).
>> > > > Just block (all at this phase) upcoming operations during the
>> recovery by
>> > > > incrementing the topology version.
>> > > > So in other words, it will be some kind of PME with waiting, but it
>> will
>> > > > wait for recovery (fast) instead of finishing current operations
>> (long).
>> > > >
>> > > > #2.2 Recovery, theoretically, can be async.
>> > > > We have to solve unsynced HWM issue (to avoid concurrent usage of
>> the
>> > > same
>> > > > counters) to make it happen.
>> > > > We may just increment HWM with IGNITE_MAX_COMPLETED_TX_COUNT at
>> > > new-primary
>> > > > and continue recovery in an async way.
>> > > > Currently, IGNITE_MAX_COMPLETED_TX_COUNT specifies the number of
>> > > committed
>> > > > transactions we expect between "the first backup committed tx1" and
>> "the
>> > > > last backup committed the same tx1".
>> > > > I propose to use it to specify the number of prepared transactions
>> we
>> > > > expect between "the first backup prepared tx1" and "the last backup
>> > > > prepared the same tx1".
>> > > > Both cases look pretty similar.
>> > > > In this case, we able to make switch fully non-blocking, with async
>> > > > recovery.
>> > > > Thoughts?
>> > > >
>> > > > So, I'm going to implement both improvements at "Lightweight
>> version of
>> > > > partitions map exchange" issue [2] if no one minds.
>> > > >
>> > > > [1]
>> > > >
>> > > >
>> > >
>> https://docs.google.com/presentation/d/1Ay7OZk_iiJwBCcA8KFOlw6CRmKPXkkyxCXy_JNg4b0Q/edit?usp=sharing
>> > > > [2] https://issues.apache.org/jira/browse/IGNITE-9913
>> > > >
>> > >
>>
>>
>>
>> --
>> Best regards,
>> Ivan Pavlukhin
>>
>

Re: Non-blocking PME Phase One (Node fail)

Posted by Anton Vinogradov <av...@apache.org>.

Ivan,

The useful profile can be produced only by the reproducible benchmark.
We're working on such benchmark to analyze duration and measure the
improvement.

Currently, only the typical overall duration from prod servers is known.

On Thu, Sep 19, 2019 at 12:09 PM Павлухин Иван <vo...@gmail.com> wrote:

> Anton, folks,
>
> Out of curiosity. Do we have some measurements of PME execution in an
> environment similar to some real-world one? In my mind some kind of
> "profile" showing durations of different PME phases with an indication
> where we are "blocked" would be ideal.
>
> ср, 18 сент. 2019 г. в 13:27, Anton Vinogradov <av...@apache.org>:
> >
> > Alexey,
> >
> > >> Can you describe the complete protocol changes first
> > The current goal is to find a better way.
> > I had at least 5 scenarios discarded because of finding corner cases
> > (Thanks to Alexey Scherbakov, Aleksei Plekhanov and Nikita Amelchev).
> > That's why I explained what we able to improve and why I think it works.
> >
> > >> we need to remove this property, not add new logic that relies on it.
> > Agree.
> >
> > >> How are you going to synchronize this?
> > Thanks for providing this case, seems it discards #1 + #2.2 case and #2.1
> > still possible with some optimizations.
> >
> > "Zombie eating transactions" case can be theoretically solved, I think.
> > As I said before we may perform "Local switch" in case affinity was not
> > changed (except failed mode miss) other cases require regular PME.
> > In this case, new-primary is an ex-backup and we expect that old-primary
> > will try to update new-primary as a backup.
> > New primary will handle operations as a backup until it notified it's a
> > primary now.
> > Operations from ex-primary will be discarded or remapped once new-primary
> > notified it became the primary.
> >
> > Discarding is a big improvement,
> > remapping is a huge improvement,
> > there is no 100% warranty that ex-primary will try to update new-primary
> as
> > a backup.
> >
> > A lot of corner cases here.
> > So, seems minimized sync is a better solution.
> >
> > Finally, according to your and Alexey Scherbakov's comments, the better
> > case is just to improve PME to wait for less, at least now.
> > Seems, we have to wait for (or cancel, I vote for this case - any
> > objections?) operations related to the failed primaries and wait for
> > recovery finish (which is fast).
> > In case affinity was changed and backup-primary switch (not related to
> the
> > failed primaries) required between the owners or even rebalance required,
> > we should just ignore this and perform only "Recovery PME".
> > Regular PME should happen later (if necessary), it can be even delayed
> (if
> > necessary).
> >
> > Sounds good?
> >
> > On Wed, Sep 18, 2019 at 11:46 AM Alexey Goncharuk <
> > alexey.goncharuk@gmail.com> wrote:
> >
> > > Anton,
> > >
> > > I looked through the presentation and the ticket but did not find any
> new
> > > protocol description that you are going to implement. Can you describe
> the
> > > complete protocol changes first (to save time for both you and during
> the
> > > later review)?
> > >
> > > Some questions that I have in mind:
> > >  * It looks like that for "Local Switch" optimization you assume that
> node
> > > failure happens immediately for the whole cluster. This is not true -
> some
> > > nodes may "see" a node A failure, while others still consider it alive.
> > > Moreover, node A may not know yet that it is about to be failed and
> process
> > > requests correctly. This may happen, for example, due to a long GC
> pause on
> > > node A. In this case, node A resumes it's execution and proceeds to
> work as
> > > a primary (it has not received a segmentation event yet), node B also
> did
> > > not receive the A FAILED event yet. Node C received the event, ran the
> > > "Local switch" and assumed a primary role, node D also received the A
> > > FAILED event and switched to the new primary. Transactions from nodes
> B and
> > > D will be processed simultaneously on different primaries. How are you
> > > going to synchronize this?
> > >  * IGNITE_MAX_COMPLETED_TX_COUNT is fragile and we need to remove this
> > > property, not add new logic that relies on it. There is no way a user
> can
> > > calculate this property or adjust it in runtime. If a user decreases
> this
> > > property below a safe value, we will get inconsistent update counters
> and
> > > cluster desync. Besides, IGNITE_MAX_COMPLETED_TX_COUNT is quite a large
> > > value and will push HWM forward very quickly, much faster than during
> > > regular updates (you will have to increment it for each partition)
> > >
> > > ср, 18 сент. 2019 г. в 10:53, Anton Vinogradov <av...@apache.org>:
> > >
> > > > Igniters,
> > > >
> > > > Recently we had a discussion devoted to the non-blocking PME.
> > > > We agreed that the most important case is a blocking on node failure
> and
> > > it
> > > > can be splitted to:
> > > >
> > > > 1) Affected partition’s operations latency will be increased by node
> > > > failure detection duration.
> > > > So, some operations may be freezed for 10+ seconds at real clusters
> just
> > > > waiting for a failed primary response.
> > > > In other words, some operations will be blocked even before blocking
> PME
> > > > started.
> > > >
> > > > The good news here that "bigger cluster decrease blocked operations
> > > > percent".
> > > >
> > > > Bad news that these operations may block non-affected operations at
> > > > - customers code (single_thread/striped pool usage)
> > > > - multikey operations (tx1 one locked A and waits for failed B,
> > > > non-affected tx2 waits for A)
> > > > - striped pools inside AI (when some task wais for tx.op() in sync
> way
> > > and
> > > > the striped thread is busy)
> > > > - etc ...
> > > >
> > > > Seems, we already, thanks to StopNodeFailureHandler (if configured),
> > > always
> > > > send node left event before node stop to minimize the waiting period.
> > > > So, only cases cause the hang without the stop are the problems now.
> > > >
> > > > Anyway, some additional research required here and it will be nice if
> > > > someone willing to help.
> > > >
> > > > 2) Some optimizations may speed-up node left case (eliminate upcoming
> > > > operations blocking).
> > > > A full list can be found at presentation [1].
> > > > List contains 8 optimizations, but I propose to implement some at
> phase
> > > one
> > > > and the rest at phase two.
> > > > Assuming that real production deployment has Baseline enabled we
> able to
> > > > gain speed-up by implementing the following:
> > > >
> > > > #1 Switch on node_fail/node_left event locally instead of starting
> real
> > > PME
> > > > (Local switch).
> > > > Since BLT enabled we always able to switch to the new-affinity
> primaries
> > > > (no need to preload partitions).
> > > > In case we're not able to switch to new-affinity primaries (all
> missed or
> > > > BLT disabled) we'll just start regular PME.
> > > > The new-primary calculation can be performed locally or by the
> > > coordinator
> > > > (eg. attached to the node_fail message).
> > > >
> > > > #2 We should not wait for any already started operations completion
> > > (since
> > > > they not related to failed primary partitions).
> > > > The only problem is a recovery which may cause update-counters
> > > duplications
> > > > in case of unsynced HWM.
> > > >
> > > > #2.1 We may wait only for recovery completion (Micro-blocking
> switch).
> > > > Just block (all at this phase) upcoming operations during the
> recovery by
> > > > incrementing the topology version.
> > > > So in other words, it will be some kind of PME with waiting, but it
> will
> > > > wait for recovery (fast) instead of finishing current operations
> (long).
> > > >
> > > > #2.2 Recovery, theoretically, can be async.
> > > > We have to solve unsynced HWM issue (to avoid concurrent usage of the
> > > same
> > > > counters) to make it happen.
> > > > We may just increment HWM with IGNITE_MAX_COMPLETED_TX_COUNT at
> > > new-primary
> > > > and continue recovery in an async way.
> > > > Currently, IGNITE_MAX_COMPLETED_TX_COUNT specifies the number of
> > > committed
> > > > transactions we expect between "the first backup committed tx1" and
> "the
> > > > last backup committed the same tx1".
> > > > I propose to use it to specify the number of prepared transactions we
> > > > expect between "the first backup prepared tx1" and "the last backup
> > > > prepared the same tx1".
> > > > Both cases look pretty similar.
> > > > In this case, we able to make switch fully non-blocking, with async
> > > > recovery.
> > > > Thoughts?
> > > >
> > > > So, I'm going to implement both improvements at "Lightweight version
> of
> > > > partitions map exchange" issue [2] if no one minds.
> > > >
> > > > [1]
> > > >
> > > >
> > >
> https://docs.google.com/presentation/d/1Ay7OZk_iiJwBCcA8KFOlw6CRmKPXkkyxCXy_JNg4b0Q/edit?usp=sharing
> > > > [2] https://issues.apache.org/jira/browse/IGNITE-9913
> > > >
> > >
>
>
>
> --
> Best regards,
> Ivan Pavlukhin
>

Re: Non-blocking PME Phase One (Node fail)

Posted by Павлухин Иван <vo...@gmail.com>.

Anton, folks,

Out of curiosity. Do we have some measurements of PME execution in an
environment similar to some real-world one? In my mind some kind of
"profile" showing durations of different PME phases with an indication
where we are "blocked" would be ideal.

ср, 18 сент. 2019 г. в 13:27, Anton Vinogradov <av...@apache.org>:
>
> Alexey,
>
> >> Can you describe the complete protocol changes first
> The current goal is to find a better way.
> I had at least 5 scenarios discarded because of finding corner cases
> (Thanks to Alexey Scherbakov, Aleksei Plekhanov and Nikita Amelchev).
> That's why I explained what we able to improve and why I think it works.
>
> >> we need to remove this property, not add new logic that relies on it.
> Agree.
>
> >> How are you going to synchronize this?
> Thanks for providing this case, seems it discards #1 + #2.2 case and #2.1
> still possible with some optimizations.
>
> "Zombie eating transactions" case can be theoretically solved, I think.
> As I said before we may perform "Local switch" in case affinity was not
> changed (except failed mode miss) other cases require regular PME.
> In this case, new-primary is an ex-backup and we expect that old-primary
> will try to update new-primary as a backup.
> New primary will handle operations as a backup until it notified it's a
> primary now.
> Operations from ex-primary will be discarded or remapped once new-primary
> notified it became the primary.
>
> Discarding is a big improvement,
> remapping is a huge improvement,
> there is no 100% warranty that ex-primary will try to update new-primary as
> a backup.
>
> A lot of corner cases here.
> So, seems minimized sync is a better solution.
>
> Finally, according to your and Alexey Scherbakov's comments, the better
> case is just to improve PME to wait for less, at least now.
> Seems, we have to wait for (or cancel, I vote for this case - any
> objections?) operations related to the failed primaries and wait for
> recovery finish (which is fast).
> In case affinity was changed and backup-primary switch (not related to the
> failed primaries) required between the owners or even rebalance required,
> we should just ignore this and perform only "Recovery PME".
> Regular PME should happen later (if necessary), it can be even delayed (if
> necessary).
>
> Sounds good?
>
> On Wed, Sep 18, 2019 at 11:46 AM Alexey Goncharuk <
> alexey.goncharuk@gmail.com> wrote:
>
> > Anton,
> >
> > I looked through the presentation and the ticket but did not find any new
> > protocol description that you are going to implement. Can you describe the
> > complete protocol changes first (to save time for both you and during the
> > later review)?
> >
> > Some questions that I have in mind:
> >  * It looks like that for "Local Switch" optimization you assume that node
> > failure happens immediately for the whole cluster. This is not true - some
> > nodes may "see" a node A failure, while others still consider it alive.
> > Moreover, node A may not know yet that it is about to be failed and process
> > requests correctly. This may happen, for example, due to a long GC pause on
> > node A. In this case, node A resumes it's execution and proceeds to work as
> > a primary (it has not received a segmentation event yet), node B also did
> > not receive the A FAILED event yet. Node C received the event, ran the
> > "Local switch" and assumed a primary role, node D also received the A
> > FAILED event and switched to the new primary. Transactions from nodes B and
> > D will be processed simultaneously on different primaries. How are you
> > going to synchronize this?
> >  * IGNITE_MAX_COMPLETED_TX_COUNT is fragile and we need to remove this
> > property, not add new logic that relies on it. There is no way a user can
> > calculate this property or adjust it in runtime. If a user decreases this
> > property below a safe value, we will get inconsistent update counters and
> > cluster desync. Besides, IGNITE_MAX_COMPLETED_TX_COUNT is quite a large
> > value and will push HWM forward very quickly, much faster than during
> > regular updates (you will have to increment it for each partition)
> >
> > ср, 18 сент. 2019 г. в 10:53, Anton Vinogradov <av...@apache.org>:
> >
> > > Igniters,
> > >
> > > Recently we had a discussion devoted to the non-blocking PME.
> > > We agreed that the most important case is a blocking on node failure and
> > it
> > > can be splitted to:
> > >
> > > 1) Affected partition’s operations latency will be increased by node
> > > failure detection duration.
> > > So, some operations may be freezed for 10+ seconds at real clusters just
> > > waiting for a failed primary response.
> > > In other words, some operations will be blocked even before blocking PME
> > > started.
> > >
> > > The good news here that "bigger cluster decrease blocked operations
> > > percent".
> > >
> > > Bad news that these operations may block non-affected operations at
> > > - customers code (single_thread/striped pool usage)
> > > - multikey operations (tx1 one locked A and waits for failed B,
> > > non-affected tx2 waits for A)
> > > - striped pools inside AI (when some task wais for tx.op() in sync way
> > and
> > > the striped thread is busy)
> > > - etc ...
> > >
> > > Seems, we already, thanks to StopNodeFailureHandler (if configured),
> > always
> > > send node left event before node stop to minimize the waiting period.
> > > So, only cases cause the hang without the stop are the problems now.
> > >
> > > Anyway, some additional research required here and it will be nice if
> > > someone willing to help.
> > >
> > > 2) Some optimizations may speed-up node left case (eliminate upcoming
> > > operations blocking).
> > > A full list can be found at presentation [1].
> > > List contains 8 optimizations, but I propose to implement some at phase
> > one
> > > and the rest at phase two.
> > > Assuming that real production deployment has Baseline enabled we able to
> > > gain speed-up by implementing the following:
> > >
> > > #1 Switch on node_fail/node_left event locally instead of starting real
> > PME
> > > (Local switch).
> > > Since BLT enabled we always able to switch to the new-affinity primaries
> > > (no need to preload partitions).
> > > In case we're not able to switch to new-affinity primaries (all missed or
> > > BLT disabled) we'll just start regular PME.
> > > The new-primary calculation can be performed locally or by the
> > coordinator
> > > (eg. attached to the node_fail message).
> > >
> > > #2 We should not wait for any already started operations completion
> > (since
> > > they not related to failed primary partitions).
> > > The only problem is a recovery which may cause update-counters
> > duplications
> > > in case of unsynced HWM.
> > >
> > > #2.1 We may wait only for recovery completion (Micro-blocking switch).
> > > Just block (all at this phase) upcoming operations during the recovery by
> > > incrementing the topology version.
> > > So in other words, it will be some kind of PME with waiting, but it will
> > > wait for recovery (fast) instead of finishing current operations (long).
> > >
> > > #2.2 Recovery, theoretically, can be async.
> > > We have to solve unsynced HWM issue (to avoid concurrent usage of the
> > same
> > > counters) to make it happen.
> > > We may just increment HWM with IGNITE_MAX_COMPLETED_TX_COUNT at
> > new-primary
> > > and continue recovery in an async way.
> > > Currently, IGNITE_MAX_COMPLETED_TX_COUNT specifies the number of
> > committed
> > > transactions we expect between "the first backup committed tx1" and "the
> > > last backup committed the same tx1".
> > > I propose to use it to specify the number of prepared transactions we
> > > expect between "the first backup prepared tx1" and "the last backup
> > > prepared the same tx1".
> > > Both cases look pretty similar.
> > > In this case, we able to make switch fully non-blocking, with async
> > > recovery.
> > > Thoughts?
> > >
> > > So, I'm going to implement both improvements at "Lightweight version of
> > > partitions map exchange" issue [2] if no one minds.
> > >
> > > [1]
> > >
> > >
> > https://docs.google.com/presentation/d/1Ay7OZk_iiJwBCcA8KFOlw6CRmKPXkkyxCXy_JNg4b0Q/edit?usp=sharing
> > > [2] https://issues.apache.org/jira/browse/IGNITE-9913
> > >
> >



-- 
Best regards,
Ivan Pavlukhin

Re: Non-blocking PME Phase One (Node fail)

Posted by Anton Vinogradov <av...@apache.org>.

Alexey,

>> Can you describe the complete protocol changes first
The current goal is to find a better way.
I had at least 5 scenarios discarded because of finding corner cases
(Thanks to Alexey Scherbakov, Aleksei Plekhanov and Nikita Amelchev).
That's why I explained what we able to improve and why I think it works.

>> we need to remove this property, not add new logic that relies on it.
Agree.

>> How are you going to synchronize this?
Thanks for providing this case, seems it discards #1 + #2.2 case and #2.1
still possible with some optimizations.

"Zombie eating transactions" case can be theoretically solved, I think.
As I said before we may perform "Local switch" in case affinity was not
changed (except failed mode miss) other cases require regular PME.
In this case, new-primary is an ex-backup and we expect that old-primary
will try to update new-primary as a backup.
New primary will handle operations as a backup until it notified it's a
primary now.
Operations from ex-primary will be discarded or remapped once new-primary
notified it became the primary.

Discarding is a big improvement,
remapping is a huge improvement,
there is no 100% warranty that ex-primary will try to update new-primary as
a backup.

A lot of corner cases here.
So, seems minimized sync is a better solution.

Finally, according to your and Alexey Scherbakov's comments, the better
case is just to improve PME to wait for less, at least now.
Seems, we have to wait for (or cancel, I vote for this case - any
objections?) operations related to the failed primaries and wait for
recovery finish (which is fast).
In case affinity was changed and backup-primary switch (not related to the
failed primaries) required between the owners or even rebalance required,
we should just ignore this and perform only "Recovery PME".
Regular PME should happen later (if necessary), it can be even delayed (if
necessary).

Sounds good?

On Wed, Sep 18, 2019 at 11:46 AM Alexey Goncharuk <
alexey.goncharuk@gmail.com> wrote:

> Anton,
>
> I looked through the presentation and the ticket but did not find any new
> protocol description that you are going to implement. Can you describe the
> complete protocol changes first (to save time for both you and during the
> later review)?
>
> Some questions that I have in mind:
>  * It looks like that for "Local Switch" optimization you assume that node
> failure happens immediately for the whole cluster. This is not true - some
> nodes may "see" a node A failure, while others still consider it alive.
> Moreover, node A may not know yet that it is about to be failed and process
> requests correctly. This may happen, for example, due to a long GC pause on
> node A. In this case, node A resumes it's execution and proceeds to work as
> a primary (it has not received a segmentation event yet), node B also did
> not receive the A FAILED event yet. Node C received the event, ran the
> "Local switch" and assumed a primary role, node D also received the A
> FAILED event and switched to the new primary. Transactions from nodes B and
> D will be processed simultaneously on different primaries. How are you
> going to synchronize this?
>  * IGNITE_MAX_COMPLETED_TX_COUNT is fragile and we need to remove this
> property, not add new logic that relies on it. There is no way a user can
> calculate this property or adjust it in runtime. If a user decreases this
> property below a safe value, we will get inconsistent update counters and
> cluster desync. Besides, IGNITE_MAX_COMPLETED_TX_COUNT is quite a large
> value and will push HWM forward very quickly, much faster than during
> regular updates (you will have to increment it for each partition)
>
> ср, 18 сент. 2019 г. в 10:53, Anton Vinogradov <av...@apache.org>:
>
> > Igniters,
> >
> > Recently we had a discussion devoted to the non-blocking PME.
> > We agreed that the most important case is a blocking on node failure and
> it
> > can be splitted to:
> >
> > 1) Affected partition’s operations latency will be increased by node
> > failure detection duration.
> > So, some operations may be freezed for 10+ seconds at real clusters just
> > waiting for a failed primary response.
> > In other words, some operations will be blocked even before blocking PME
> > started.
> >
> > The good news here that "bigger cluster decrease blocked operations
> > percent".
> >
> > Bad news that these operations may block non-affected operations at
> > - customers code (single_thread/striped pool usage)
> > - multikey operations (tx1 one locked A and waits for failed B,
> > non-affected tx2 waits for A)
> > - striped pools inside AI (when some task wais for tx.op() in sync way
> and
> > the striped thread is busy)
> > - etc ...
> >
> > Seems, we already, thanks to StopNodeFailureHandler (if configured),
> always
> > send node left event before node stop to minimize the waiting period.
> > So, only cases cause the hang without the stop are the problems now.
> >
> > Anyway, some additional research required here and it will be nice if
> > someone willing to help.
> >
> > 2) Some optimizations may speed-up node left case (eliminate upcoming
> > operations blocking).
> > A full list can be found at presentation [1].
> > List contains 8 optimizations, but I propose to implement some at phase
> one
> > and the rest at phase two.
> > Assuming that real production deployment has Baseline enabled we able to
> > gain speed-up by implementing the following:
> >
> > #1 Switch on node_fail/node_left event locally instead of starting real
> PME
> > (Local switch).
> > Since BLT enabled we always able to switch to the new-affinity primaries
> > (no need to preload partitions).
> > In case we're not able to switch to new-affinity primaries (all missed or
> > BLT disabled) we'll just start regular PME.
> > The new-primary calculation can be performed locally or by the
> coordinator
> > (eg. attached to the node_fail message).
> >
> > #2 We should not wait for any already started operations completion
> (since
> > they not related to failed primary partitions).
> > The only problem is a recovery which may cause update-counters
> duplications
> > in case of unsynced HWM.
> >
> > #2.1 We may wait only for recovery completion (Micro-blocking switch).
> > Just block (all at this phase) upcoming operations during the recovery by
> > incrementing the topology version.
> > So in other words, it will be some kind of PME with waiting, but it will
> > wait for recovery (fast) instead of finishing current operations (long).
> >
> > #2.2 Recovery, theoretically, can be async.
> > We have to solve unsynced HWM issue (to avoid concurrent usage of the
> same
> > counters) to make it happen.
> > We may just increment HWM with IGNITE_MAX_COMPLETED_TX_COUNT at
> new-primary
> > and continue recovery in an async way.
> > Currently, IGNITE_MAX_COMPLETED_TX_COUNT specifies the number of
> committed
> > transactions we expect between "the first backup committed tx1" and "the
> > last backup committed the same tx1".
> > I propose to use it to specify the number of prepared transactions we
> > expect between "the first backup prepared tx1" and "the last backup
> > prepared the same tx1".
> > Both cases look pretty similar.
> > In this case, we able to make switch fully non-blocking, with async
> > recovery.
> > Thoughts?
> >
> > So, I'm going to implement both improvements at "Lightweight version of
> > partitions map exchange" issue [2] if no one minds.
> >
> > [1]
> >
> >
> https://docs.google.com/presentation/d/1Ay7OZk_iiJwBCcA8KFOlw6CRmKPXkkyxCXy_JNg4b0Q/edit?usp=sharing
> > [2] https://issues.apache.org/jira/browse/IGNITE-9913
> >
>

Re: Non-blocking PME Phase One (Node fail)

Posted by Alexey Goncharuk <al...@gmail.com>.

Anton,

I looked through the presentation and the ticket but did not find any new
protocol description that you are going to implement. Can you describe the
complete protocol changes first (to save time for both you and during the
later review)?

Some questions that I have in mind:
 * It looks like that for "Local Switch" optimization you assume that node
failure happens immediately for the whole cluster. This is not true - some
nodes may "see" a node A failure, while others still consider it alive.
Moreover, node A may not know yet that it is about to be failed and process
requests correctly. This may happen, for example, due to a long GC pause on
node A. In this case, node A resumes it's execution and proceeds to work as
a primary (it has not received a segmentation event yet), node B also did
not receive the A FAILED event yet. Node C received the event, ran the
"Local switch" and assumed a primary role, node D also received the A
FAILED event and switched to the new primary. Transactions from nodes B and
D will be processed simultaneously on different primaries. How are you
going to synchronize this?
 * IGNITE_MAX_COMPLETED_TX_COUNT is fragile and we need to remove this
property, not add new logic that relies on it. There is no way a user can
calculate this property or adjust it in runtime. If a user decreases this
property below a safe value, we will get inconsistent update counters and
cluster desync. Besides, IGNITE_MAX_COMPLETED_TX_COUNT is quite a large
value and will push HWM forward very quickly, much faster than during
regular updates (you will have to increment it for each partition)

ср, 18 сент. 2019 г. в 10:53, Anton Vinogradov <av...@apache.org>:

> Igniters,
>
> Recently we had a discussion devoted to the non-blocking PME.
> We agreed that the most important case is a blocking on node failure and it
> can be splitted to:
>
> 1) Affected partition’s operations latency will be increased by node
> failure detection duration.
> So, some operations may be freezed for 10+ seconds at real clusters just
> waiting for a failed primary response.
> In other words, some operations will be blocked even before blocking PME
> started.
>
> The good news here that "bigger cluster decrease blocked operations
> percent".
>
> Bad news that these operations may block non-affected operations at
> - customers code (single_thread/striped pool usage)
> - multikey operations (tx1 one locked A and waits for failed B,
> non-affected tx2 waits for A)
> - striped pools inside AI (when some task wais for tx.op() in sync way and
> the striped thread is busy)
> - etc ...
>
> Seems, we already, thanks to StopNodeFailureHandler (if configured), always
> send node left event before node stop to minimize the waiting period.
> So, only cases cause the hang without the stop are the problems now.
>
> Anyway, some additional research required here and it will be nice if
> someone willing to help.
>
> 2) Some optimizations may speed-up node left case (eliminate upcoming
> operations blocking).
> A full list can be found at presentation [1].
> List contains 8 optimizations, but I propose to implement some at phase one
> and the rest at phase two.
> Assuming that real production deployment has Baseline enabled we able to
> gain speed-up by implementing the following:
>
> #1 Switch on node_fail/node_left event locally instead of starting real PME
> (Local switch).
> Since BLT enabled we always able to switch to the new-affinity primaries
> (no need to preload partitions).
> In case we're not able to switch to new-affinity primaries (all missed or
> BLT disabled) we'll just start regular PME.
> The new-primary calculation can be performed locally or by the coordinator
> (eg. attached to the node_fail message).
>
> #2 We should not wait for any already started operations completion (since
> they not related to failed primary partitions).
> The only problem is a recovery which may cause update-counters duplications
> in case of unsynced HWM.
>
> #2.1 We may wait only for recovery completion (Micro-blocking switch).
> Just block (all at this phase) upcoming operations during the recovery by
> incrementing the topology version.
> So in other words, it will be some kind of PME with waiting, but it will
> wait for recovery (fast) instead of finishing current operations (long).
>
> #2.2 Recovery, theoretically, can be async.
> We have to solve unsynced HWM issue (to avoid concurrent usage of the same
> counters) to make it happen.
> We may just increment HWM with IGNITE_MAX_COMPLETED_TX_COUNT at new-primary
> and continue recovery in an async way.
> Currently, IGNITE_MAX_COMPLETED_TX_COUNT specifies the number of committed
> transactions we expect between "the first backup committed tx1" and "the
> last backup committed the same tx1".
> I propose to use it to specify the number of prepared transactions we
> expect between "the first backup prepared tx1" and "the last backup
> prepared the same tx1".
> Both cases look pretty similar.
> In this case, we able to make switch fully non-blocking, with async
> recovery.
> Thoughts?
>
> So, I'm going to implement both improvements at "Lightweight version of
> partitions map exchange" issue [2] if no one minds.
>
> [1]
>
> https://docs.google.com/presentation/d/1Ay7OZk_iiJwBCcA8KFOlw6CRmKPXkkyxCXy_JNg4b0Q/edit?usp=sharing
> [2] https://issues.apache.org/jira/browse/IGNITE-9913
>