You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ignite.apache.org by Anton Vinogradov <av...@apache.org> on 2019/12/05 13:14:43 UTC

Re: Non-blocking PME Phase One (Node fail)

Igniters,

Solution ready to be reviewed
TeamCity is green:
https://issues.apache.org/jira/browse/IGNITE-9913?focusedCommentId=16988566&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16988566
Issue: https://issues.apache.org/jira/browse/IGNITE-9913
Relevant PR: https://github.com/apache/ignite/pull/7069

In brief: Solution allows avoiding PME and running transactions waiting on
node left.

What done/changed:
Extended Late Affinity Assignment semantic: from "guarantee to switch when
ideal primaries rebalanced" to "guarantee to switch when all expected
owners rebalanced".
This extension provides the ability to perform PME-free switch on baseline
node left when cluster fully rebalanced, avoiding a lot of corner-cases
possible on a partially rebalanced cluster.

The PME-free switch does not block or cancel or wait for current operations.
It's just recover failed primaries and once cluster-wide recovery finished
it finishes exchange future (allows new operations).
So in other words, now new-topology operations allowed immediately after
cluster-wide recovery finished (which is fast), not after previous-topology
operations finished + recovery + PME, as it was before.

Limitations:
Optimization works only when baseline set (alive primaries require no
relocation, recovery required for failed primaries).

BTW, PME-free code seems to be small and simple, most of the changes
related to the "rebalanced cluster guarantee".

It will be nice if someone will check the code and tests.
Test coverage proposals are welcome!

P.s. Here's a benchmark used to check we have performance improvements
listed above
https://github.com/anton-vinogradov/ignite/blob/426f0de9185ac2938dadb7bd2cbfe488233fe7d6/modules/core/src/test/java/org/apache/ignite/internal/processors/cache/distributed/Bench.java

P.p.s. We also checked the same case on the real environment and found no
latency degradation.

On Thu, Sep 19, 2019 at 2:18 PM Anton Vinogradov <av...@apache.org> wrote:

> Ivan,
>
> The useful profile can be produced only by the reproducible benchmark.
> We're working on such benchmark to analyze duration and measure the
> improvement.
>
> Currently, only the typical overall duration from prod servers is known.
>
> On Thu, Sep 19, 2019 at 12:09 PM Павлухин Иван <vo...@gmail.com>
> wrote:
>
>> Anton, folks,
>>
>> Out of curiosity. Do we have some measurements of PME execution in an
>> environment similar to some real-world one? In my mind some kind of
>> "profile" showing durations of different PME phases with an indication
>> where we are "blocked" would be ideal.
>>
>> ср, 18 сент. 2019 г. в 13:27, Anton Vinogradov <av...@apache.org>:
>> >
>> > Alexey,
>> >
>> > >> Can you describe the complete protocol changes first
>> > The current goal is to find a better way.
>> > I had at least 5 scenarios discarded because of finding corner cases
>> > (Thanks to Alexey Scherbakov, Aleksei Plekhanov and Nikita Amelchev).
>> > That's why I explained what we able to improve and why I think it works.
>> >
>> > >> we need to remove this property, not add new logic that relies on it.
>> > Agree.
>> >
>> > >> How are you going to synchronize this?
>> > Thanks for providing this case, seems it discards #1 + #2.2 case and
>> #2.1
>> > still possible with some optimizations.
>> >
>> > "Zombie eating transactions" case can be theoretically solved, I think.
>> > As I said before we may perform "Local switch" in case affinity was not
>> > changed (except failed mode miss) other cases require regular PME.
>> > In this case, new-primary is an ex-backup and we expect that old-primary
>> > will try to update new-primary as a backup.
>> > New primary will handle operations as a backup until it notified it's a
>> > primary now.
>> > Operations from ex-primary will be discarded or remapped once
>> new-primary
>> > notified it became the primary.
>> >
>> > Discarding is a big improvement,
>> > remapping is a huge improvement,
>> > there is no 100% warranty that ex-primary will try to update
>> new-primary as
>> > a backup.
>> >
>> > A lot of corner cases here.
>> > So, seems minimized sync is a better solution.
>> >
>> > Finally, according to your and Alexey Scherbakov's comments, the better
>> > case is just to improve PME to wait for less, at least now.
>> > Seems, we have to wait for (or cancel, I vote for this case - any
>> > objections?) operations related to the failed primaries and wait for
>> > recovery finish (which is fast).
>> > In case affinity was changed and backup-primary switch (not related to
>> the
>> > failed primaries) required between the owners or even rebalance
>> required,
>> > we should just ignore this and perform only "Recovery PME".
>> > Regular PME should happen later (if necessary), it can be even delayed
>> (if
>> > necessary).
>> >
>> > Sounds good?
>> >
>> > On Wed, Sep 18, 2019 at 11:46 AM Alexey Goncharuk <
>> > alexey.goncharuk@gmail.com> wrote:
>> >
>> > > Anton,
>> > >
>> > > I looked through the presentation and the ticket but did not find any
>> new
>> > > protocol description that you are going to implement. Can you
>> describe the
>> > > complete protocol changes first (to save time for both you and during
>> the
>> > > later review)?
>> > >
>> > > Some questions that I have in mind:
>> > >  * It looks like that for "Local Switch" optimization you assume that
>> node
>> > > failure happens immediately for the whole cluster. This is not true -
>> some
>> > > nodes may "see" a node A failure, while others still consider it
>> alive.
>> > > Moreover, node A may not know yet that it is about to be failed and
>> process
>> > > requests correctly. This may happen, for example, due to a long GC
>> pause on
>> > > node A. In this case, node A resumes it's execution and proceeds to
>> work as
>> > > a primary (it has not received a segmentation event yet), node B also
>> did
>> > > not receive the A FAILED event yet. Node C received the event, ran the
>> > > "Local switch" and assumed a primary role, node D also received the A
>> > > FAILED event and switched to the new primary. Transactions from nodes
>> B and
>> > > D will be processed simultaneously on different primaries. How are you
>> > > going to synchronize this?
>> > >  * IGNITE_MAX_COMPLETED_TX_COUNT is fragile and we need to remove this
>> > > property, not add new logic that relies on it. There is no way a user
>> can
>> > > calculate this property or adjust it in runtime. If a user decreases
>> this
>> > > property below a safe value, we will get inconsistent update counters
>> and
>> > > cluster desync. Besides, IGNITE_MAX_COMPLETED_TX_COUNT is quite a
>> large
>> > > value and will push HWM forward very quickly, much faster than during
>> > > regular updates (you will have to increment it for each partition)
>> > >
>> > > ср, 18 сент. 2019 г. в 10:53, Anton Vinogradov <av...@apache.org>:
>> > >
>> > > > Igniters,
>> > > >
>> > > > Recently we had a discussion devoted to the non-blocking PME.
>> > > > We agreed that the most important case is a blocking on node
>> failure and
>> > > it
>> > > > can be splitted to:
>> > > >
>> > > > 1) Affected partition’s operations latency will be increased by node
>> > > > failure detection duration.
>> > > > So, some operations may be freezed for 10+ seconds at real clusters
>> just
>> > > > waiting for a failed primary response.
>> > > > In other words, some operations will be blocked even before
>> blocking PME
>> > > > started.
>> > > >
>> > > > The good news here that "bigger cluster decrease blocked operations
>> > > > percent".
>> > > >
>> > > > Bad news that these operations may block non-affected operations at
>> > > > - customers code (single_thread/striped pool usage)
>> > > > - multikey operations (tx1 one locked A and waits for failed B,
>> > > > non-affected tx2 waits for A)
>> > > > - striped pools inside AI (when some task wais for tx.op() in sync
>> way
>> > > and
>> > > > the striped thread is busy)
>> > > > - etc ...
>> > > >
>> > > > Seems, we already, thanks to StopNodeFailureHandler (if configured),
>> > > always
>> > > > send node left event before node stop to minimize the waiting
>> period.
>> > > > So, only cases cause the hang without the stop are the problems now.
>> > > >
>> > > > Anyway, some additional research required here and it will be nice
>> if
>> > > > someone willing to help.
>> > > >
>> > > > 2) Some optimizations may speed-up node left case (eliminate
>> upcoming
>> > > > operations blocking).
>> > > > A full list can be found at presentation [1].
>> > > > List contains 8 optimizations, but I propose to implement some at
>> phase
>> > > one
>> > > > and the rest at phase two.
>> > > > Assuming that real production deployment has Baseline enabled we
>> able to
>> > > > gain speed-up by implementing the following:
>> > > >
>> > > > #1 Switch on node_fail/node_left event locally instead of starting
>> real
>> > > PME
>> > > > (Local switch).
>> > > > Since BLT enabled we always able to switch to the new-affinity
>> primaries
>> > > > (no need to preload partitions).
>> > > > In case we're not able to switch to new-affinity primaries (all
>> missed or
>> > > > BLT disabled) we'll just start regular PME.
>> > > > The new-primary calculation can be performed locally or by the
>> > > coordinator
>> > > > (eg. attached to the node_fail message).
>> > > >
>> > > > #2 We should not wait for any already started operations completion
>> > > (since
>> > > > they not related to failed primary partitions).
>> > > > The only problem is a recovery which may cause update-counters
>> > > duplications
>> > > > in case of unsynced HWM.
>> > > >
>> > > > #2.1 We may wait only for recovery completion (Micro-blocking
>> switch).
>> > > > Just block (all at this phase) upcoming operations during the
>> recovery by
>> > > > incrementing the topology version.
>> > > > So in other words, it will be some kind of PME with waiting, but it
>> will
>> > > > wait for recovery (fast) instead of finishing current operations
>> (long).
>> > > >
>> > > > #2.2 Recovery, theoretically, can be async.
>> > > > We have to solve unsynced HWM issue (to avoid concurrent usage of
>> the
>> > > same
>> > > > counters) to make it happen.
>> > > > We may just increment HWM with IGNITE_MAX_COMPLETED_TX_COUNT at
>> > > new-primary
>> > > > and continue recovery in an async way.
>> > > > Currently, IGNITE_MAX_COMPLETED_TX_COUNT specifies the number of
>> > > committed
>> > > > transactions we expect between "the first backup committed tx1" and
>> "the
>> > > > last backup committed the same tx1".
>> > > > I propose to use it to specify the number of prepared transactions
>> we
>> > > > expect between "the first backup prepared tx1" and "the last backup
>> > > > prepared the same tx1".
>> > > > Both cases look pretty similar.
>> > > > In this case, we able to make switch fully non-blocking, with async
>> > > > recovery.
>> > > > Thoughts?
>> > > >
>> > > > So, I'm going to implement both improvements at "Lightweight
>> version of
>> > > > partitions map exchange" issue [2] if no one minds.
>> > > >
>> > > > [1]
>> > > >
>> > > >
>> > >
>> https://docs.google.com/presentation/d/1Ay7OZk_iiJwBCcA8KFOlw6CRmKPXkkyxCXy_JNg4b0Q/edit?usp=sharing
>> > > > [2] https://issues.apache.org/jira/browse/IGNITE-9913
>> > > >
>> > >
>>
>>
>>
>> --
>> Best regards,
>> Ivan Pavlukhin
>>
>

Re: Non-blocking PME Phase One (Node fail)

Posted by Alexei Scherbakov <al...@gmail.com>.

Anton,

Thanks for effort on reducing PME blocking time.
I'm looking at.

чт, 5 дек. 2019 г. в 16:14, Anton Vinogradov <av...@apache.org>:

> Igniters,
>
> Solution ready to be reviewed
> TeamCity is green:
>
> https://issues.apache.org/jira/browse/IGNITE-9913?focusedCommentId=16988566&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16988566
> Issue: https://issues.apache.org/jira/browse/IGNITE-9913
> Relevant PR: https://github.com/apache/ignite/pull/7069
>
> In brief: Solution allows avoiding PME and running transactions waiting on
> node left.
>
> What done/changed:
> Extended Late Affinity Assignment semantic: from "guarantee to switch when
> ideal primaries rebalanced" to "guarantee to switch when all expected
> owners rebalanced".
> This extension provides the ability to perform PME-free switch on baseline
> node left when cluster fully rebalanced, avoiding a lot of corner-cases
> possible on a partially rebalanced cluster.
>
> The PME-free switch does not block or cancel or wait for current
> operations.
> It's just recover failed primaries and once cluster-wide recovery finished
> it finishes exchange future (allows new operations).
> So in other words, now new-topology operations allowed immediately after
> cluster-wide recovery finished (which is fast), not after previous-topology
> operations finished + recovery + PME, as it was before.
>
> Limitations:
> Optimization works only when baseline set (alive primaries require no
> relocation, recovery required for failed primaries).
>
> BTW, PME-free code seems to be small and simple, most of the changes
> related to the "rebalanced cluster guarantee".
>
> It will be nice if someone will check the code and tests.
> Test coverage proposals are welcome!
>
> P.s. Here's a benchmark used to check we have performance improvements
> listed above
>
> https://github.com/anton-vinogradov/ignite/blob/426f0de9185ac2938dadb7bd2cbfe488233fe7d6/modules/core/src/test/java/org/apache/ignite/internal/processors/cache/distributed/Bench.java
>
> P.p.s. We also checked the same case on the real environment and found no
> latency degradation.
>
> On Thu, Sep 19, 2019 at 2:18 PM Anton Vinogradov <av...@apache.org> wrote:
>
> > Ivan,
> >
> > The useful profile can be produced only by the reproducible benchmark.
> > We're working on such benchmark to analyze duration and measure the
> > improvement.
> >
> > Currently, only the typical overall duration from prod servers is known.
> >
> > On Thu, Sep 19, 2019 at 12:09 PM Павлухин Иван <vo...@gmail.com>
> > wrote:
> >
> >> Anton, folks,
> >>
> >> Out of curiosity. Do we have some measurements of PME execution in an
> >> environment similar to some real-world one? In my mind some kind of
> >> "profile" showing durations of different PME phases with an indication
> >> where we are "blocked" would be ideal.
> >>
> >> ср, 18 сент. 2019 г. в 13:27, Anton Vinogradov <av...@apache.org>:
> >> >
> >> > Alexey,
> >> >
> >> > >> Can you describe the complete protocol changes first
> >> > The current goal is to find a better way.
> >> > I had at least 5 scenarios discarded because of finding corner cases
> >> > (Thanks to Alexey Scherbakov, Aleksei Plekhanov and Nikita Amelchev).
> >> > That's why I explained what we able to improve and why I think it
> works.
> >> >
> >> > >> we need to remove this property, not add new logic that relies on
> it.
> >> > Agree.
> >> >
> >> > >> How are you going to synchronize this?
> >> > Thanks for providing this case, seems it discards #1 + #2.2 case and
> >> #2.1
> >> > still possible with some optimizations.
> >> >
> >> > "Zombie eating transactions" case can be theoretically solved, I
> think.
> >> > As I said before we may perform "Local switch" in case affinity was
> not
> >> > changed (except failed mode miss) other cases require regular PME.
> >> > In this case, new-primary is an ex-backup and we expect that
> old-primary
> >> > will try to update new-primary as a backup.
> >> > New primary will handle operations as a backup until it notified it's
> a
> >> > primary now.
> >> > Operations from ex-primary will be discarded or remapped once
> >> new-primary
> >> > notified it became the primary.
> >> >
> >> > Discarding is a big improvement,
> >> > remapping is a huge improvement,
> >> > there is no 100% warranty that ex-primary will try to update
> >> new-primary as
> >> > a backup.
> >> >
> >> > A lot of corner cases here.
> >> > So, seems minimized sync is a better solution.
> >> >
> >> > Finally, according to your and Alexey Scherbakov's comments, the
> better
> >> > case is just to improve PME to wait for less, at least now.
> >> > Seems, we have to wait for (or cancel, I vote for this case - any
> >> > objections?) operations related to the failed primaries and wait for
> >> > recovery finish (which is fast).
> >> > In case affinity was changed and backup-primary switch (not related to
> >> the
> >> > failed primaries) required between the owners or even rebalance
> >> required,
> >> > we should just ignore this and perform only "Recovery PME".
> >> > Regular PME should happen later (if necessary), it can be even delayed
> >> (if
> >> > necessary).
> >> >
> >> > Sounds good?
> >> >
> >> > On Wed, Sep 18, 2019 at 11:46 AM Alexey Goncharuk <
> >> > alexey.goncharuk@gmail.com> wrote:
> >> >
> >> > > Anton,
> >> > >
> >> > > I looked through the presentation and the ticket but did not find
> any
> >> new
> >> > > protocol description that you are going to implement. Can you
> >> describe the
> >> > > complete protocol changes first (to save time for both you and
> during
> >> the
> >> > > later review)?
> >> > >
> >> > > Some questions that I have in mind:
> >> > >  * It looks like that for "Local Switch" optimization you assume
> that
> >> node
> >> > > failure happens immediately for the whole cluster. This is not true
> -
> >> some
> >> > > nodes may "see" a node A failure, while others still consider it
> >> alive.
> >> > > Moreover, node A may not know yet that it is about to be failed and
> >> process
> >> > > requests correctly. This may happen, for example, due to a long GC
> >> pause on
> >> > > node A. In this case, node A resumes it's execution and proceeds to
> >> work as
> >> > > a primary (it has not received a segmentation event yet), node B
> also
> >> did
> >> > > not receive the A FAILED event yet. Node C received the event, ran
> the
> >> > > "Local switch" and assumed a primary role, node D also received the
> A
> >> > > FAILED event and switched to the new primary. Transactions from
> nodes
> >> B and
> >> > > D will be processed simultaneously on different primaries. How are
> you
> >> > > going to synchronize this?
> >> > >  * IGNITE_MAX_COMPLETED_TX_COUNT is fragile and we need to remove
> this
> >> > > property, not add new logic that relies on it. There is no way a
> user
> >> can
> >> > > calculate this property or adjust it in runtime. If a user decreases
> >> this
> >> > > property below a safe value, we will get inconsistent update
> counters
> >> and
> >> > > cluster desync. Besides, IGNITE_MAX_COMPLETED_TX_COUNT is quite a
> >> large
> >> > > value and will push HWM forward very quickly, much faster than
> during
> >> > > regular updates (you will have to increment it for each partition)
> >> > >
> >> > > ср, 18 сент. 2019 г. в 10:53, Anton Vinogradov <av...@apache.org>:
> >> > >
> >> > > > Igniters,
> >> > > >
> >> > > > Recently we had a discussion devoted to the non-blocking PME.
> >> > > > We agreed that the most important case is a blocking on node
> >> failure and
> >> > > it
> >> > > > can be splitted to:
> >> > > >
> >> > > > 1) Affected partition’s operations latency will be increased by
> node
> >> > > > failure detection duration.
> >> > > > So, some operations may be freezed for 10+ seconds at real
> clusters
> >> just
> >> > > > waiting for a failed primary response.
> >> > > > In other words, some operations will be blocked even before
> >> blocking PME
> >> > > > started.
> >> > > >
> >> > > > The good news here that "bigger cluster decrease blocked
> operations
> >> > > > percent".
> >> > > >
> >> > > > Bad news that these operations may block non-affected operations
> at
> >> > > > - customers code (single_thread/striped pool usage)
> >> > > > - multikey operations (tx1 one locked A and waits for failed B,
> >> > > > non-affected tx2 waits for A)
> >> > > > - striped pools inside AI (when some task wais for tx.op() in sync
> >> way
> >> > > and
> >> > > > the striped thread is busy)
> >> > > > - etc ...
> >> > > >
> >> > > > Seems, we already, thanks to StopNodeFailureHandler (if
> configured),
> >> > > always
> >> > > > send node left event before node stop to minimize the waiting
> >> period.
> >> > > > So, only cases cause the hang without the stop are the problems
> now.
> >> > > >
> >> > > > Anyway, some additional research required here and it will be nice
> >> if
> >> > > > someone willing to help.
> >> > > >
> >> > > > 2) Some optimizations may speed-up node left case (eliminate
> >> upcoming
> >> > > > operations blocking).
> >> > > > A full list can be found at presentation [1].
> >> > > > List contains 8 optimizations, but I propose to implement some at
> >> phase
> >> > > one
> >> > > > and the rest at phase two.
> >> > > > Assuming that real production deployment has Baseline enabled we
> >> able to
> >> > > > gain speed-up by implementing the following:
> >> > > >
> >> > > > #1 Switch on node_fail/node_left event locally instead of starting
> >> real
> >> > > PME
> >> > > > (Local switch).
> >> > > > Since BLT enabled we always able to switch to the new-affinity
> >> primaries
> >> > > > (no need to preload partitions).
> >> > > > In case we're not able to switch to new-affinity primaries (all
> >> missed or
> >> > > > BLT disabled) we'll just start regular PME.
> >> > > > The new-primary calculation can be performed locally or by the
> >> > > coordinator
> >> > > > (eg. attached to the node_fail message).
> >> > > >
> >> > > > #2 We should not wait for any already started operations
> completion
> >> > > (since
> >> > > > they not related to failed primary partitions).
> >> > > > The only problem is a recovery which may cause update-counters
> >> > > duplications
> >> > > > in case of unsynced HWM.
> >> > > >
> >> > > > #2.1 We may wait only for recovery completion (Micro-blocking
> >> switch).
> >> > > > Just block (all at this phase) upcoming operations during the
> >> recovery by
> >> > > > incrementing the topology version.
> >> > > > So in other words, it will be some kind of PME with waiting, but
> it
> >> will
> >> > > > wait for recovery (fast) instead of finishing current operations
> >> (long).
> >> > > >
> >> > > > #2.2 Recovery, theoretically, can be async.
> >> > > > We have to solve unsynced HWM issue (to avoid concurrent usage of
> >> the
> >> > > same
> >> > > > counters) to make it happen.
> >> > > > We may just increment HWM with IGNITE_MAX_COMPLETED_TX_COUNT at
> >> > > new-primary
> >> > > > and continue recovery in an async way.
> >> > > > Currently, IGNITE_MAX_COMPLETED_TX_COUNT specifies the number of
> >> > > committed
> >> > > > transactions we expect between "the first backup committed tx1"
> and
> >> "the
> >> > > > last backup committed the same tx1".
> >> > > > I propose to use it to specify the number of prepared transactions
> >> we
> >> > > > expect between "the first backup prepared tx1" and "the last
> backup
> >> > > > prepared the same tx1".
> >> > > > Both cases look pretty similar.
> >> > > > In this case, we able to make switch fully non-blocking, with
> async
> >> > > > recovery.
> >> > > > Thoughts?
> >> > > >
> >> > > > So, I'm going to implement both improvements at "Lightweight
> >> version of
> >> > > > partitions map exchange" issue [2] if no one minds.
> >> > > >
> >> > > > [1]
> >> > > >
> >> > > >
> >> > >
> >>
> https://docs.google.com/presentation/d/1Ay7OZk_iiJwBCcA8KFOlw6CRmKPXkkyxCXy_JNg4b0Q/edit?usp=sharing
> >> > > > [2] https://issues.apache.org/jira/browse/IGNITE-9913
> >> > > >
> >> > >
> >>
> >>
> >>
> >> --
> >> Best regards,
> >> Ivan Pavlukhin
> >>
> >
>


-- 

Best regards,
Alexei Scherbakov