You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ignite.apache.org by Ilya Lantukh <il...@gridgain.com> on 2018/09/13 12:58:08 UTC

The future of Affinity / Topology concepts and possible PME optimizations.

Igniters,

As most of you know, Ignite has a concept of AffinityTopologyVersion, which
is associated with nodes that are currently present in topology and a
global cluster state (active/inactive, baseline topology, started caches).
Modification of either of them involves process called Partition Map
Exchange (PME) and results in new AffinityTopologyVersion. At that moment
all new cache and compute grid operations are globally "frozen". This might
lead to indeterminate cache downtimes.

However, our recent changes (esp. introduction of Baseline Topology) caused
me to re-think those concept. Currently there are many cases when we
trigger PME, but it isn't necessary. For example, adding/removing client
node or server node not in BLT should never cause partition map
modifications. Those events modify the *topology*, but *affinity* in
unaffected. On the other hand, there are events that affect only *affinity*
- most straightforward example is CacheAffinityChange event, which is
triggered after rebalance is finished to assign new primary/backup nodes.
So the term *AffinityTopologyVersion* now looks weird - it tries to "merge"
two entities that aren't always related. To me it makes sense to introduce
separate *AffinityVersion *and *TopologyVersion*, review all events that
currently modify AffinityTopologyVersion and split them into 3 categories:
those that modify only AffinityVersion, only TopologyVersion and both. It
will allow us to process such events using different mechanics and avoid
redundant steps, and also reconsider mapping of operations - some will be
mapped to topology, others - to affinity.

Here is my view about how different event types theoretically can be
optimized:
1. Client node start / stop: as stated above, no PME is needed, ticket
https://issues.apache.org/jira/browse/IGNITE-9558 is already in progress.
2. Server node start / stop not from baseline: should be similar to the
previous case, since nodes outside of baseline cannot be partition owners.
3. Start node in baseline: both affinity and topology versions should be
incremented, but it might be possible to optimize PME for such case and
avoid cluster-wide freeze. Partition assignments for such node are already
calculated, so we can simply put them all into MOVING state. However, it
might take significant effort to avoid race conditions and redesign our
architecture.
4. Cache start / stop: starting or stopping one cache doesn't modify
partition maps for other caches. It should be possible to change this
procedure to skip PME and perform all necessary actions (compute affinity,
start/stop cache contexts on each node) in background, but it looks like a
very complex modification too.
5. Rebalance finish: it seems possible to design a "lightweight" PME for
this case as well. If there were no node failures (and if there were, PME
should be triggered and rebalance should be cancelled anyways) all
partition states are already known by coordinator. Furthermore, no new
MOVING or OWNING node for any partition is introduced, so all previous
mappings should still be valid.

For the latter complex cases in might be necessary to introduce "is
compatible" relationship between affinity versions. Operation needs to be
remapped only if new version isn't compatible with the previous one.

Please share your thoughts.

-- 
Best regards,
Ilya

Re: The future of Affinity / Topology concepts and possible PME optimizations.

Posted by Anton Vinogradov <av...@apache.org>.

Ilya,

I like your idea.
Let's create IEP and jira issues.
I will be glad to take a part in this journey once we'll discuss every
optimization in details.


чт, 13 сент. 2018 г. в 22:51, Dmitriy Setrakyan <ds...@apache.org>:

> Ilya,
>
> Thanks for the detailed explanation. Everything you suggested makes sense.
> Needless to say, PME effect on the grid should be minimized. Let's start
> with the simpler and less risky changes first.
>
> D.
>
> On Thu, Sep 13, 2018 at 5:58 AM Ilya Lantukh <il...@gridgain.com>
> wrote:
>
> > Igniters,
> >
> > As most of you know, Ignite has a concept of AffinityTopologyVersion,
> which
> > is associated with nodes that are currently present in topology and a
> > global cluster state (active/inactive, baseline topology, started
> caches).
> > Modification of either of them involves process called Partition Map
> > Exchange (PME) and results in new AffinityTopologyVersion. At that moment
> > all new cache and compute grid operations are globally "frozen". This
> might
> > lead to indeterminate cache downtimes.
> >
> > However, our recent changes (esp. introduction of Baseline Topology)
> caused
> > me to re-think those concept. Currently there are many cases when we
> > trigger PME, but it isn't necessary. For example, adding/removing client
> > node or server node not in BLT should never cause partition map
> > modifications. Those events modify the *topology*, but *affinity* in
> > unaffected. On the other hand, there are events that affect only
> *affinity*
> > - most straightforward example is CacheAffinityChange event, which is
> > triggered after rebalance is finished to assign new primary/backup nodes.
> > So the term *AffinityTopologyVersion* now looks weird - it tries to
> "merge"
> > two entities that aren't always related. To me it makes sense to
> introduce
> > separate *AffinityVersion *and *TopologyVersion*, review all events that
> > currently modify AffinityTopologyVersion and split them into 3
> categories:
> > those that modify only AffinityVersion, only TopologyVersion and both. It
> > will allow us to process such events using different mechanics and avoid
> > redundant steps, and also reconsider mapping of operations - some will be
> > mapped to topology, others - to affinity.
> >
> > Here is my view about how different event types theoretically can be
> > optimized:
> > 1. Client node start / stop: as stated above, no PME is needed, ticket
> > https://issues.apache.org/jira/browse/IGNITE-9558 is already in
> progress.
> > 2. Server node start / stop not from baseline: should be similar to the
> > previous case, since nodes outside of baseline cannot be partition
> owners.
> > 3. Start node in baseline: both affinity and topology versions should be
> > incremented, but it might be possible to optimize PME for such case and
> > avoid cluster-wide freeze. Partition assignments for such node are
> already
> > calculated, so we can simply put them all into MOVING state. However, it
> > might take significant effort to avoid race conditions and redesign our
> > architecture.
> > 4. Cache start / stop: starting or stopping one cache doesn't modify
> > partition maps for other caches. It should be possible to change this
> > procedure to skip PME and perform all necessary actions (compute
> affinity,
> > start/stop cache contexts on each node) in background, but it looks like
> a
> > very complex modification too.
> > 5. Rebalance finish: it seems possible to design a "lightweight" PME for
> > this case as well. If there were no node failures (and if there were, PME
> > should be triggered and rebalance should be cancelled anyways) all
> > partition states are already known by coordinator. Furthermore, no new
> > MOVING or OWNING node for any partition is introduced, so all previous
> > mappings should still be valid.
> >
> > For the latter complex cases in might be necessary to introduce "is
> > compatible" relationship between affinity versions. Operation needs to be
> > remapped only if new version isn't compatible with the previous one.
> >
> > Please share your thoughts.
> >
> > --
> > Best regards,
> > Ilya
> >
>

Re: The future of Affinity / Topology concepts and possible PME optimizations.

Posted by Dmitriy Setrakyan <ds...@apache.org>.

Ilya,

Thanks for the detailed explanation. Everything you suggested makes sense.
Needless to say, PME effect on the grid should be minimized. Let's start
with the simpler and less risky changes first.

D.

On Thu, Sep 13, 2018 at 5:58 AM Ilya Lantukh <il...@gridgain.com> wrote:

> Igniters,
>
> As most of you know, Ignite has a concept of AffinityTopologyVersion, which
> is associated with nodes that are currently present in topology and a
> global cluster state (active/inactive, baseline topology, started caches).
> Modification of either of them involves process called Partition Map
> Exchange (PME) and results in new AffinityTopologyVersion. At that moment
> all new cache and compute grid operations are globally "frozen". This might
> lead to indeterminate cache downtimes.
>
> However, our recent changes (esp. introduction of Baseline Topology) caused
> me to re-think those concept. Currently there are many cases when we
> trigger PME, but it isn't necessary. For example, adding/removing client
> node or server node not in BLT should never cause partition map
> modifications. Those events modify the *topology*, but *affinity* in
> unaffected. On the other hand, there are events that affect only *affinity*
> - most straightforward example is CacheAffinityChange event, which is
> triggered after rebalance is finished to assign new primary/backup nodes.
> So the term *AffinityTopologyVersion* now looks weird - it tries to "merge"
> two entities that aren't always related. To me it makes sense to introduce
> separate *AffinityVersion *and *TopologyVersion*, review all events that
> currently modify AffinityTopologyVersion and split them into 3 categories:
> those that modify only AffinityVersion, only TopologyVersion and both. It
> will allow us to process such events using different mechanics and avoid
> redundant steps, and also reconsider mapping of operations - some will be
> mapped to topology, others - to affinity.
>
> Here is my view about how different event types theoretically can be
> optimized:
> 1. Client node start / stop: as stated above, no PME is needed, ticket
> https://issues.apache.org/jira/browse/IGNITE-9558 is already in progress.
> 2. Server node start / stop not from baseline: should be similar to the
> previous case, since nodes outside of baseline cannot be partition owners.
> 3. Start node in baseline: both affinity and topology versions should be
> incremented, but it might be possible to optimize PME for such case and
> avoid cluster-wide freeze. Partition assignments for such node are already
> calculated, so we can simply put them all into MOVING state. However, it
> might take significant effort to avoid race conditions and redesign our
> architecture.
> 4. Cache start / stop: starting or stopping one cache doesn't modify
> partition maps for other caches. It should be possible to change this
> procedure to skip PME and perform all necessary actions (compute affinity,
> start/stop cache contexts on each node) in background, but it looks like a
> very complex modification too.
> 5. Rebalance finish: it seems possible to design a "lightweight" PME for
> this case as well. If there were no node failures (and if there were, PME
> should be triggered and rebalance should be cancelled anyways) all
> partition states are already known by coordinator. Furthermore, no new
> MOVING or OWNING node for any partition is introduced, so all previous
> mappings should still be valid.
>
> For the latter complex cases in might be necessary to introduce "is
> compatible" relationship between affinity versions. Operation needs to be
> remapped only if new version isn't compatible with the previous one.
>
> Please share your thoughts.
>
> --
> Best regards,
> Ilya
>

Re: The future of Affinity / Topology concepts and possible PME optimizations.

Posted by Dmitriy Setrakyan <ds...@apache.org>.

Ilya, I would suggest that you discuss everything here on the dev@ list,
including suggestions how to split the work.

D.

On Tue, Sep 18, 2018 at 6:34 PM Ilya Lantukh <il...@gridgain.com> wrote:

> Thanks for the feedback!
>
> I agree that we should start with the simplest optimizations, but it seems
> that decoupling affinity/topology versions is necessary before we can make
> any progress here, and this is a rather complex change all over the code.
>
> If anyone wants to help, please contact me privately and we will discuss
> how this work can be split.
>
> Denis Magda, do you think we should create IEP for these optimizations?
>
> On Tue, Sep 18, 2018 at 5:59 PM, Maxim Muzafarov <ma...@gmail.com>
> wrote:
>
> > Ilya,
> >
> >
> > > 3. Start node in baseline: both affinity and topology versions should
> be
> > incremented, but it might be possible to optimize PME for such case and
> > avoid cluster-wide freeze. Partition assignments for such node are
> already
> > calculated, so we can simply put them all into MOVING state. However, it
> > might take significant effort to avoid race conditions and redesign our
> > architecture.
> >
> > As you mentioned all assignments are already calculated. So as another
> > suggestion,
> > we can introduce a new `intermediate` state of such joined nodes. Being
> in
> > this state
> > node recovers all data from their local storage, preloads whole missed
> > partition
> > data from the cluster (probably on some point in time), creates and
> > preloads missed
> > in-memory and persistent caches. And only after these recovery such node
> > will fire
> > discovery join event and affinity\topology version may be incremented. I
> > think this
> > approach can help to reduce the further rebalance time.
> > WDYT?
> >
> >
> >
> > On Tue, 18 Sep 2018 at 16:31 Alexey Goncharuk <
> alexey.goncharuk@gmail.com>
> > wrote:
> >
> > > Ilya,
> > >
> > > This is a great idea, but before we can ultimately decouple the
> affinity
> > > version from the topology version, we need to fix a few things with
> > > baseline topology first. Currently the in-memory caches are not using
> the
> > > baseline topology. We are going to fix this as a part of IEP-4 Phase II
> > > (baseline auto-adjust). Once fixed, we can safely assume that
> > > out-of-baseline node does not affect affinity distribution.
> > >
> > > Agree with Dmitriy that we should start with simpler optimizations
> first.
> > >
> > > чт, 13 сент. 2018 г. в 15:58, Ilya Lantukh <il...@gridgain.com>:
> > >
> > > > Igniters,
> > > >
> > > > As most of you know, Ignite has a concept of AffinityTopologyVersion,
> > > which
> > > > is associated with nodes that are currently present in topology and a
> > > > global cluster state (active/inactive, baseline topology, started
> > > caches).
> > > > Modification of either of them involves process called Partition Map
> > > > Exchange (PME) and results in new AffinityTopologyVersion. At that
> > moment
> > > > all new cache and compute grid operations are globally "frozen". This
> > > might
> > > > lead to indeterminate cache downtimes.
> > > >
> > > > However, our recent changes (esp. introduction of Baseline Topology)
> > > caused
> > > > me to re-think those concept. Currently there are many cases when we
> > > > trigger PME, but it isn't necessary. For example, adding/removing
> > client
> > > > node or server node not in BLT should never cause partition map
> > > > modifications. Those events modify the *topology*, but *affinity* in
> > > > unaffected. On the other hand, there are events that affect only
> > > *affinity*
> > > > - most straightforward example is CacheAffinityChange event, which is
> > > > triggered after rebalance is finished to assign new primary/backup
> > nodes.
> > > > So the term *AffinityTopologyVersion* now looks weird - it tries to
> > > "merge"
> > > > two entities that aren't always related. To me it makes sense to
> > > introduce
> > > > separate *AffinityVersion *and *TopologyVersion*, review all events
> > that
> > > > currently modify AffinityTopologyVersion and split them into 3
> > > categories:
> > > > those that modify only AffinityVersion, only TopologyVersion and
> both.
> > It
> > > > will allow us to process such events using different mechanics and
> > avoid
> > > > redundant steps, and also reconsider mapping of operations - some
> will
> > be
> > > > mapped to topology, others - to affinity.
> > > >
> > > > Here is my view about how different event types theoretically can be
> > > > optimized:
> > > > 1. Client node start / stop: as stated above, no PME is needed,
> ticket
> > > > https://issues.apache.org/jira/browse/IGNITE-9558 is already in
> > > progress.
> > > > 2. Server node start / stop not from baseline: should be similar to
> the
> > > > previous case, since nodes outside of baseline cannot be partition
> > > owners.
> > > > 3. Start node in baseline: both affinity and topology versions should
> > be
> > > > incremented, but it might be possible to optimize PME for such case
> and
> > > > avoid cluster-wide freeze. Partition assignments for such node are
> > > already
> > > > calculated, so we can simply put them all into MOVING state. However,
> > it
> > > > might take significant effort to avoid race conditions and redesign
> our
> > > > architecture.
> > > > 4. Cache start / stop: starting or stopping one cache doesn't modify
> > > > partition maps for other caches. It should be possible to change this
> > > > procedure to skip PME and perform all necessary actions (compute
> > > affinity,
> > > > start/stop cache contexts on each node) in background, but it looks
> > like
> > > a
> > > > very complex modification too.
> > > > 5. Rebalance finish: it seems possible to design a "lightweight" PME
> > for
> > > > this case as well. If there were no node failures (and if there were,
> > PME
> > > > should be triggered and rebalance should be cancelled anyways) all
> > > > partition states are already known by coordinator. Furthermore, no
> new
> > > > MOVING or OWNING node for any partition is introduced, so all
> previous
> > > > mappings should still be valid.
> > > >
> > > > For the latter complex cases in might be necessary to introduce "is
> > > > compatible" relationship between affinity versions. Operation needs
> to
> > be
> > > > remapped only if new version isn't compatible with the previous one.
> > > >
> > > > Please share your thoughts.
> > > >
> > > > --
> > > > Best regards,
> > > > Ilya
> > > >
> > >
> > --
> > --
> > Maxim Muzafarov
> >
>
>
>
> --
> Best regards,
> Ilya
>

Re: The future of Affinity / Topology concepts and possible PME optimizations.

Posted by Ilya Lantukh <il...@gridgain.com>.

Thanks for the feedback!

I agree that we should start with the simplest optimizations, but it seems
that decoupling affinity/topology versions is necessary before we can make
any progress here, and this is a rather complex change all over the code.

If anyone wants to help, please contact me privately and we will discuss
how this work can be split.

Denis Magda, do you think we should create IEP for these optimizations?

On Tue, Sep 18, 2018 at 5:59 PM, Maxim Muzafarov <ma...@gmail.com> wrote:

> Ilya,
>
>
> > 3. Start node in baseline: both affinity and topology versions should be
> incremented, but it might be possible to optimize PME for such case and
> avoid cluster-wide freeze. Partition assignments for such node are already
> calculated, so we can simply put them all into MOVING state. However, it
> might take significant effort to avoid race conditions and redesign our
> architecture.
>
> As you mentioned all assignments are already calculated. So as another
> suggestion,
> we can introduce a new `intermediate` state of such joined nodes. Being in
> this state
> node recovers all data from their local storage, preloads whole missed
> partition
> data from the cluster (probably on some point in time), creates and
> preloads missed
> in-memory and persistent caches. And only after these recovery such node
> will fire
> discovery join event and affinity\topology version may be incremented. I
> think this
> approach can help to reduce the further rebalance time.
> WDYT?
>
>
>
> On Tue, 18 Sep 2018 at 16:31 Alexey Goncharuk <al...@gmail.com>
> wrote:
>
> > Ilya,
> >
> > This is a great idea, but before we can ultimately decouple the affinity
> > version from the topology version, we need to fix a few things with
> > baseline topology first. Currently the in-memory caches are not using the
> > baseline topology. We are going to fix this as a part of IEP-4 Phase II
> > (baseline auto-adjust). Once fixed, we can safely assume that
> > out-of-baseline node does not affect affinity distribution.
> >
> > Agree with Dmitriy that we should start with simpler optimizations first.
> >
> > чт, 13 сент. 2018 г. в 15:58, Ilya Lantukh <il...@gridgain.com>:
> >
> > > Igniters,
> > >
> > > As most of you know, Ignite has a concept of AffinityTopologyVersion,
> > which
> > > is associated with nodes that are currently present in topology and a
> > > global cluster state (active/inactive, baseline topology, started
> > caches).
> > > Modification of either of them involves process called Partition Map
> > > Exchange (PME) and results in new AffinityTopologyVersion. At that
> moment
> > > all new cache and compute grid operations are globally "frozen". This
> > might
> > > lead to indeterminate cache downtimes.
> > >
> > > However, our recent changes (esp. introduction of Baseline Topology)
> > caused
> > > me to re-think those concept. Currently there are many cases when we
> > > trigger PME, but it isn't necessary. For example, adding/removing
> client
> > > node or server node not in BLT should never cause partition map
> > > modifications. Those events modify the *topology*, but *affinity* in
> > > unaffected. On the other hand, there are events that affect only
> > *affinity*
> > > - most straightforward example is CacheAffinityChange event, which is
> > > triggered after rebalance is finished to assign new primary/backup
> nodes.
> > > So the term *AffinityTopologyVersion* now looks weird - it tries to
> > "merge"
> > > two entities that aren't always related. To me it makes sense to
> > introduce
> > > separate *AffinityVersion *and *TopologyVersion*, review all events
> that
> > > currently modify AffinityTopologyVersion and split them into 3
> > categories:
> > > those that modify only AffinityVersion, only TopologyVersion and both.
> It
> > > will allow us to process such events using different mechanics and
> avoid
> > > redundant steps, and also reconsider mapping of operations - some will
> be
> > > mapped to topology, others - to affinity.
> > >
> > > Here is my view about how different event types theoretically can be
> > > optimized:
> > > 1. Client node start / stop: as stated above, no PME is needed, ticket
> > > https://issues.apache.org/jira/browse/IGNITE-9558 is already in
> > progress.
> > > 2. Server node start / stop not from baseline: should be similar to the
> > > previous case, since nodes outside of baseline cannot be partition
> > owners.
> > > 3. Start node in baseline: both affinity and topology versions should
> be
> > > incremented, but it might be possible to optimize PME for such case and
> > > avoid cluster-wide freeze. Partition assignments for such node are
> > already
> > > calculated, so we can simply put them all into MOVING state. However,
> it
> > > might take significant effort to avoid race conditions and redesign our
> > > architecture.
> > > 4. Cache start / stop: starting or stopping one cache doesn't modify
> > > partition maps for other caches. It should be possible to change this
> > > procedure to skip PME and perform all necessary actions (compute
> > affinity,
> > > start/stop cache contexts on each node) in background, but it looks
> like
> > a
> > > very complex modification too.
> > > 5. Rebalance finish: it seems possible to design a "lightweight" PME
> for
> > > this case as well. If there were no node failures (and if there were,
> PME
> > > should be triggered and rebalance should be cancelled anyways) all
> > > partition states are already known by coordinator. Furthermore, no new
> > > MOVING or OWNING node for any partition is introduced, so all previous
> > > mappings should still be valid.
> > >
> > > For the latter complex cases in might be necessary to introduce "is
> > > compatible" relationship between affinity versions. Operation needs to
> be
> > > remapped only if new version isn't compatible with the previous one.
> > >
> > > Please share your thoughts.
> > >
> > > --
> > > Best regards,
> > > Ilya
> > >
> >
> --
> --
> Maxim Muzafarov
>



-- 
Best regards,
Ilya

Re: The future of Affinity / Topology concepts and possible PME optimizations.

Posted by Maxim Muzafarov <ma...@gmail.com>.

Ilya,


> 3. Start node in baseline: both affinity and topology versions should be
incremented, but it might be possible to optimize PME for such case and
avoid cluster-wide freeze. Partition assignments for such node are already
calculated, so we can simply put them all into MOVING state. However, it
might take significant effort to avoid race conditions and redesign our
architecture.

As you mentioned all assignments are already calculated. So as another
suggestion,
we can introduce a new `intermediate` state of such joined nodes. Being in
this state
node recovers all data from their local storage, preloads whole missed
partition
data from the cluster (probably on some point in time), creates and
preloads missed
in-memory and persistent caches. And only after these recovery such node
will fire
discovery join event and affinity\topology version may be incremented. I
think this
approach can help to reduce the further rebalance time.
WDYT?



On Tue, 18 Sep 2018 at 16:31 Alexey Goncharuk <al...@gmail.com>
wrote:

> Ilya,
>
> This is a great idea, but before we can ultimately decouple the affinity
> version from the topology version, we need to fix a few things with
> baseline topology first. Currently the in-memory caches are not using the
> baseline topology. We are going to fix this as a part of IEP-4 Phase II
> (baseline auto-adjust). Once fixed, we can safely assume that
> out-of-baseline node does not affect affinity distribution.
>
> Agree with Dmitriy that we should start with simpler optimizations first.
>
> чт, 13 сент. 2018 г. в 15:58, Ilya Lantukh <il...@gridgain.com>:
>
> > Igniters,
> >
> > As most of you know, Ignite has a concept of AffinityTopologyVersion,
> which
> > is associated with nodes that are currently present in topology and a
> > global cluster state (active/inactive, baseline topology, started
> caches).
> > Modification of either of them involves process called Partition Map
> > Exchange (PME) and results in new AffinityTopologyVersion. At that moment
> > all new cache and compute grid operations are globally "frozen". This
> might
> > lead to indeterminate cache downtimes.
> >
> > However, our recent changes (esp. introduction of Baseline Topology)
> caused
> > me to re-think those concept. Currently there are many cases when we
> > trigger PME, but it isn't necessary. For example, adding/removing client
> > node or server node not in BLT should never cause partition map
> > modifications. Those events modify the *topology*, but *affinity* in
> > unaffected. On the other hand, there are events that affect only
> *affinity*
> > - most straightforward example is CacheAffinityChange event, which is
> > triggered after rebalance is finished to assign new primary/backup nodes.
> > So the term *AffinityTopologyVersion* now looks weird - it tries to
> "merge"
> > two entities that aren't always related. To me it makes sense to
> introduce
> > separate *AffinityVersion *and *TopologyVersion*, review all events that
> > currently modify AffinityTopologyVersion and split them into 3
> categories:
> > those that modify only AffinityVersion, only TopologyVersion and both. It
> > will allow us to process such events using different mechanics and avoid
> > redundant steps, and also reconsider mapping of operations - some will be
> > mapped to topology, others - to affinity.
> >
> > Here is my view about how different event types theoretically can be
> > optimized:
> > 1. Client node start / stop: as stated above, no PME is needed, ticket
> > https://issues.apache.org/jira/browse/IGNITE-9558 is already in
> progress.
> > 2. Server node start / stop not from baseline: should be similar to the
> > previous case, since nodes outside of baseline cannot be partition
> owners.
> > 3. Start node in baseline: both affinity and topology versions should be
> > incremented, but it might be possible to optimize PME for such case and
> > avoid cluster-wide freeze. Partition assignments for such node are
> already
> > calculated, so we can simply put them all into MOVING state. However, it
> > might take significant effort to avoid race conditions and redesign our
> > architecture.
> > 4. Cache start / stop: starting or stopping one cache doesn't modify
> > partition maps for other caches. It should be possible to change this
> > procedure to skip PME and perform all necessary actions (compute
> affinity,
> > start/stop cache contexts on each node) in background, but it looks like
> a
> > very complex modification too.
> > 5. Rebalance finish: it seems possible to design a "lightweight" PME for
> > this case as well. If there were no node failures (and if there were, PME
> > should be triggered and rebalance should be cancelled anyways) all
> > partition states are already known by coordinator. Furthermore, no new
> > MOVING or OWNING node for any partition is introduced, so all previous
> > mappings should still be valid.
> >
> > For the latter complex cases in might be necessary to introduce "is
> > compatible" relationship between affinity versions. Operation needs to be
> > remapped only if new version isn't compatible with the previous one.
> >
> > Please share your thoughts.
> >
> > --
> > Best regards,
> > Ilya
> >
>
-- 
--
Maxim Muzafarov

Re: The future of Affinity / Topology concepts and possible PME optimizations.

Posted by Alexey Goncharuk <al...@gmail.com>.

Ilya,

This is a great idea, but before we can ultimately decouple the affinity
version from the topology version, we need to fix a few things with
baseline topology first. Currently the in-memory caches are not using the
baseline topology. We are going to fix this as a part of IEP-4 Phase II
(baseline auto-adjust). Once fixed, we can safely assume that
out-of-baseline node does not affect affinity distribution.

Agree with Dmitriy that we should start with simpler optimizations first.

чт, 13 сент. 2018 г. в 15:58, Ilya Lantukh <il...@gridgain.com>:

> Igniters,
>
> As most of you know, Ignite has a concept of AffinityTopologyVersion, which
> is associated with nodes that are currently present in topology and a
> global cluster state (active/inactive, baseline topology, started caches).
> Modification of either of them involves process called Partition Map
> Exchange (PME) and results in new AffinityTopologyVersion. At that moment
> all new cache and compute grid operations are globally "frozen". This might
> lead to indeterminate cache downtimes.
>
> However, our recent changes (esp. introduction of Baseline Topology) caused
> me to re-think those concept. Currently there are many cases when we
> trigger PME, but it isn't necessary. For example, adding/removing client
> node or server node not in BLT should never cause partition map
> modifications. Those events modify the *topology*, but *affinity* in
> unaffected. On the other hand, there are events that affect only *affinity*
> - most straightforward example is CacheAffinityChange event, which is
> triggered after rebalance is finished to assign new primary/backup nodes.
> So the term *AffinityTopologyVersion* now looks weird - it tries to "merge"
> two entities that aren't always related. To me it makes sense to introduce
> separate *AffinityVersion *and *TopologyVersion*, review all events that
> currently modify AffinityTopologyVersion and split them into 3 categories:
> those that modify only AffinityVersion, only TopologyVersion and both. It
> will allow us to process such events using different mechanics and avoid
> redundant steps, and also reconsider mapping of operations - some will be
> mapped to topology, others - to affinity.
>
> Here is my view about how different event types theoretically can be
> optimized:
> 1. Client node start / stop: as stated above, no PME is needed, ticket
> https://issues.apache.org/jira/browse/IGNITE-9558 is already in progress.
> 2. Server node start / stop not from baseline: should be similar to the
> previous case, since nodes outside of baseline cannot be partition owners.
> 3. Start node in baseline: both affinity and topology versions should be
> incremented, but it might be possible to optimize PME for such case and
> avoid cluster-wide freeze. Partition assignments for such node are already
> calculated, so we can simply put them all into MOVING state. However, it
> might take significant effort to avoid race conditions and redesign our
> architecture.
> 4. Cache start / stop: starting or stopping one cache doesn't modify
> partition maps for other caches. It should be possible to change this
> procedure to skip PME and perform all necessary actions (compute affinity,
> start/stop cache contexts on each node) in background, but it looks like a
> very complex modification too.
> 5. Rebalance finish: it seems possible to design a "lightweight" PME for
> this case as well. If there were no node failures (and if there were, PME
> should be triggered and rebalance should be cancelled anyways) all
> partition states are already known by coordinator. Furthermore, no new
> MOVING or OWNING node for any partition is introduced, so all previous
> mappings should still be valid.
>
> For the latter complex cases in might be necessary to introduce "is
> compatible" relationship between affinity versions. Operation needs to be
> remapped only if new version isn't compatible with the previous one.
>
> Please share your thoughts.
>
> --
> Best regards,
> Ilya
>