You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ignite.apache.org by Maxim Muzafarov <ma...@gmail.com> on 2018/08/03 06:44:04 UTC

Exchange stucks while node restoring state from WAL

Hi Igniters,


I'm working on bug [1] and have some questions about the final
implementation. Probably, I've already found answers on some of
them but I want to be sure. Please, help me to clarify details.


The key problem here is that we are reading WAL and restoring
memory state of new joined node inside PME. Reading WAL can
consume huge amount of time, so the whole cluster stucks and
waits for the single node.


1) Is it correct that readMetastore() happens after node starts
but before including node into the ring?

2) Is after onDone() method called for LocalJoinFuture on local
node happend we can proceed with initiating PME on local node?

3) After reading checkpoint and restore memory for new joined
node how and when we are updating obsolete partitions update
counter? At historical rebalance, right?

4) Should we restoreMemory for new joined node before PME
initiates on the other nodes in cluster?

5) Does in our final solution for new joined node readMetastore
and restoreMemory should be performed in one step?


[1] https://issues.apache.org/jira/browse/IGNITE-7196
-- 
--
Maxim Muzafarov

Re: Exchange stucks while node restoring state from WAL

Posted by Maxim Muzafarov <ma...@gmail.com>.

Pavel, Dmitry G.,

Thank you for the help and review this improvement!

Igniters,

I'd like to finalize this topic with the following changes [1] `Update
Grid(Ignite)CacheDatabaseSharedManager
according to accepted coding guidelines`. I've consciously excluded them
from the original PR not to overcomplicate review.

Please, consider these changes to be included in the master branch too:
 - createDataRegionConfiguration have an unobvious name – it creates
metastore dataregion config
 - startMemoryPolicies rename to startDataRegion\or remove
 - output messages format fix code style
 - register\unregister method for MBean can be simplified

I've prepared everything required for the review.

[1] https://issues.apache.org/jira/browse/IGNITE-10033
[2] https://reviews.ignite.apache.org/ignite/review/IGNT-CR-927
[3] https://github.com/apache/ignite/pull/5156
[4]
https://ci.ignite.apache.org/viewLog.html?buildId=2184711&tab=buildResultsDiv&buildTypeId=IgniteTests24Java8_RunAll



On Tue, 18 Sep 2018 at 18:02 Maxim Muzafarov <ma...@gmail.com> wrote:

> Igniters,
>
> I would like this issue to be part of release 2.7. There is not much time
> left
> considering that I will have to correct comments after the review.
>
> Will anyone help with review?
>
>
> PR: https://github.com/apache/ignite/pull/4520/files
> Upsource: https://reviews.ignite.apache.org/ignite/review/IGNT-CR-727
> JIRA: https://issues.apache.org/jira/browse/IGNITE-7196
>
> On Thu, 13 Sep 2018 at 11:27 Maxim Muzafarov <ma...@gmail.com> wrote:
>
>> Igniters,
>>
>> I need your help with the review of this patch. In general, it will help
>> to reduce PME duration. I've moved binary
>> memory recovery at the moment of node startup as we've discussed it with
>> Pavel previously this topic.
>>
>> Changes are relatively small (+299 −95) and they are ready. I've left a
>> comment in JIRA with high-level
>> implementation details. Hope, this will help with the review [1].
>>
>> Please, take a look.
>> PR: https://github.com/apache/ignite/pull/4520/files
>> Upsource: https://reviews.ignite.apache.org/ignite/review/IGNT-CR-727
>> JIRA: https://issues.apache.org/jira/browse/IGNITE-7196
>> TC Run All: [2]
>>
>>
>> [1]
>> https://issues.apache.org/jira/browse/IGNITE-7196?focusedCommentId=16613175&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16613175
>> [2]
>> https://ci.ignite.apache.org/viewLog.html?buildId=1835822&tab=buildResultsDiv&buildTypeId=IgniteTests24Java8_RunAll
>>
>>
>> On Tue, 28 Aug 2018 at 17:59 Pavel Kovalenko <jo...@gmail.com> wrote:
>>
>>> Hello Maxim,
>>>
>>> I think you're going in the right direction, but why you perform binary
>>> recovery only in the case when a node in the baseline?
>>> I think that node can also perform binary recovery before the first
>>> exchange is started in case of
>>> 1) First activation
>>> 2) Baseline change
>>>
>>> In all of these cases, you need just local available CacheDescriptors to
>>> restore binary state from WAL.
>>> I would like also see a test when Ignite is stopped in the middle of a
>>> checkpoint and binary recovery succeed before PME in that case also.
>>>
>>>
>>> ср, 22 авг. 2018 г. в 15:31, Pavel Kovalenko <jo...@gmail.com>:
>>>
>>> > Hello Maxim,
>>> >
>>> > Thank you for your work. I will review your changes within the next
>>> > several days.
>>> >
>>> > 2018-08-22 11:12 GMT+03:00 Maxim Muzafarov <ma...@gmail.com>:
>>> >
>>> >> Pavel,
>>> >>
>>> >> Thank you for so detailed introduction into the root of the problem of
>>> >> ticket IGNITE-7196.
>>> >> As you mentioned before, we can divide this ticket into two
>>> sub-tasks. So,
>>> >> I will
>>> >> file new issue for `Logical consistency recovery`. I think it's more
>>> >> convenient way
>>> >> for reviewing and merging each of these high level Steps (PRs).
>>> >>
>>> >> Currently, let's focus on `Physical(binary) consistency recovery` as
>>> the
>>> >> most critical
>>> >> part of this issue and synchronize our vision and vision of other
>>> >> community
>>> >> members
>>> >> on implementation details of Step 1. From this point everything what
>>> I'm
>>> >> saying is about
>>> >> only binary recovery.
>>> >>
>>> >>
>>> >> I think PR is almost ready (I need to check TC carefully).
>>> >> Can you look at my changes? Am I going the right way?
>>> >>
>>> >> PR: https://github.com/apache/ignite/pull/4520
>>> >> Upsource: https://reviews.ignite.apache.org/ignite/review/IGNT-CR-727
>>> >> JIRA: https://issues.apache.org/jira/browse/IGNITE-7196
>>> >>
>>> >>
>>> >> These are my milestones which I've tried to solve:
>>> >> 1. On first time cluster activation no changes required. We should
>>> keep
>>> >> this behavior.
>>> >>
>>> >> 2. `startMemoryPolicies` earlier if need.
>>> >> Now at the master branch data regions starts at onActivate method
>>> called.
>>> >> If we are  trying
>>> >> to restore binary state on node reconnects to cluster we should start
>>> >> regions earlier.
>>> >>
>>> >> 3. `suspendLogging` method added to switch off handlers.
>>> >> Keep fast PDS cleanup when joining is not in BLT (or belongs to
>>> another
>>> >> BLT).
>>> >> Restore memory and metastorage initialization required `WALWriter` and
>>> >> `FileWriteHandle`
>>> >> to be started. They can lock directory\files and block cleanups.
>>> >>
>>> >> 4. `isBaselineNode` check before resumming logging.
>>> >> The same as point 3. Local node gets discovery data and determines
>>> >> #isLocalNodeNotInBaseline.
>>> >> If node not in current baseline it performs cleanup local PDS and we
>>> >> should
>>> >> reset state of
>>> >> previously initialized at startup metastorage instance.
>>> >>
>>> >> 5. `cacheProcessorStarted` callback added to restore memory at
>>> startup.
>>> >> We can restore memory only after having info about all
>>> >> `CacheGroupDescriptor`. To achieve this
>>> >> we should do restoring at the moment `GridCacheProcessor` started.
>>> >>
>>> >> 6. Move `fsync` for `MemoryRecoveryRecord` at node startup.
>>> >> We are fsync'ing it inside running PME. Suppose, it used for recovery
>>> >> actions and\or control
>>> >> cluster state outside Ignite project (methods `nodeStart` and
>>> >> `nodeStartedPointers` of this
>>> >> are not used at Ignite project code). I've moved them at the node
>>> startup,
>>> >> but may be we should
>>> >> dumping timestamp only at the time of node activation. Not sure.
>>> >>
>>> >> 7. `testBinaryRecoverBeforeExchange` test added to check restore
>>> before
>>> >> PME.
>>> >> Not sure that we should repeat wal reading logic as it was in
>>> >> `testApplyDeltaRecords` test for
>>> >> checking memory restoring. Suppose, minimally necessary condition is
>>> to
>>> >> check `lastRestored` and
>>> >> `checkpointHistory` states of database manager's (they should happend
>>> >> before node joins to cluster).
>>> >>
>>> >> I will create separate ticket and refactor these minor issues after we
>>> >> will
>>> >> finish with 7196:
>>> >> - simlify registerMBeans methods for
>>> >> `Ignite\GridCacheDatabaseSharedManager`
>>> >> - add @Override annotations if missed
>>> >> - fix abbreviations regarding IdeaAbbrPlugin
>>> >> - createDataRegionConfiguration method can be removed and simplified
>>> >> - startMemoryPolicies rename to startDataRegion
>>> >> - output messages format fix code style needed
>>> >> - fix format javadoc's
>>> >> - make inner classes private
>>> >> - missed logging for cleanupCheckpointDirectory,
>>> >> cleanupCheckpointDirectory, cleanupWalDirectories
>>> >>
>>> >> On Fri, 3 Aug 2018 at 18:20 Pavel Kovalenko <jo...@gmail.com>
>>> wrote:
>>> >>
>>> >> > Maxim,
>>> >> >
>>> >> > In the addition to my answer, I propose to divide the whole task
>>> into 2
>>> >> > steps:
>>> >> > 1) Physical consistency recovery before PME.
>>> >> > 2) Logical consistency recovery before PME.
>>> >> >
>>> >> > The first step can be done before the discovery manager start, as
>>> it no
>>> >> > needs any information about caches, it operates only with pages
>>> binary
>>> >> > data. On this step, we can do points a) and b) from answer 3) and
>>> fsync
>>> >> the
>>> >> > state of the page memory.
>>> >> > The second step is much harder to implement as it requires major
>>> >> changes in
>>> >> > Discovery SPI. Discovery node join event is triggered on coordinator
>>> >> after
>>> >> > node successfully joins to the cluster. To avoid it, we should
>>> >> introduce a
>>> >> > new state of node and fire discovery event only after logical
>>> recovery
>>> >> has
>>> >> > finished. I doubt that logical recovery consumes a lot of time of
>>> >> exchange,
>>> >> > but physical recovery does, as it performs fsync on disk after
>>> restore.
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> > 2018-08-03 17:46 GMT+03:00 Pavel Kovalenko <jo...@gmail.com>:
>>> >> >
>>> >> > > Hello Maxim,
>>> >> > >
>>> >> > > 1) Yes, Discovery Manager is starting after GridCacheProcessor,
>>> which
>>> >> > > starts GridCacheDatabaseSharedManager which invokes
>>> readMetastorage on
>>> >> > > start.
>>> >> > >
>>> >> > > 2) Before we complete the local join future, we create and add
>>> >> Exchange
>>> >> > > future on local node join to ExchangeManager. So, when local join
>>> >> future
>>> >> > > completes, first PME on that node should be already run or at
>>> least
>>> >> there
>>> >> > > will be first Exchange future in Exchange worker queue.
>>> >> > >
>>> >> > > 3) I don't exactly understand what do you mean saying "update
>>> obsolete
>>> >> > > partition counter", but the process is the following:
>>> >> > >     a) We restore page memory state (which contain partition
>>> update
>>> >> > > counters) from the last checkpoint.
>>> >> > >     b) We repair page memory state using physical delta records
>>> from
>>> >> WAL.
>>> >> > >     c) We apply logical records since last checkpoint finish
>>> marker.
>>> >> > > During this iteration, if we meet DataRecord we increment update
>>> >> counter
>>> >> > on
>>> >> > > appropriate partition. NOTE: This phase currently happens during
>>> >> > exchange.
>>> >> > >     After that partition exchange starts and information about
>>> actual
>>> >> > > update counters will be collected from FullMessage. If partition
>>> >> counters
>>> >> > > are outdated, rebalance will start after PME end.
>>> >> > >
>>> >> > > 4) I think yes because everything that you need is cache
>>> descriptors
>>> >> of
>>> >> > > currently started caches on the grid. Some information can be
>>> >> retrieved
>>> >> > > from the static configuration. Information about dynamically
>>> started
>>> >> > caches
>>> >> > > can be retrieved from Discovery Data Bag received when a node
>>> joins to
>>> >> > > ring. See
>>> >> org.apache.ignite.internal.processors.cache.ClusterCachesInfo#
>>> >> > > onGridDataReceived
>>> >> > >
>>> >> > > 5) I think the main problem is that metastorage and page memory
>>> share
>>> >> one
>>> >> > > WAL. But If this phase will happen before PME is started on all
>>> >> cluster
>>> >> > > nodes, I think it's acceptable that they will use 2 WAL
>>> iterations.
>>> >> > >
>>> >> > > 2018-08-03 10:28 GMT+03:00 Nikolay Izhikov <ni...@apache.org>:
>>> >> > >
>>> >> > >> Hello, Maxim.
>>> >> > >>
>>> >> > >> > 1) Is it correct that readMetastore() happens after node
>>> starts>
>>> >> but
>>> >> > >> before including node into the ring?>
>>> >> > >>
>>> >> > >> I think yes.
>>> >> > >> You can have some kind of metainformation required on node join.
>>> >> > >>
>>> >> > >> > 5) Does in our final solution for new joined node
>>> readMetastore>
>>> >> and
>>> >> > >> restoreMemory should be performed in one step?
>>> >> > >>
>>> >> > >> I think, no.
>>> >> > >>
>>> >> > >> Meta Information can be required to perform restore memory.
>>> >> > >> So we have to restore metainformation in first step and restore
>>> whole
>>> >> > >> memory as a second step.
>>> >> > >>
>>> >> > >> В Пт, 03/08/2018 в 09:44 +0300, Maxim Muzafarov пишет:
>>> >> > >> > Hi Igniters,
>>> >> > >> >
>>> >> > >> >
>>> >> > >> > I'm working on bug [1] and have some questions about the final
>>> >> > >> > implementation. Probably, I've already found answers on some of
>>> >> > >> > them but I want to be sure. Please, help me to clarify details.
>>> >> > >> >
>>> >> > >> >
>>> >> > >> > The key problem here is that we are reading WAL and restoring
>>> >> > >> > memory state of new joined node inside PME. Reading WAL can
>>> >> > >> > consume huge amount of time, so the whole cluster stucks and
>>> >> > >> > waits for the single node.
>>> >> > >> >
>>> >> > >> >
>>> >> > >> > 1) Is it correct that readMetastore() happens after node starts
>>> >> > >> > but before including node into the ring?
>>> >> > >> >
>>> >> > >> > 2) Is after onDone() method called for LocalJoinFuture on local
>>> >> > >> > node happend we can proceed with initiating PME on local node?
>>> >> > >> >
>>> >> > >> > 3) After reading checkpoint and restore memory for new joined
>>> >> > >> > node how and when we are updating obsolete partitions update
>>> >> > >> > counter? At historical rebalance, right?
>>> >> > >> >
>>> >> > >> > 4) Should we restoreMemory for new joined node before PME
>>> >> > >> > initiates on the other nodes in cluster?
>>> >> > >> >
>>> >> > >> > 5) Does in our final solution for new joined node readMetastore
>>> >> > >> > and restoreMemory should be performed in one step?
>>> >> > >> >
>>> >> > >> >
>>> >> > >> > [1] https://issues.apache.org/jira/browse/IGNITE-7196
>>> >> > >>
>>> >> > >
>>> >> > >
>>> >> >
>>> >> --
>>> >> --
>>> >> Maxim Muzafarov
>>> >>
>>> >
>>> >
>>>
>> --
>> --
>> Maxim Muzafarov
>>
> --
> --
> Maxim Muzafarov
>
-- 
--
Maxim Muzafarov

Re: Exchange stucks while node restoring state from WAL

Posted by Maxim Muzafarov <ma...@gmail.com>.

Igniters,

I would like this issue to be part of release 2.7. There is not much time
left
considering that I will have to correct comments after the review.

Will anyone help with review?

PR: https://github.com/apache/ignite/pull/4520/files
Upsource: https://reviews.ignite.apache.org/ignite/review/IGNT-CR-727
JIRA: https://issues.apache.org/jira/browse/IGNITE-7196

On Thu, 13 Sep 2018 at 11:27 Maxim Muzafarov <ma...@gmail.com> wrote:

> Igniters,
>
> I need your help with the review of this patch. In general, it will help
> to reduce PME duration. I've moved binary
> memory recovery at the moment of node startup as we've discussed it with
> Pavel previously this topic.
>
> Changes are relatively small (+299 −95) and they are ready. I've left a
> comment in JIRA with high-level
> implementation details. Hope, this will help with the review [1].
>
> Please, take a look.
> PR: https://github.com/apache/ignite/pull/4520/files
> Upsource: https://reviews.ignite.apache.org/ignite/review/IGNT-CR-727
> JIRA: https://issues.apache.org/jira/browse/IGNITE-7196
> TC Run All: [2]
>
>
> [1]
> https://issues.apache.org/jira/browse/IGNITE-7196?focusedCommentId=16613175&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16613175
> [2]
> https://ci.ignite.apache.org/viewLog.html?buildId=1835822&tab=buildResultsDiv&buildTypeId=IgniteTests24Java8_RunAll
>
>
> On Tue, 28 Aug 2018 at 17:59 Pavel Kovalenko <jo...@gmail.com> wrote:
>
>> Hello Maxim,
>>
>> I think you're going in the right direction, but why you perform binary
>> recovery only in the case when a node in the baseline?
>> I think that node can also perform binary recovery before the first
>> exchange is started in case of
>> 1) First activation
>> 2) Baseline change
>>
>> In all of these cases, you need just local available CacheDescriptors to
>> restore binary state from WAL.
>> I would like also see a test when Ignite is stopped in the middle of a
>> checkpoint and binary recovery succeed before PME in that case also.
>>
>>
>> ср, 22 авг. 2018 г. в 15:31, Pavel Kovalenko <jo...@gmail.com>:
>>
>> > Hello Maxim,
>> >
>> > Thank you for your work. I will review your changes within the next
>> > several days.
>> >
>> > 2018-08-22 11:12 GMT+03:00 Maxim Muzafarov <ma...@gmail.com>:
>> >
>> >> Pavel,
>> >>
>> >> Thank you for so detailed introduction into the root of the problem of
>> >> ticket IGNITE-7196.
>> >> As you mentioned before, we can divide this ticket into two sub-tasks.
>> So,
>> >> I will
>> >> file new issue for `Logical consistency recovery`. I think it's more
>> >> convenient way
>> >> for reviewing and merging each of these high level Steps (PRs).
>> >>
>> >> Currently, let's focus on `Physical(binary) consistency recovery` as
>> the
>> >> most critical
>> >> part of this issue and synchronize our vision and vision of other
>> >> community
>> >> members
>> >> on implementation details of Step 1. From this point everything what
>> I'm
>> >> saying is about
>> >> only binary recovery.
>> >>
>> >>
>> >> I think PR is almost ready (I need to check TC carefully).
>> >> Can you look at my changes? Am I going the right way?
>> >>
>> >> PR: https://github.com/apache/ignite/pull/4520
>> >> Upsource: https://reviews.ignite.apache.org/ignite/review/IGNT-CR-727
>> >> JIRA: https://issues.apache.org/jira/browse/IGNITE-7196
>> >>
>> >>
>> >> These are my milestones which I've tried to solve:
>> >> 1. On first time cluster activation no changes required. We should keep
>> >> this behavior.
>> >>
>> >> 2. `startMemoryPolicies` earlier if need.
>> >> Now at the master branch data regions starts at onActivate method
>> called.
>> >> If we are  trying
>> >> to restore binary state on node reconnects to cluster we should start
>> >> regions earlier.
>> >>
>> >> 3. `suspendLogging` method added to switch off handlers.
>> >> Keep fast PDS cleanup when joining is not in BLT (or belongs to another
>> >> BLT).
>> >> Restore memory and metastorage initialization required `WALWriter` and
>> >> `FileWriteHandle`
>> >> to be started. They can lock directory\files and block cleanups.
>> >>
>> >> 4. `isBaselineNode` check before resumming logging.
>> >> The same as point 3. Local node gets discovery data and determines
>> >> #isLocalNodeNotInBaseline.
>> >> If node not in current baseline it performs cleanup local PDS and we
>> >> should
>> >> reset state of
>> >> previously initialized at startup metastorage instance.
>> >>
>> >> 5. `cacheProcessorStarted` callback added to restore memory at startup.
>> >> We can restore memory only after having info about all
>> >> `CacheGroupDescriptor`. To achieve this
>> >> we should do restoring at the moment `GridCacheProcessor` started.
>> >>
>> >> 6. Move `fsync` for `MemoryRecoveryRecord` at node startup.
>> >> We are fsync'ing it inside running PME. Suppose, it used for recovery
>> >> actions and\or control
>> >> cluster state outside Ignite project (methods `nodeStart` and
>> >> `nodeStartedPointers` of this
>> >> are not used at Ignite project code). I've moved them at the node
>> startup,
>> >> but may be we should
>> >> dumping timestamp only at the time of node activation. Not sure.
>> >>
>> >> 7. `testBinaryRecoverBeforeExchange` test added to check restore before
>> >> PME.
>> >> Not sure that we should repeat wal reading logic as it was in
>> >> `testApplyDeltaRecords` test for
>> >> checking memory restoring. Suppose, minimally necessary condition is to
>> >> check `lastRestored` and
>> >> `checkpointHistory` states of database manager's (they should happend
>> >> before node joins to cluster).
>> >>
>> >> I will create separate ticket and refactor these minor issues after we
>> >> will
>> >> finish with 7196:
>> >> - simlify registerMBeans methods for
>> >> `Ignite\GridCacheDatabaseSharedManager`
>> >> - add @Override annotations if missed
>> >> - fix abbreviations regarding IdeaAbbrPlugin
>> >> - createDataRegionConfiguration method can be removed and simplified
>> >> - startMemoryPolicies rename to startDataRegion
>> >> - output messages format fix code style needed
>> >> - fix format javadoc's
>> >> - make inner classes private
>> >> - missed logging for cleanupCheckpointDirectory,
>> >> cleanupCheckpointDirectory, cleanupWalDirectories
>> >>
>> >> On Fri, 3 Aug 2018 at 18:20 Pavel Kovalenko <jo...@gmail.com>
>> wrote:
>> >>
>> >> > Maxim,
>> >> >
>> >> > In the addition to my answer, I propose to divide the whole task
>> into 2
>> >> > steps:
>> >> > 1) Physical consistency recovery before PME.
>> >> > 2) Logical consistency recovery before PME.
>> >> >
>> >> > The first step can be done before the discovery manager start, as it
>> no
>> >> > needs any information about caches, it operates only with pages
>> binary
>> >> > data. On this step, we can do points a) and b) from answer 3) and
>> fsync
>> >> the
>> >> > state of the page memory.
>> >> > The second step is much harder to implement as it requires major
>> >> changes in
>> >> > Discovery SPI. Discovery node join event is triggered on coordinator
>> >> after
>> >> > node successfully joins to the cluster. To avoid it, we should
>> >> introduce a
>> >> > new state of node and fire discovery event only after logical
>> recovery
>> >> has
>> >> > finished. I doubt that logical recovery consumes a lot of time of
>> >> exchange,
>> >> > but physical recovery does, as it performs fsync on disk after
>> restore.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > 2018-08-03 17:46 GMT+03:00 Pavel Kovalenko <jo...@gmail.com>:
>> >> >
>> >> > > Hello Maxim,
>> >> > >
>> >> > > 1) Yes, Discovery Manager is starting after GridCacheProcessor,
>> which
>> >> > > starts GridCacheDatabaseSharedManager which invokes
>> readMetastorage on
>> >> > > start.
>> >> > >
>> >> > > 2) Before we complete the local join future, we create and add
>> >> Exchange
>> >> > > future on local node join to ExchangeManager. So, when local join
>> >> future
>> >> > > completes, first PME on that node should be already run or at least
>> >> there
>> >> > > will be first Exchange future in Exchange worker queue.
>> >> > >
>> >> > > 3) I don't exactly understand what do you mean saying "update
>> obsolete
>> >> > > partition counter", but the process is the following:
>> >> > >     a) We restore page memory state (which contain partition update
>> >> > > counters) from the last checkpoint.
>> >> > >     b) We repair page memory state using physical delta records
>> from
>> >> WAL.
>> >> > >     c) We apply logical records since last checkpoint finish
>> marker.
>> >> > > During this iteration, if we meet DataRecord we increment update
>> >> counter
>> >> > on
>> >> > > appropriate partition. NOTE: This phase currently happens during
>> >> > exchange.
>> >> > >     After that partition exchange starts and information about
>> actual
>> >> > > update counters will be collected from FullMessage. If partition
>> >> counters
>> >> > > are outdated, rebalance will start after PME end.
>> >> > >
>> >> > > 4) I think yes because everything that you need is cache
>> descriptors
>> >> of
>> >> > > currently started caches on the grid. Some information can be
>> >> retrieved
>> >> > > from the static configuration. Information about dynamically
>> started
>> >> > caches
>> >> > > can be retrieved from Discovery Data Bag received when a node
>> joins to
>> >> > > ring. See
>> >> org.apache.ignite.internal.processors.cache.ClusterCachesInfo#
>> >> > > onGridDataReceived
>> >> > >
>> >> > > 5) I think the main problem is that metastorage and page memory
>> share
>> >> one
>> >> > > WAL. But If this phase will happen before PME is started on all
>> >> cluster
>> >> > > nodes, I think it's acceptable that they will use 2 WAL iterations.
>> >> > >
>> >> > > 2018-08-03 10:28 GMT+03:00 Nikolay Izhikov <ni...@apache.org>:
>> >> > >
>> >> > >> Hello, Maxim.
>> >> > >>
>> >> > >> > 1) Is it correct that readMetastore() happens after node starts>
>> >> but
>> >> > >> before including node into the ring?>
>> >> > >>
>> >> > >> I think yes.
>> >> > >> You can have some kind of metainformation required on node join.
>> >> > >>
>> >> > >> > 5) Does in our final solution for new joined node readMetastore>
>> >> and
>> >> > >> restoreMemory should be performed in one step?
>> >> > >>
>> >> > >> I think, no.
>> >> > >>
>> >> > >> Meta Information can be required to perform restore memory.
>> >> > >> So we have to restore metainformation in first step and restore
>> whole
>> >> > >> memory as a second step.
>> >> > >>
>> >> > >> В Пт, 03/08/2018 в 09:44 +0300, Maxim Muzafarov пишет:
>> >> > >> > Hi Igniters,
>> >> > >> >
>> >> > >> >
>> >> > >> > I'm working on bug [1] and have some questions about the final
>> >> > >> > implementation. Probably, I've already found answers on some of
>> >> > >> > them but I want to be sure. Please, help me to clarify details.
>> >> > >> >
>> >> > >> >
>> >> > >> > The key problem here is that we are reading WAL and restoring
>> >> > >> > memory state of new joined node inside PME. Reading WAL can
>> >> > >> > consume huge amount of time, so the whole cluster stucks and
>> >> > >> > waits for the single node.
>> >> > >> >
>> >> > >> >
>> >> > >> > 1) Is it correct that readMetastore() happens after node starts
>> >> > >> > but before including node into the ring?
>> >> > >> >
>> >> > >> > 2) Is after onDone() method called for LocalJoinFuture on local
>> >> > >> > node happend we can proceed with initiating PME on local node?
>> >> > >> >
>> >> > >> > 3) After reading checkpoint and restore memory for new joined
>> >> > >> > node how and when we are updating obsolete partitions update
>> >> > >> > counter? At historical rebalance, right?
>> >> > >> >
>> >> > >> > 4) Should we restoreMemory for new joined node before PME
>> >> > >> > initiates on the other nodes in cluster?
>> >> > >> >
>> >> > >> > 5) Does in our final solution for new joined node readMetastore
>> >> > >> > and restoreMemory should be performed in one step?
>> >> > >> >
>> >> > >> >
>> >> > >> > [1] https://issues.apache.org/jira/browse/IGNITE-7196
>> >> > >>
>> >> > >
>> >> > >
>> >> >
>> >> --
>> >> --
>> >> Maxim Muzafarov
>> >>
>> >
>> >
>>
> --
> --
> Maxim Muzafarov
>
-- 
--
Maxim Muzafarov

Re: Exchange stucks while node restoring state from WAL

Posted by Maxim Muzafarov <ma...@gmail.com>.

Igniters,

I need your help with the review of this patch. In general, it will help to
reduce PME duration. I've moved binary
memory recovery at the moment of node startup as we've discussed it with
Pavel previously this topic.

Changes are relatively small (+299 −95) and they are ready. I've left a
comment in JIRA with high-level
implementation details. Hope, this will help with the review [1].

Please, take a look.
PR: https://github.com/apache/ignite/pull/4520/files
Upsource: https://reviews.ignite.apache.org/ignite/review/IGNT-CR-727
JIRA: https://issues.apache.org/jira/browse/IGNITE-7196
TC Run All: [2]


[1]
https://issues.apache.org/jira/browse/IGNITE-7196?focusedCommentId=16613175&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16613175
[2]
https://ci.ignite.apache.org/viewLog.html?buildId=1835822&tab=buildResultsDiv&buildTypeId=IgniteTests24Java8_RunAll


On Tue, 28 Aug 2018 at 17:59 Pavel Kovalenko <jo...@gmail.com> wrote:

> Hello Maxim,
>
> I think you're going in the right direction, but why you perform binary
> recovery only in the case when a node in the baseline?
> I think that node can also perform binary recovery before the first
> exchange is started in case of
> 1) First activation
> 2) Baseline change
>
> In all of these cases, you need just local available CacheDescriptors to
> restore binary state from WAL.
> I would like also see a test when Ignite is stopped in the middle of a
> checkpoint and binary recovery succeed before PME in that case also.
>
>
> ср, 22 авг. 2018 г. в 15:31, Pavel Kovalenko <jo...@gmail.com>:
>
> > Hello Maxim,
> >
> > Thank you for your work. I will review your changes within the next
> > several days.
> >
> > 2018-08-22 11:12 GMT+03:00 Maxim Muzafarov <ma...@gmail.com>:
> >
> >> Pavel,
> >>
> >> Thank you for so detailed introduction into the root of the problem of
> >> ticket IGNITE-7196.
> >> As you mentioned before, we can divide this ticket into two sub-tasks.
> So,
> >> I will
> >> file new issue for `Logical consistency recovery`. I think it's more
> >> convenient way
> >> for reviewing and merging each of these high level Steps (PRs).
> >>
> >> Currently, let's focus on `Physical(binary) consistency recovery` as the
> >> most critical
> >> part of this issue and synchronize our vision and vision of other
> >> community
> >> members
> >> on implementation details of Step 1. From this point everything what I'm
> >> saying is about
> >> only binary recovery.
> >>
> >>
> >> I think PR is almost ready (I need to check TC carefully).
> >> Can you look at my changes? Am I going the right way?
> >>
> >> PR: https://github.com/apache/ignite/pull/4520
> >> Upsource: https://reviews.ignite.apache.org/ignite/review/IGNT-CR-727
> >> JIRA: https://issues.apache.org/jira/browse/IGNITE-7196
> >>
> >>
> >> These are my milestones which I've tried to solve:
> >> 1. On first time cluster activation no changes required. We should keep
> >> this behavior.
> >>
> >> 2. `startMemoryPolicies` earlier if need.
> >> Now at the master branch data regions starts at onActivate method
> called.
> >> If we are  trying
> >> to restore binary state on node reconnects to cluster we should start
> >> regions earlier.
> >>
> >> 3. `suspendLogging` method added to switch off handlers.
> >> Keep fast PDS cleanup when joining is not in BLT (or belongs to another
> >> BLT).
> >> Restore memory and metastorage initialization required `WALWriter` and
> >> `FileWriteHandle`
> >> to be started. They can lock directory\files and block cleanups.
> >>
> >> 4. `isBaselineNode` check before resumming logging.
> >> The same as point 3. Local node gets discovery data and determines
> >> #isLocalNodeNotInBaseline.
> >> If node not in current baseline it performs cleanup local PDS and we
> >> should
> >> reset state of
> >> previously initialized at startup metastorage instance.
> >>
> >> 5. `cacheProcessorStarted` callback added to restore memory at startup.
> >> We can restore memory only after having info about all
> >> `CacheGroupDescriptor`. To achieve this
> >> we should do restoring at the moment `GridCacheProcessor` started.
> >>
> >> 6. Move `fsync` for `MemoryRecoveryRecord` at node startup.
> >> We are fsync'ing it inside running PME. Suppose, it used for recovery
> >> actions and\or control
> >> cluster state outside Ignite project (methods `nodeStart` and
> >> `nodeStartedPointers` of this
> >> are not used at Ignite project code). I've moved them at the node
> startup,
> >> but may be we should
> >> dumping timestamp only at the time of node activation. Not sure.
> >>
> >> 7. `testBinaryRecoverBeforeExchange` test added to check restore before
> >> PME.
> >> Not sure that we should repeat wal reading logic as it was in
> >> `testApplyDeltaRecords` test for
> >> checking memory restoring. Suppose, minimally necessary condition is to
> >> check `lastRestored` and
> >> `checkpointHistory` states of database manager's (they should happend
> >> before node joins to cluster).
> >>
> >> I will create separate ticket and refactor these minor issues after we
> >> will
> >> finish with 7196:
> >> - simlify registerMBeans methods for
> >> `Ignite\GridCacheDatabaseSharedManager`
> >> - add @Override annotations if missed
> >> - fix abbreviations regarding IdeaAbbrPlugin
> >> - createDataRegionConfiguration method can be removed and simplified
> >> - startMemoryPolicies rename to startDataRegion
> >> - output messages format fix code style needed
> >> - fix format javadoc's
> >> - make inner classes private
> >> - missed logging for cleanupCheckpointDirectory,
> >> cleanupCheckpointDirectory, cleanupWalDirectories
> >>
> >> On Fri, 3 Aug 2018 at 18:20 Pavel Kovalenko <jo...@gmail.com> wrote:
> >>
> >> > Maxim,
> >> >
> >> > In the addition to my answer, I propose to divide the whole task into
> 2
> >> > steps:
> >> > 1) Physical consistency recovery before PME.
> >> > 2) Logical consistency recovery before PME.
> >> >
> >> > The first step can be done before the discovery manager start, as it
> no
> >> > needs any information about caches, it operates only with pages binary
> >> > data. On this step, we can do points a) and b) from answer 3) and
> fsync
> >> the
> >> > state of the page memory.
> >> > The second step is much harder to implement as it requires major
> >> changes in
> >> > Discovery SPI. Discovery node join event is triggered on coordinator
> >> after
> >> > node successfully joins to the cluster. To avoid it, we should
> >> introduce a
> >> > new state of node and fire discovery event only after logical recovery
> >> has
> >> > finished. I doubt that logical recovery consumes a lot of time of
> >> exchange,
> >> > but physical recovery does, as it performs fsync on disk after
> restore.
> >> >
> >> >
> >> >
> >> >
> >> > 2018-08-03 17:46 GMT+03:00 Pavel Kovalenko <jo...@gmail.com>:
> >> >
> >> > > Hello Maxim,
> >> > >
> >> > > 1) Yes, Discovery Manager is starting after GridCacheProcessor,
> which
> >> > > starts GridCacheDatabaseSharedManager which invokes readMetastorage
> on
> >> > > start.
> >> > >
> >> > > 2) Before we complete the local join future, we create and add
> >> Exchange
> >> > > future on local node join to ExchangeManager. So, when local join
> >> future
> >> > > completes, first PME on that node should be already run or at least
> >> there
> >> > > will be first Exchange future in Exchange worker queue.
> >> > >
> >> > > 3) I don't exactly understand what do you mean saying "update
> obsolete
> >> > > partition counter", but the process is the following:
> >> > >     a) We restore page memory state (which contain partition update
> >> > > counters) from the last checkpoint.
> >> > >     b) We repair page memory state using physical delta records from
> >> WAL.
> >> > >     c) We apply logical records since last checkpoint finish marker.
> >> > > During this iteration, if we meet DataRecord we increment update
> >> counter
> >> > on
> >> > > appropriate partition. NOTE: This phase currently happens during
> >> > exchange.
> >> > >     After that partition exchange starts and information about
> actual
> >> > > update counters will be collected from FullMessage. If partition
> >> counters
> >> > > are outdated, rebalance will start after PME end.
> >> > >
> >> > > 4) I think yes because everything that you need is cache descriptors
> >> of
> >> > > currently started caches on the grid. Some information can be
> >> retrieved
> >> > > from the static configuration. Information about dynamically started
> >> > caches
> >> > > can be retrieved from Discovery Data Bag received when a node joins
> to
> >> > > ring. See
> >> org.apache.ignite.internal.processors.cache.ClusterCachesInfo#
> >> > > onGridDataReceived
> >> > >
> >> > > 5) I think the main problem is that metastorage and page memory
> share
> >> one
> >> > > WAL. But If this phase will happen before PME is started on all
> >> cluster
> >> > > nodes, I think it's acceptable that they will use 2 WAL iterations.
> >> > >
> >> > > 2018-08-03 10:28 GMT+03:00 Nikolay Izhikov <ni...@apache.org>:
> >> > >
> >> > >> Hello, Maxim.
> >> > >>
> >> > >> > 1) Is it correct that readMetastore() happens after node starts>
> >> but
> >> > >> before including node into the ring?>
> >> > >>
> >> > >> I think yes.
> >> > >> You can have some kind of metainformation required on node join.
> >> > >>
> >> > >> > 5) Does in our final solution for new joined node readMetastore>
> >> and
> >> > >> restoreMemory should be performed in one step?
> >> > >>
> >> > >> I think, no.
> >> > >>
> >> > >> Meta Information can be required to perform restore memory.
> >> > >> So we have to restore metainformation in first step and restore
> whole
> >> > >> memory as a second step.
> >> > >>
> >> > >> В Пт, 03/08/2018 в 09:44 +0300, Maxim Muzafarov пишет:
> >> > >> > Hi Igniters,
> >> > >> >
> >> > >> >
> >> > >> > I'm working on bug [1] and have some questions about the final
> >> > >> > implementation. Probably, I've already found answers on some of
> >> > >> > them but I want to be sure. Please, help me to clarify details.
> >> > >> >
> >> > >> >
> >> > >> > The key problem here is that we are reading WAL and restoring
> >> > >> > memory state of new joined node inside PME. Reading WAL can
> >> > >> > consume huge amount of time, so the whole cluster stucks and
> >> > >> > waits for the single node.
> >> > >> >
> >> > >> >
> >> > >> > 1) Is it correct that readMetastore() happens after node starts
> >> > >> > but before including node into the ring?
> >> > >> >
> >> > >> > 2) Is after onDone() method called for LocalJoinFuture on local
> >> > >> > node happend we can proceed with initiating PME on local node?
> >> > >> >
> >> > >> > 3) After reading checkpoint and restore memory for new joined
> >> > >> > node how and when we are updating obsolete partitions update
> >> > >> > counter? At historical rebalance, right?
> >> > >> >
> >> > >> > 4) Should we restoreMemory for new joined node before PME
> >> > >> > initiates on the other nodes in cluster?
> >> > >> >
> >> > >> > 5) Does in our final solution for new joined node readMetastore
> >> > >> > and restoreMemory should be performed in one step?
> >> > >> >
> >> > >> >
> >> > >> > [1] https://issues.apache.org/jira/browse/IGNITE-7196
> >> > >>
> >> > >
> >> > >
> >> >
> >> --
> >> --
> >> Maxim Muzafarov
> >>
> >
> >
>
-- 
--
Maxim Muzafarov

Re: Exchange stucks while node restoring state from WAL

Posted by Pavel Kovalenko <jo...@gmail.com>.

Hello Maxim,

I think you're going in the right direction, but why you perform binary
recovery only in the case when a node in the baseline?
I think that node can also perform binary recovery before the first
exchange is started in case of
1) First activation
2) Baseline change

In all of these cases, you need just local available CacheDescriptors to
restore binary state from WAL.
I would like also see a test when Ignite is stopped in the middle of a
checkpoint and binary recovery succeed before PME in that case also.


ср, 22 авг. 2018 г. в 15:31, Pavel Kovalenko <jo...@gmail.com>:

> Hello Maxim,
>
> Thank you for your work. I will review your changes within the next
> several days.
>
> 2018-08-22 11:12 GMT+03:00 Maxim Muzafarov <ma...@gmail.com>:
>
>> Pavel,
>>
>> Thank you for so detailed introduction into the root of the problem of
>> ticket IGNITE-7196.
>> As you mentioned before, we can divide this ticket into two sub-tasks. So,
>> I will
>> file new issue for `Logical consistency recovery`. I think it's more
>> convenient way
>> for reviewing and merging each of these high level Steps (PRs).
>>
>> Currently, let's focus on `Physical(binary) consistency recovery` as the
>> most critical
>> part of this issue and synchronize our vision and vision of other
>> community
>> members
>> on implementation details of Step 1. From this point everything what I'm
>> saying is about
>> only binary recovery.
>>
>>
>> I think PR is almost ready (I need to check TC carefully).
>> Can you look at my changes? Am I going the right way?
>>
>> PR: https://github.com/apache/ignite/pull/4520
>> Upsource: https://reviews.ignite.apache.org/ignite/review/IGNT-CR-727
>> JIRA: https://issues.apache.org/jira/browse/IGNITE-7196
>>
>>
>> These are my milestones which I've tried to solve:
>> 1. On first time cluster activation no changes required. We should keep
>> this behavior.
>>
>> 2. `startMemoryPolicies` earlier if need.
>> Now at the master branch data regions starts at onActivate method called.
>> If we are  trying
>> to restore binary state on node reconnects to cluster we should start
>> regions earlier.
>>
>> 3. `suspendLogging` method added to switch off handlers.
>> Keep fast PDS cleanup when joining is not in BLT (or belongs to another
>> BLT).
>> Restore memory and metastorage initialization required `WALWriter` and
>> `FileWriteHandle`
>> to be started. They can lock directory\files and block cleanups.
>>
>> 4. `isBaselineNode` check before resumming logging.
>> The same as point 3. Local node gets discovery data and determines
>> #isLocalNodeNotInBaseline.
>> If node not in current baseline it performs cleanup local PDS and we
>> should
>> reset state of
>> previously initialized at startup metastorage instance.
>>
>> 5. `cacheProcessorStarted` callback added to restore memory at startup.
>> We can restore memory only after having info about all
>> `CacheGroupDescriptor`. To achieve this
>> we should do restoring at the moment `GridCacheProcessor` started.
>>
>> 6. Move `fsync` for `MemoryRecoveryRecord` at node startup.
>> We are fsync'ing it inside running PME. Suppose, it used for recovery
>> actions and\or control
>> cluster state outside Ignite project (methods `nodeStart` and
>> `nodeStartedPointers` of this
>> are not used at Ignite project code). I've moved them at the node startup,
>> but may be we should
>> dumping timestamp only at the time of node activation. Not sure.
>>
>> 7. `testBinaryRecoverBeforeExchange` test added to check restore before
>> PME.
>> Not sure that we should repeat wal reading logic as it was in
>> `testApplyDeltaRecords` test for
>> checking memory restoring. Suppose, minimally necessary condition is to
>> check `lastRestored` and
>> `checkpointHistory` states of database manager's (they should happend
>> before node joins to cluster).
>>
>> I will create separate ticket and refactor these minor issues after we
>> will
>> finish with 7196:
>> - simlify registerMBeans methods for
>> `Ignite\GridCacheDatabaseSharedManager`
>> - add @Override annotations if missed
>> - fix abbreviations regarding IdeaAbbrPlugin
>> - createDataRegionConfiguration method can be removed and simplified
>> - startMemoryPolicies rename to startDataRegion
>> - output messages format fix code style needed
>> - fix format javadoc's
>> - make inner classes private
>> - missed logging for cleanupCheckpointDirectory,
>> cleanupCheckpointDirectory, cleanupWalDirectories
>>
>> On Fri, 3 Aug 2018 at 18:20 Pavel Kovalenko <jo...@gmail.com> wrote:
>>
>> > Maxim,
>> >
>> > In the addition to my answer, I propose to divide the whole task into 2
>> > steps:
>> > 1) Physical consistency recovery before PME.
>> > 2) Logical consistency recovery before PME.
>> >
>> > The first step can be done before the discovery manager start, as it no
>> > needs any information about caches, it operates only with pages binary
>> > data. On this step, we can do points a) and b) from answer 3) and fsync
>> the
>> > state of the page memory.
>> > The second step is much harder to implement as it requires major
>> changes in
>> > Discovery SPI. Discovery node join event is triggered on coordinator
>> after
>> > node successfully joins to the cluster. To avoid it, we should
>> introduce a
>> > new state of node and fire discovery event only after logical recovery
>> has
>> > finished. I doubt that logical recovery consumes a lot of time of
>> exchange,
>> > but physical recovery does, as it performs fsync on disk after restore.
>> >
>> >
>> >
>> >
>> > 2018-08-03 17:46 GMT+03:00 Pavel Kovalenko <jo...@gmail.com>:
>> >
>> > > Hello Maxim,
>> > >
>> > > 1) Yes, Discovery Manager is starting after GridCacheProcessor, which
>> > > starts GridCacheDatabaseSharedManager which invokes readMetastorage on
>> > > start.
>> > >
>> > > 2) Before we complete the local join future, we create and add
>> Exchange
>> > > future on local node join to ExchangeManager. So, when local join
>> future
>> > > completes, first PME on that node should be already run or at least
>> there
>> > > will be first Exchange future in Exchange worker queue.
>> > >
>> > > 3) I don't exactly understand what do you mean saying "update obsolete
>> > > partition counter", but the process is the following:
>> > >     a) We restore page memory state (which contain partition update
>> > > counters) from the last checkpoint.
>> > >     b) We repair page memory state using physical delta records from
>> WAL.
>> > >     c) We apply logical records since last checkpoint finish marker.
>> > > During this iteration, if we meet DataRecord we increment update
>> counter
>> > on
>> > > appropriate partition. NOTE: This phase currently happens during
>> > exchange.
>> > >     After that partition exchange starts and information about actual
>> > > update counters will be collected from FullMessage. If partition
>> counters
>> > > are outdated, rebalance will start after PME end.
>> > >
>> > > 4) I think yes because everything that you need is cache descriptors
>> of
>> > > currently started caches on the grid. Some information can be
>> retrieved
>> > > from the static configuration. Information about dynamically started
>> > caches
>> > > can be retrieved from Discovery Data Bag received when a node joins to
>> > > ring. See
>> org.apache.ignite.internal.processors.cache.ClusterCachesInfo#
>> > > onGridDataReceived
>> > >
>> > > 5) I think the main problem is that metastorage and page memory share
>> one
>> > > WAL. But If this phase will happen before PME is started on all
>> cluster
>> > > nodes, I think it's acceptable that they will use 2 WAL iterations.
>> > >
>> > > 2018-08-03 10:28 GMT+03:00 Nikolay Izhikov <ni...@apache.org>:
>> > >
>> > >> Hello, Maxim.
>> > >>
>> > >> > 1) Is it correct that readMetastore() happens after node starts>
>> but
>> > >> before including node into the ring?>
>> > >>
>> > >> I think yes.
>> > >> You can have some kind of metainformation required on node join.
>> > >>
>> > >> > 5) Does in our final solution for new joined node readMetastore>
>> and
>> > >> restoreMemory should be performed in one step?
>> > >>
>> > >> I think, no.
>> > >>
>> > >> Meta Information can be required to perform restore memory.
>> > >> So we have to restore metainformation in first step and restore whole
>> > >> memory as a second step.
>> > >>
>> > >> В Пт, 03/08/2018 в 09:44 +0300, Maxim Muzafarov пишет:
>> > >> > Hi Igniters,
>> > >> >
>> > >> >
>> > >> > I'm working on bug [1] and have some questions about the final
>> > >> > implementation. Probably, I've already found answers on some of
>> > >> > them but I want to be sure. Please, help me to clarify details.
>> > >> >
>> > >> >
>> > >> > The key problem here is that we are reading WAL and restoring
>> > >> > memory state of new joined node inside PME. Reading WAL can
>> > >> > consume huge amount of time, so the whole cluster stucks and
>> > >> > waits for the single node.
>> > >> >
>> > >> >
>> > >> > 1) Is it correct that readMetastore() happens after node starts
>> > >> > but before including node into the ring?
>> > >> >
>> > >> > 2) Is after onDone() method called for LocalJoinFuture on local
>> > >> > node happend we can proceed with initiating PME on local node?
>> > >> >
>> > >> > 3) After reading checkpoint and restore memory for new joined
>> > >> > node how and when we are updating obsolete partitions update
>> > >> > counter? At historical rebalance, right?
>> > >> >
>> > >> > 4) Should we restoreMemory for new joined node before PME
>> > >> > initiates on the other nodes in cluster?
>> > >> >
>> > >> > 5) Does in our final solution for new joined node readMetastore
>> > >> > and restoreMemory should be performed in one step?
>> > >> >
>> > >> >
>> > >> > [1] https://issues.apache.org/jira/browse/IGNITE-7196
>> > >>
>> > >
>> > >
>> >
>> --
>> --
>> Maxim Muzafarov
>>
>
>

Re: Exchange stucks while node restoring state from WAL

Posted by Pavel Kovalenko <jo...@gmail.com>.

Hello Maxim,

Thank you for your work. I will review your changes within the next several
days.

2018-08-22 11:12 GMT+03:00 Maxim Muzafarov <ma...@gmail.com>:

> Pavel,
>
> Thank you for so detailed introduction into the root of the problem of
> ticket IGNITE-7196.
> As you mentioned before, we can divide this ticket into two sub-tasks. So,
> I will
> file new issue for `Logical consistency recovery`. I think it's more
> convenient way
> for reviewing and merging each of these high level Steps (PRs).
>
> Currently, let's focus on `Physical(binary) consistency recovery` as the
> most critical
> part of this issue and synchronize our vision and vision of other community
> members
> on implementation details of Step 1. From this point everything what I'm
> saying is about
> only binary recovery.
>
>
> I think PR is almost ready (I need to check TC carefully).
> Can you look at my changes? Am I going the right way?
>
> PR: https://github.com/apache/ignite/pull/4520
> Upsource: https://reviews.ignite.apache.org/ignite/review/IGNT-CR-727
> JIRA: https://issues.apache.org/jira/browse/IGNITE-7196
>
>
> These are my milestones which I've tried to solve:
> 1. On first time cluster activation no changes required. We should keep
> this behavior.
>
> 2. `startMemoryPolicies` earlier if need.
> Now at the master branch data regions starts at onActivate method called.
> If we are  trying
> to restore binary state on node reconnects to cluster we should start
> regions earlier.
>
> 3. `suspendLogging` method added to switch off handlers.
> Keep fast PDS cleanup when joining is not in BLT (or belongs to another
> BLT).
> Restore memory and metastorage initialization required `WALWriter` and
> `FileWriteHandle`
> to be started. They can lock directory\files and block cleanups.
>
> 4. `isBaselineNode` check before resumming logging.
> The same as point 3. Local node gets discovery data and determines
> #isLocalNodeNotInBaseline.
> If node not in current baseline it performs cleanup local PDS and we should
> reset state of
> previously initialized at startup metastorage instance.
>
> 5. `cacheProcessorStarted` callback added to restore memory at startup.
> We can restore memory only after having info about all
> `CacheGroupDescriptor`. To achieve this
> we should do restoring at the moment `GridCacheProcessor` started.
>
> 6. Move `fsync` for `MemoryRecoveryRecord` at node startup.
> We are fsync'ing it inside running PME. Suppose, it used for recovery
> actions and\or control
> cluster state outside Ignite project (methods `nodeStart` and
> `nodeStartedPointers` of this
> are not used at Ignite project code). I've moved them at the node startup,
> but may be we should
> dumping timestamp only at the time of node activation. Not sure.
>
> 7. `testBinaryRecoverBeforeExchange` test added to check restore before
> PME.
> Not sure that we should repeat wal reading logic as it was in
> `testApplyDeltaRecords` test for
> checking memory restoring. Suppose, minimally necessary condition is to
> check `lastRestored` and
> `checkpointHistory` states of database manager's (they should happend
> before node joins to cluster).
>
> I will create separate ticket and refactor these minor issues after we will
> finish with 7196:
> - simlify registerMBeans methods for `Ignite\
> GridCacheDatabaseSharedManager`
> - add @Override annotations if missed
> - fix abbreviations regarding IdeaAbbrPlugin
> - createDataRegionConfiguration method can be removed and simplified
> - startMemoryPolicies rename to startDataRegion
> - output messages format fix code style needed
> - fix format javadoc's
> - make inner classes private
> - missed logging for cleanupCheckpointDirectory,
> cleanupCheckpointDirectory, cleanupWalDirectories
>
> On Fri, 3 Aug 2018 at 18:20 Pavel Kovalenko <jo...@gmail.com> wrote:
>
> > Maxim,
> >
> > In the addition to my answer, I propose to divide the whole task into 2
> > steps:
> > 1) Physical consistency recovery before PME.
> > 2) Logical consistency recovery before PME.
> >
> > The first step can be done before the discovery manager start, as it no
> > needs any information about caches, it operates only with pages binary
> > data. On this step, we can do points a) and b) from answer 3) and fsync
> the
> > state of the page memory.
> > The second step is much harder to implement as it requires major changes
> in
> > Discovery SPI. Discovery node join event is triggered on coordinator
> after
> > node successfully joins to the cluster. To avoid it, we should introduce
> a
> > new state of node and fire discovery event only after logical recovery
> has
> > finished. I doubt that logical recovery consumes a lot of time of
> exchange,
> > but physical recovery does, as it performs fsync on disk after restore.
> >
> >
> >
> >
> > 2018-08-03 17:46 GMT+03:00 Pavel Kovalenko <jo...@gmail.com>:
> >
> > > Hello Maxim,
> > >
> > > 1) Yes, Discovery Manager is starting after GridCacheProcessor, which
> > > starts GridCacheDatabaseSharedManager which invokes readMetastorage on
> > > start.
> > >
> > > 2) Before we complete the local join future, we create and add Exchange
> > > future on local node join to ExchangeManager. So, when local join
> future
> > > completes, first PME on that node should be already run or at least
> there
> > > will be first Exchange future in Exchange worker queue.
> > >
> > > 3) I don't exactly understand what do you mean saying "update obsolete
> > > partition counter", but the process is the following:
> > >     a) We restore page memory state (which contain partition update
> > > counters) from the last checkpoint.
> > >     b) We repair page memory state using physical delta records from
> WAL.
> > >     c) We apply logical records since last checkpoint finish marker.
> > > During this iteration, if we meet DataRecord we increment update
> counter
> > on
> > > appropriate partition. NOTE: This phase currently happens during
> > exchange.
> > >     After that partition exchange starts and information about actual
> > > update counters will be collected from FullMessage. If partition
> counters
> > > are outdated, rebalance will start after PME end.
> > >
> > > 4) I think yes because everything that you need is cache descriptors of
> > > currently started caches on the grid. Some information can be retrieved
> > > from the static configuration. Information about dynamically started
> > caches
> > > can be retrieved from Discovery Data Bag received when a node joins to
> > > ring. See org.apache.ignite.internal.processors.cache.
> ClusterCachesInfo#
> > > onGridDataReceived
> > >
> > > 5) I think the main problem is that metastorage and page memory share
> one
> > > WAL. But If this phase will happen before PME is started on all cluster
> > > nodes, I think it's acceptable that they will use 2 WAL iterations.
> > >
> > > 2018-08-03 10:28 GMT+03:00 Nikolay Izhikov <ni...@apache.org>:
> > >
> > >> Hello, Maxim.
> > >>
> > >> > 1) Is it correct that readMetastore() happens after node starts> but
> > >> before including node into the ring?>
> > >>
> > >> I think yes.
> > >> You can have some kind of metainformation required on node join.
> > >>
> > >> > 5) Does in our final solution for new joined node readMetastore> and
> > >> restoreMemory should be performed in one step?
> > >>
> > >> I think, no.
> > >>
> > >> Meta Information can be required to perform restore memory.
> > >> So we have to restore metainformation in first step and restore whole
> > >> memory as a second step.
> > >>
> > >> В Пт, 03/08/2018 в 09:44 +0300, Maxim Muzafarov пишет:
> > >> > Hi Igniters,
> > >> >
> > >> >
> > >> > I'm working on bug [1] and have some questions about the final
> > >> > implementation. Probably, I've already found answers on some of
> > >> > them but I want to be sure. Please, help me to clarify details.
> > >> >
> > >> >
> > >> > The key problem here is that we are reading WAL and restoring
> > >> > memory state of new joined node inside PME. Reading WAL can
> > >> > consume huge amount of time, so the whole cluster stucks and
> > >> > waits for the single node.
> > >> >
> > >> >
> > >> > 1) Is it correct that readMetastore() happens after node starts
> > >> > but before including node into the ring?
> > >> >
> > >> > 2) Is after onDone() method called for LocalJoinFuture on local
> > >> > node happend we can proceed with initiating PME on local node?
> > >> >
> > >> > 3) After reading checkpoint and restore memory for new joined
> > >> > node how and when we are updating obsolete partitions update
> > >> > counter? At historical rebalance, right?
> > >> >
> > >> > 4) Should we restoreMemory for new joined node before PME
> > >> > initiates on the other nodes in cluster?
> > >> >
> > >> > 5) Does in our final solution for new joined node readMetastore
> > >> > and restoreMemory should be performed in one step?
> > >> >
> > >> >
> > >> > [1] https://issues.apache.org/jira/browse/IGNITE-7196
> > >>
> > >
> > >
> >
> --
> --
> Maxim Muzafarov
>

Re: Exchange stucks while node restoring state from WAL

Posted by Maxim Muzafarov <ma...@gmail.com>.

Pavel,

Thank you for so detailed introduction into the root of the problem of
ticket IGNITE-7196.
As you mentioned before, we can divide this ticket into two sub-tasks. So,
I will
file new issue for `Logical consistency recovery`. I think it's more
convenient way
for reviewing and merging each of these high level Steps (PRs).

Currently, let's focus on `Physical(binary) consistency recovery` as the
most critical
part of this issue and synchronize our vision and vision of other community
members
on implementation details of Step 1. From this point everything what I'm
saying is about
only binary recovery.

I think PR is almost ready (I need to check TC carefully).
Can you look at my changes? Am I going the right way?

PR: https://github.com/apache/ignite/pull/4520
Upsource: https://reviews.ignite.apache.org/ignite/review/IGNT-CR-727
JIRA: https://issues.apache.org/jira/browse/IGNITE-7196

These are my milestones which I've tried to solve:
1. On first time cluster activation no changes required. We should keep
this behavior.

2. `startMemoryPolicies` earlier if need.
Now at the master branch data regions starts at onActivate method called.
If we are  trying
to restore binary state on node reconnects to cluster we should start
regions earlier.

3. `suspendLogging` method added to switch off handlers.
Keep fast PDS cleanup when joining is not in BLT (or belongs to another
BLT).
Restore memory and metastorage initialization required `WALWriter` and
`FileWriteHandle`
to be started. They can lock directory\files and block cleanups.

4. `isBaselineNode` check before resumming logging.
The same as point 3. Local node gets discovery data and determines
#isLocalNodeNotInBaseline.
If node not in current baseline it performs cleanup local PDS and we should
reset state of
previously initialized at startup metastorage instance.

5. `cacheProcessorStarted` callback added to restore memory at startup.
We can restore memory only after having info about all
`CacheGroupDescriptor`. To achieve this
we should do restoring at the moment `GridCacheProcessor` started.

6. Move `fsync` for `MemoryRecoveryRecord` at node startup.
We are fsync'ing it inside running PME. Suppose, it used for recovery
actions and\or control
cluster state outside Ignite project (methods `nodeStart` and
`nodeStartedPointers` of this
are not used at Ignite project code). I've moved them at the node startup,
but may be we should
dumping timestamp only at the time of node activation. Not sure.

7. `testBinaryRecoverBeforeExchange` test added to check restore before PME.
Not sure that we should repeat wal reading logic as it was in
`testApplyDeltaRecords` test for
checking memory restoring. Suppose, minimally necessary condition is to
check `lastRestored` and
`checkpointHistory` states of database manager's (they should happend
before node joins to cluster).

I will create separate ticket and refactor these minor issues after we will
finish with 7196:
- simlify registerMBeans methods for `Ignite\GridCacheDatabaseSharedManager`
- add @Override annotations if missed
- fix abbreviations regarding IdeaAbbrPlugin
- createDataRegionConfiguration method can be removed and simplified
- startMemoryPolicies rename to startDataRegion
- output messages format fix code style needed
- fix format javadoc's
- make inner classes private
- missed logging for cleanupCheckpointDirectory,
cleanupCheckpointDirectory, cleanupWalDirectories

On Fri, 3 Aug 2018 at 18:20 Pavel Kovalenko <jo...@gmail.com> wrote:

> Maxim,
>
> In the addition to my answer, I propose to divide the whole task into 2
> steps:
> 1) Physical consistency recovery before PME.
> 2) Logical consistency recovery before PME.
>
> The first step can be done before the discovery manager start, as it no
> needs any information about caches, it operates only with pages binary
> data. On this step, we can do points a) and b) from answer 3) and fsync the
> state of the page memory.
> The second step is much harder to implement as it requires major changes in
> Discovery SPI. Discovery node join event is triggered on coordinator after
> node successfully joins to the cluster. To avoid it, we should introduce a
> new state of node and fire discovery event only after logical recovery has
> finished. I doubt that logical recovery consumes a lot of time of exchange,
> but physical recovery does, as it performs fsync on disk after restore.
>
>
>
>
> 2018-08-03 17:46 GMT+03:00 Pavel Kovalenko <jo...@gmail.com>:
>
> > Hello Maxim,
> >
> > 1) Yes, Discovery Manager is starting after GridCacheProcessor, which
> > starts GridCacheDatabaseSharedManager which invokes readMetastorage on
> > start.
> >
> > 2) Before we complete the local join future, we create and add Exchange
> > future on local node join to ExchangeManager. So, when local join future
> > completes, first PME on that node should be already run or at least there
> > will be first Exchange future in Exchange worker queue.
> >
> > 3) I don't exactly understand what do you mean saying "update obsolete
> > partition counter", but the process is the following:
> >     a) We restore page memory state (which contain partition update
> > counters) from the last checkpoint.
> >     b) We repair page memory state using physical delta records from WAL.
> >     c) We apply logical records since last checkpoint finish marker.
> > During this iteration, if we meet DataRecord we increment update counter
> on
> > appropriate partition. NOTE: This phase currently happens during
> exchange.
> >     After that partition exchange starts and information about actual
> > update counters will be collected from FullMessage. If partition counters
> > are outdated, rebalance will start after PME end.
> >
> > 4) I think yes because everything that you need is cache descriptors of
> > currently started caches on the grid. Some information can be retrieved
> > from the static configuration. Information about dynamically started
> caches
> > can be retrieved from Discovery Data Bag received when a node joins to
> > ring. See org.apache.ignite.internal.processors.cache.ClusterCachesInfo#
> > onGridDataReceived
> >
> > 5) I think the main problem is that metastorage and page memory share one
> > WAL. But If this phase will happen before PME is started on all cluster
> > nodes, I think it's acceptable that they will use 2 WAL iterations.
> >
> > 2018-08-03 10:28 GMT+03:00 Nikolay Izhikov <ni...@apache.org>:
> >
> >> Hello, Maxim.
> >>
> >> > 1) Is it correct that readMetastore() happens after node starts> but
> >> before including node into the ring?>
> >>
> >> I think yes.
> >> You can have some kind of metainformation required on node join.
> >>
> >> > 5) Does in our final solution for new joined node readMetastore> and
> >> restoreMemory should be performed in one step?
> >>
> >> I think, no.
> >>
> >> Meta Information can be required to perform restore memory.
> >> So we have to restore metainformation in first step and restore whole
> >> memory as a second step.
> >>
> >> В Пт, 03/08/2018 в 09:44 +0300, Maxim Muzafarov пишет:
> >> > Hi Igniters,
> >> >
> >> >
> >> > I'm working on bug [1] and have some questions about the final
> >> > implementation. Probably, I've already found answers on some of
> >> > them but I want to be sure. Please, help me to clarify details.
> >> >
> >> >
> >> > The key problem here is that we are reading WAL and restoring
> >> > memory state of new joined node inside PME. Reading WAL can
> >> > consume huge amount of time, so the whole cluster stucks and
> >> > waits for the single node.
> >> >
> >> >
> >> > 1) Is it correct that readMetastore() happens after node starts
> >> > but before including node into the ring?
> >> >
> >> > 2) Is after onDone() method called for LocalJoinFuture on local
> >> > node happend we can proceed with initiating PME on local node?
> >> >
> >> > 3) After reading checkpoint and restore memory for new joined
> >> > node how and when we are updating obsolete partitions update
> >> > counter? At historical rebalance, right?
> >> >
> >> > 4) Should we restoreMemory for new joined node before PME
> >> > initiates on the other nodes in cluster?
> >> >
> >> > 5) Does in our final solution for new joined node readMetastore
> >> > and restoreMemory should be performed in one step?
> >> >
> >> >
> >> > [1] https://issues.apache.org/jira/browse/IGNITE-7196
> >>
> >
> >
>
-- 
--
Maxim Muzafarov

Re: Exchange stucks while node restoring state from WAL

Posted by Pavel Kovalenko <jo...@gmail.com>.

Maxim,

In the addition to my answer, I propose to divide the whole task into 2
steps:
1) Physical consistency recovery before PME.
2) Logical consistency recovery before PME.

The first step can be done before the discovery manager start, as it no
needs any information about caches, it operates only with pages binary
data. On this step, we can do points a) and b) from answer 3) and fsync the
state of the page memory.
The second step is much harder to implement as it requires major changes in
Discovery SPI. Discovery node join event is triggered on coordinator after
node successfully joins to the cluster. To avoid it, we should introduce a
new state of node and fire discovery event only after logical recovery has
finished. I doubt that logical recovery consumes a lot of time of exchange,
but physical recovery does, as it performs fsync on disk after restore.




2018-08-03 17:46 GMT+03:00 Pavel Kovalenko <jo...@gmail.com>:

> Hello Maxim,
>
> 1) Yes, Discovery Manager is starting after GridCacheProcessor, which
> starts GridCacheDatabaseSharedManager which invokes readMetastorage on
> start.
>
> 2) Before we complete the local join future, we create and add Exchange
> future on local node join to ExchangeManager. So, when local join future
> completes, first PME on that node should be already run or at least there
> will be first Exchange future in Exchange worker queue.
>
> 3) I don't exactly understand what do you mean saying "update obsolete
> partition counter", but the process is the following:
>     a) We restore page memory state (which contain partition update
> counters) from the last checkpoint.
>     b) We repair page memory state using physical delta records from WAL.
>     c) We apply logical records since last checkpoint finish marker.
> During this iteration, if we meet DataRecord we increment update counter on
> appropriate partition. NOTE: This phase currently happens during exchange.
>     After that partition exchange starts and information about actual
> update counters will be collected from FullMessage. If partition counters
> are outdated, rebalance will start after PME end.
>
> 4) I think yes because everything that you need is cache descriptors of
> currently started caches on the grid. Some information can be retrieved
> from the static configuration. Information about dynamically started caches
> can be retrieved from Discovery Data Bag received when a node joins to
> ring. See org.apache.ignite.internal.processors.cache.ClusterCachesInfo#
> onGridDataReceived
>
> 5) I think the main problem is that metastorage and page memory share one
> WAL. But If this phase will happen before PME is started on all cluster
> nodes, I think it's acceptable that they will use 2 WAL iterations.
>
> 2018-08-03 10:28 GMT+03:00 Nikolay Izhikov <ni...@apache.org>:
>
>> Hello, Maxim.
>>
>> > 1) Is it correct that readMetastore() happens after node starts> but
>> before including node into the ring?>
>>
>> I think yes.
>> You can have some kind of metainformation required on node join.
>>
>> > 5) Does in our final solution for new joined node readMetastore> and
>> restoreMemory should be performed in one step?
>>
>> I think, no.
>>
>> Meta Information can be required to perform restore memory.
>> So we have to restore metainformation in first step and restore whole
>> memory as a second step.
>>
>> В Пт, 03/08/2018 в 09:44 +0300, Maxim Muzafarov пишет:
>> > Hi Igniters,
>> >
>> >
>> > I'm working on bug [1] and have some questions about the final
>> > implementation. Probably, I've already found answers on some of
>> > them but I want to be sure. Please, help me to clarify details.
>> >
>> >
>> > The key problem here is that we are reading WAL and restoring
>> > memory state of new joined node inside PME. Reading WAL can
>> > consume huge amount of time, so the whole cluster stucks and
>> > waits for the single node.
>> >
>> >
>> > 1) Is it correct that readMetastore() happens after node starts
>> > but before including node into the ring?
>> >
>> > 2) Is after onDone() method called for LocalJoinFuture on local
>> > node happend we can proceed with initiating PME on local node?
>> >
>> > 3) After reading checkpoint and restore memory for new joined
>> > node how and when we are updating obsolete partitions update
>> > counter? At historical rebalance, right?
>> >
>> > 4) Should we restoreMemory for new joined node before PME
>> > initiates on the other nodes in cluster?
>> >
>> > 5) Does in our final solution for new joined node readMetastore
>> > and restoreMemory should be performed in one step?
>> >
>> >
>> > [1] https://issues.apache.org/jira/browse/IGNITE-7196
>>
>
>

Re: Exchange stucks while node restoring state from WAL

Posted by Pavel Kovalenko <jo...@gmail.com>.

Hello Maxim,

1) Yes, Discovery Manager is starting after GridCacheProcessor, which
starts GridCacheDatabaseSharedManager which invokes readMetastorage on
start.

2) Before we complete the local join future, we create and add Exchange
future on local node join to ExchangeManager. So, when local join future
completes, first PME on that node should be already run or at least there
will be first Exchange future in Exchange worker queue.

3) I don't exactly understand what do you mean saying "update obsolete
partition counter", but the process is the following:
    a) We restore page memory state (which contain partition update
counters) from the last checkpoint.
    b) We repair page memory state using physical delta records from WAL.
    c) We apply logical records since last checkpoint finish marker. During
this iteration, if we meet DataRecord we increment update counter on
appropriate partition. NOTE: This phase currently happens during exchange.
    After that partition exchange starts and information about actual
update counters will be collected from FullMessage. If partition counters
are outdated, rebalance will start after PME end.

4) I think yes because everything that you need is cache descriptors of
currently started caches on the grid. Some information can be retrieved
from the static configuration. Information about dynamically started caches
can be retrieved from Discovery Data Bag received when a node joins to
ring.
See org.apache.ignite.internal.processors.cache.ClusterCachesInfo#onGridDataReceived

5) I think the main problem is that metastorage and page memory share one
WAL. But If this phase will happen before PME is started on all cluster
nodes, I think it's acceptable that they will use 2 WAL iterations.

2018-08-03 10:28 GMT+03:00 Nikolay Izhikov <ni...@apache.org>:

> Hello, Maxim.
>
> > 1) Is it correct that readMetastore() happens after node starts> but
> before including node into the ring?>
>
> I think yes.
> You can have some kind of metainformation required on node join.
>
> > 5) Does in our final solution for new joined node readMetastore> and
> restoreMemory should be performed in one step?
>
> I think, no.
>
> Meta Information can be required to perform restore memory.
> So we have to restore metainformation in first step and restore whole
> memory as a second step.
>
> В Пт, 03/08/2018 в 09:44 +0300, Maxim Muzafarov пишет:
> > Hi Igniters,
> >
> >
> > I'm working on bug [1] and have some questions about the final
> > implementation. Probably, I've already found answers on some of
> > them but I want to be sure. Please, help me to clarify details.
> >
> >
> > The key problem here is that we are reading WAL and restoring
> > memory state of new joined node inside PME. Reading WAL can
> > consume huge amount of time, so the whole cluster stucks and
> > waits for the single node.
> >
> >
> > 1) Is it correct that readMetastore() happens after node starts
> > but before including node into the ring?
> >
> > 2) Is after onDone() method called for LocalJoinFuture on local
> > node happend we can proceed with initiating PME on local node?
> >
> > 3) After reading checkpoint and restore memory for new joined
> > node how and when we are updating obsolete partitions update
> > counter? At historical rebalance, right?
> >
> > 4) Should we restoreMemory for new joined node before PME
> > initiates on the other nodes in cluster?
> >
> > 5) Does in our final solution for new joined node readMetastore
> > and restoreMemory should be performed in one step?
> >
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-7196
>

Re: Exchange stucks while node restoring state from WAL

Posted by Nikolay Izhikov <ni...@apache.org>.

Hello, Maxim.

> 1) Is it correct that readMetastore() happens after node starts> but before including node into the ring?> 

I think yes. 
You can have some kind of metainformation required on node join.

> 5) Does in our final solution for new joined node readMetastore> and restoreMemory should be performed in one step?

I think, no.

Meta Information can be required to perform restore memory.
So we have to restore metainformation in first step and restore whole memory as a second step.

В Пт, 03/08/2018 в 09:44 +0300, Maxim Muzafarov пишет:
> Hi Igniters,
> 
> 
> I'm working on bug [1] and have some questions about the final
> implementation. Probably, I've already found answers on some of
> them but I want to be sure. Please, help me to clarify details.
> 
> 
> The key problem here is that we are reading WAL and restoring
> memory state of new joined node inside PME. Reading WAL can
> consume huge amount of time, so the whole cluster stucks and
> waits for the single node.
> 
> 
> 1) Is it correct that readMetastore() happens after node starts
> but before including node into the ring?
> 
> 2) Is after onDone() method called for LocalJoinFuture on local
> node happend we can proceed with initiating PME on local node?
> 
> 3) After reading checkpoint and restore memory for new joined
> node how and when we are updating obsolete partitions update
> counter? At historical rebalance, right?
> 
> 4) Should we restoreMemory for new joined node before PME
> initiates on the other nodes in cluster?
> 
> 5) Does in our final solution for new joined node readMetastore
> and restoreMemory should be performed in one step?
> 
> 
> [1] https://issues.apache.org/jira/browse/IGNITE-7196