You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ignite.apache.org by Maxim Muzafarov <ma...@gmail.com> on 2020/04/07 19:10:15 UTC

Re: [DISCUSSION] Hot cache backup

Igniters,


I'd like to back to the discussion of a snapshot operation for Apache
Ignite for persistence cache groups and I propose my changes below. I
have prepared everything so that the discussion is as meaningful and
specific as much as possible:

- IEP-43: Cluster snapshot [1]
- The Jira task IGNITE-11073 [2]
- PR with described changes, Patch Available [4]

Changes are ready for review.


Here are a few implementation details and my thoughts:

1. Snapshot restore assumed to be manual at the first step. The
process will be described on our documentation pages, but it is
possible to start node right from the snapshot directory since the
directory structure is preserved (check
`testConsistentClusterSnapshotUnderLoad` in the PR). We also have some
options here about how the restore process must look like:
- fully manual snapshot restore (will be documented)
- ansible or shell scripts for restore
- Java API for restore (I doubt we should go this way).

3. The snapshot `create` procedure creates a snapshot of all
persistent caches available on the cluster (see limitations [1]).

2. The snapshot `create` procedure is available through Java API and
JMX (control.sh may be implemented further).

Java API:
IgniteFuture<Void> fut = ignite.snapshot()
.createSnapshot(name);

JMX:
SnapshotMXBean mxBean = getMBean(ignite.name());
mxBean.createSnapshot(name);

3. The Distribute Process [3] is used to perform a cluster-wide
snapshot procedure, so we've avoided a lot of boilerplate code here.

4. The design document [1] contains also an internal API for creating
a consistent local snapshot of requested cache groups and transfer it
to another node using the FileTransmission protocol [6]. This is one
of the parts of IEP-28 [5] for cluster rebalancing via partition files
and an important part for understanding the whole design.

Java API:
public IgniteInternalFuture<Void> createRemoteSnapshot(
    UUID rmtNodeId,
    Map<Integer, Set<Integer>> parts,
    BiConsumer<File, GroupPartitionId> partConsumer);


Please, share your thoughts and take a loot at my changes [4].


[1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-43%3A+Cluster+snapshots
[2] https://issues.apache.org/jira/browse/IGNITE-11073
[3] https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/util/distributed/DistributedProcess.java#L49
[4] https://github.com/apache/ignite/pull/7607
[5] https://cwiki.apache.org/confluence/display/IGNITE/IEP-28%3A+Cluster+peer-2-peer+balancing#IEP-28:Clusterpeer-2-peerbalancing-Filetransferbetweennodes
[6] https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/managers/communication/TransmissionHandler.java#L42


On Thu, 28 Feb 2019 at 14:43, Dmitriy Pavlov <dp...@apache.org> wrote:
>
> Hi Maxim,
>
> I agree with Denis and I have just one concern here.
>
> Apache Ignite has quite a long story (started even before Apache), and now
> it has a way too huge number of features. Some of these features
> - are developed and well known by community members,
> - some of them were contributed a long time ago and nobody develops it,
> - and, actually, in some rare cases, nobody in the community knows how it
> works and how to change it.
>
> Such features may attract users, but a bug in it may ruin impression about
> the product. Even worse, nobody can help to solve it, and only user himself
> or herself may be encouraged to contribute a fix.
>
> And my concern here, such a big feature should have a number of interested
> contributors, who can support it in case if others lost interest. I will be
> happy if 3-5 members will come and say, yes, I will do a review/I will help
> with further changes.
>
> Just to be clear, I'm not against it, and I'll never cast -1 for it, but it
> would be more comfortable to develop this feature with understanding that
> this work will not be useless.
>
> Sincerely,
> Dmitriy Pavlov
>
> ср, 27 февр. 2019 г. в 23:36, Denis Magda <dm...@apache.org>:
>
> > Maxim,
> >
> > GridGain has this exact feature available for Ignite native persistence
> > deployments. It's not as easy as it might have been seen from the
> > enablement perspective. Took us many years to make it production ready,
> > involving many engineers. If the rest of the community wants to create
> > something similar and available in open source then please take this
> > estimate into consideration.
> >
> > -
> > Denis
> >
> >
> > On Wed, Feb 27, 2019 at 8:53 AM Maxim Muzafarov <ma...@gmail.com>
> > wrote:
> >
> > > Igniters,
> > >
> > > Some of the stores with which the Apache Ignite is often compared has
> > > a feature called Snapshots [1] [2]. This feature provides an
> > > eventually consistent view on stored data for different purposes (e.g.
> > > moving data between environments, saving a backup of data for the
> > > further restore procedure and so on). The Apache Ignite has all
> > > opportunities and machinery to provide cache and\or data region
> > > snapshots out of the box but still don't have them.
> > >
> > > This issue derives from IEP-28 [5] on which I'm currently working on
> > > (partially described in the section [6]). I would like to solve this
> > > issue too and make Apache Ignite more attractive to use on a
> > > production environment. I've haven't investigated in-memory type
> > > caches yet, but for caches with enabled persistence, we can do it
> > > without any performance impact on cache operations (some additional IO
> > > operations are needed to copy cache data to backup store, copy on
> > > write technique is used here). We just need to use our DiscoverySpi,
> > > PME and Checkpointer process the right way.
> > >
> > > For the first step, we can store all backup data on each of cache
> > > affinity node locally. For instance, the `backup\snapshotId\cache0`
> > > folder will be created and all `cache0` partitions will be stored
> > > there for each local node for the snapshot process with id
> > > `snapshotId`. In future, we can teach nodes to upload snapshotted
> > > partitions to the one remote node or cloud.
> > >
> > > --
> > >
> > > High-level process overview
> > >
> > > A new snapshot process is managed via DiscoverySpi and
> > > CommunicationSpi messages.
> > >
> > > 1. The initiator sends a request to the cluster (DiscoveryMessage).
> > > 2. When the node receives a message it initiates PME.
> > > 3. The node begins checkpoint process (holding write lock a short time)
> > > 4. The node starts to track any write attempts to the snapshotting
> > > partition and places the copy of original pages to the temp file.
> > > 5. The node performs merge the partition file with the corresponding
> > delta.
> > > 6. When the node finishes the backup process it sends ack message with
> > > saved partitions to the initiator (or the error response).
> > > 7. When all ack messages received the backup is finished.
> > >
> > > The only problem here is that when the request message arrives at the
> > > particular node during running checkpoint PME will be locked until it
> > > ends. This is not good. But hopefully, it will be fixed here [4].
> > >
> > > --
> > >
> > > Probable API
> > >
> > > From the cache perspective:
> > >
> > > IgniteFuture<IgniteSnapshot> snapshotFut =
> > >     ignite.cache("default")
> > >         .shapshotter()
> > >         .create("myShapshotId");
> > >
> > > IgniteSnapshot cacheSnapshot = snapshotFut.get();
> > >
> > > IgniteCache<K, V> copiedCache =
> > >     ignite.createCache("CopyCache")
> > >         .withConfiguration(defaultCache.getConfiguration())
> > >         .loadFromSnapshot(cacheSnapshot.id());
> > >
> > > From the command line perspective:
> > >
> > > control.sh --snapshot take cache0,cache1,cache2
> > >
> > > --
> > >
> > > WDYT?
> > > Will it be a useful feature for the Apache Ignite?
> > >
> > >
> > > [1]
> > >
> > https://geode.apache.org/docs/guide/10/managing/cache_snapshots/chapter_overview.html
> > > [2]
> > >
> > https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsBackupTakesSnapshot.html
> > > [3]
> > >
> > http://apache-ignite-developers.2346864.n4.nabble.com/Data-Snapshots-in-Ignite-td4183.html
> > > [4] https://issues.apache.org/jira/browse/IGNITE-10508
> > > [5]
> > >
> > https://cwiki.apache.org/confluence/display/IGNITE/IEP-28%3A+Cluster+peer-2-peer+balancing
> > > [6]
> > >
> > https://cwiki.apache.org/confluence/display/IGNITE/IEP-28%3A+Cluster+peer-2-peer+balancing#IEP-28:Clusterpeer-2-peerbalancing-Checkpointer
> > >
> >

Re: [DISCUSSION] Hot cache backup

Posted by Surkov <al...@bk.ru.INVALID>.

That's cool. 
I'm waiting for this thing.



--
Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/

Re: [DISCUSSION] Hot cache backup

Posted by Denis Magda <dm...@apache.org>.

👍

https://twitter.com/ApacheIgnite/status/1256249943846576129

-
Denis


On Fri, May 1, 2020 at 2:05 AM Maxim Muzafarov <mm...@apache.org> wrote:

> Folks,
>
>
> I've merged the changes.
> Thanks everyone for the help.
>
> Here are a few tasks that I'm going to complete too.
>
> https://issues.apache.org/jira/browse/IGNITE-12968
> https://issues.apache.org/jira/browse/IGNITE-12967
> https://issues.apache.org/jira/browse/IGNITE-12961
>
> On Wed, 29 Apr 2020 at 21:21, Denis Magda <dm...@apache.org> wrote:
> >
> > Maxim,
> >
> > Ok, let's follow your plan. Ping me once the docs are ready. It's
> > definitely not a blocker for merging the feature into the master. We can
> > always adjust the implementation in the master before a public release.
> >
> > Btw, could you please fill in the "readiness estimated data" column in
> the
> > roadmap draft? I've added this snapshots to the table earlier:
> > https://cwiki.apache.org/confluence/display/IGNITE/Apache+Ignite+Roadmap
> >
> > -
> > Denis
> >
> >
> > On Wed, Apr 29, 2020 at 9:17 AM Maxim Muzafarov <mm...@apache.org>
> wrote:
> >
> > > Denis,
> > >
> > > No, I don't. I'm planning to work on documentation pages right after
> > > we'll finish with the source code changes. I will be very grateful if
> > > you will help with the review of the documentation.
> > >
> > > Currently, the approach is very straightforward and simple and I doubt
> > > we can change anything from the user's standpoint:
> > > 1. The single method for creating snapshots of the whole persisted
> > > cluster caches - createSnapshot(name);
> > > 2. Users can change the location of the base snapshot directory to any
> > > he likes (absolute path or relative path can be used, available from
> > > IgniteConfiguration);
> > > 3. The created snapshot will have the same directory structure as the
> > > Ignite instances have;
> > > 4. Users will be able to start Ignite instances right from snapshot
> > > directory and all will work fine for them (with respect to consistent
> > > nodeId).
> > >
> > > On Wed, 29 Apr 2020 at 19:06, Denis Magda <dm...@apache.org> wrote:
> > > >
> > > > Hi Maxim,
> > > >
> > > > Do you have a draft of docs in any form explaining how the feature is
> > > > supposed to be used (snapshots creation, restore procedure,
> > > > setting/changing snapshots location, etc. - essential operations for
> such
> > > > capabilities)? I can help with the review from the user standpoint
> and
> > > > might advise usability improvements.
> > > >
> > > > -
> > > > Denis
> > > >
> > > >
> > > > On Wed, Apr 29, 2020 at 8:57 AM Maxim Muzafarov <mm...@apache.org>
> > > wrote:
> > > >
> > > > > Folks,
> > > > >
> > > > >
> > > > > I'm going to merge this issue [1] on the 1-st day of May.
> > > > > If you still have any questions or PR improvement suggestions,
> please
> > > > > let me know.
> > > > >
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11073
> > > > >
> > > > > On Mon, 27 Apr 2020 at 18:27, Maxim Muzafarov <mm...@apache.org>
> > > wrote:
> > > > > >
> > > > > > Alexey,
> > > > > >
> > > > > >
> > > > > > From my point of view, the feature is fully self-sufficient and
> ready
> > > > > > for a release (with a small caveat):
> > > > > > - administrators will be able to create snapshots without writing
> > > > > java-code;
> > > > > > - developers will be able to create snapshots through java API;
> > > > > >
> > > > > > The documentation pages for creating and restoring procedures
> with
> > > > > > examples will be completed by me prior to release this feature
> for
> > > our
> > > > > > end-users.
> > > > > >
> > > > > > All other features mentioned in this list [1] adds convenience
> for
> > > > > > users but not mandatory. I'll try to finish these tasks from the
> list
> > > > > > [1] prior to release:
> > > > > > - support snapshot creation from a client node
> > > > > > - add starting snapshot via control.sh
> > > > > >
> > > > > > Are there any details I've missed?
> > > > > >
> > > > > >
> > > > > > [1]
> > > https://github.com/apache/ignite/pull/7607#issuecomment-618964647
> > > > > >
> > > > > > On Mon, 27 Apr 2020 at 18:12, Alexey Goncharuk
> > > > > > <al...@gmail.com> wrote:
> > > > > > >
> > > > > > > Maxim,
> > > > > > >
> > > > > > > I saw the list of the tickets you want to work on in the PR, it
> > > looks
> > > > > nice.
> > > > > > > I was wondering, what part of that list are you planning to
> > > implement
> > > > > > > before the feature is released to end users? For example, I
> agree
> > > with
> > > > > > > Slava that we should implement a command-line utility part for
> > > > > snapshots
> > > > > > > before the release, however I think it's better to do it in a
> > > separate
> > > > > > > ticket.
> > > > > > >
> > > > > > > I know we do not have a strict policy regarding big features
> > > > > development in
> > > > > > > the community, so perhaps it's a good time to discuss this? If
> we
> > > are
> > > > > ok
> > > > > > > with merging separate tickets to master, how we ensure a
> complete
> > > > > feature
> > > > > > > is released to public? If not, should we create a feature
> branch
> > > and
> > > > > wait
> > > > > > > for all related tickets to be merged there? Will be glad to
> discuss
> > > > > this in
> > > > > > > a separate thread if needed.
> > > > > > >
> > > > > > > пн, 27 апр. 2020 г. в 14:38, Maxim Muzafarov <
> mmuzaf@apache.org>:
> > > > > > >
> > > > > > > > Folks,
> > > > > > > >
> > > > > > > >
> > > > > > > > Are there any cases left which we need to discuss?
> > > > > > > >
> > > > > > > > Do you have any questions?
> > > > > > > > I'm ready to provide all the details you need for the review.
> > > > > > > >
> > > > > > > > Who else what to take a look at my changes [1] [2]?
> > > > > > > >
> > > > > > > >
> > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11073
> > > > > > > > [2] https://github.com/apache/ignite/pull/7607
> > > > > > > >
> > > > > > > > On Fri, 24 Apr 2020 at 15:01, Maxim Muzafarov <
> mmuzaf@apache.org
> > > >
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > Alexey,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I've addressed all your comments, please, take a look at
> the PR
> > > > > [1].
> > > > > > > > > Additional tests were added.
> > > > > > > > > Additional comments with further steps were added.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > [1] https://github.com/apache/ignite/pull/7607
> > > > > > > > > [2] https://issues.apache.org/jira/browse/IGNITE-11073
> > > > > > > > >
> > > > > > > > > On Tue, 21 Apr 2020 at 09:53, Alexey Goncharuk
> > > > > > > > > <al...@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Maxim,
> > > > > > > > > >
> > > > > > > > > > I've left my comments in the PR.
> > > > > > > > > >
> > > > > > > > > > пн, 20 апр. 2020 г. в 12:52, Maxim Muzafarov <
> > > mmuzaf@apache.org
> > > > > >:
> > > > > > > > > >
> > > > > > > > > > > Alex P,
> > > > > > > > > > > Thank you for the great sophisticated review.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Alexey G,
> > > > > > > > > > > Will you take a look at my changes[1]?
> > > > > > > > > > > The fresh TC.Bot visa attached.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11073
> > > > > > > > > > >
> > > > > > > > > > > On Mon, 20 Apr 2020 at 11:54, Alex Plehanov <
> > > > > plehanov.alex@gmail.com
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Maxim, I've reviewed your PR and it looks good to me.
> > > Good
> > > > > job!
> > > > > > > > > > > >
> > > > > > > > > > > > пт, 10 апр. 2020 г. в 19:43, Alexey Goncharuk <
> > > > > > > > > > > alexey.goncharuk@gmail.com>:
> > > > > > > > > > > >
> > > > > > > > > > > > > Maxim,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for raising this PR. I will do a review
> during
> > > next
> > > > > week.
> > > > > > > > > > > > >
> > > > > > > > > > > > > --AG
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > >
> > > > >
> > >
>

Re: [DISCUSSION] Hot cache backup

Posted by Maxim Muzafarov <mm...@apache.org>.

Folks,


I've merged the changes.
Thanks everyone for the help.

Here are a few tasks that I'm going to complete too.

https://issues.apache.org/jira/browse/IGNITE-12968
https://issues.apache.org/jira/browse/IGNITE-12967
https://issues.apache.org/jira/browse/IGNITE-12961

On Wed, 29 Apr 2020 at 21:21, Denis Magda <dm...@apache.org> wrote:
>
> Maxim,
>
> Ok, let's follow your plan. Ping me once the docs are ready. It's
> definitely not a blocker for merging the feature into the master. We can
> always adjust the implementation in the master before a public release.
>
> Btw, could you please fill in the "readiness estimated data" column in the
> roadmap draft? I've added this snapshots to the table earlier:
> https://cwiki.apache.org/confluence/display/IGNITE/Apache+Ignite+Roadmap
>
> -
> Denis
>
>
> On Wed, Apr 29, 2020 at 9:17 AM Maxim Muzafarov <mm...@apache.org> wrote:
>
> > Denis,
> >
> > No, I don't. I'm planning to work on documentation pages right after
> > we'll finish with the source code changes. I will be very grateful if
> > you will help with the review of the documentation.
> >
> > Currently, the approach is very straightforward and simple and I doubt
> > we can change anything from the user's standpoint:
> > 1. The single method for creating snapshots of the whole persisted
> > cluster caches - createSnapshot(name);
> > 2. Users can change the location of the base snapshot directory to any
> > he likes (absolute path or relative path can be used, available from
> > IgniteConfiguration);
> > 3. The created snapshot will have the same directory structure as the
> > Ignite instances have;
> > 4. Users will be able to start Ignite instances right from snapshot
> > directory and all will work fine for them (with respect to consistent
> > nodeId).
> >
> > On Wed, 29 Apr 2020 at 19:06, Denis Magda <dm...@apache.org> wrote:
> > >
> > > Hi Maxim,
> > >
> > > Do you have a draft of docs in any form explaining how the feature is
> > > supposed to be used (snapshots creation, restore procedure,
> > > setting/changing snapshots location, etc. - essential operations for such
> > > capabilities)? I can help with the review from the user standpoint and
> > > might advise usability improvements.
> > >
> > > -
> > > Denis
> > >
> > >
> > > On Wed, Apr 29, 2020 at 8:57 AM Maxim Muzafarov <mm...@apache.org>
> > wrote:
> > >
> > > > Folks,
> > > >
> > > >
> > > > I'm going to merge this issue [1] on the 1-st day of May.
> > > > If you still have any questions or PR improvement suggestions, please
> > > > let me know.
> > > >
> > > >
> > > > [1] https://issues.apache.org/jira/browse/IGNITE-11073
> > > >
> > > > On Mon, 27 Apr 2020 at 18:27, Maxim Muzafarov <mm...@apache.org>
> > wrote:
> > > > >
> > > > > Alexey,
> > > > >
> > > > >
> > > > > From my point of view, the feature is fully self-sufficient and ready
> > > > > for a release (with a small caveat):
> > > > > - administrators will be able to create snapshots without writing
> > > > java-code;
> > > > > - developers will be able to create snapshots through java API;
> > > > >
> > > > > The documentation pages for creating and restoring procedures with
> > > > > examples will be completed by me prior to release this feature for
> > our
> > > > > end-users.
> > > > >
> > > > > All other features mentioned in this list [1] adds convenience for
> > > > > users but not mandatory. I'll try to finish these tasks from the list
> > > > > [1] prior to release:
> > > > > - support snapshot creation from a client node
> > > > > - add starting snapshot via control.sh
> > > > >
> > > > > Are there any details I've missed?
> > > > >
> > > > >
> > > > > [1]
> > https://github.com/apache/ignite/pull/7607#issuecomment-618964647
> > > > >
> > > > > On Mon, 27 Apr 2020 at 18:12, Alexey Goncharuk
> > > > > <al...@gmail.com> wrote:
> > > > > >
> > > > > > Maxim,
> > > > > >
> > > > > > I saw the list of the tickets you want to work on in the PR, it
> > looks
> > > > nice.
> > > > > > I was wondering, what part of that list are you planning to
> > implement
> > > > > > before the feature is released to end users? For example, I agree
> > with
> > > > > > Slava that we should implement a command-line utility part for
> > > > snapshots
> > > > > > before the release, however I think it's better to do it in a
> > separate
> > > > > > ticket.
> > > > > >
> > > > > > I know we do not have a strict policy regarding big features
> > > > development in
> > > > > > the community, so perhaps it's a good time to discuss this? If we
> > are
> > > > ok
> > > > > > with merging separate tickets to master, how we ensure a complete
> > > > feature
> > > > > > is released to public? If not, should we create a feature branch
> > and
> > > > wait
> > > > > > for all related tickets to be merged there? Will be glad to discuss
> > > > this in
> > > > > > a separate thread if needed.
> > > > > >
> > > > > > пн, 27 апр. 2020 г. в 14:38, Maxim Muzafarov <mm...@apache.org>:
> > > > > >
> > > > > > > Folks,
> > > > > > >
> > > > > > >
> > > > > > > Are there any cases left which we need to discuss?
> > > > > > >
> > > > > > > Do you have any questions?
> > > > > > > I'm ready to provide all the details you need for the review.
> > > > > > >
> > > > > > > Who else what to take a look at my changes [1] [2]?
> > > > > > >
> > > > > > >
> > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11073
> > > > > > > [2] https://github.com/apache/ignite/pull/7607
> > > > > > >
> > > > > > > On Fri, 24 Apr 2020 at 15:01, Maxim Muzafarov <mmuzaf@apache.org
> > >
> > > > wrote:
> > > > > > > >
> > > > > > > > Alexey,
> > > > > > > >
> > > > > > > >
> > > > > > > > I've addressed all your comments, please, take a look at the PR
> > > > [1].
> > > > > > > > Additional tests were added.
> > > > > > > > Additional comments with further steps were added.
> > > > > > > >
> > > > > > > >
> > > > > > > > [1] https://github.com/apache/ignite/pull/7607
> > > > > > > > [2] https://issues.apache.org/jira/browse/IGNITE-11073
> > > > > > > >
> > > > > > > > On Tue, 21 Apr 2020 at 09:53, Alexey Goncharuk
> > > > > > > > <al...@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > Maxim,
> > > > > > > > >
> > > > > > > > > I've left my comments in the PR.
> > > > > > > > >
> > > > > > > > > пн, 20 апр. 2020 г. в 12:52, Maxim Muzafarov <
> > mmuzaf@apache.org
> > > > >:
> > > > > > > > >
> > > > > > > > > > Alex P,
> > > > > > > > > > Thank you for the great sophisticated review.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Alexey G,
> > > > > > > > > > Will you take a look at my changes[1]?
> > > > > > > > > > The fresh TC.Bot visa attached.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11073
> > > > > > > > > >
> > > > > > > > > > On Mon, 20 Apr 2020 at 11:54, Alex Plehanov <
> > > > plehanov.alex@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Maxim, I've reviewed your PR and it looks good to me.
> > Good
> > > > job!
> > > > > > > > > > >
> > > > > > > > > > > пт, 10 апр. 2020 г. в 19:43, Alexey Goncharuk <
> > > > > > > > > > alexey.goncharuk@gmail.com>:
> > > > > > > > > > >
> > > > > > > > > > > > Maxim,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for raising this PR. I will do a review during
> > next
> > > > week.
> > > > > > > > > > > >
> > > > > > > > > > > > --AG
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > >
> > > >
> >

Re: [DISCUSSION] Hot cache backup

Posted by Denis Magda <dm...@apache.org>.

Maxim,

Ok, let's follow your plan. Ping me once the docs are ready. It's
definitely not a blocker for merging the feature into the master. We can
always adjust the implementation in the master before a public release.

Btw, could you please fill in the "readiness estimated data" column in the
roadmap draft? I've added this snapshots to the table earlier:
https://cwiki.apache.org/confluence/display/IGNITE/Apache+Ignite+Roadmap

-
Denis


On Wed, Apr 29, 2020 at 9:17 AM Maxim Muzafarov <mm...@apache.org> wrote:

> Denis,
>
> No, I don't. I'm planning to work on documentation pages right after
> we'll finish with the source code changes. I will be very grateful if
> you will help with the review of the documentation.
>
> Currently, the approach is very straightforward and simple and I doubt
> we can change anything from the user's standpoint:
> 1. The single method for creating snapshots of the whole persisted
> cluster caches - createSnapshot(name);
> 2. Users can change the location of the base snapshot directory to any
> he likes (absolute path or relative path can be used, available from
> IgniteConfiguration);
> 3. The created snapshot will have the same directory structure as the
> Ignite instances have;
> 4. Users will be able to start Ignite instances right from snapshot
> directory and all will work fine for them (with respect to consistent
> nodeId).
>
> On Wed, 29 Apr 2020 at 19:06, Denis Magda <dm...@apache.org> wrote:
> >
> > Hi Maxim,
> >
> > Do you have a draft of docs in any form explaining how the feature is
> > supposed to be used (snapshots creation, restore procedure,
> > setting/changing snapshots location, etc. - essential operations for such
> > capabilities)? I can help with the review from the user standpoint and
> > might advise usability improvements.
> >
> > -
> > Denis
> >
> >
> > On Wed, Apr 29, 2020 at 8:57 AM Maxim Muzafarov <mm...@apache.org>
> wrote:
> >
> > > Folks,
> > >
> > >
> > > I'm going to merge this issue [1] on the 1-st day of May.
> > > If you still have any questions or PR improvement suggestions, please
> > > let me know.
> > >
> > >
> > > [1] https://issues.apache.org/jira/browse/IGNITE-11073
> > >
> > > On Mon, 27 Apr 2020 at 18:27, Maxim Muzafarov <mm...@apache.org>
> wrote:
> > > >
> > > > Alexey,
> > > >
> > > >
> > > > From my point of view, the feature is fully self-sufficient and ready
> > > > for a release (with a small caveat):
> > > > - administrators will be able to create snapshots without writing
> > > java-code;
> > > > - developers will be able to create snapshots through java API;
> > > >
> > > > The documentation pages for creating and restoring procedures with
> > > > examples will be completed by me prior to release this feature for
> our
> > > > end-users.
> > > >
> > > > All other features mentioned in this list [1] adds convenience for
> > > > users but not mandatory. I'll try to finish these tasks from the list
> > > > [1] prior to release:
> > > > - support snapshot creation from a client node
> > > > - add starting snapshot via control.sh
> > > >
> > > > Are there any details I've missed?
> > > >
> > > >
> > > > [1]
> https://github.com/apache/ignite/pull/7607#issuecomment-618964647
> > > >
> > > > On Mon, 27 Apr 2020 at 18:12, Alexey Goncharuk
> > > > <al...@gmail.com> wrote:
> > > > >
> > > > > Maxim,
> > > > >
> > > > > I saw the list of the tickets you want to work on in the PR, it
> looks
> > > nice.
> > > > > I was wondering, what part of that list are you planning to
> implement
> > > > > before the feature is released to end users? For example, I agree
> with
> > > > > Slava that we should implement a command-line utility part for
> > > snapshots
> > > > > before the release, however I think it's better to do it in a
> separate
> > > > > ticket.
> > > > >
> > > > > I know we do not have a strict policy regarding big features
> > > development in
> > > > > the community, so perhaps it's a good time to discuss this? If we
> are
> > > ok
> > > > > with merging separate tickets to master, how we ensure a complete
> > > feature
> > > > > is released to public? If not, should we create a feature branch
> and
> > > wait
> > > > > for all related tickets to be merged there? Will be glad to discuss
> > > this in
> > > > > a separate thread if needed.
> > > > >
> > > > > пн, 27 апр. 2020 г. в 14:38, Maxim Muzafarov <mm...@apache.org>:
> > > > >
> > > > > > Folks,
> > > > > >
> > > > > >
> > > > > > Are there any cases left which we need to discuss?
> > > > > >
> > > > > > Do you have any questions?
> > > > > > I'm ready to provide all the details you need for the review.
> > > > > >
> > > > > > Who else what to take a look at my changes [1] [2]?
> > > > > >
> > > > > >
> > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11073
> > > > > > [2] https://github.com/apache/ignite/pull/7607
> > > > > >
> > > > > > On Fri, 24 Apr 2020 at 15:01, Maxim Muzafarov <mmuzaf@apache.org
> >
> > > wrote:
> > > > > > >
> > > > > > > Alexey,
> > > > > > >
> > > > > > >
> > > > > > > I've addressed all your comments, please, take a look at the PR
> > > [1].
> > > > > > > Additional tests were added.
> > > > > > > Additional comments with further steps were added.
> > > > > > >
> > > > > > >
> > > > > > > [1] https://github.com/apache/ignite/pull/7607
> > > > > > > [2] https://issues.apache.org/jira/browse/IGNITE-11073
> > > > > > >
> > > > > > > On Tue, 21 Apr 2020 at 09:53, Alexey Goncharuk
> > > > > > > <al...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Maxim,
> > > > > > > >
> > > > > > > > I've left my comments in the PR.
> > > > > > > >
> > > > > > > > пн, 20 апр. 2020 г. в 12:52, Maxim Muzafarov <
> mmuzaf@apache.org
> > > >:
> > > > > > > >
> > > > > > > > > Alex P,
> > > > > > > > > Thank you for the great sophisticated review.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Alexey G,
> > > > > > > > > Will you take a look at my changes[1]?
> > > > > > > > > The fresh TC.Bot visa attached.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11073
> > > > > > > > >
> > > > > > > > > On Mon, 20 Apr 2020 at 11:54, Alex Plehanov <
> > > plehanov.alex@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Maxim, I've reviewed your PR and it looks good to me.
> Good
> > > job!
> > > > > > > > > >
> > > > > > > > > > пт, 10 апр. 2020 г. в 19:43, Alexey Goncharuk <
> > > > > > > > > alexey.goncharuk@gmail.com>:
> > > > > > > > > >
> > > > > > > > > > > Maxim,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for raising this PR. I will do a review during
> next
> > > week.
> > > > > > > > > > >
> > > > > > > > > > > --AG
> > > > > > > > > > >
> > > > > > > > >
> > > > > >
> > >
>

Re: [DISCUSSION] Hot cache backup

Posted by Maxim Muzafarov <mm...@apache.org>.

Denis,

No, I don't. I'm planning to work on documentation pages right after
we'll finish with the source code changes. I will be very grateful if
you will help with the review of the documentation.

Currently, the approach is very straightforward and simple and I doubt
we can change anything from the user's standpoint:
1. The single method for creating snapshots of the whole persisted
cluster caches - createSnapshot(name);
2. Users can change the location of the base snapshot directory to any
he likes (absolute path or relative path can be used, available from
IgniteConfiguration);
3. The created snapshot will have the same directory structure as the
Ignite instances have;
4. Users will be able to start Ignite instances right from snapshot
directory and all will work fine for them (with respect to consistent
nodeId).

On Wed, 29 Apr 2020 at 19:06, Denis Magda <dm...@apache.org> wrote:
>
> Hi Maxim,
>
> Do you have a draft of docs in any form explaining how the feature is
> supposed to be used (snapshots creation, restore procedure,
> setting/changing snapshots location, etc. - essential operations for such
> capabilities)? I can help with the review from the user standpoint and
> might advise usability improvements.
>
> -
> Denis
>
>
> On Wed, Apr 29, 2020 at 8:57 AM Maxim Muzafarov <mm...@apache.org> wrote:
>
> > Folks,
> >
> >
> > I'm going to merge this issue [1] on the 1-st day of May.
> > If you still have any questions or PR improvement suggestions, please
> > let me know.
> >
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-11073
> >
> > On Mon, 27 Apr 2020 at 18:27, Maxim Muzafarov <mm...@apache.org> wrote:
> > >
> > > Alexey,
> > >
> > >
> > > From my point of view, the feature is fully self-sufficient and ready
> > > for a release (with a small caveat):
> > > - administrators will be able to create snapshots without writing
> > java-code;
> > > - developers will be able to create snapshots through java API;
> > >
> > > The documentation pages for creating and restoring procedures with
> > > examples will be completed by me prior to release this feature for our
> > > end-users.
> > >
> > > All other features mentioned in this list [1] adds convenience for
> > > users but not mandatory. I'll try to finish these tasks from the list
> > > [1] prior to release:
> > > - support snapshot creation from a client node
> > > - add starting snapshot via control.sh
> > >
> > > Are there any details I've missed?
> > >
> > >
> > > [1] https://github.com/apache/ignite/pull/7607#issuecomment-618964647
> > >
> > > On Mon, 27 Apr 2020 at 18:12, Alexey Goncharuk
> > > <al...@gmail.com> wrote:
> > > >
> > > > Maxim,
> > > >
> > > > I saw the list of the tickets you want to work on in the PR, it looks
> > nice.
> > > > I was wondering, what part of that list are you planning to implement
> > > > before the feature is released to end users? For example, I agree with
> > > > Slava that we should implement a command-line utility part for
> > snapshots
> > > > before the release, however I think it's better to do it in a separate
> > > > ticket.
> > > >
> > > > I know we do not have a strict policy regarding big features
> > development in
> > > > the community, so perhaps it's a good time to discuss this? If we are
> > ok
> > > > with merging separate tickets to master, how we ensure a complete
> > feature
> > > > is released to public? If not, should we create a feature branch and
> > wait
> > > > for all related tickets to be merged there? Will be glad to discuss
> > this in
> > > > a separate thread if needed.
> > > >
> > > > пн, 27 апр. 2020 г. в 14:38, Maxim Muzafarov <mm...@apache.org>:
> > > >
> > > > > Folks,
> > > > >
> > > > >
> > > > > Are there any cases left which we need to discuss?
> > > > >
> > > > > Do you have any questions?
> > > > > I'm ready to provide all the details you need for the review.
> > > > >
> > > > > Who else what to take a look at my changes [1] [2]?
> > > > >
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11073
> > > > > [2] https://github.com/apache/ignite/pull/7607
> > > > >
> > > > > On Fri, 24 Apr 2020 at 15:01, Maxim Muzafarov <mm...@apache.org>
> > wrote:
> > > > > >
> > > > > > Alexey,
> > > > > >
> > > > > >
> > > > > > I've addressed all your comments, please, take a look at the PR
> > [1].
> > > > > > Additional tests were added.
> > > > > > Additional comments with further steps were added.
> > > > > >
> > > > > >
> > > > > > [1] https://github.com/apache/ignite/pull/7607
> > > > > > [2] https://issues.apache.org/jira/browse/IGNITE-11073
> > > > > >
> > > > > > On Tue, 21 Apr 2020 at 09:53, Alexey Goncharuk
> > > > > > <al...@gmail.com> wrote:
> > > > > > >
> > > > > > > Maxim,
> > > > > > >
> > > > > > > I've left my comments in the PR.
> > > > > > >
> > > > > > > пн, 20 апр. 2020 г. в 12:52, Maxim Muzafarov <mmuzaf@apache.org
> > >:
> > > > > > >
> > > > > > > > Alex P,
> > > > > > > > Thank you for the great sophisticated review.
> > > > > > > >
> > > > > > > >
> > > > > > > > Alexey G,
> > > > > > > > Will you take a look at my changes[1]?
> > > > > > > > The fresh TC.Bot visa attached.
> > > > > > > >
> > > > > > > >
> > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11073
> > > > > > > >
> > > > > > > > On Mon, 20 Apr 2020 at 11:54, Alex Plehanov <
> > plehanov.alex@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Maxim, I've reviewed your PR and it looks good to me. Good
> > job!
> > > > > > > > >
> > > > > > > > > пт, 10 апр. 2020 г. в 19:43, Alexey Goncharuk <
> > > > > > > > alexey.goncharuk@gmail.com>:
> > > > > > > > >
> > > > > > > > > > Maxim,
> > > > > > > > > >
> > > > > > > > > > Thanks for raising this PR. I will do a review during next
> > week.
> > > > > > > > > >
> > > > > > > > > > --AG
> > > > > > > > > >
> > > > > > > >
> > > > >
> >

Re: [DISCUSSION] Hot cache backup

Posted by Denis Magda <dm...@apache.org>.

Hi Maxim,

Do you have a draft of docs in any form explaining how the feature is
supposed to be used (snapshots creation, restore procedure,
setting/changing snapshots location, etc. - essential operations for such
capabilities)? I can help with the review from the user standpoint and
might advise usability improvements.

-
Denis


On Wed, Apr 29, 2020 at 8:57 AM Maxim Muzafarov <mm...@apache.org> wrote:

> Folks,
>
>
> I'm going to merge this issue [1] on the 1-st day of May.
> If you still have any questions or PR improvement suggestions, please
> let me know.
>
>
> [1] https://issues.apache.org/jira/browse/IGNITE-11073
>
> On Mon, 27 Apr 2020 at 18:27, Maxim Muzafarov <mm...@apache.org> wrote:
> >
> > Alexey,
> >
> >
> > From my point of view, the feature is fully self-sufficient and ready
> > for a release (with a small caveat):
> > - administrators will be able to create snapshots without writing
> java-code;
> > - developers will be able to create snapshots through java API;
> >
> > The documentation pages for creating and restoring procedures with
> > examples will be completed by me prior to release this feature for our
> > end-users.
> >
> > All other features mentioned in this list [1] adds convenience for
> > users but not mandatory. I'll try to finish these tasks from the list
> > [1] prior to release:
> > - support snapshot creation from a client node
> > - add starting snapshot via control.sh
> >
> > Are there any details I've missed?
> >
> >
> > [1] https://github.com/apache/ignite/pull/7607#issuecomment-618964647
> >
> > On Mon, 27 Apr 2020 at 18:12, Alexey Goncharuk
> > <al...@gmail.com> wrote:
> > >
> > > Maxim,
> > >
> > > I saw the list of the tickets you want to work on in the PR, it looks
> nice.
> > > I was wondering, what part of that list are you planning to implement
> > > before the feature is released to end users? For example, I agree with
> > > Slava that we should implement a command-line utility part for
> snapshots
> > > before the release, however I think it's better to do it in a separate
> > > ticket.
> > >
> > > I know we do not have a strict policy regarding big features
> development in
> > > the community, so perhaps it's a good time to discuss this? If we are
> ok
> > > with merging separate tickets to master, how we ensure a complete
> feature
> > > is released to public? If not, should we create a feature branch and
> wait
> > > for all related tickets to be merged there? Will be glad to discuss
> this in
> > > a separate thread if needed.
> > >
> > > пн, 27 апр. 2020 г. в 14:38, Maxim Muzafarov <mm...@apache.org>:
> > >
> > > > Folks,
> > > >
> > > >
> > > > Are there any cases left which we need to discuss?
> > > >
> > > > Do you have any questions?
> > > > I'm ready to provide all the details you need for the review.
> > > >
> > > > Who else what to take a look at my changes [1] [2]?
> > > >
> > > >
> > > > [1] https://issues.apache.org/jira/browse/IGNITE-11073
> > > > [2] https://github.com/apache/ignite/pull/7607
> > > >
> > > > On Fri, 24 Apr 2020 at 15:01, Maxim Muzafarov <mm...@apache.org>
> wrote:
> > > > >
> > > > > Alexey,
> > > > >
> > > > >
> > > > > I've addressed all your comments, please, take a look at the PR
> [1].
> > > > > Additional tests were added.
> > > > > Additional comments with further steps were added.
> > > > >
> > > > >
> > > > > [1] https://github.com/apache/ignite/pull/7607
> > > > > [2] https://issues.apache.org/jira/browse/IGNITE-11073
> > > > >
> > > > > On Tue, 21 Apr 2020 at 09:53, Alexey Goncharuk
> > > > > <al...@gmail.com> wrote:
> > > > > >
> > > > > > Maxim,
> > > > > >
> > > > > > I've left my comments in the PR.
> > > > > >
> > > > > > пн, 20 апр. 2020 г. в 12:52, Maxim Muzafarov <mmuzaf@apache.org
> >:
> > > > > >
> > > > > > > Alex P,
> > > > > > > Thank you for the great sophisticated review.
> > > > > > >
> > > > > > >
> > > > > > > Alexey G,
> > > > > > > Will you take a look at my changes[1]?
> > > > > > > The fresh TC.Bot visa attached.
> > > > > > >
> > > > > > >
> > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11073
> > > > > > >
> > > > > > > On Mon, 20 Apr 2020 at 11:54, Alex Plehanov <
> plehanov.alex@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Maxim, I've reviewed your PR and it looks good to me. Good
> job!
> > > > > > > >
> > > > > > > > пт, 10 апр. 2020 г. в 19:43, Alexey Goncharuk <
> > > > > > > alexey.goncharuk@gmail.com>:
> > > > > > > >
> > > > > > > > > Maxim,
> > > > > > > > >
> > > > > > > > > Thanks for raising this PR. I will do a review during next
> week.
> > > > > > > > >
> > > > > > > > > --AG
> > > > > > > > >
> > > > > > >
> > > >
>

Re: [DISCUSSION] Hot cache backup

Posted by Maxim Muzafarov <mm...@apache.org>.

Folks,


I'm going to merge this issue [1] on the 1-st day of May.
If you still have any questions or PR improvement suggestions, please
let me know.


[1] https://issues.apache.org/jira/browse/IGNITE-11073

On Mon, 27 Apr 2020 at 18:27, Maxim Muzafarov <mm...@apache.org> wrote:
>
> Alexey,
>
>
> From my point of view, the feature is fully self-sufficient and ready
> for a release (with a small caveat):
> - administrators will be able to create snapshots without writing java-code;
> - developers will be able to create snapshots through java API;
>
> The documentation pages for creating and restoring procedures with
> examples will be completed by me prior to release this feature for our
> end-users.
>
> All other features mentioned in this list [1] adds convenience for
> users but not mandatory. I'll try to finish these tasks from the list
> [1] prior to release:
> - support snapshot creation from a client node
> - add starting snapshot via control.sh
>
> Are there any details I've missed?
>
>
> [1] https://github.com/apache/ignite/pull/7607#issuecomment-618964647
>
> On Mon, 27 Apr 2020 at 18:12, Alexey Goncharuk
> <al...@gmail.com> wrote:
> >
> > Maxim,
> >
> > I saw the list of the tickets you want to work on in the PR, it looks nice.
> > I was wondering, what part of that list are you planning to implement
> > before the feature is released to end users? For example, I agree with
> > Slava that we should implement a command-line utility part for snapshots
> > before the release, however I think it's better to do it in a separate
> > ticket.
> >
> > I know we do not have a strict policy regarding big features development in
> > the community, so perhaps it's a good time to discuss this? If we are ok
> > with merging separate tickets to master, how we ensure a complete feature
> > is released to public? If not, should we create a feature branch and wait
> > for all related tickets to be merged there? Will be glad to discuss this in
> > a separate thread if needed.
> >
> > пн, 27 апр. 2020 г. в 14:38, Maxim Muzafarov <mm...@apache.org>:
> >
> > > Folks,
> > >
> > >
> > > Are there any cases left which we need to discuss?
> > >
> > > Do you have any questions?
> > > I'm ready to provide all the details you need for the review.
> > >
> > > Who else what to take a look at my changes [1] [2]?
> > >
> > >
> > > [1] https://issues.apache.org/jira/browse/IGNITE-11073
> > > [2] https://github.com/apache/ignite/pull/7607
> > >
> > > On Fri, 24 Apr 2020 at 15:01, Maxim Muzafarov <mm...@apache.org> wrote:
> > > >
> > > > Alexey,
> > > >
> > > >
> > > > I've addressed all your comments, please, take a look at the PR [1].
> > > > Additional tests were added.
> > > > Additional comments with further steps were added.
> > > >
> > > >
> > > > [1] https://github.com/apache/ignite/pull/7607
> > > > [2] https://issues.apache.org/jira/browse/IGNITE-11073
> > > >
> > > > On Tue, 21 Apr 2020 at 09:53, Alexey Goncharuk
> > > > <al...@gmail.com> wrote:
> > > > >
> > > > > Maxim,
> > > > >
> > > > > I've left my comments in the PR.
> > > > >
> > > > > пн, 20 апр. 2020 г. в 12:52, Maxim Muzafarov <mm...@apache.org>:
> > > > >
> > > > > > Alex P,
> > > > > > Thank you for the great sophisticated review.
> > > > > >
> > > > > >
> > > > > > Alexey G,
> > > > > > Will you take a look at my changes[1]?
> > > > > > The fresh TC.Bot visa attached.
> > > > > >
> > > > > >
> > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11073
> > > > > >
> > > > > > On Mon, 20 Apr 2020 at 11:54, Alex Plehanov <plehanov.alex@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > Maxim, I've reviewed your PR and it looks good to me. Good job!
> > > > > > >
> > > > > > > пт, 10 апр. 2020 г. в 19:43, Alexey Goncharuk <
> > > > > > alexey.goncharuk@gmail.com>:
> > > > > > >
> > > > > > > > Maxim,
> > > > > > > >
> > > > > > > > Thanks for raising this PR. I will do a review during next week.
> > > > > > > >
> > > > > > > > --AG
> > > > > > > >
> > > > > >
> > >

Re: [DISCUSSION] Hot cache backup

Posted by Maxim Muzafarov <mm...@apache.org>.

Alexey,


From my point of view, the feature is fully self-sufficient and ready
for a release (with a small caveat):
- administrators will be able to create snapshots without writing java-code;
- developers will be able to create snapshots through java API;

The documentation pages for creating and restoring procedures with
examples will be completed by me prior to release this feature for our
end-users.

All other features mentioned in this list [1] adds convenience for
users but not mandatory. I'll try to finish these tasks from the list
[1] prior to release:
- support snapshot creation from a client node
- add starting snapshot via control.sh

Are there any details I've missed?


[1] https://github.com/apache/ignite/pull/7607#issuecomment-618964647

On Mon, 27 Apr 2020 at 18:12, Alexey Goncharuk
<al...@gmail.com> wrote:
>
> Maxim,
>
> I saw the list of the tickets you want to work on in the PR, it looks nice.
> I was wondering, what part of that list are you planning to implement
> before the feature is released to end users? For example, I agree with
> Slava that we should implement a command-line utility part for snapshots
> before the release, however I think it's better to do it in a separate
> ticket.
>
> I know we do not have a strict policy regarding big features development in
> the community, so perhaps it's a good time to discuss this? If we are ok
> with merging separate tickets to master, how we ensure a complete feature
> is released to public? If not, should we create a feature branch and wait
> for all related tickets to be merged there? Will be glad to discuss this in
> a separate thread if needed.
>
> пн, 27 апр. 2020 г. в 14:38, Maxim Muzafarov <mm...@apache.org>:
>
> > Folks,
> >
> >
> > Are there any cases left which we need to discuss?
> >
> > Do you have any questions?
> > I'm ready to provide all the details you need for the review.
> >
> > Who else what to take a look at my changes [1] [2]?
> >
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-11073
> > [2] https://github.com/apache/ignite/pull/7607
> >
> > On Fri, 24 Apr 2020 at 15:01, Maxim Muzafarov <mm...@apache.org> wrote:
> > >
> > > Alexey,
> > >
> > >
> > > I've addressed all your comments, please, take a look at the PR [1].
> > > Additional tests were added.
> > > Additional comments with further steps were added.
> > >
> > >
> > > [1] https://github.com/apache/ignite/pull/7607
> > > [2] https://issues.apache.org/jira/browse/IGNITE-11073
> > >
> > > On Tue, 21 Apr 2020 at 09:53, Alexey Goncharuk
> > > <al...@gmail.com> wrote:
> > > >
> > > > Maxim,
> > > >
> > > > I've left my comments in the PR.
> > > >
> > > > пн, 20 апр. 2020 г. в 12:52, Maxim Muzafarov <mm...@apache.org>:
> > > >
> > > > > Alex P,
> > > > > Thank you for the great sophisticated review.
> > > > >
> > > > >
> > > > > Alexey G,
> > > > > Will you take a look at my changes[1]?
> > > > > The fresh TC.Bot visa attached.
> > > > >
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11073
> > > > >
> > > > > On Mon, 20 Apr 2020 at 11:54, Alex Plehanov <plehanov.alex@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > Maxim, I've reviewed your PR and it looks good to me. Good job!
> > > > > >
> > > > > > пт, 10 апр. 2020 г. в 19:43, Alexey Goncharuk <
> > > > > alexey.goncharuk@gmail.com>:
> > > > > >
> > > > > > > Maxim,
> > > > > > >
> > > > > > > Thanks for raising this PR. I will do a review during next week.
> > > > > > >
> > > > > > > --AG
> > > > > > >
> > > > >
> >

Re: [DISCUSSION] Hot cache backup

Posted by Alexey Goncharuk <al...@gmail.com>.

Maxim,

I saw the list of the tickets you want to work on in the PR, it looks nice.
I was wondering, what part of that list are you planning to implement
before the feature is released to end users? For example, I agree with
Slava that we should implement a command-line utility part for snapshots
before the release, however I think it's better to do it in a separate
ticket.

I know we do not have a strict policy regarding big features development in
the community, so perhaps it's a good time to discuss this? If we are ok
with merging separate tickets to master, how we ensure a complete feature
is released to public? If not, should we create a feature branch and wait
for all related tickets to be merged there? Will be glad to discuss this in
a separate thread if needed.

пн, 27 апр. 2020 г. в 14:38, Maxim Muzafarov <mm...@apache.org>:

> Folks,
>
>
> Are there any cases left which we need to discuss?
>
> Do you have any questions?
> I'm ready to provide all the details you need for the review.
>
> Who else what to take a look at my changes [1] [2]?
>
>
> [1] https://issues.apache.org/jira/browse/IGNITE-11073
> [2] https://github.com/apache/ignite/pull/7607
>
> On Fri, 24 Apr 2020 at 15:01, Maxim Muzafarov <mm...@apache.org> wrote:
> >
> > Alexey,
> >
> >
> > I've addressed all your comments, please, take a look at the PR [1].
> > Additional tests were added.
> > Additional comments with further steps were added.
> >
> >
> > [1] https://github.com/apache/ignite/pull/7607
> > [2] https://issues.apache.org/jira/browse/IGNITE-11073
> >
> > On Tue, 21 Apr 2020 at 09:53, Alexey Goncharuk
> > <al...@gmail.com> wrote:
> > >
> > > Maxim,
> > >
> > > I've left my comments in the PR.
> > >
> > > пн, 20 апр. 2020 г. в 12:52, Maxim Muzafarov <mm...@apache.org>:
> > >
> > > > Alex P,
> > > > Thank you for the great sophisticated review.
> > > >
> > > >
> > > > Alexey G,
> > > > Will you take a look at my changes[1]?
> > > > The fresh TC.Bot visa attached.
> > > >
> > > >
> > > > [1] https://issues.apache.org/jira/browse/IGNITE-11073
> > > >
> > > > On Mon, 20 Apr 2020 at 11:54, Alex Plehanov <plehanov.alex@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > Maxim, I've reviewed your PR and it looks good to me. Good job!
> > > > >
> > > > > пт, 10 апр. 2020 г. в 19:43, Alexey Goncharuk <
> > > > alexey.goncharuk@gmail.com>:
> > > > >
> > > > > > Maxim,
> > > > > >
> > > > > > Thanks for raising this PR. I will do a review during next week.
> > > > > >
> > > > > > --AG
> > > > > >
> > > >
>

Re: [DISCUSSION] Hot cache backup

Posted by Maxim Muzafarov <mm...@apache.org>.

Folks,


Are there any cases left which we need to discuss?

Do you have any questions?
I'm ready to provide all the details you need for the review.

Who else what to take a look at my changes [1] [2]?


[1] https://issues.apache.org/jira/browse/IGNITE-11073
[2] https://github.com/apache/ignite/pull/7607

On Fri, 24 Apr 2020 at 15:01, Maxim Muzafarov <mm...@apache.org> wrote:
>
> Alexey,
>
>
> I've addressed all your comments, please, take a look at the PR [1].
> Additional tests were added.
> Additional comments with further steps were added.
>
>
> [1] https://github.com/apache/ignite/pull/7607
> [2] https://issues.apache.org/jira/browse/IGNITE-11073
>
> On Tue, 21 Apr 2020 at 09:53, Alexey Goncharuk
> <al...@gmail.com> wrote:
> >
> > Maxim,
> >
> > I've left my comments in the PR.
> >
> > пн, 20 апр. 2020 г. в 12:52, Maxim Muzafarov <mm...@apache.org>:
> >
> > > Alex P,
> > > Thank you for the great sophisticated review.
> > >
> > >
> > > Alexey G,
> > > Will you take a look at my changes[1]?
> > > The fresh TC.Bot visa attached.
> > >
> > >
> > > [1] https://issues.apache.org/jira/browse/IGNITE-11073
> > >
> > > On Mon, 20 Apr 2020 at 11:54, Alex Plehanov <pl...@gmail.com>
> > > wrote:
> > > >
> > > > Maxim, I've reviewed your PR and it looks good to me. Good job!
> > > >
> > > > пт, 10 апр. 2020 г. в 19:43, Alexey Goncharuk <
> > > alexey.goncharuk@gmail.com>:
> > > >
> > > > > Maxim,
> > > > >
> > > > > Thanks for raising this PR. I will do a review during next week.
> > > > >
> > > > > --AG
> > > > >
> > >

Re: [DISCUSSION] Hot cache backup

Posted by Maxim Muzafarov <mm...@apache.org>.

Alexey,


I've addressed all your comments, please, take a look at the PR [1].
Additional tests were added.
Additional comments with further steps were added.


[1] https://github.com/apache/ignite/pull/7607
[2] https://issues.apache.org/jira/browse/IGNITE-11073

On Tue, 21 Apr 2020 at 09:53, Alexey Goncharuk
<al...@gmail.com> wrote:
>
> Maxim,
>
> I've left my comments in the PR.
>
> пн, 20 апр. 2020 г. в 12:52, Maxim Muzafarov <mm...@apache.org>:
>
> > Alex P,
> > Thank you for the great sophisticated review.
> >
> >
> > Alexey G,
> > Will you take a look at my changes[1]?
> > The fresh TC.Bot visa attached.
> >
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-11073
> >
> > On Mon, 20 Apr 2020 at 11:54, Alex Plehanov <pl...@gmail.com>
> > wrote:
> > >
> > > Maxim, I've reviewed your PR and it looks good to me. Good job!
> > >
> > > пт, 10 апр. 2020 г. в 19:43, Alexey Goncharuk <
> > alexey.goncharuk@gmail.com>:
> > >
> > > > Maxim,
> > > >
> > > > Thanks for raising this PR. I will do a review during next week.
> > > >
> > > > --AG
> > > >
> >

Re: [DISCUSSION] Hot cache backup

Posted by Alexey Goncharuk <al...@gmail.com>.

Maxim,

I've left my comments in the PR.

пн, 20 апр. 2020 г. в 12:52, Maxim Muzafarov <mm...@apache.org>:

> Alex P,
> Thank you for the great sophisticated review.
>
>
> Alexey G,
> Will you take a look at my changes[1]?
> The fresh TC.Bot visa attached.
>
>
> [1] https://issues.apache.org/jira/browse/IGNITE-11073
>
> On Mon, 20 Apr 2020 at 11:54, Alex Plehanov <pl...@gmail.com>
> wrote:
> >
> > Maxim, I've reviewed your PR and it looks good to me. Good job!
> >
> > пт, 10 апр. 2020 г. в 19:43, Alexey Goncharuk <
> alexey.goncharuk@gmail.com>:
> >
> > > Maxim,
> > >
> > > Thanks for raising this PR. I will do a review during next week.
> > >
> > > --AG
> > >
>

Re: [DISCUSSION] Hot cache backup

Posted by Maxim Muzafarov <mm...@apache.org>.

Alex P,
Thank you for the great sophisticated review.


Alexey G,
Will you take a look at my changes[1]?
The fresh TC.Bot visa attached.


[1] https://issues.apache.org/jira/browse/IGNITE-11073

On Mon, 20 Apr 2020 at 11:54, Alex Plehanov <pl...@gmail.com> wrote:
>
> Maxim, I've reviewed your PR and it looks good to me. Good job!
>
> пт, 10 апр. 2020 г. в 19:43, Alexey Goncharuk <al...@gmail.com>:
>
> > Maxim,
> >
> > Thanks for raising this PR. I will do a review during next week.
> >
> > --AG
> >

Re: [DISCUSSION] Hot cache backup

Posted by Alex Plehanov <pl...@gmail.com>.

Maxim, I've reviewed your PR and it looks good to me. Good job!

пт, 10 апр. 2020 г. в 19:43, Alexey Goncharuk <al...@gmail.com>:

> Maxim,
>
> Thanks for raising this PR. I will do a review during next week.
>
> --AG
>

Re: [DISCUSSION] Hot cache backup

Posted by Alexey Goncharuk <al...@gmail.com>.

Maxim,

Thanks for raising this PR. I will do a review during next week.

--AG

Re: [DISCUSSION] Hot cache backup

Posted by Maxim Muzafarov <mm...@apache.org>.

Andrey,


> What about primary/backup node data consistency.

Primary and backup partitions must be fully consistent in a snapshot,
additional recovery procedures not required. So, when we restore a
snapshot on the same topology everything will work right out of the
box - no WAL needed.

This is achieved by triggering PME [1]. Doing this we will get a point
in time when all started transactions are finished (on backups too)
and new ones are blocked on a new topology version. That's the point
in time when snapshot operation starts. And also this is a weak point
of the current solution since the process blocks all cluster
transactions for a while. See [2].

> I cant quite picture how persistence rebalancing works

The WAL-rebalance will not happen. The full rebalance will be used in
case of restoring a snapshot on different topology. For now, only
restoring on the same cluster topology (same baseline) will work fine,
other cases must be explicitly tested but in a theory, it will work
too.

> You analyze alternative snapshot solutions based on WAL?

Do you mean taking snapshots from the cluster without blocking
transactions (without PME)? It's not a trivial task from my point of
view. Currently, I have no design for it which can cover all corner
cases.


[1] https://cwiki.apache.org/confluence/display/IGNITE/%28Partition+Map%29+Exchange+-+under+the+hood
[2] https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/distributed/dht/preloader/GridDhtPartitionsExchangeFuture.java#L1524

On Thu, 9 Apr 2020 at 00:52, Andrey Dolmatov <it...@gmail.com> wrote:
>
> I would like to understand your solution deeper. Hope, that my questions
> are interesting not only for me:
>
>    - What about primary/backup node data consistency. I found, that [1]
>    Cassandra uses eventually consistent backups, so some backup data could
>    miss from snapshot. If I apply snapshot, would Ignite detect and rebalance
>    data to backup nodes?
>    - I cant quite picture how persistence rebalancing works, but according
>    to [2] it uses WAL logs. Snapshot doesn't contain WAL data, correct? Did
>    You analyze alternative snapshot solutions based on WAL?
>
> [1]
> https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/operations/opsAboutSnapshots.html
> [2]
> https://cwiki.apache.org/confluence/display/IGNITE/Persistent+Store+Architecture#PersistentStoreArchitecture-Rebalancing
>
> ср, 8 апр. 2020 г. в 18:22, Maxim Muzafarov <mm...@apache.org>:
>
> > Andrey,
> >
> >
> > Thanks for your questions, I've also clarified some details on the
> > IEP-43 [1] page according to them.
> >
> > > Does snapshot contain only primary data or backup partitions or both?
> >
> > A snapshot contains a full copy of persistence data on each local
> > node. This means all primary, backup partitions and the SQL index file
> > available on the local node are copied to snapshot.
> >
> > > Could I create snapshot from m-node cluster and apply it to n-node
> > cluster (n<>m)?
> >
> > Currently, the restore procedure is fully manual, but it is possible
> > to restore on different topology in general. There are a few options
> > here:
> > - m == n, the easiest and fastest way
> > - m < n, cluster will start and the rebalance will happen (see
> > testClusterSnapshotWithRebalancing in PR). If some SQL indexes exist
> > it may take a quite a long time to complete.
> > - m > n, the hardest case. For instance, if backups > 1 you can start
> > a cluster and remove node one by one from baseline. I think this case
> > should be covered by additional recovery scripts which will be
> > developed further.
> >
> > > - Should data node has extra space on persistent store to create
> > snapshot? Or, from another point of view, woild size of temporary file be
> > equal to size of all data on cluster node?
> >
> > If a cluster has no load you will need only a free space to store
> > snapshot which is almost equal to the node `db` directory size.
> >
> > If a cluster is under the load it needs some extra space to store
> > intermediate snapshot results. The amount of such space depends on how
> > fast cache partition files are copied to snapshot directory (if disks
> > are slow). The maximum size of the temporary file per each partition
> > is equal to the size of the appropriate partition file. So, the worst
> > case you need x3 extra disk size. But according to my measurements
> > assume SSD is used and size of each partition is 300MB it will require
> > no more than 1-3% to a cluster under high load.
> >
> > - What resulted snapshot is, single file or collection of files (one
> > for every data node)?
> >
> > Check the example of the snapshot directory structure on the IEP-43
> > page [1], this is how a completed snapshot will look like.
> >
> > [1]
> > https://cwiki.apache.org/confluence/display/IGNITE/IEP-43%3A+Cluster+snapshots#IEP-43:Clustersnapshots-Restoresnapshot(manually)
> >
> > On Wed, 8 Apr 2020 at 17:18, Andrey Dolmatov <it...@gmail.com> wrote:
> > >
> > > Hi, Maxim!
> > > It is very useful feature, great job!
> > >
> > > But could you explain me some aspects?
> > >
> > >    - Does snapshot contain only primary data or backup partitions or
> > both?
> > >    - Could I create snapshot from m-node cluster and apply it to n-node
> > >    cluster (n<>m)?
> > >    - Should data node has extra space on persistent store to create
> > >    snapshot? Or, from another point of view, woild size of temporary
> > file be
> > >    equal to size of all data on cluster node?
> > >    - What resulted snapshot is, single file or collection of files (one
> > for
> > >    every data node)?
> > >
> > > I apologize for my questions, but i really interested in such feature.
> > >
> > >
> > > вт, 7 апр. 2020 г. в 22:10, Maxim Muzafarov <ma...@gmail.com>:
> > >
> > > > Igniters,
> > > >
> > > >
> > > > I'd like to back to the discussion of a snapshot operation for Apache
> > > > Ignite for persistence cache groups and I propose my changes below. I
> > > > have prepared everything so that the discussion is as meaningful and
> > > > specific as much as possible:
> > > >
> > > > - IEP-43: Cluster snapshot [1]
> > > > - The Jira task IGNITE-11073 [2]
> > > > - PR with described changes, Patch Available [4]
> > > >
> > > > Changes are ready for review.
> > > >
> > > >
> > > > Here are a few implementation details and my thoughts:
> > > >
> > > > 1. Snapshot restore assumed to be manual at the first step. The
> > > > process will be described on our documentation pages, but it is
> > > > possible to start node right from the snapshot directory since the
> > > > directory structure is preserved (check
> > > > `testConsistentClusterSnapshotUnderLoad` in the PR). We also have some
> > > > options here about how the restore process must look like:
> > > > - fully manual snapshot restore (will be documented)
> > > > - ansible or shell scripts for restore
> > > > - Java API for restore (I doubt we should go this way).
> > > >
> > > > 3. The snapshot `create` procedure creates a snapshot of all
> > > > persistent caches available on the cluster (see limitations [1]).
> > > >
> > > > 2. The snapshot `create` procedure is available through Java API and
> > > > JMX (control.sh may be implemented further).
> > > >
> > > > Java API:
> > > > IgniteFuture<Void> fut = ignite.snapshot()
> > > > .createSnapshot(name);
> > > >
> > > > JMX:
> > > > SnapshotMXBean mxBean = getMBean(ignite.name());
> > > > mxBean.createSnapshot(name);
> > > >
> > > > 3. The Distribute Process [3] is used to perform a cluster-wide
> > > > snapshot procedure, so we've avoided a lot of boilerplate code here.
> > > >
> > > > 4. The design document [1] contains also an internal API for creating
> > > > a consistent local snapshot of requested cache groups and transfer it
> > > > to another node using the FileTransmission protocol [6]. This is one
> > > > of the parts of IEP-28 [5] for cluster rebalancing via partition files
> > > > and an important part for understanding the whole design.
> > > >
> > > > Java API:
> > > > public IgniteInternalFuture<Void> createRemoteSnapshot(
> > > >     UUID rmtNodeId,
> > > >     Map<Integer, Set<Integer>> parts,
> > > >     BiConsumer<File, GroupPartitionId> partConsumer);
> > > >
> > > >
> > > > Please, share your thoughts and take a loot at my changes [4].
> > > >
> > > >
> > > > [1]
> > > >
> > https://cwiki.apache.org/confluence/display/IGNITE/IEP-43%3A+Cluster+snapshots
> > > > [2] https://issues.apache.org/jira/browse/IGNITE-11073
> > > > [3]
> > > >
> > https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/util/distributed/DistributedProcess.java#L49
> > > > [4] https://github.com/apache/ignite/pull/7607
> > > > [5]
> > > >
> > https://cwiki.apache.org/confluence/display/IGNITE/IEP-28%3A+Cluster+peer-2-peer+balancing#IEP-28:Clusterpeer-2-peerbalancing-Filetransferbetweennodes
> > > > [6]
> > > >
> > https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/managers/communication/TransmissionHandler.java#L42
> > > >
> > > >
> > > > On Thu, 28 Feb 2019 at 14:43, Dmitriy Pavlov <dp...@apache.org>
> > wrote:
> > > > >
> > > > > Hi Maxim,
> > > > >
> > > > > I agree with Denis and I have just one concern here.
> > > > >
> > > > > Apache Ignite has quite a long story (started even before Apache),
> > and
> > > > now
> > > > > it has a way too huge number of features. Some of these features
> > > > > - are developed and well known by community members,
> > > > > - some of them were contributed a long time ago and nobody develops
> > it,
> > > > > - and, actually, in some rare cases, nobody in the community knows
> > how it
> > > > > works and how to change it.
> > > > >
> > > > > Such features may attract users, but a bug in it may ruin impression
> > > > about
> > > > > the product. Even worse, nobody can help to solve it, and only user
> > > > himself
> > > > > or herself may be encouraged to contribute a fix.
> > > > >
> > > > > And my concern here, such a big feature should have a number of
> > > > interested
> > > > > contributors, who can support it in case if others lost interest. I
> > will
> > > > be
> > > > > happy if 3-5 members will come and say, yes, I will do a review/I
> > will
> > > > help
> > > > > with further changes.
> > > > >
> > > > > Just to be clear, I'm not against it, and I'll never cast -1 for it,
> > but
> > > > it
> > > > > would be more comfortable to develop this feature with understanding
> > that
> > > > > this work will not be useless.
> > > > >
> > > > > Sincerely,
> > > > > Dmitriy Pavlov
> > > > >
> > > > > ср, 27 февр. 2019 г. в 23:36, Denis Magda <dm...@apache.org>:
> > > > >
> > > > > > Maxim,
> > > > > >
> > > > > > GridGain has this exact feature available for Ignite native
> > persistence
> > > > > > deployments. It's not as easy as it might have been seen from the
> > > > > > enablement perspective. Took us many years to make it production
> > ready,
> > > > > > involving many engineers. If the rest of the community wants to
> > create
> > > > > > something similar and available in open source then please take
> > this
> > > > > > estimate into consideration.
> > > > > >
> > > > > > -
> > > > > > Denis
> > > > > >
> > > > > >
> > > > > > On Wed, Feb 27, 2019 at 8:53 AM Maxim Muzafarov <
> > maxmuzaf@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Igniters,
> > > > > > >
> > > > > > > Some of the stores with which the Apache Ignite is often
> > compared has
> > > > > > > a feature called Snapshots [1] [2]. This feature provides an
> > > > > > > eventually consistent view on stored data for different purposes
> > > > (e.g.
> > > > > > > moving data between environments, saving a backup of data for the
> > > > > > > further restore procedure and so on). The Apache Ignite has all
> > > > > > > opportunities and machinery to provide cache and\or data region
> > > > > > > snapshots out of the box but still don't have them.
> > > > > > >
> > > > > > > This issue derives from IEP-28 [5] on which I'm currently
> > working on
> > > > > > > (partially described in the section [6]). I would like to solve
> > this
> > > > > > > issue too and make Apache Ignite more attractive to use on a
> > > > > > > production environment. I've haven't investigated in-memory type
> > > > > > > caches yet, but for caches with enabled persistence, we can do it
> > > > > > > without any performance impact on cache operations (some
> > additional
> > > > IO
> > > > > > > operations are needed to copy cache data to backup store, copy on
> > > > > > > write technique is used here). We just need to use our
> > DiscoverySpi,
> > > > > > > PME and Checkpointer process the right way.
> > > > > > >
> > > > > > > For the first step, we can store all backup data on each of cache
> > > > > > > affinity node locally. For instance, the
> > `backup\snapshotId\cache0`
> > > > > > > folder will be created and all `cache0` partitions will be stored
> > > > > > > there for each local node for the snapshot process with id
> > > > > > > `snapshotId`. In future, we can teach nodes to upload snapshotted
> > > > > > > partitions to the one remote node or cloud.
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > > High-level process overview
> > > > > > >
> > > > > > > A new snapshot process is managed via DiscoverySpi and
> > > > > > > CommunicationSpi messages.
> > > > > > >
> > > > > > > 1. The initiator sends a request to the cluster
> > (DiscoveryMessage).
> > > > > > > 2. When the node receives a message it initiates PME.
> > > > > > > 3. The node begins checkpoint process (holding write lock a short
> > > > time)
> > > > > > > 4. The node starts to track any write attempts to the
> > snapshotting
> > > > > > > partition and places the copy of original pages to the temp file.
> > > > > > > 5. The node performs merge the partition file with the
> > corresponding
> > > > > > delta.
> > > > > > > 6. When the node finishes the backup process it sends ack message
> > > > with
> > > > > > > saved partitions to the initiator (or the error response).
> > > > > > > 7. When all ack messages received the backup is finished.
> > > > > > >
> > > > > > > The only problem here is that when the request message arrives
> > at the
> > > > > > > particular node during running checkpoint PME will be locked
> > until it
> > > > > > > ends. This is not good. But hopefully, it will be fixed here [4].
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > > Probable API
> > > > > > >
> > > > > > > From the cache perspective:
> > > > > > >
> > > > > > > IgniteFuture<IgniteSnapshot> snapshotFut =
> > > > > > >     ignite.cache("default")
> > > > > > >         .shapshotter()
> > > > > > >         .create("myShapshotId");
> > > > > > >
> > > > > > > IgniteSnapshot cacheSnapshot = snapshotFut.get();
> > > > > > >
> > > > > > > IgniteCache<K, V> copiedCache =
> > > > > > >     ignite.createCache("CopyCache")
> > > > > > >         .withConfiguration(defaultCache.getConfiguration())
> > > > > > >         .loadFromSnapshot(cacheSnapshot.id());
> > > > > > >
> > > > > > > From the command line perspective:
> > > > > > >
> > > > > > > control.sh --snapshot take cache0,cache1,cache2
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > > WDYT?
> > > > > > > Will it be a useful feature for the Apache Ignite?
> > > > > > >
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > > >
> > > >
> > https://geode.apache.org/docs/guide/10/managing/cache_snapshots/chapter_overview.html
> > > > > > > [2]
> > > > > > >
> > > > > >
> > > >
> > https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsBackupTakesSnapshot.html
> > > > > > > [3]
> > > > > > >
> > > > > >
> > > >
> > http://apache-ignite-developers.2346864.n4.nabble.com/Data-Snapshots-in-Ignite-td4183.html
> > > > > > > [4] https://issues.apache.org/jira/browse/IGNITE-10508
> > > > > > > [5]
> > > > > > >
> > > > > >
> > > >
> > https://cwiki.apache.org/confluence/display/IGNITE/IEP-28%3A+Cluster+peer-2-peer+balancing
> > > > > > > [6]
> > > > > > >
> > > > > >
> > > >
> > https://cwiki.apache.org/confluence/display/IGNITE/IEP-28%3A+Cluster+peer-2-peer+balancing#IEP-28:Clusterpeer-2-peerbalancing-Checkpointer
> > > > > > >
> > > > > >
> > > >
> >

Re: [DISCUSSION] Hot cache backup

Posted by Andrey Dolmatov <it...@gmail.com>.

I would like to understand your solution deeper. Hope, that my questions
are interesting not only for me:

   - What about primary/backup node data consistency. I found, that [1]
   Cassandra uses eventually consistent backups, so some backup data could
   miss from snapshot. If I apply snapshot, would Ignite detect and rebalance
   data to backup nodes?
   - I cant quite picture how persistence rebalancing works, but according
   to [2] it uses WAL logs. Snapshot doesn't contain WAL data, correct? Did
   You analyze alternative snapshot solutions based on WAL?

[1]
https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/operations/opsAboutSnapshots.html
[2]
https://cwiki.apache.org/confluence/display/IGNITE/Persistent+Store+Architecture#PersistentStoreArchitecture-Rebalancing

ср, 8 апр. 2020 г. в 18:22, Maxim Muzafarov <mm...@apache.org>:

> Andrey,
>
>
> Thanks for your questions, I've also clarified some details on the
> IEP-43 [1] page according to them.
>
> > Does snapshot contain only primary data or backup partitions or both?
>
> A snapshot contains a full copy of persistence data on each local
> node. This means all primary, backup partitions and the SQL index file
> available on the local node are copied to snapshot.
>
> > Could I create snapshot from m-node cluster and apply it to n-node
> cluster (n<>m)?
>
> Currently, the restore procedure is fully manual, but it is possible
> to restore on different topology in general. There are a few options
> here:
> - m == n, the easiest and fastest way
> - m < n, cluster will start and the rebalance will happen (see
> testClusterSnapshotWithRebalancing in PR). If some SQL indexes exist
> it may take a quite a long time to complete.
> - m > n, the hardest case. For instance, if backups > 1 you can start
> a cluster and remove node one by one from baseline. I think this case
> should be covered by additional recovery scripts which will be
> developed further.
>
> > - Should data node has extra space on persistent store to create
> snapshot? Or, from another point of view, woild size of temporary file be
> equal to size of all data on cluster node?
>
> If a cluster has no load you will need only a free space to store
> snapshot which is almost equal to the node `db` directory size.
>
> If a cluster is under the load it needs some extra space to store
> intermediate snapshot results. The amount of such space depends on how
> fast cache partition files are copied to snapshot directory (if disks
> are slow). The maximum size of the temporary file per each partition
> is equal to the size of the appropriate partition file. So, the worst
> case you need x3 extra disk size. But according to my measurements
> assume SSD is used and size of each partition is 300MB it will require
> no more than 1-3% to a cluster under high load.
>
> - What resulted snapshot is, single file or collection of files (one
> for every data node)?
>
> Check the example of the snapshot directory structure on the IEP-43
> page [1], this is how a completed snapshot will look like.
>
> [1]
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-43%3A+Cluster+snapshots#IEP-43:Clustersnapshots-Restoresnapshot(manually)
>
> On Wed, 8 Apr 2020 at 17:18, Andrey Dolmatov <it...@gmail.com> wrote:
> >
> > Hi, Maxim!
> > It is very useful feature, great job!
> >
> > But could you explain me some aspects?
> >
> >    - Does snapshot contain only primary data or backup partitions or
> both?
> >    - Could I create snapshot from m-node cluster and apply it to n-node
> >    cluster (n<>m)?
> >    - Should data node has extra space on persistent store to create
> >    snapshot? Or, from another point of view, woild size of temporary
> file be
> >    equal to size of all data on cluster node?
> >    - What resulted snapshot is, single file or collection of files (one
> for
> >    every data node)?
> >
> > I apologize for my questions, but i really interested in such feature.
> >
> >
> > вт, 7 апр. 2020 г. в 22:10, Maxim Muzafarov <ma...@gmail.com>:
> >
> > > Igniters,
> > >
> > >
> > > I'd like to back to the discussion of a snapshot operation for Apache
> > > Ignite for persistence cache groups and I propose my changes below. I
> > > have prepared everything so that the discussion is as meaningful and
> > > specific as much as possible:
> > >
> > > - IEP-43: Cluster snapshot [1]
> > > - The Jira task IGNITE-11073 [2]
> > > - PR with described changes, Patch Available [4]
> > >
> > > Changes are ready for review.
> > >
> > >
> > > Here are a few implementation details and my thoughts:
> > >
> > > 1. Snapshot restore assumed to be manual at the first step. The
> > > process will be described on our documentation pages, but it is
> > > possible to start node right from the snapshot directory since the
> > > directory structure is preserved (check
> > > `testConsistentClusterSnapshotUnderLoad` in the PR). We also have some
> > > options here about how the restore process must look like:
> > > - fully manual snapshot restore (will be documented)
> > > - ansible or shell scripts for restore
> > > - Java API for restore (I doubt we should go this way).
> > >
> > > 3. The snapshot `create` procedure creates a snapshot of all
> > > persistent caches available on the cluster (see limitations [1]).
> > >
> > > 2. The snapshot `create` procedure is available through Java API and
> > > JMX (control.sh may be implemented further).
> > >
> > > Java API:
> > > IgniteFuture<Void> fut = ignite.snapshot()
> > > .createSnapshot(name);
> > >
> > > JMX:
> > > SnapshotMXBean mxBean = getMBean(ignite.name());
> > > mxBean.createSnapshot(name);
> > >
> > > 3. The Distribute Process [3] is used to perform a cluster-wide
> > > snapshot procedure, so we've avoided a lot of boilerplate code here.
> > >
> > > 4. The design document [1] contains also an internal API for creating
> > > a consistent local snapshot of requested cache groups and transfer it
> > > to another node using the FileTransmission protocol [6]. This is one
> > > of the parts of IEP-28 [5] for cluster rebalancing via partition files
> > > and an important part for understanding the whole design.
> > >
> > > Java API:
> > > public IgniteInternalFuture<Void> createRemoteSnapshot(
> > >     UUID rmtNodeId,
> > >     Map<Integer, Set<Integer>> parts,
> > >     BiConsumer<File, GroupPartitionId> partConsumer);
> > >
> > >
> > > Please, share your thoughts and take a loot at my changes [4].
> > >
> > >
> > > [1]
> > >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-43%3A+Cluster+snapshots
> > > [2] https://issues.apache.org/jira/browse/IGNITE-11073
> > > [3]
> > >
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/util/distributed/DistributedProcess.java#L49
> > > [4] https://github.com/apache/ignite/pull/7607
> > > [5]
> > >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-28%3A+Cluster+peer-2-peer+balancing#IEP-28:Clusterpeer-2-peerbalancing-Filetransferbetweennodes
> > > [6]
> > >
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/managers/communication/TransmissionHandler.java#L42
> > >
> > >
> > > On Thu, 28 Feb 2019 at 14:43, Dmitriy Pavlov <dp...@apache.org>
> wrote:
> > > >
> > > > Hi Maxim,
> > > >
> > > > I agree with Denis and I have just one concern here.
> > > >
> > > > Apache Ignite has quite a long story (started even before Apache),
> and
> > > now
> > > > it has a way too huge number of features. Some of these features
> > > > - are developed and well known by community members,
> > > > - some of them were contributed a long time ago and nobody develops
> it,
> > > > - and, actually, in some rare cases, nobody in the community knows
> how it
> > > > works and how to change it.
> > > >
> > > > Such features may attract users, but a bug in it may ruin impression
> > > about
> > > > the product. Even worse, nobody can help to solve it, and only user
> > > himself
> > > > or herself may be encouraged to contribute a fix.
> > > >
> > > > And my concern here, such a big feature should have a number of
> > > interested
> > > > contributors, who can support it in case if others lost interest. I
> will
> > > be
> > > > happy if 3-5 members will come and say, yes, I will do a review/I
> will
> > > help
> > > > with further changes.
> > > >
> > > > Just to be clear, I'm not against it, and I'll never cast -1 for it,
> but
> > > it
> > > > would be more comfortable to develop this feature with understanding
> that
> > > > this work will not be useless.
> > > >
> > > > Sincerely,
> > > > Dmitriy Pavlov
> > > >
> > > > ср, 27 февр. 2019 г. в 23:36, Denis Magda <dm...@apache.org>:
> > > >
> > > > > Maxim,
> > > > >
> > > > > GridGain has this exact feature available for Ignite native
> persistence
> > > > > deployments. It's not as easy as it might have been seen from the
> > > > > enablement perspective. Took us many years to make it production
> ready,
> > > > > involving many engineers. If the rest of the community wants to
> create
> > > > > something similar and available in open source then please take
> this
> > > > > estimate into consideration.
> > > > >
> > > > > -
> > > > > Denis
> > > > >
> > > > >
> > > > > On Wed, Feb 27, 2019 at 8:53 AM Maxim Muzafarov <
> maxmuzaf@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Igniters,
> > > > > >
> > > > > > Some of the stores with which the Apache Ignite is often
> compared has
> > > > > > a feature called Snapshots [1] [2]. This feature provides an
> > > > > > eventually consistent view on stored data for different purposes
> > > (e.g.
> > > > > > moving data between environments, saving a backup of data for the
> > > > > > further restore procedure and so on). The Apache Ignite has all
> > > > > > opportunities and machinery to provide cache and\or data region
> > > > > > snapshots out of the box but still don't have them.
> > > > > >
> > > > > > This issue derives from IEP-28 [5] on which I'm currently
> working on
> > > > > > (partially described in the section [6]). I would like to solve
> this
> > > > > > issue too and make Apache Ignite more attractive to use on a
> > > > > > production environment. I've haven't investigated in-memory type
> > > > > > caches yet, but for caches with enabled persistence, we can do it
> > > > > > without any performance impact on cache operations (some
> additional
> > > IO
> > > > > > operations are needed to copy cache data to backup store, copy on
> > > > > > write technique is used here). We just need to use our
> DiscoverySpi,
> > > > > > PME and Checkpointer process the right way.
> > > > > >
> > > > > > For the first step, we can store all backup data on each of cache
> > > > > > affinity node locally. For instance, the
> `backup\snapshotId\cache0`
> > > > > > folder will be created and all `cache0` partitions will be stored
> > > > > > there for each local node for the snapshot process with id
> > > > > > `snapshotId`. In future, we can teach nodes to upload snapshotted
> > > > > > partitions to the one remote node or cloud.
> > > > > >
> > > > > > --
> > > > > >
> > > > > > High-level process overview
> > > > > >
> > > > > > A new snapshot process is managed via DiscoverySpi and
> > > > > > CommunicationSpi messages.
> > > > > >
> > > > > > 1. The initiator sends a request to the cluster
> (DiscoveryMessage).
> > > > > > 2. When the node receives a message it initiates PME.
> > > > > > 3. The node begins checkpoint process (holding write lock a short
> > > time)
> > > > > > 4. The node starts to track any write attempts to the
> snapshotting
> > > > > > partition and places the copy of original pages to the temp file.
> > > > > > 5. The node performs merge the partition file with the
> corresponding
> > > > > delta.
> > > > > > 6. When the node finishes the backup process it sends ack message
> > > with
> > > > > > saved partitions to the initiator (or the error response).
> > > > > > 7. When all ack messages received the backup is finished.
> > > > > >
> > > > > > The only problem here is that when the request message arrives
> at the
> > > > > > particular node during running checkpoint PME will be locked
> until it
> > > > > > ends. This is not good. But hopefully, it will be fixed here [4].
> > > > > >
> > > > > > --
> > > > > >
> > > > > > Probable API
> > > > > >
> > > > > > From the cache perspective:
> > > > > >
> > > > > > IgniteFuture<IgniteSnapshot> snapshotFut =
> > > > > >     ignite.cache("default")
> > > > > >         .shapshotter()
> > > > > >         .create("myShapshotId");
> > > > > >
> > > > > > IgniteSnapshot cacheSnapshot = snapshotFut.get();
> > > > > >
> > > > > > IgniteCache<K, V> copiedCache =
> > > > > >     ignite.createCache("CopyCache")
> > > > > >         .withConfiguration(defaultCache.getConfiguration())
> > > > > >         .loadFromSnapshot(cacheSnapshot.id());
> > > > > >
> > > > > > From the command line perspective:
> > > > > >
> > > > > > control.sh --snapshot take cache0,cache1,cache2
> > > > > >
> > > > > > --
> > > > > >
> > > > > > WDYT?
> > > > > > Will it be a useful feature for the Apache Ignite?
> > > > > >
> > > > > >
> > > > > > [1]
> > > > > >
> > > > >
> > >
> https://geode.apache.org/docs/guide/10/managing/cache_snapshots/chapter_overview.html
> > > > > > [2]
> > > > > >
> > > > >
> > >
> https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsBackupTakesSnapshot.html
> > > > > > [3]
> > > > > >
> > > > >
> > >
> http://apache-ignite-developers.2346864.n4.nabble.com/Data-Snapshots-in-Ignite-td4183.html
> > > > > > [4] https://issues.apache.org/jira/browse/IGNITE-10508
> > > > > > [5]
> > > > > >
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-28%3A+Cluster+peer-2-peer+balancing
> > > > > > [6]
> > > > > >
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-28%3A+Cluster+peer-2-peer+balancing#IEP-28:Clusterpeer-2-peerbalancing-Checkpointer
> > > > > >
> > > > >
> > >
>

Re: [DISCUSSION] Hot cache backup

Posted by Maxim Muzafarov <mm...@apache.org>.

Andrey,


Thanks for your questions, I've also clarified some details on the
IEP-43 [1] page according to them.

> Does snapshot contain only primary data or backup partitions or both?

A snapshot contains a full copy of persistence data on each local
node. This means all primary, backup partitions and the SQL index file
available on the local node are copied to snapshot.

> Could I create snapshot from m-node cluster and apply it to n-node cluster (n<>m)?

Currently, the restore procedure is fully manual, but it is possible
to restore on different topology in general. There are a few options
here:
- m == n, the easiest and fastest way
- m < n, cluster will start and the rebalance will happen (see
testClusterSnapshotWithRebalancing in PR). If some SQL indexes exist
it may take a quite a long time to complete.
- m > n, the hardest case. For instance, if backups > 1 you can start
a cluster and remove node one by one from baseline. I think this case
should be covered by additional recovery scripts which will be
developed further.

> - Should data node has extra space on persistent store to create snapshot? Or, from another point of view, woild size of temporary file be equal to size of all data on cluster node?

If a cluster has no load you will need only a free space to store
snapshot which is almost equal to the node `db` directory size.

If a cluster is under the load it needs some extra space to store
intermediate snapshot results. The amount of such space depends on how
fast cache partition files are copied to snapshot directory (if disks
are slow). The maximum size of the temporary file per each partition
is equal to the size of the appropriate partition file. So, the worst
case you need x3 extra disk size. But according to my measurements
assume SSD is used and size of each partition is 300MB it will require
no more than 1-3% to a cluster under high load.

- What resulted snapshot is, single file or collection of files (one
for every data node)?

Check the example of the snapshot directory structure on the IEP-43
page [1], this is how a completed snapshot will look like.

[1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-43%3A+Cluster+snapshots#IEP-43:Clustersnapshots-Restoresnapshot(manually)

On Wed, 8 Apr 2020 at 17:18, Andrey Dolmatov <it...@gmail.com> wrote:
>
> Hi, Maxim!
> It is very useful feature, great job!
>
> But could you explain me some aspects?
>
>    - Does snapshot contain only primary data or backup partitions or both?
>    - Could I create snapshot from m-node cluster and apply it to n-node
>    cluster (n<>m)?
>    - Should data node has extra space on persistent store to create
>    snapshot? Or, from another point of view, woild size of temporary file be
>    equal to size of all data on cluster node?
>    - What resulted snapshot is, single file or collection of files (one for
>    every data node)?
>
> I apologize for my questions, but i really interested in such feature.
>
>
> вт, 7 апр. 2020 г. в 22:10, Maxim Muzafarov <ma...@gmail.com>:
>
> > Igniters,
> >
> >
> > I'd like to back to the discussion of a snapshot operation for Apache
> > Ignite for persistence cache groups and I propose my changes below. I
> > have prepared everything so that the discussion is as meaningful and
> > specific as much as possible:
> >
> > - IEP-43: Cluster snapshot [1]
> > - The Jira task IGNITE-11073 [2]
> > - PR with described changes, Patch Available [4]
> >
> > Changes are ready for review.
> >
> >
> > Here are a few implementation details and my thoughts:
> >
> > 1. Snapshot restore assumed to be manual at the first step. The
> > process will be described on our documentation pages, but it is
> > possible to start node right from the snapshot directory since the
> > directory structure is preserved (check
> > `testConsistentClusterSnapshotUnderLoad` in the PR). We also have some
> > options here about how the restore process must look like:
> > - fully manual snapshot restore (will be documented)
> > - ansible or shell scripts for restore
> > - Java API for restore (I doubt we should go this way).
> >
> > 3. The snapshot `create` procedure creates a snapshot of all
> > persistent caches available on the cluster (see limitations [1]).
> >
> > 2. The snapshot `create` procedure is available through Java API and
> > JMX (control.sh may be implemented further).
> >
> > Java API:
> > IgniteFuture<Void> fut = ignite.snapshot()
> > .createSnapshot(name);
> >
> > JMX:
> > SnapshotMXBean mxBean = getMBean(ignite.name());
> > mxBean.createSnapshot(name);
> >
> > 3. The Distribute Process [3] is used to perform a cluster-wide
> > snapshot procedure, so we've avoided a lot of boilerplate code here.
> >
> > 4. The design document [1] contains also an internal API for creating
> > a consistent local snapshot of requested cache groups and transfer it
> > to another node using the FileTransmission protocol [6]. This is one
> > of the parts of IEP-28 [5] for cluster rebalancing via partition files
> > and an important part for understanding the whole design.
> >
> > Java API:
> > public IgniteInternalFuture<Void> createRemoteSnapshot(
> >     UUID rmtNodeId,
> >     Map<Integer, Set<Integer>> parts,
> >     BiConsumer<File, GroupPartitionId> partConsumer);
> >
> >
> > Please, share your thoughts and take a loot at my changes [4].
> >
> >
> > [1]
> > https://cwiki.apache.org/confluence/display/IGNITE/IEP-43%3A+Cluster+snapshots
> > [2] https://issues.apache.org/jira/browse/IGNITE-11073
> > [3]
> > https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/util/distributed/DistributedProcess.java#L49
> > [4] https://github.com/apache/ignite/pull/7607
> > [5]
> > https://cwiki.apache.org/confluence/display/IGNITE/IEP-28%3A+Cluster+peer-2-peer+balancing#IEP-28:Clusterpeer-2-peerbalancing-Filetransferbetweennodes
> > [6]
> > https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/managers/communication/TransmissionHandler.java#L42
> >
> >
> > On Thu, 28 Feb 2019 at 14:43, Dmitriy Pavlov <dp...@apache.org> wrote:
> > >
> > > Hi Maxim,
> > >
> > > I agree with Denis and I have just one concern here.
> > >
> > > Apache Ignite has quite a long story (started even before Apache), and
> > now
> > > it has a way too huge number of features. Some of these features
> > > - are developed and well known by community members,
> > > - some of them were contributed a long time ago and nobody develops it,
> > > - and, actually, in some rare cases, nobody in the community knows how it
> > > works and how to change it.
> > >
> > > Such features may attract users, but a bug in it may ruin impression
> > about
> > > the product. Even worse, nobody can help to solve it, and only user
> > himself
> > > or herself may be encouraged to contribute a fix.
> > >
> > > And my concern here, such a big feature should have a number of
> > interested
> > > contributors, who can support it in case if others lost interest. I will
> > be
> > > happy if 3-5 members will come and say, yes, I will do a review/I will
> > help
> > > with further changes.
> > >
> > > Just to be clear, I'm not against it, and I'll never cast -1 for it, but
> > it
> > > would be more comfortable to develop this feature with understanding that
> > > this work will not be useless.
> > >
> > > Sincerely,
> > > Dmitriy Pavlov
> > >
> > > ср, 27 февр. 2019 г. в 23:36, Denis Magda <dm...@apache.org>:
> > >
> > > > Maxim,
> > > >
> > > > GridGain has this exact feature available for Ignite native persistence
> > > > deployments. It's not as easy as it might have been seen from the
> > > > enablement perspective. Took us many years to make it production ready,
> > > > involving many engineers. If the rest of the community wants to create
> > > > something similar and available in open source then please take this
> > > > estimate into consideration.
> > > >
> > > > -
> > > > Denis
> > > >
> > > >
> > > > On Wed, Feb 27, 2019 at 8:53 AM Maxim Muzafarov <ma...@gmail.com>
> > > > wrote:
> > > >
> > > > > Igniters,
> > > > >
> > > > > Some of the stores with which the Apache Ignite is often compared has
> > > > > a feature called Snapshots [1] [2]. This feature provides an
> > > > > eventually consistent view on stored data for different purposes
> > (e.g.
> > > > > moving data between environments, saving a backup of data for the
> > > > > further restore procedure and so on). The Apache Ignite has all
> > > > > opportunities and machinery to provide cache and\or data region
> > > > > snapshots out of the box but still don't have them.
> > > > >
> > > > > This issue derives from IEP-28 [5] on which I'm currently working on
> > > > > (partially described in the section [6]). I would like to solve this
> > > > > issue too and make Apache Ignite more attractive to use on a
> > > > > production environment. I've haven't investigated in-memory type
> > > > > caches yet, but for caches with enabled persistence, we can do it
> > > > > without any performance impact on cache operations (some additional
> > IO
> > > > > operations are needed to copy cache data to backup store, copy on
> > > > > write technique is used here). We just need to use our DiscoverySpi,
> > > > > PME and Checkpointer process the right way.
> > > > >
> > > > > For the first step, we can store all backup data on each of cache
> > > > > affinity node locally. For instance, the `backup\snapshotId\cache0`
> > > > > folder will be created and all `cache0` partitions will be stored
> > > > > there for each local node for the snapshot process with id
> > > > > `snapshotId`. In future, we can teach nodes to upload snapshotted
> > > > > partitions to the one remote node or cloud.
> > > > >
> > > > > --
> > > > >
> > > > > High-level process overview
> > > > >
> > > > > A new snapshot process is managed via DiscoverySpi and
> > > > > CommunicationSpi messages.
> > > > >
> > > > > 1. The initiator sends a request to the cluster (DiscoveryMessage).
> > > > > 2. When the node receives a message it initiates PME.
> > > > > 3. The node begins checkpoint process (holding write lock a short
> > time)
> > > > > 4. The node starts to track any write attempts to the snapshotting
> > > > > partition and places the copy of original pages to the temp file.
> > > > > 5. The node performs merge the partition file with the corresponding
> > > > delta.
> > > > > 6. When the node finishes the backup process it sends ack message
> > with
> > > > > saved partitions to the initiator (or the error response).
> > > > > 7. When all ack messages received the backup is finished.
> > > > >
> > > > > The only problem here is that when the request message arrives at the
> > > > > particular node during running checkpoint PME will be locked until it
> > > > > ends. This is not good. But hopefully, it will be fixed here [4].
> > > > >
> > > > > --
> > > > >
> > > > > Probable API
> > > > >
> > > > > From the cache perspective:
> > > > >
> > > > > IgniteFuture<IgniteSnapshot> snapshotFut =
> > > > >     ignite.cache("default")
> > > > >         .shapshotter()
> > > > >         .create("myShapshotId");
> > > > >
> > > > > IgniteSnapshot cacheSnapshot = snapshotFut.get();
> > > > >
> > > > > IgniteCache<K, V> copiedCache =
> > > > >     ignite.createCache("CopyCache")
> > > > >         .withConfiguration(defaultCache.getConfiguration())
> > > > >         .loadFromSnapshot(cacheSnapshot.id());
> > > > >
> > > > > From the command line perspective:
> > > > >
> > > > > control.sh --snapshot take cache0,cache1,cache2
> > > > >
> > > > > --
> > > > >
> > > > > WDYT?
> > > > > Will it be a useful feature for the Apache Ignite?
> > > > >
> > > > >
> > > > > [1]
> > > > >
> > > >
> > https://geode.apache.org/docs/guide/10/managing/cache_snapshots/chapter_overview.html
> > > > > [2]
> > > > >
> > > >
> > https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsBackupTakesSnapshot.html
> > > > > [3]
> > > > >
> > > >
> > http://apache-ignite-developers.2346864.n4.nabble.com/Data-Snapshots-in-Ignite-td4183.html
> > > > > [4] https://issues.apache.org/jira/browse/IGNITE-10508
> > > > > [5]
> > > > >
> > > >
> > https://cwiki.apache.org/confluence/display/IGNITE/IEP-28%3A+Cluster+peer-2-peer+balancing
> > > > > [6]
> > > > >
> > > >
> > https://cwiki.apache.org/confluence/display/IGNITE/IEP-28%3A+Cluster+peer-2-peer+balancing#IEP-28:Clusterpeer-2-peerbalancing-Checkpointer
> > > > >
> > > >
> >

Re: [DISCUSSION] Hot cache backup

Posted by Andrey Dolmatov <it...@gmail.com>.

Hi, Maxim!
It is very useful feature, great job!

But could you explain me some aspects?

   - Does snapshot contain only primary data or backup partitions or both?
   - Could I create snapshot from m-node cluster and apply it to n-node
   cluster (n<>m)?
   - Should data node has extra space on persistent store to create
   snapshot? Or, from another point of view, woild size of temporary file be
   equal to size of all data on cluster node?
   - What resulted snapshot is, single file or collection of files (one for
   every data node)?

I apologize for my questions, but i really interested in such feature.


вт, 7 апр. 2020 г. в 22:10, Maxim Muzafarov <ma...@gmail.com>:

> Igniters,
>
>
> I'd like to back to the discussion of a snapshot operation for Apache
> Ignite for persistence cache groups and I propose my changes below. I
> have prepared everything so that the discussion is as meaningful and
> specific as much as possible:
>
> - IEP-43: Cluster snapshot [1]
> - The Jira task IGNITE-11073 [2]
> - PR with described changes, Patch Available [4]
>
> Changes are ready for review.
>
>
> Here are a few implementation details and my thoughts:
>
> 1. Snapshot restore assumed to be manual at the first step. The
> process will be described on our documentation pages, but it is
> possible to start node right from the snapshot directory since the
> directory structure is preserved (check
> `testConsistentClusterSnapshotUnderLoad` in the PR). We also have some
> options here about how the restore process must look like:
> - fully manual snapshot restore (will be documented)
> - ansible or shell scripts for restore
> - Java API for restore (I doubt we should go this way).
>
> 3. The snapshot `create` procedure creates a snapshot of all
> persistent caches available on the cluster (see limitations [1]).
>
> 2. The snapshot `create` procedure is available through Java API and
> JMX (control.sh may be implemented further).
>
> Java API:
> IgniteFuture<Void> fut = ignite.snapshot()
> .createSnapshot(name);
>
> JMX:
> SnapshotMXBean mxBean = getMBean(ignite.name());
> mxBean.createSnapshot(name);
>
> 3. The Distribute Process [3] is used to perform a cluster-wide
> snapshot procedure, so we've avoided a lot of boilerplate code here.
>
> 4. The design document [1] contains also an internal API for creating
> a consistent local snapshot of requested cache groups and transfer it
> to another node using the FileTransmission protocol [6]. This is one
> of the parts of IEP-28 [5] for cluster rebalancing via partition files
> and an important part for understanding the whole design.
>
> Java API:
> public IgniteInternalFuture<Void> createRemoteSnapshot(
>     UUID rmtNodeId,
>     Map<Integer, Set<Integer>> parts,
>     BiConsumer<File, GroupPartitionId> partConsumer);
>
>
> Please, share your thoughts and take a loot at my changes [4].
>
>
> [1]
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-43%3A+Cluster+snapshots
> [2] https://issues.apache.org/jira/browse/IGNITE-11073
> [3]
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/util/distributed/DistributedProcess.java#L49
> [4] https://github.com/apache/ignite/pull/7607
> [5]
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-28%3A+Cluster+peer-2-peer+balancing#IEP-28:Clusterpeer-2-peerbalancing-Filetransferbetweennodes
> [6]
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/managers/communication/TransmissionHandler.java#L42
>
>
> On Thu, 28 Feb 2019 at 14:43, Dmitriy Pavlov <dp...@apache.org> wrote:
> >
> > Hi Maxim,
> >
> > I agree with Denis and I have just one concern here.
> >
> > Apache Ignite has quite a long story (started even before Apache), and
> now
> > it has a way too huge number of features. Some of these features
> > - are developed and well known by community members,
> > - some of them were contributed a long time ago and nobody develops it,
> > - and, actually, in some rare cases, nobody in the community knows how it
> > works and how to change it.
> >
> > Such features may attract users, but a bug in it may ruin impression
> about
> > the product. Even worse, nobody can help to solve it, and only user
> himself
> > or herself may be encouraged to contribute a fix.
> >
> > And my concern here, such a big feature should have a number of
> interested
> > contributors, who can support it in case if others lost interest. I will
> be
> > happy if 3-5 members will come and say, yes, I will do a review/I will
> help
> > with further changes.
> >
> > Just to be clear, I'm not against it, and I'll never cast -1 for it, but
> it
> > would be more comfortable to develop this feature with understanding that
> > this work will not be useless.
> >
> > Sincerely,
> > Dmitriy Pavlov
> >
> > ср, 27 февр. 2019 г. в 23:36, Denis Magda <dm...@apache.org>:
> >
> > > Maxim,
> > >
> > > GridGain has this exact feature available for Ignite native persistence
> > > deployments. It's not as easy as it might have been seen from the
> > > enablement perspective. Took us many years to make it production ready,
> > > involving many engineers. If the rest of the community wants to create
> > > something similar and available in open source then please take this
> > > estimate into consideration.
> > >
> > > -
> > > Denis
> > >
> > >
> > > On Wed, Feb 27, 2019 at 8:53 AM Maxim Muzafarov <ma...@gmail.com>
> > > wrote:
> > >
> > > > Igniters,
> > > >
> > > > Some of the stores with which the Apache Ignite is often compared has
> > > > a feature called Snapshots [1] [2]. This feature provides an
> > > > eventually consistent view on stored data for different purposes
> (e.g.
> > > > moving data between environments, saving a backup of data for the
> > > > further restore procedure and so on). The Apache Ignite has all
> > > > opportunities and machinery to provide cache and\or data region
> > > > snapshots out of the box but still don't have them.
> > > >
> > > > This issue derives from IEP-28 [5] on which I'm currently working on
> > > > (partially described in the section [6]). I would like to solve this
> > > > issue too and make Apache Ignite more attractive to use on a
> > > > production environment. I've haven't investigated in-memory type
> > > > caches yet, but for caches with enabled persistence, we can do it
> > > > without any performance impact on cache operations (some additional
> IO
> > > > operations are needed to copy cache data to backup store, copy on
> > > > write technique is used here). We just need to use our DiscoverySpi,
> > > > PME and Checkpointer process the right way.
> > > >
> > > > For the first step, we can store all backup data on each of cache
> > > > affinity node locally. For instance, the `backup\snapshotId\cache0`
> > > > folder will be created and all `cache0` partitions will be stored
> > > > there for each local node for the snapshot process with id
> > > > `snapshotId`. In future, we can teach nodes to upload snapshotted
> > > > partitions to the one remote node or cloud.
> > > >
> > > > --
> > > >
> > > > High-level process overview
> > > >
> > > > A new snapshot process is managed via DiscoverySpi and
> > > > CommunicationSpi messages.
> > > >
> > > > 1. The initiator sends a request to the cluster (DiscoveryMessage).
> > > > 2. When the node receives a message it initiates PME.
> > > > 3. The node begins checkpoint process (holding write lock a short
> time)
> > > > 4. The node starts to track any write attempts to the snapshotting
> > > > partition and places the copy of original pages to the temp file.
> > > > 5. The node performs merge the partition file with the corresponding
> > > delta.
> > > > 6. When the node finishes the backup process it sends ack message
> with
> > > > saved partitions to the initiator (or the error response).
> > > > 7. When all ack messages received the backup is finished.
> > > >
> > > > The only problem here is that when the request message arrives at the
> > > > particular node during running checkpoint PME will be locked until it
> > > > ends. This is not good. But hopefully, it will be fixed here [4].
> > > >
> > > > --
> > > >
> > > > Probable API
> > > >
> > > > From the cache perspective:
> > > >
> > > > IgniteFuture<IgniteSnapshot> snapshotFut =
> > > >     ignite.cache("default")
> > > >         .shapshotter()
> > > >         .create("myShapshotId");
> > > >
> > > > IgniteSnapshot cacheSnapshot = snapshotFut.get();
> > > >
> > > > IgniteCache<K, V> copiedCache =
> > > >     ignite.createCache("CopyCache")
> > > >         .withConfiguration(defaultCache.getConfiguration())
> > > >         .loadFromSnapshot(cacheSnapshot.id());
> > > >
> > > > From the command line perspective:
> > > >
> > > > control.sh --snapshot take cache0,cache1,cache2
> > > >
> > > > --
> > > >
> > > > WDYT?
> > > > Will it be a useful feature for the Apache Ignite?
> > > >
> > > >
> > > > [1]
> > > >
> > >
> https://geode.apache.org/docs/guide/10/managing/cache_snapshots/chapter_overview.html
> > > > [2]
> > > >
> > >
> https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsBackupTakesSnapshot.html
> > > > [3]
> > > >
> > >
> http://apache-ignite-developers.2346864.n4.nabble.com/Data-Snapshots-in-Ignite-td4183.html
> > > > [4] https://issues.apache.org/jira/browse/IGNITE-10508
> > > > [5]
> > > >
> > >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-28%3A+Cluster+peer-2-peer+balancing
> > > > [6]
> > > >
> > >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-28%3A+Cluster+peer-2-peer+balancing#IEP-28:Clusterpeer-2-peerbalancing-Checkpointer
> > > >
> > >
>

Re: [DISCUSSION] Hot cache backup

Posted by Nikolay Izhikov <ni...@apache.org>.

Hello, Maxim.

Great to see such an important feature in Ignite!
Please, let me know if you need any help with review.

> 7 апр. 2020 г., в 22:10, Maxim Muzafarov <ma...@gmail.com> написал(а):
> 
> Igniters,
> 
> 
> I'd like to back to the discussion of a snapshot operation for Apache
> Ignite for persistence cache groups and I propose my changes below. I
> have prepared everything so that the discussion is as meaningful and
> specific as much as possible:
> 
> - IEP-43: Cluster snapshot [1]
> - The Jira task IGNITE-11073 [2]
> - PR with described changes, Patch Available [4]
> 
> Changes are ready for review.
> 
> 
> Here are a few implementation details and my thoughts:
> 
> 1. Snapshot restore assumed to be manual at the first step. The
> process will be described on our documentation pages, but it is
> possible to start node right from the snapshot directory since the
> directory structure is preserved (check
> `testConsistentClusterSnapshotUnderLoad` in the PR). We also have some
> options here about how the restore process must look like:
> - fully manual snapshot restore (will be documented)
> - ansible or shell scripts for restore
> - Java API for restore (I doubt we should go this way).
> 
> 3. The snapshot `create` procedure creates a snapshot of all
> persistent caches available on the cluster (see limitations [1]).
> 
> 2. The snapshot `create` procedure is available through Java API and
> JMX (control.sh may be implemented further).
> 
> Java API:
> IgniteFuture<Void> fut = ignite.snapshot()
> .createSnapshot(name);
> 
> JMX:
> SnapshotMXBean mxBean = getMBean(ignite.name());
> mxBean.createSnapshot(name);
> 
> 3. The Distribute Process [3] is used to perform a cluster-wide
> snapshot procedure, so we've avoided a lot of boilerplate code here.
> 
> 4. The design document [1] contains also an internal API for creating
> a consistent local snapshot of requested cache groups and transfer it
> to another node using the FileTransmission protocol [6]. This is one
> of the parts of IEP-28 [5] for cluster rebalancing via partition files
> and an important part for understanding the whole design.
> 
> Java API:
> public IgniteInternalFuture<Void> createRemoteSnapshot(
>    UUID rmtNodeId,
>    Map<Integer, Set<Integer>> parts,
>    BiConsumer<File, GroupPartitionId> partConsumer);
> 
> 
> Please, share your thoughts and take a loot at my changes [4].
> 
> 
> [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-43%3A+Cluster+snapshots
> [2] https://issues.apache.org/jira/browse/IGNITE-11073
> [3] https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/util/distributed/DistributedProcess.java#L49
> [4] https://github.com/apache/ignite/pull/7607
> [5] https://cwiki.apache.org/confluence/display/IGNITE/IEP-28%3A+Cluster+peer-2-peer+balancing#IEP-28:Clusterpeer-2-peerbalancing-Filetransferbetweennodes
> [6] https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/managers/communication/TransmissionHandler.java#L42
> 
> 
> On Thu, 28 Feb 2019 at 14:43, Dmitriy Pavlov <dp...@apache.org> wrote:
>> 
>> Hi Maxim,
>> 
>> I agree with Denis and I have just one concern here.
>> 
>> Apache Ignite has quite a long story (started even before Apache), and now
>> it has a way too huge number of features. Some of these features
>> - are developed and well known by community members,
>> - some of them were contributed a long time ago and nobody develops it,
>> - and, actually, in some rare cases, nobody in the community knows how it
>> works and how to change it.
>> 
>> Such features may attract users, but a bug in it may ruin impression about
>> the product. Even worse, nobody can help to solve it, and only user himself
>> or herself may be encouraged to contribute a fix.
>> 
>> And my concern here, such a big feature should have a number of interested
>> contributors, who can support it in case if others lost interest. I will be
>> happy if 3-5 members will come and say, yes, I will do a review/I will help
>> with further changes.
>> 
>> Just to be clear, I'm not against it, and I'll never cast -1 for it, but it
>> would be more comfortable to develop this feature with understanding that
>> this work will not be useless.
>> 
>> Sincerely,
>> Dmitriy Pavlov
>> 
>> ср, 27 февр. 2019 г. в 23:36, Denis Magda <dm...@apache.org>:
>> 
>>> Maxim,
>>> 
>>> GridGain has this exact feature available for Ignite native persistence
>>> deployments. It's not as easy as it might have been seen from the
>>> enablement perspective. Took us many years to make it production ready,
>>> involving many engineers. If the rest of the community wants to create
>>> something similar and available in open source then please take this
>>> estimate into consideration.
>>> 
>>> -
>>> Denis
>>> 
>>> 
>>> On Wed, Feb 27, 2019 at 8:53 AM Maxim Muzafarov <ma...@gmail.com>
>>> wrote:
>>> 
>>>> Igniters,
>>>> 
>>>> Some of the stores with which the Apache Ignite is often compared has
>>>> a feature called Snapshots [1] [2]. This feature provides an
>>>> eventually consistent view on stored data for different purposes (e.g.
>>>> moving data between environments, saving a backup of data for the
>>>> further restore procedure and so on). The Apache Ignite has all
>>>> opportunities and machinery to provide cache and\or data region
>>>> snapshots out of the box but still don't have them.
>>>> 
>>>> This issue derives from IEP-28 [5] on which I'm currently working on
>>>> (partially described in the section [6]). I would like to solve this
>>>> issue too and make Apache Ignite more attractive to use on a
>>>> production environment. I've haven't investigated in-memory type
>>>> caches yet, but for caches with enabled persistence, we can do it
>>>> without any performance impact on cache operations (some additional IO
>>>> operations are needed to copy cache data to backup store, copy on
>>>> write technique is used here). We just need to use our DiscoverySpi,
>>>> PME and Checkpointer process the right way.
>>>> 
>>>> For the first step, we can store all backup data on each of cache
>>>> affinity node locally. For instance, the `backup\snapshotId\cache0`
>>>> folder will be created and all `cache0` partitions will be stored
>>>> there for each local node for the snapshot process with id
>>>> `snapshotId`. In future, we can teach nodes to upload snapshotted
>>>> partitions to the one remote node or cloud.
>>>> 
>>>> --
>>>> 
>>>> High-level process overview
>>>> 
>>>> A new snapshot process is managed via DiscoverySpi and
>>>> CommunicationSpi messages.
>>>> 
>>>> 1. The initiator sends a request to the cluster (DiscoveryMessage).
>>>> 2. When the node receives a message it initiates PME.
>>>> 3. The node begins checkpoint process (holding write lock a short time)
>>>> 4. The node starts to track any write attempts to the snapshotting
>>>> partition and places the copy of original pages to the temp file.
>>>> 5. The node performs merge the partition file with the corresponding
>>> delta.
>>>> 6. When the node finishes the backup process it sends ack message with
>>>> saved partitions to the initiator (or the error response).
>>>> 7. When all ack messages received the backup is finished.
>>>> 
>>>> The only problem here is that when the request message arrives at the
>>>> particular node during running checkpoint PME will be locked until it
>>>> ends. This is not good. But hopefully, it will be fixed here [4].
>>>> 
>>>> --
>>>> 
>>>> Probable API
>>>> 
>>>> From the cache perspective:
>>>> 
>>>> IgniteFuture<IgniteSnapshot> snapshotFut =
>>>>    ignite.cache("default")
>>>>        .shapshotter()
>>>>        .create("myShapshotId");
>>>> 
>>>> IgniteSnapshot cacheSnapshot = snapshotFut.get();
>>>> 
>>>> IgniteCache<K, V> copiedCache =
>>>>    ignite.createCache("CopyCache")
>>>>        .withConfiguration(defaultCache.getConfiguration())
>>>>        .loadFromSnapshot(cacheSnapshot.id());
>>>> 
>>>> From the command line perspective:
>>>> 
>>>> control.sh --snapshot take cache0,cache1,cache2
>>>> 
>>>> --
>>>> 
>>>> WDYT?
>>>> Will it be a useful feature for the Apache Ignite?
>>>> 
>>>> 
>>>> [1]
>>>> 
>>> https://geode.apache.org/docs/guide/10/managing/cache_snapshots/chapter_overview.html
>>>> [2]
>>>> 
>>> https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsBackupTakesSnapshot.html
>>>> [3]
>>>> 
>>> http://apache-ignite-developers.2346864.n4.nabble.com/Data-Snapshots-in-Ignite-td4183.html
>>>> [4] https://issues.apache.org/jira/browse/IGNITE-10508
>>>> [5]
>>>> 
>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-28%3A+Cluster+peer-2-peer+balancing
>>>> [6]
>>>> 
>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-28%3A+Cluster+peer-2-peer+balancing#IEP-28:Clusterpeer-2-peerbalancing-Checkpointer
>>>> 
>>>