You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ignite.apache.org by Raul Kripalani <ra...@apache.org> on 2015/10/21 11:06:25 UTC

Data Snapshots in Ignite

Hey guys,

LevelDb has a functionality called Snapshots which provides a consistent
read-only view of the DB at a given point in time, against which queries
can be executed.

To my knowledge, this functionality doesn't exist in the world of open
source In-Memory Computing. Ignite could be an innovator here.

Ignite Snapshots would allow queries, distributed closures, map-reduce
jobs, etc. It could be useful for Spark RDDs to avoid data shift while the
computation is taking place (not sure if there's already some form of
snapshotting, though). Same for IGFS.

Example usage:

    IgniteCacheSnapshot snapshot =
ignite.cache("mycache").snapshots().create();

    // all three queries are executed against a view of the cache at the
point in time where it was snapshotted
    snapshot.query("select ...");
    snapshot.query("select ...");
    snapshot.query("select ...");

In fact, it would be awesome to be able to logically save this snapshot
with a name so that later jobs, queries, etc. can run on top of it, e.g.:

    IgniteCacheSnapshot snapshot =
ignite.cache("mycache").snapshots().create("abc");

    // ...
    // in another module of a distributed system, or in another thread in
parallel, use the saved snapshot
    IgniteCacheSnapshot snapshot =
ignite.cache("mycache").snapshots().get("abc");
    ....

Named snapshotting can be dangerous due to data retention, e.g. imagine
keeping a snapshot for 2 weeks! So we should force the user to specify a
TTL:

    IgniteCacheSnapshot snapshot =
ignite.cache("mycache").snapshots().create("abc", 2, TimeUnit.HOURS);

Such functionality would allow for "reporting checkpoints" and "time
travel", for example, where you want users to be able to query the data as
it stood 1 hour ago, 2 hours ago, etc.

What do you think?

P.S.: We do have some form of snapshotting in the Compute checkpointing
functionality – but my proposal is to generalise the notion.

Regards,

*Raúl Kripalani*
PMC & Committer @ Apache Ignite, Apache Camel | Integration, Big Data and
Messaging Engineer
http://about.me/raulkripalani | http://www.linkedin.com/in/raulkripalani
http://blog.raulkr.net | twitter: @raulvk

Re: Data Snapshots in Ignite

Posted by Dmitriy Setrakyan <ds...@apache.org>.

On Mon, Oct 26, 2015 at 5:31 AM, Raul Kripalani <ra...@apache.org> wrote:

> Hi,
>
> Thanks all for chiming in. It seems like this feature could be of interest
> to the user community, so I've opened a ticket to continue maturing the
> idea there:
>
> https://issues.apache.org/jira/browse/IGNITE-1789
>
> We may need to create a Wiki page later to collaborate around specifics and
> design.
>

Thanks Raul. I agree that Wiki page may be in order. I have responded in
the ticket. Take a look and see if you agree with my thinking.


> Regards,
>
> *Raúl Kripalani*
> PMC & Committer @ Apache Ignite, Apache Camel | Integration, Big Data and
> Messaging Engineer
> http://about.me/raulkripalani | http://www.linkedin.com/in/raulkripalani
> http://blog.raulkr.net | twitter: @raulvk
>
> On Wed, Oct 21, 2015 at 10:06 AM, Raul Kripalani <ra...@apache.org> wrote:
>
> > Hey guys,
> >
> > LevelDb has a functionality called Snapshots which provides a consistent
> > read-only view of the DB at a given point in time, against which queries
> > can be executed.
> >
> > To my knowledge, this functionality doesn't exist in the world of open
> > source In-Memory Computing. Ignite could be an innovator here.
> >
> > Ignite Snapshots would allow queries, distributed closures, map-reduce
> > jobs, etc. It could be useful for Spark RDDs to avoid data shift while
> the
> > computation is taking place (not sure if there's already some form of
> > snapshotting, though). Same for IGFS.
> >
> > Example usage:
> >
> >     IgniteCacheSnapshot snapshot =
> > ignite.cache("mycache").snapshots().create();
> >
> >     // all three queries are executed against a view of the cache at the
> > point in time where it was snapshotted
> >     snapshot.query("select ...");
> >     snapshot.query("select ...");
> >     snapshot.query("select ...");
> >
> > In fact, it would be awesome to be able to logically save this snapshot
> > with a name so that later jobs, queries, etc. can run on top of it, e.g.:
> >
> >     IgniteCacheSnapshot snapshot =
> > ignite.cache("mycache").snapshots().create("abc");
> >
> >     // ...
> >     // in another module of a distributed system, or in another thread in
> > parallel, use the saved snapshot
> >     IgniteCacheSnapshot snapshot =
> > ignite.cache("mycache").snapshots().get("abc");
> >     ....
> >
> > Named snapshotting can be dangerous due to data retention, e.g. imagine
> > keeping a snapshot for 2 weeks! So we should force the user to specify a
> > TTL:
> >
> >     IgniteCacheSnapshot snapshot =
> > ignite.cache("mycache").snapshots().create("abc", 2, TimeUnit.HOURS);
> >
> > Such functionality would allow for "reporting checkpoints" and "time
> > travel", for example, where you want users to be able to query the data
> as
> > it stood 1 hour ago, 2 hours ago, etc.
> >
> > What do you think?
> >
> > P.S.: We do have some form of snapshotting in the Compute checkpointing
> > functionality – but my proposal is to generalise the notion.
> >
> > Regards,
> >
> > *Raúl Kripalani*
> > PMC & Committer @ Apache Ignite, Apache Camel | Integration, Big Data and
> > Messaging Engineer
> > http://about.me/raulkripalani | http://www.linkedin.com/in/raulkripalani
> > http://blog.raulkr.net | twitter: @raulvk
> >
>

Re: Data Snapshots in Ignite

Posted by Raul Kripalani <ra...@apache.org>.

Hi,

Thanks all for chiming in. It seems like this feature could be of interest
to the user community, so I've opened a ticket to continue maturing the
idea there:

https://issues.apache.org/jira/browse/IGNITE-1789

We may need to create a Wiki page later to collaborate around specifics and
design.

Regards,

*Raúl Kripalani*
PMC & Committer @ Apache Ignite, Apache Camel | Integration, Big Data and
Messaging Engineer
http://about.me/raulkripalani | http://www.linkedin.com/in/raulkripalani
http://blog.raulkr.net | twitter: @raulvk

On Wed, Oct 21, 2015 at 10:06 AM, Raul Kripalani <ra...@apache.org> wrote:

> Hey guys,
>
> LevelDb has a functionality called Snapshots which provides a consistent
> read-only view of the DB at a given point in time, against which queries
> can be executed.
>
> To my knowledge, this functionality doesn't exist in the world of open
> source In-Memory Computing. Ignite could be an innovator here.
>
> Ignite Snapshots would allow queries, distributed closures, map-reduce
> jobs, etc. It could be useful for Spark RDDs to avoid data shift while the
> computation is taking place (not sure if there's already some form of
> snapshotting, though). Same for IGFS.
>
> Example usage:
>
>     IgniteCacheSnapshot snapshot =
> ignite.cache("mycache").snapshots().create();
>
>     // all three queries are executed against a view of the cache at the
> point in time where it was snapshotted
>     snapshot.query("select ...");
>     snapshot.query("select ...");
>     snapshot.query("select ...");
>
> In fact, it would be awesome to be able to logically save this snapshot
> with a name so that later jobs, queries, etc. can run on top of it, e.g.:
>
>     IgniteCacheSnapshot snapshot =
> ignite.cache("mycache").snapshots().create("abc");
>
>     // ...
>     // in another module of a distributed system, or in another thread in
> parallel, use the saved snapshot
>     IgniteCacheSnapshot snapshot =
> ignite.cache("mycache").snapshots().get("abc");
>     ....
>
> Named snapshotting can be dangerous due to data retention, e.g. imagine
> keeping a snapshot for 2 weeks! So we should force the user to specify a
> TTL:
>
>     IgniteCacheSnapshot snapshot =
> ignite.cache("mycache").snapshots().create("abc", 2, TimeUnit.HOURS);
>
> Such functionality would allow for "reporting checkpoints" and "time
> travel", for example, where you want users to be able to query the data as
> it stood 1 hour ago, 2 hours ago, etc.
>
> What do you think?
>
> P.S.: We do have some form of snapshotting in the Compute checkpointing
> functionality – but my proposal is to generalise the notion.
>
> Regards,
>
> *Raúl Kripalani*
> PMC & Committer @ Apache Ignite, Apache Camel | Integration, Big Data and
> Messaging Engineer
> http://about.me/raulkripalani | http://www.linkedin.com/in/raulkripalani
> http://blog.raulkr.net | twitter: @raulvk
>

Re: Data Snapshots in Ignite

Posted by Alexey Kuznetsov <ak...@gridgain.com>.

>> Not sure how CQEngine will help for snapshotting of Ignite. Can you
explain?

There was a discussion about this lib has various types of indexes that
could be only created on read-only data.

But it seems that indeed my note has a little value for this thread.

On Thu, Oct 22, 2015 at 2:31 PM, Dmitriy Setrakyan <ds...@apache.org>
wrote:

> Not sure how CQEngine will help for snapshotting of Ignite. Can you
> explain?
>
> On Wed, Oct 21, 2015 at 7:48 PM, Alexey Kuznetsov <akuznetsov@gridgain.com
> >
> wrote:
>
> > Some time ago we already discussed snapshots idea while discussing
> library
> > "cqengine" [1].
> >
> > May be this library could be useful in snapshot mode.
> >
> > [1] Sources: https://github.com/npgall/cqengine
> >      Docs: https://code.google.com/p/cqengine/
> >
> > On Thu, Oct 22, 2015 at 8:48 AM, Konstantin Boudnik <co...@apache.org>
> > wrote:
> >
> > > On Wed, Oct 21, 2015 at 05:58PM, Dmitriy Setrakyan wrote:
> > > > On Wed, Oct 21, 2015 at 4:48 PM, Konstantin Boudnik <co...@apache.org>
> > > wrote:
> > > >
> > > > > I like it quite a bit, as well! Ticket would make the most sense as
> > > well,
> > > > > so
> > > > > there will be a single place to collect the design docs (if
> needed),
> > > etc.
> > > > >
> > > > > On Wed, Oct 21, 2015 at 04:45PM, Dmitriy Setrakyan wrote:
> > > > > > I also really like the idea. One potential use case is fraud
> > > analysis in
> > > > > > financial institutions. Rarely it makes sense to perform such
> > > analysis
> > > > > on a
> > > > > > life system, but rather a snapshot of some data needs to be taken
> > and
> > > > > > analyzed offline.
> > > > > >
> > > > > > I think snapshots should be saved to disk, so users could load
> them
> > > for
> > > > > > analysis on a totally different cluster.
> > > > >
> > > > > I think disk persistence should be optional, not mandatory.
> > > > >
> > > >
> > > > I would actually prefer to support disk-only snapshots. I think it
> will
> > > be
> > > > difficult (double-the-work) to support both, in-memory and disk
> > formats.
> > > > Also, storing snapshots in-memory would require extra memory (a lot
> of
> > > it)
> > > > for something that gets saved mainly for historic purposes or offline
> > > > analysis.
> > >
> > > Ah, good points! Agree.
> > >
> > >
> >
> >
> > --
> > Alexey Kuznetsov
> > GridGain Systems
> > www.gridgain.com
> >
>



-- 
Alexey Kuznetsov
GridGain Systems
www.gridgain.com

Re: Data Snapshots in Ignite

Posted by Dmitriy Setrakyan <ds...@apache.org>.

Not sure how CQEngine will help for snapshotting of Ignite. Can you explain?

On Wed, Oct 21, 2015 at 7:48 PM, Alexey Kuznetsov <ak...@gridgain.com>
wrote:

> Some time ago we already discussed snapshots idea while discussing library
> "cqengine" [1].
>
> May be this library could be useful in snapshot mode.
>
> [1] Sources: https://github.com/npgall/cqengine
>      Docs: https://code.google.com/p/cqengine/
>
> On Thu, Oct 22, 2015 at 8:48 AM, Konstantin Boudnik <co...@apache.org>
> wrote:
>
> > On Wed, Oct 21, 2015 at 05:58PM, Dmitriy Setrakyan wrote:
> > > On Wed, Oct 21, 2015 at 4:48 PM, Konstantin Boudnik <co...@apache.org>
> > wrote:
> > >
> > > > I like it quite a bit, as well! Ticket would make the most sense as
> > well,
> > > > so
> > > > there will be a single place to collect the design docs (if needed),
> > etc.
> > > >
> > > > On Wed, Oct 21, 2015 at 04:45PM, Dmitriy Setrakyan wrote:
> > > > > I also really like the idea. One potential use case is fraud
> > analysis in
> > > > > financial institutions. Rarely it makes sense to perform such
> > analysis
> > > > on a
> > > > > life system, but rather a snapshot of some data needs to be taken
> and
> > > > > analyzed offline.
> > > > >
> > > > > I think snapshots should be saved to disk, so users could load them
> > for
> > > > > analysis on a totally different cluster.
> > > >
> > > > I think disk persistence should be optional, not mandatory.
> > > >
> > >
> > > I would actually prefer to support disk-only snapshots. I think it will
> > be
> > > difficult (double-the-work) to support both, in-memory and disk
> formats.
> > > Also, storing snapshots in-memory would require extra memory (a lot of
> > it)
> > > for something that gets saved mainly for historic purposes or offline
> > > analysis.
> >
> > Ah, good points! Agree.
> >
> >
>
>
> --
> Alexey Kuznetsov
> GridGain Systems
> www.gridgain.com
>

Re: Data Snapshots in Ignite

Posted by Alexey Kuznetsov <ak...@gridgain.com>.

Some time ago we already discussed snapshots idea while discussing library
"cqengine" [1].

May be this library could be useful in snapshot mode.

[1] Sources: https://github.com/npgall/cqengine
     Docs: https://code.google.com/p/cqengine/

On Thu, Oct 22, 2015 at 8:48 AM, Konstantin Boudnik <co...@apache.org> wrote:

> On Wed, Oct 21, 2015 at 05:58PM, Dmitriy Setrakyan wrote:
> > On Wed, Oct 21, 2015 at 4:48 PM, Konstantin Boudnik <co...@apache.org>
> wrote:
> >
> > > I like it quite a bit, as well! Ticket would make the most sense as
> well,
> > > so
> > > there will be a single place to collect the design docs (if needed),
> etc.
> > >
> > > On Wed, Oct 21, 2015 at 04:45PM, Dmitriy Setrakyan wrote:
> > > > I also really like the idea. One potential use case is fraud
> analysis in
> > > > financial institutions. Rarely it makes sense to perform such
> analysis
> > > on a
> > > > life system, but rather a snapshot of some data needs to be taken and
> > > > analyzed offline.
> > > >
> > > > I think snapshots should be saved to disk, so users could load them
> for
> > > > analysis on a totally different cluster.
> > >
> > > I think disk persistence should be optional, not mandatory.
> > >
> >
> > I would actually prefer to support disk-only snapshots. I think it will
> be
> > difficult (double-the-work) to support both, in-memory and disk formats.
> > Also, storing snapshots in-memory would require extra memory (a lot of
> it)
> > for something that gets saved mainly for historic purposes or offline
> > analysis.
>
> Ah, good points! Agree.
>
>


-- 
Alexey Kuznetsov
GridGain Systems
www.gridgain.com

Re: Data Snapshots in Ignite

Posted by Konstantin Boudnik <co...@apache.org>.

On Wed, Oct 21, 2015 at 05:58PM, Dmitriy Setrakyan wrote:
> On Wed, Oct 21, 2015 at 4:48 PM, Konstantin Boudnik <co...@apache.org> wrote:
> 
> > I like it quite a bit, as well! Ticket would make the most sense as well,
> > so
> > there will be a single place to collect the design docs (if needed), etc.
> >
> > On Wed, Oct 21, 2015 at 04:45PM, Dmitriy Setrakyan wrote:
> > > I also really like the idea. One potential use case is fraud analysis in
> > > financial institutions. Rarely it makes sense to perform such analysis
> > on a
> > > life system, but rather a snapshot of some data needs to be taken and
> > > analyzed offline.
> > >
> > > I think snapshots should be saved to disk, so users could load them for
> > > analysis on a totally different cluster.
> >
> > I think disk persistence should be optional, not mandatory.
> >
> 
> I would actually prefer to support disk-only snapshots. I think it will be
> difficult (double-the-work) to support both, in-memory and disk formats.
> Also, storing snapshots in-memory would require extra memory (a lot of it)
> for something that gets saved mainly for historic purposes or offline
> analysis.

Ah, good points! Agree.

Re: Data Snapshots in Ignite

Posted by Raul Kripalani <ra...@apache.org>.

Hey Dmitry,

Actually, there are so many possibilities around snapshotting that we're
thinking about what I feel are two distinct functionalities ;-)

While persistent snapshotting is indeed useful, what you describe is a
mechanism somewhere in the spectrum between archiving and backups, right? I
think this may be a nice to have, but not a priority. Reason being that
Ignite would typically be part of a Lambda Architecture where
recent/actionable data is in cache + storage, and historical data (entire
dataset) only in storage. So the data ingestion layer (e.g. glued by Kafka)
would take care of feeding the data into both a persistent store (e.g.
Cassandra) indexed by time and into Ignite. I believe most users already
have some degree of persistence backing Ignite, in order to allow them to
recover from an integral Ignite disaster, right?

What I had in mind is a functionality that Ignite currently lacks (unless
I'm mistaken): the possibility of executing multiple read-only actions
against a consistent view of (paused) cache data. If I understand
correctly, there's currently no way to tell Ignite: "hey! I want to launch
3 compute jobs, one after another, each taking 5 minutes, against an
*identical* set of data, i.e. against a snapshot of data; I don't want
these jobs to see any data changes even if they occur in the underlying
cache during this time".

This type of snapshots would be short-lived, hence persisting the entire
snapshot is questionable. But retaining entries throughout the snapshot's
lifespan can also be dangerous due to memory constraints. So... how would
be solve this dilemma? Ideas:

* Move only evicted / outdated entries that are still active in the scope
of a snapshot to persistent medium. We would need an indexing mechanism
that addresses the location of the data item (e.g. memory or offset N in
persistent file X). As data changes in the underlying cache, Ignite would
keep filling up a disk file with the previous state of the updated /
evicted items as they stood within a snapshot.

* Keep the snapshot only in memory and allow the user to specify a policy
on how to handle memory repletion while snapshots are active:

    * Cancel and discard the snapshot when memory usage reaches a certain
threshold. Interrupt any jobs / queries, etc. that were running and return
an exception.
    * Throttle cache operations while the snapshot is active and memory is
getting full (reliant on a threshold).

To me, the TTL is important if we're retaining entries in memory...

Regards,

*Raúl Kripalani*
PMC & Committer @ Apache Ignite, Apache Camel | Integration, Big Data and
Messaging Engineer
http://about.me/raulkripalani | http://www.linkedin.com/in/raulkripalani
http://blog.raulkr.net | twitter: @raulvk

On Thu, Oct 22, 2015 at 1:58 AM, Dmitriy Setrakyan <ds...@apache.org>
wrote:

> On Wed, Oct 21, 2015 at 4:48 PM, Konstantin Boudnik <co...@apache.org>
> wrote:
>
> > I like it quite a bit, as well! Ticket would make the most sense as well,
> > so
> > there will be a single place to collect the design docs (if needed), etc.
> >
> > On Wed, Oct 21, 2015 at 04:45PM, Dmitriy Setrakyan wrote:
> > > I also really like the idea. One potential use case is fraud analysis
> in
> > > financial institutions. Rarely it makes sense to perform such analysis
> > on a
> > > life system, but rather a snapshot of some data needs to be taken and
> > > analyzed offline.
> > >
> > > I think snapshots should be saved to disk, so users could load them for
> > > analysis on a totally different cluster.
> >
> > I think disk persistence should be optional, not mandatory.
> >
>
> I would actually prefer to support disk-only snapshots. I think it will be
> difficult (double-the-work) to support both, in-memory and disk formats.
> Also, storing snapshots in-memory would require extra memory (a lot of it)
> for something that gets saved mainly for historic purposes or offline
> analysis.
>
>
> >
> > Cos
> >
> > > Raul, if you don’t mind, can you file a ticket and see if anyone in the
> > > community wants to pick it up?
> > >
> > > D.
> > >
> > > On Wed, Oct 21, 2015 at 5:51 AM, Sergi Vladykin <
> > sergi.vladykin@gmail.com>
> > > wrote:
> > >
> > > > Raul,
> > > >
> > > > Actually SQL indexes are already snapshotable. I'm not sure if it
> does
> > make
> > > > sense to make
> > > > the whole cache (with full cache API support) snapshotable, but I
> like
> > your
> > > > idea
> > > > about running multiple SQL statements against the same snapshot.
> > > >
> > > > Also I don't think that it is a good idea to keep snapshots for a
> long
> > > > time,
> > > > so I'd prefer to have typical AutoClosable API like:
> > > >
> > > > try (Snapshot s = ...) {
> > > >     s.query(...);
> > > >     s.query(...);
> > > >     s.query(...);
> > > > }
> > > >
> > > > Though I'm not sure when we will be able to get down to this.
> > > >
> > > > Sergi
> > > >
> > > > 2015-10-21 12:06 GMT+03:00 Raul Kripalani <ra...@apache.org>:
> > > >
> > > > > Hey guys,
> > > > >
> > > > > LevelDb has a functionality called Snapshots which provides a
> > consistent
> > > > > read-only view of the DB at a given point in time, against which
> > queries
> > > > > can be executed.
> > > > >
> > > > > To my knowledge, this functionality doesn't exist in the world of
> > open
> > > > > source In-Memory Computing. Ignite could be an innovator here.
> > > > >
> > > > > Ignite Snapshots would allow queries, distributed closures,
> > map-reduce
> > > > > jobs, etc. It could be useful for Spark RDDs to avoid data shift
> > while
> > > > the
> > > > > computation is taking place (not sure if there's already some form
> of
> > > > > snapshotting, though). Same for IGFS.
> > > > >
> > > > > Example usage:
> > > > >
> > > > >     IgniteCacheSnapshot snapshot =
> > > > > ignite.cache("mycache").snapshots().create();
> > > > >
> > > > >     // all three queries are executed against a view of the cache
> at
> > the
> > > > > point in time where it was snapshotted
> > > > >     snapshot.query("select ...");
> > > > >     snapshot.query("select ...");
> > > > >     snapshot.query("select ...");
> > > > >
> > > > > In fact, it would be awesome to be able to logically save this
> > snapshot
> > > > > with a name so that later jobs, queries, etc. can run on top of it,
> > e.g.:
> > > > >
> > > > >     IgniteCacheSnapshot snapshot =
> > > > > ignite.cache("mycache").snapshots().create("abc");
> > > > >
> > > > >     // ...
> > > > >     // in another module of a distributed system, or in another
> > thread in
> > > > > parallel, use the saved snapshot
> > > > >     IgniteCacheSnapshot snapshot =
> > > > > ignite.cache("mycache").snapshots().get("abc");
> > > > >     ....
> > > > >
> > > > > Named snapshotting can be dangerous due to data retention, e.g.
> > imagine
> > > > > keeping a snapshot for 2 weeks! So we should force the user to
> > specify a
> > > > > TTL:
> > > > >
> > > > >     IgniteCacheSnapshot snapshot =
> > > > > ignite.cache("mycache").snapshots().create("abc", 2,
> TimeUnit.HOURS);
> > > > >
> > > > > Such functionality would allow for "reporting checkpoints" and
> "time
> > > > > travel", for example, where you want users to be able to query the
> > data
> > > > as
> > > > > it stood 1 hour ago, 2 hours ago, etc.
> > > > >
> > > > > What do you think?
> > > > >
> > > > > P.S.: We do have some form of snapshotting in the Compute
> > checkpointing
> > > > > functionality – but my proposal is to generalise the notion.
> > > > >
> > > > > Regards,
> > > > >
> > > > > *Raúl Kripalani*
> > > > > PMC & Committer @ Apache Ignite, Apache Camel | Integration, Big
> > Data and
> > > > > Messaging Engineer
> > > > > http://about.me/raulkripalani |
> > http://www.linkedin.com/in/raulkripalani
> > > > > http://blog.raulkr.net | twitter: @raulvk
> > > > >
> > > >
> >
>

Re: Data Snapshots in Ignite

Posted by Dmitriy Setrakyan <ds...@apache.org>.

On Wed, Oct 21, 2015 at 4:48 PM, Konstantin Boudnik <co...@apache.org> wrote:

> I like it quite a bit, as well! Ticket would make the most sense as well,
> so
> there will be a single place to collect the design docs (if needed), etc.
>
> On Wed, Oct 21, 2015 at 04:45PM, Dmitriy Setrakyan wrote:
> > I also really like the idea. One potential use case is fraud analysis in
> > financial institutions. Rarely it makes sense to perform such analysis
> on a
> > life system, but rather a snapshot of some data needs to be taken and
> > analyzed offline.
> >
> > I think snapshots should be saved to disk, so users could load them for
> > analysis on a totally different cluster.
>
> I think disk persistence should be optional, not mandatory.
>

I would actually prefer to support disk-only snapshots. I think it will be
difficult (double-the-work) to support both, in-memory and disk formats.
Also, storing snapshots in-memory would require extra memory (a lot of it)
for something that gets saved mainly for historic purposes or offline
analysis.


>
> Cos
>
> > Raul, if you don’t mind, can you file a ticket and see if anyone in the
> > community wants to pick it up?
> >
> > D.
> >
> > On Wed, Oct 21, 2015 at 5:51 AM, Sergi Vladykin <
> sergi.vladykin@gmail.com>
> > wrote:
> >
> > > Raul,
> > >
> > > Actually SQL indexes are already snapshotable. I'm not sure if it does
> make
> > > sense to make
> > > the whole cache (with full cache API support) snapshotable, but I like
> your
> > > idea
> > > about running multiple SQL statements against the same snapshot.
> > >
> > > Also I don't think that it is a good idea to keep snapshots for a long
> > > time,
> > > so I'd prefer to have typical AutoClosable API like:
> > >
> > > try (Snapshot s = ...) {
> > >     s.query(...);
> > >     s.query(...);
> > >     s.query(...);
> > > }
> > >
> > > Though I'm not sure when we will be able to get down to this.
> > >
> > > Sergi
> > >
> > > 2015-10-21 12:06 GMT+03:00 Raul Kripalani <ra...@apache.org>:
> > >
> > > > Hey guys,
> > > >
> > > > LevelDb has a functionality called Snapshots which provides a
> consistent
> > > > read-only view of the DB at a given point in time, against which
> queries
> > > > can be executed.
> > > >
> > > > To my knowledge, this functionality doesn't exist in the world of
> open
> > > > source In-Memory Computing. Ignite could be an innovator here.
> > > >
> > > > Ignite Snapshots would allow queries, distributed closures,
> map-reduce
> > > > jobs, etc. It could be useful for Spark RDDs to avoid data shift
> while
> > > the
> > > > computation is taking place (not sure if there's already some form of
> > > > snapshotting, though). Same for IGFS.
> > > >
> > > > Example usage:
> > > >
> > > >     IgniteCacheSnapshot snapshot =
> > > > ignite.cache("mycache").snapshots().create();
> > > >
> > > >     // all three queries are executed against a view of the cache at
> the
> > > > point in time where it was snapshotted
> > > >     snapshot.query("select ...");
> > > >     snapshot.query("select ...");
> > > >     snapshot.query("select ...");
> > > >
> > > > In fact, it would be awesome to be able to logically save this
> snapshot
> > > > with a name so that later jobs, queries, etc. can run on top of it,
> e.g.:
> > > >
> > > >     IgniteCacheSnapshot snapshot =
> > > > ignite.cache("mycache").snapshots().create("abc");
> > > >
> > > >     // ...
> > > >     // in another module of a distributed system, or in another
> thread in
> > > > parallel, use the saved snapshot
> > > >     IgniteCacheSnapshot snapshot =
> > > > ignite.cache("mycache").snapshots().get("abc");
> > > >     ....
> > > >
> > > > Named snapshotting can be dangerous due to data retention, e.g.
> imagine
> > > > keeping a snapshot for 2 weeks! So we should force the user to
> specify a
> > > > TTL:
> > > >
> > > >     IgniteCacheSnapshot snapshot =
> > > > ignite.cache("mycache").snapshots().create("abc", 2, TimeUnit.HOURS);
> > > >
> > > > Such functionality would allow for "reporting checkpoints" and "time
> > > > travel", for example, where you want users to be able to query the
> data
> > > as
> > > > it stood 1 hour ago, 2 hours ago, etc.
> > > >
> > > > What do you think?
> > > >
> > > > P.S.: We do have some form of snapshotting in the Compute
> checkpointing
> > > > functionality – but my proposal is to generalise the notion.
> > > >
> > > > Regards,
> > > >
> > > > *Raúl Kripalani*
> > > > PMC & Committer @ Apache Ignite, Apache Camel | Integration, Big
> Data and
> > > > Messaging Engineer
> > > > http://about.me/raulkripalani |
> http://www.linkedin.com/in/raulkripalani
> > > > http://blog.raulkr.net | twitter: @raulvk
> > > >
> > >
>

Re: Data Snapshots in Ignite

Posted by Konstantin Boudnik <co...@apache.org>.

I like it quite a bit, as well! Ticket would make the most sense as well, so
there will be a single place to collect the design docs (if needed), etc.

On Wed, Oct 21, 2015 at 04:45PM, Dmitriy Setrakyan wrote:
> I also really like the idea. One potential use case is fraud analysis in
> financial institutions. Rarely it makes sense to perform such analysis on a
> life system, but rather a snapshot of some data needs to be taken and
> analyzed offline.
> 
> I think snapshots should be saved to disk, so users could load them for
> analysis on a totally different cluster.

I think disk persistence should be optional, not mandatory.

Cos

> Raul, if you don’t mind, can you file a ticket and see if anyone in the
> community wants to pick it up?
> 
> D.
> 
> On Wed, Oct 21, 2015 at 5:51 AM, Sergi Vladykin <se...@gmail.com>
> wrote:
> 
> > Raul,
> >
> > Actually SQL indexes are already snapshotable. I'm not sure if it does make
> > sense to make
> > the whole cache (with full cache API support) snapshotable, but I like your
> > idea
> > about running multiple SQL statements against the same snapshot.
> >
> > Also I don't think that it is a good idea to keep snapshots for a long
> > time,
> > so I'd prefer to have typical AutoClosable API like:
> >
> > try (Snapshot s = ...) {
> >     s.query(...);
> >     s.query(...);
> >     s.query(...);
> > }
> >
> > Though I'm not sure when we will be able to get down to this.
> >
> > Sergi
> >
> > 2015-10-21 12:06 GMT+03:00 Raul Kripalani <ra...@apache.org>:
> >
> > > Hey guys,
> > >
> > > LevelDb has a functionality called Snapshots which provides a consistent
> > > read-only view of the DB at a given point in time, against which queries
> > > can be executed.
> > >
> > > To my knowledge, this functionality doesn't exist in the world of open
> > > source In-Memory Computing. Ignite could be an innovator here.
> > >
> > > Ignite Snapshots would allow queries, distributed closures, map-reduce
> > > jobs, etc. It could be useful for Spark RDDs to avoid data shift while
> > the
> > > computation is taking place (not sure if there's already some form of
> > > snapshotting, though). Same for IGFS.
> > >
> > > Example usage:
> > >
> > >     IgniteCacheSnapshot snapshot =
> > > ignite.cache("mycache").snapshots().create();
> > >
> > >     // all three queries are executed against a view of the cache at the
> > > point in time where it was snapshotted
> > >     snapshot.query("select ...");
> > >     snapshot.query("select ...");
> > >     snapshot.query("select ...");
> > >
> > > In fact, it would be awesome to be able to logically save this snapshot
> > > with a name so that later jobs, queries, etc. can run on top of it, e.g.:
> > >
> > >     IgniteCacheSnapshot snapshot =
> > > ignite.cache("mycache").snapshots().create("abc");
> > >
> > >     // ...
> > >     // in another module of a distributed system, or in another thread in
> > > parallel, use the saved snapshot
> > >     IgniteCacheSnapshot snapshot =
> > > ignite.cache("mycache").snapshots().get("abc");
> > >     ....
> > >
> > > Named snapshotting can be dangerous due to data retention, e.g. imagine
> > > keeping a snapshot for 2 weeks! So we should force the user to specify a
> > > TTL:
> > >
> > >     IgniteCacheSnapshot snapshot =
> > > ignite.cache("mycache").snapshots().create("abc", 2, TimeUnit.HOURS);
> > >
> > > Such functionality would allow for "reporting checkpoints" and "time
> > > travel", for example, where you want users to be able to query the data
> > as
> > > it stood 1 hour ago, 2 hours ago, etc.
> > >
> > > What do you think?
> > >
> > > P.S.: We do have some form of snapshotting in the Compute checkpointing
> > > functionality – but my proposal is to generalise the notion.
> > >
> > > Regards,
> > >
> > > *Raúl Kripalani*
> > > PMC & Committer @ Apache Ignite, Apache Camel | Integration, Big Data and
> > > Messaging Engineer
> > > http://about.me/raulkripalani | http://www.linkedin.com/in/raulkripalani
> > > http://blog.raulkr.net | twitter: @raulvk
> > >
> >

Re: Data Snapshots in Ignite

Posted by Dmitriy Setrakyan <ds...@apache.org>.

I also really like the idea. One potential use case is fraud analysis in
financial institutions. Rarely it makes sense to perform such analysis on a
life system, but rather a snapshot of some data needs to be taken and
analyzed offline.

I think snapshots should be saved to disk, so users could load them for
analysis on a totally different cluster.

Raul, if you don’t mind, can you file a ticket and see if anyone in the
community wants to pick it up?

D.

On Wed, Oct 21, 2015 at 5:51 AM, Sergi Vladykin <se...@gmail.com>
wrote:

> Raul,
>
> Actually SQL indexes are already snapshotable. I'm not sure if it does make
> sense to make
> the whole cache (with full cache API support) snapshotable, but I like your
> idea
> about running multiple SQL statements against the same snapshot.
>
> Also I don't think that it is a good idea to keep snapshots for a long
> time,
> so I'd prefer to have typical AutoClosable API like:
>
> try (Snapshot s = ...) {
>     s.query(...);
>     s.query(...);
>     s.query(...);
> }
>
> Though I'm not sure when we will be able to get down to this.
>
> Sergi
>
> 2015-10-21 12:06 GMT+03:00 Raul Kripalani <ra...@apache.org>:
>
> > Hey guys,
> >
> > LevelDb has a functionality called Snapshots which provides a consistent
> > read-only view of the DB at a given point in time, against which queries
> > can be executed.
> >
> > To my knowledge, this functionality doesn't exist in the world of open
> > source In-Memory Computing. Ignite could be an innovator here.
> >
> > Ignite Snapshots would allow queries, distributed closures, map-reduce
> > jobs, etc. It could be useful for Spark RDDs to avoid data shift while
> the
> > computation is taking place (not sure if there's already some form of
> > snapshotting, though). Same for IGFS.
> >
> > Example usage:
> >
> >     IgniteCacheSnapshot snapshot =
> > ignite.cache("mycache").snapshots().create();
> >
> >     // all three queries are executed against a view of the cache at the
> > point in time where it was snapshotted
> >     snapshot.query("select ...");
> >     snapshot.query("select ...");
> >     snapshot.query("select ...");
> >
> > In fact, it would be awesome to be able to logically save this snapshot
> > with a name so that later jobs, queries, etc. can run on top of it, e.g.:
> >
> >     IgniteCacheSnapshot snapshot =
> > ignite.cache("mycache").snapshots().create("abc");
> >
> >     // ...
> >     // in another module of a distributed system, or in another thread in
> > parallel, use the saved snapshot
> >     IgniteCacheSnapshot snapshot =
> > ignite.cache("mycache").snapshots().get("abc");
> >     ....
> >
> > Named snapshotting can be dangerous due to data retention, e.g. imagine
> > keeping a snapshot for 2 weeks! So we should force the user to specify a
> > TTL:
> >
> >     IgniteCacheSnapshot snapshot =
> > ignite.cache("mycache").snapshots().create("abc", 2, TimeUnit.HOURS);
> >
> > Such functionality would allow for "reporting checkpoints" and "time
> > travel", for example, where you want users to be able to query the data
> as
> > it stood 1 hour ago, 2 hours ago, etc.
> >
> > What do you think?
> >
> > P.S.: We do have some form of snapshotting in the Compute checkpointing
> > functionality – but my proposal is to generalise the notion.
> >
> > Regards,
> >
> > *Raúl Kripalani*
> > PMC & Committer @ Apache Ignite, Apache Camel | Integration, Big Data and
> > Messaging Engineer
> > http://about.me/raulkripalani | http://www.linkedin.com/in/raulkripalani
> > http://blog.raulkr.net | twitter: @raulvk
> >
>

Re: Data Snapshots in Ignite

Posted by Raul Kripalani <ra...@apache.org>.

Hey Sergey,

I like your idea of Autocloseable snapshots!

Regards,

*Raúl Kripalani*
PMC & Committer @ Apache Ignite, Apache Camel | Integration, Big Data and
Messaging Engineer
http://about.me/raulkripalani | http://www.linkedin.com/in/raulkripalani
http://blog.raulkr.net | twitter: @raulvk

On Wed, Oct 21, 2015 at 1:51 PM, Sergi Vladykin <se...@gmail.com>
wrote:

> Raul,
>
> Actually SQL indexes are already snapshotable. I'm not sure if it does make
> sense to make
> the whole cache (with full cache API support) snapshotable, but I like your
> idea
> about running multiple SQL statements against the same snapshot.
>
> Also I don't think that it is a good idea to keep snapshots for a long
> time,
> so I'd prefer to have typical AutoClosable API like:
>
> try (Snapshot s = ...) {
>     s.query(...);
>     s.query(...);
>     s.query(...);
> }
>
> Though I'm not sure when we will be able to get down to this.
>
> Sergi
>
> 2015-10-21 12:06 GMT+03:00 Raul Kripalani <ra...@apache.org>:
>
> > Hey guys,
> >
> > LevelDb has a functionality called Snapshots which provides a consistent
> > read-only view of the DB at a given point in time, against which queries
> > can be executed.
> >
> > To my knowledge, this functionality doesn't exist in the world of open
> > source In-Memory Computing. Ignite could be an innovator here.
> >
> > Ignite Snapshots would allow queries, distributed closures, map-reduce
> > jobs, etc. It could be useful for Spark RDDs to avoid data shift while
> the
> > computation is taking place (not sure if there's already some form of
> > snapshotting, though). Same for IGFS.
> >
> > Example usage:
> >
> >     IgniteCacheSnapshot snapshot =
> > ignite.cache("mycache").snapshots().create();
> >
> >     // all three queries are executed against a view of the cache at the
> > point in time where it was snapshotted
> >     snapshot.query("select ...");
> >     snapshot.query("select ...");
> >     snapshot.query("select ...");
> >
> > In fact, it would be awesome to be able to logically save this snapshot
> > with a name so that later jobs, queries, etc. can run on top of it, e.g.:
> >
> >     IgniteCacheSnapshot snapshot =
> > ignite.cache("mycache").snapshots().create("abc");
> >
> >     // ...
> >     // in another module of a distributed system, or in another thread in
> > parallel, use the saved snapshot
> >     IgniteCacheSnapshot snapshot =
> > ignite.cache("mycache").snapshots().get("abc");
> >     ....
> >
> > Named snapshotting can be dangerous due to data retention, e.g. imagine
> > keeping a snapshot for 2 weeks! So we should force the user to specify a
> > TTL:
> >
> >     IgniteCacheSnapshot snapshot =
> > ignite.cache("mycache").snapshots().create("abc", 2, TimeUnit.HOURS);
> >
> > Such functionality would allow for "reporting checkpoints" and "time
> > travel", for example, where you want users to be able to query the data
> as
> > it stood 1 hour ago, 2 hours ago, etc.
> >
> > What do you think?
> >
> > P.S.: We do have some form of snapshotting in the Compute checkpointing
> > functionality – but my proposal is to generalise the notion.
> >
> > Regards,
> >
> > *Raúl Kripalani*
> > PMC & Committer @ Apache Ignite, Apache Camel | Integration, Big Data and
> > Messaging Engineer
> > http://about.me/raulkripalani | http://www.linkedin.com/in/raulkripalani
> > http://blog.raulkr.net | twitter: @raulvk
> >
>

Re: Data Snapshots in Ignite

Posted by Sergi Vladykin <se...@gmail.com>.

Raul,

Actually SQL indexes are already snapshotable. I'm not sure if it does make
sense to make
the whole cache (with full cache API support) snapshotable, but I like your
idea
about running multiple SQL statements against the same snapshot.

Also I don't think that it is a good idea to keep snapshots for a long time,
so I'd prefer to have typical AutoClosable API like:

try (Snapshot s = ...) {
    s.query(...);
    s.query(...);
    s.query(...);
}

Though I'm not sure when we will be able to get down to this.

Sergi

2015-10-21 12:06 GMT+03:00 Raul Kripalani <ra...@apache.org>:

> Hey guys,
>
> LevelDb has a functionality called Snapshots which provides a consistent
> read-only view of the DB at a given point in time, against which queries
> can be executed.
>
> To my knowledge, this functionality doesn't exist in the world of open
> source In-Memory Computing. Ignite could be an innovator here.
>
> Ignite Snapshots would allow queries, distributed closures, map-reduce
> jobs, etc. It could be useful for Spark RDDs to avoid data shift while the
> computation is taking place (not sure if there's already some form of
> snapshotting, though). Same for IGFS.
>
> Example usage:
>
>     IgniteCacheSnapshot snapshot =
> ignite.cache("mycache").snapshots().create();
>
>     // all three queries are executed against a view of the cache at the
> point in time where it was snapshotted
>     snapshot.query("select ...");
>     snapshot.query("select ...");
>     snapshot.query("select ...");
>
> In fact, it would be awesome to be able to logically save this snapshot
> with a name so that later jobs, queries, etc. can run on top of it, e.g.:
>
>     IgniteCacheSnapshot snapshot =
> ignite.cache("mycache").snapshots().create("abc");
>
>     // ...
>     // in another module of a distributed system, or in another thread in
> parallel, use the saved snapshot
>     IgniteCacheSnapshot snapshot =
> ignite.cache("mycache").snapshots().get("abc");
>     ....
>
> Named snapshotting can be dangerous due to data retention, e.g. imagine
> keeping a snapshot for 2 weeks! So we should force the user to specify a
> TTL:
>
>     IgniteCacheSnapshot snapshot =
> ignite.cache("mycache").snapshots().create("abc", 2, TimeUnit.HOURS);
>
> Such functionality would allow for "reporting checkpoints" and "time
> travel", for example, where you want users to be able to query the data as
> it stood 1 hour ago, 2 hours ago, etc.
>
> What do you think?
>
> P.S.: We do have some form of snapshotting in the Compute checkpointing
> functionality – but my proposal is to generalise the notion.
>
> Regards,
>
> *Raúl Kripalani*
> PMC & Committer @ Apache Ignite, Apache Camel | Integration, Big Data and
> Messaging Engineer
> http://about.me/raulkripalani | http://www.linkedin.com/in/raulkripalani
> http://blog.raulkr.net | twitter: @raulvk
>

Re: Data Snapshots in Ignite

Posted by M G <en...@gmail.com>.

I have a specific use-case for snapshots that my current client would want
to make use of, it may be helpful if I share it with you.

At the start of day we make a batch load from a reference data system, and
run a set of Start-Of-Day (SOD) reports.  Those reports must be on a
consistent view of the data - no updates from external sources are
permitted whilst the reports are running.  However, once the SOD reports
have run, we then want to receive updates from our databases and apply
those to the cache.  Here is the use-case: we want to have the original SOD
data available, snapshotted, so that we can re-run reports if they fail, or
compare what was used in the SOD reports with what those reports now
produce.

At present we are going to build an extra layer around my Ignite-based
library that provides this snapshot functionality.

On Thu, Oct 22, 2015 at 1:38 PM, Raul Kripalani <ra...@apache.org> wrote:

> Hey Andre,
>
> I think I answered some of your questions in my response to Dmitriy [1].
> Could you please have a look and tell me if it answers your questions?
>
> N.B.: My idea is based around the typical use case for LevelDb Snapshots,
> but we might create something entirely different in Ignite if the community
> wants to.
>
> [1]
>
> http://apache-ignite-developers.2346864.n4.nabble.com/Data-Snapshots-in-Ignite-tp4183p4220.html
>
> *Raúl Kripalani*
> PMC & Committer @ Apache Ignite, Apache Camel | Integration, Big Data and
> Messaging Engineer
> http://about.me/raulkripalani | http://www.linkedin.com/in/raulkripalani
> http://blog.raulkr.net | twitter: @raulvk
>
> On Thu, Oct 22, 2015 at 12:49 PM, Andrey Kornev <an...@hotmail.com>
> wrote:
>
> > Hello,
> >
> > Just a few questions.
> >
> > 1) It's not clear from the proposed API how to capture/retrieve a
> > consistent snapshot of multiple caches. If my query involves a join I'd
> > like to ensure consistency across all join participants.
> > 2) Implementation wise, is the snapshot just a physical copy of all cache
> > entries and their indexes? Or some other mechanism is being considered?
> > 3) Isolation: is the snapshot isolated with respect to concurrent
> > modifications?
> > 4) Serialization: what are my options to ensure that I can still read the
> > data from the old snapshots as my key/value class definitions change over
> > time?
> >
> >  I feel I do not quite understand the specific use case this feature is
> > expected to be applicable to. Why keeping a snapshot for 2 weeks is
> > unimaginable, but 1 or 2 hours is ok?
> >
> > Also, I think forcing people to set a TTL on a snapshot is pointless and
> > will be abused by setting it to an unreasonably large value, just in
> case.
> >
> > Thanks
> > Andrey
> >
> > > From: raulk@apache.org
> > > Date: Wed, 21 Oct 2015 10:06:25 +0100
> > > Subject: Data Snapshots in Ignite
> > > To: dev@ignite.apache.org
> > >
> > > Hey guys,
> > >
> > > LevelDb has a functionality called Snapshots which provides a
> consistent
> > > read-only view of the DB at a given point in time, against which
> queries
> > > can be executed.
> > >
> > > To my knowledge, this functionality doesn't exist in the world of open
> > > source In-Memory Computing. Ignite could be an innovator here.
> > >
> > > Ignite Snapshots would allow queries, distributed closures, map-reduce
> > > jobs, etc. It could be useful for Spark RDDs to avoid data shift while
> > the
> > > computation is taking place (not sure if there's already some form of
> > > snapshotting, though). Same for IGFS.
> > >
> > > Example usage:
> > >
> > >     IgniteCacheSnapshot snapshot =
> > > ignite.cache("mycache").snapshots().create();
> > >
> > >     // all three queries are executed against a view of the cache at
> the
> > > point in time where it was snapshotted
> > >     snapshot.query("select ...");
> > >     snapshot.query("select ...");
> > >     snapshot.query("select ...");
> > >
> > > In fact, it would be awesome to be able to logically save this snapshot
> > > with a name so that later jobs, queries, etc. can run on top of it,
> e.g.:
> > >
> > >     IgniteCacheSnapshot snapshot =
> > > ignite.cache("mycache").snapshots().create("abc");
> > >
> > >     // ...
> > >     // in another module of a distributed system, or in another thread
> in
> > > parallel, use the saved snapshot
> > >     IgniteCacheSnapshot snapshot =
> > > ignite.cache("mycache").snapshots().get("abc");
> > >     ....
> > >
> > > Named snapshotting can be dangerous due to data retention, e.g. imagine
> > > keeping a snapshot for 2 weeks! So we should force the user to specify
> a
> > > TTL:
> > >
> > >     IgniteCacheSnapshot snapshot =
> > > ignite.cache("mycache").snapshots().create("abc", 2, TimeUnit.HOURS);
> > >
> > > Such functionality would allow for "reporting checkpoints" and "time
> > > travel", for example, where you want users to be able to query the data
> > as
> > > it stood 1 hour ago, 2 hours ago, etc.
> > >
> > > What do you think?
> > >
> > > P.S.: We do have some form of snapshotting in the Compute checkpointing
> > > functionality – but my proposal is to generalise the notion.
> > >
> > > Regards,
> > >
> > > *Raúl Kripalani*
> > > PMC & Committer @ Apache Ignite, Apache Camel | Integration, Big Data
> and
> > > Messaging Engineer
> > > http://about.me/raulkripalani |
> http://www.linkedin.com/in/raulkripalani
> > > http://blog.raulkr.net | twitter: @raulvk
> >
> >
>

Re: Data Snapshots in Ignite

Posted by Raul Kripalani <ra...@apache.org>.

Hey Andre,

I think I answered some of your questions in my response to Dmitriy [1].
Could you please have a look and tell me if it answers your questions?

N.B.: My idea is based around the typical use case for LevelDb Snapshots,
but we might create something entirely different in Ignite if the community
wants to.

[1]
http://apache-ignite-developers.2346864.n4.nabble.com/Data-Snapshots-in-Ignite-tp4183p4220.html

*Raúl Kripalani*
PMC & Committer @ Apache Ignite, Apache Camel | Integration, Big Data and
Messaging Engineer
http://about.me/raulkripalani | http://www.linkedin.com/in/raulkripalani
http://blog.raulkr.net | twitter: @raulvk

On Thu, Oct 22, 2015 at 12:49 PM, Andrey Kornev <an...@hotmail.com>
wrote:

> Hello,
>
> Just a few questions.
>
> 1) It's not clear from the proposed API how to capture/retrieve a
> consistent snapshot of multiple caches. If my query involves a join I'd
> like to ensure consistency across all join participants.
> 2) Implementation wise, is the snapshot just a physical copy of all cache
> entries and their indexes? Or some other mechanism is being considered?
> 3) Isolation: is the snapshot isolated with respect to concurrent
> modifications?
> 4) Serialization: what are my options to ensure that I can still read the
> data from the old snapshots as my key/value class definitions change over
> time?
>
>  I feel I do not quite understand the specific use case this feature is
> expected to be applicable to. Why keeping a snapshot for 2 weeks is
> unimaginable, but 1 or 2 hours is ok?
>
> Also, I think forcing people to set a TTL on a snapshot is pointless and
> will be abused by setting it to an unreasonably large value, just in case.
>
> Thanks
> Andrey
>
> > From: raulk@apache.org
> > Date: Wed, 21 Oct 2015 10:06:25 +0100
> > Subject: Data Snapshots in Ignite
> > To: dev@ignite.apache.org
> >
> > Hey guys,
> >
> > LevelDb has a functionality called Snapshots which provides a consistent
> > read-only view of the DB at a given point in time, against which queries
> > can be executed.
> >
> > To my knowledge, this functionality doesn't exist in the world of open
> > source In-Memory Computing. Ignite could be an innovator here.
> >
> > Ignite Snapshots would allow queries, distributed closures, map-reduce
> > jobs, etc. It could be useful for Spark RDDs to avoid data shift while
> the
> > computation is taking place (not sure if there's already some form of
> > snapshotting, though). Same for IGFS.
> >
> > Example usage:
> >
> >     IgniteCacheSnapshot snapshot =
> > ignite.cache("mycache").snapshots().create();
> >
> >     // all three queries are executed against a view of the cache at the
> > point in time where it was snapshotted
> >     snapshot.query("select ...");
> >     snapshot.query("select ...");
> >     snapshot.query("select ...");
> >
> > In fact, it would be awesome to be able to logically save this snapshot
> > with a name so that later jobs, queries, etc. can run on top of it, e.g.:
> >
> >     IgniteCacheSnapshot snapshot =
> > ignite.cache("mycache").snapshots().create("abc");
> >
> >     // ...
> >     // in another module of a distributed system, or in another thread in
> > parallel, use the saved snapshot
> >     IgniteCacheSnapshot snapshot =
> > ignite.cache("mycache").snapshots().get("abc");
> >     ....
> >
> > Named snapshotting can be dangerous due to data retention, e.g. imagine
> > keeping a snapshot for 2 weeks! So we should force the user to specify a
> > TTL:
> >
> >     IgniteCacheSnapshot snapshot =
> > ignite.cache("mycache").snapshots().create("abc", 2, TimeUnit.HOURS);
> >
> > Such functionality would allow for "reporting checkpoints" and "time
> > travel", for example, where you want users to be able to query the data
> as
> > it stood 1 hour ago, 2 hours ago, etc.
> >
> > What do you think?
> >
> > P.S.: We do have some form of snapshotting in the Compute checkpointing
> > functionality – but my proposal is to generalise the notion.
> >
> > Regards,
> >
> > *Raúl Kripalani*
> > PMC & Committer @ Apache Ignite, Apache Camel | Integration, Big Data and
> > Messaging Engineer
> > http://about.me/raulkripalani | http://www.linkedin.com/in/raulkripalani
> > http://blog.raulkr.net | twitter: @raulvk
>
>

RE: Data Snapshots in Ignite

Posted by Andrey Kornev <an...@hotmail.com>.

Hello,

Just a few questions.

1) It's not clear from the proposed API how to capture/retrieve a consistent snapshot of multiple caches. If my query involves a join I'd like to ensure consistency across all join participants.
2) Implementation wise, is the snapshot just a physical copy of all cache entries and their indexes? Or some other mechanism is being considered?
3) Isolation: is the snapshot isolated with respect to concurrent modifications?
4) Serialization: what are my options to ensure that I can still read the data from the old snapshots as my key/value class definitions change over time?

 I feel I do not quite understand the specific use case this feature is expected to be applicable to. Why keeping a snapshot for 2 weeks is unimaginable, but 1 or 2 hours is ok? 

Also, I think forcing people to set a TTL on a snapshot is pointless and will be abused by setting it to an unreasonably large value, just in case.

Thanks
Andrey

> From: raulk@apache.org
> Date: Wed, 21 Oct 2015 10:06:25 +0100
> Subject: Data Snapshots in Ignite
> To: dev@ignite.apache.org
> 
> Hey guys,
> 
> LevelDb has a functionality called Snapshots which provides a consistent
> read-only view of the DB at a given point in time, against which queries
> can be executed.
> 
> To my knowledge, this functionality doesn't exist in the world of open
> source In-Memory Computing. Ignite could be an innovator here.
> 
> Ignite Snapshots would allow queries, distributed closures, map-reduce
> jobs, etc. It could be useful for Spark RDDs to avoid data shift while the
> computation is taking place (not sure if there's already some form of
> snapshotting, though). Same for IGFS.
> 
> Example usage:
> 
>     IgniteCacheSnapshot snapshot =
> ignite.cache("mycache").snapshots().create();
> 
>     // all three queries are executed against a view of the cache at the
> point in time where it was snapshotted
>     snapshot.query("select ...");
>     snapshot.query("select ...");
>     snapshot.query("select ...");
> 
> In fact, it would be awesome to be able to logically save this snapshot
> with a name so that later jobs, queries, etc. can run on top of it, e.g.:
> 
>     IgniteCacheSnapshot snapshot =
> ignite.cache("mycache").snapshots().create("abc");
> 
>     // ...
>     // in another module of a distributed system, or in another thread in
> parallel, use the saved snapshot
>     IgniteCacheSnapshot snapshot =
> ignite.cache("mycache").snapshots().get("abc");
>     ....
> 
> Named snapshotting can be dangerous due to data retention, e.g. imagine
> keeping a snapshot for 2 weeks! So we should force the user to specify a
> TTL:
> 
>     IgniteCacheSnapshot snapshot =
> ignite.cache("mycache").snapshots().create("abc", 2, TimeUnit.HOURS);
> 
> Such functionality would allow for "reporting checkpoints" and "time
> travel", for example, where you want users to be able to query the data as
> it stood 1 hour ago, 2 hours ago, etc.
> 
> What do you think?
> 
> P.S.: We do have some form of snapshotting in the Compute checkpointing
> functionality – but my proposal is to generalise the notion.
> 
> Regards,
> 
> *Raúl Kripalani*
> PMC & Committer @ Apache Ignite, Apache Camel | Integration, Big Data and
> Messaging Engineer
> http://about.me/raulkripalani | http://www.linkedin.com/in/raulkripalani
> http://blog.raulkr.net | twitter: @raulvk