You are viewing a plain text version of this content. The canonical link for it is here.

Posted to oak-dev@jackrabbit.apache.org by Jörg Hoh <jh...@googlemail.com> on 2012/03/10 23:32:55 UTC

Online backup

Hi,

I already brought up this topic with Thomas Müller, but he asked me to
discuss it directly here on oak-dev. So here it goes.

We should have the possibility to create backup during normal operation of
the repository, without shutting down the repository and without major
impact to read or write performance. A true online backup. A bonus would be
if this backup facility is additionally able to produce a diff to the
latest backup (incremental backup).
Besides the scalability topics already discussed I see this as a major
painpoint in projects.

cheers,
Jörg

Re: Online backup

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,
On Mon, Mar 12, 2012 at 12:05 PM, Marcel Reutegger <mr...@adobe.com> wrote:
>> 2) Assuming we get the clustering architecture right (which we
>> should), it should be possible to start a new read-only node to an
>> existing cluster, wait for it to synchronize all existing content from
>> the rest of the cluster, and finally stop this backup node. The result
>> should be a complete, runnable copy of the repository.
>
> this however also assumes that each cluster node has a complete copy
> of the repository. I'd rather be in favor of a cluster solution that
> distributes the repository across multiple machines for improved
> scalability.

Right, sharding is an added complexity.

Once a repository reaches the size where sharding is required, the
traditional approach of backing up the repository to a single backup
server or tape no longer works. In such cases the backup itself
probably also needs to be sharded, which probably is best achieved by
starting a full set of shards instead of just a single node for the
backup.

BR,

Jukka Zitting

Re: Online backup

Posted by Felix Meschberger <fm...@adobe.com>.

Hi,

Am 12.03.2012 um 14:01 schrieb Jukka Zitting:

> Hi,
> 
> On Mon, Mar 12, 2012 at 1:46 PM, Felix Meschberger <fm...@adobe.com> wrote:
>> But: However the repository implements sharding etc. The user of the
>> JCR API does not have to care about this implementation detail...
> 
> Right, but a backup tool is not a normal JCR client.

Ok, that makes a difference of course, thanks.

Regards
Felix

> 
>> As such a backup solution, should well be possible on-top of the JCR API
>> regardless of how the internal implementation distributes data etc., right ?
> 
> Why? The JCR API does not cover all the functionality needed to fully
> recover a repository from a backup (and offers no bulk-read or
> incremental update feature needed for fast backups), so a backup
> client in any case needs to use a lower level API.
> 
> Whether and how such an API exposes features like sharding is an issue
> we still need to sort out. It's good to remember that use cases like
> backup can be significantly affected by the design we select.
> 
> BR,
> 
> Jukka Zitting

Re: Online backup

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Mon, Mar 12, 2012 at 1:46 PM, Felix Meschberger <fm...@adobe.com> wrote:
> But: However the repository implements sharding etc. The user of the
> JCR API does not have to care about this implementation detail...

Right, but a backup tool is not a normal JCR client.

> As such a backup solution, should well be possible on-top of the JCR API
> regardless of how the internal implementation distributes data etc., right ?

Why? The JCR API does not cover all the functionality needed to fully
recover a repository from a backup (and offers no bulk-read or
incremental update feature needed for fast backups), so a backup
client in any case needs to use a lower level API.

Whether and how such an API exposes features like sharding is an issue
we still need to sort out. It's good to remember that use cases like
backup can be significantly affected by the design we select.

BR,

Jukka Zitting

Re: Online backup

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

>As such a backup solution, should well be possible on-top of the JCR API
>regardless of how the internal implementation distributes data etc.,
>right ?

It depends on what features are supported by the storage layer.

Regards
Thomas

Re: Online backup

Posted by Felix Meschberger <fm...@adobe.com>.

Hi,

Am 12.03.2012 um 12:05 schrieb Marcel Reutegger:

> Hi,
> 
>> 2) Assuming we get the clustering architecture right (which we
>> should), it should be possible to start a new read-only node to an
>> existing cluster, wait for it to synchronize all existing content from
>> the rest of the cluster, and finally stop this backup node. The result
>> should be a complete, runnable copy of the repository.
> 
> this however also assumes that each cluster node has a complete copy
> of the repository. I'd rather be in favor of a cluster solution that
> distributes the repository across multiple machines for improved 
> scalability.

But: However the repository implements sharding etc. The user of the JCR API does not have to care about this implementation detail...

As such a backup solution, should well be possible on-top of the JCR API regardless of how the internal implementation distributes data etc., right ?

Regards
Felix

RE: Online backup

Posted by Marcel Reutegger <mr...@adobe.com>.

Hi,

> 2) Assuming we get the clustering architecture right (which we
> should), it should be possible to start a new read-only node to an
> existing cluster, wait for it to synchronize all existing content from
> the rest of the cluster, and finally stop this backup node. The result
> should be a complete, runnable copy of the repository.

this however also assumes that each cluster node has a complete copy
of the repository. I'd rather be in favor of a cluster solution that
distributes the repository across multiple machines for improved 
scalability.

regards
 marcel

Re: Online backup

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Sat, Mar 10, 2012 at 11:32 PM, Jörg Hoh <jh...@googlemail.com> wrote:
> We should have the possibility to create backup during normal operation of
> the repository, without shutting down the repository and without major
> impact to read or write performance. A true online backup.

With the Oak architecture as currently envisioned, there are at least
three alternative ways to achieve this:

1) The MVCC model gives us a stable snapshot of the repository state
at any given time, so a backup client should be able to export a
snapshot of the entire repository without interfering (except for the
extra IO overhead and potential cache impact) with normal repository
use.

2) Assuming we get the clustering architecture right (which we
should), it should be possible to start a new read-only node to an
existing cluster, wait for it to synchronize all existing content from
the rest of the cluster, and finally stop this backup node. The result
should be a complete, runnable copy of the repository.

3) Since the Oak architecture builds on immutable data, most
persistence models will likely employ an append-only approach with
garbage-collection to clean up unused space. With little coordination
from the garbage collector, it should be possible to also get a stable
snapshot of the entire repository with native backup tools of the
underlying persistence mechanism.

> A bonus would be if this backup facility is additionally able to produce
> a diff to the latest backup (incremental backup).

I believe this should be doable with all the above approaches.

BR,

Jukka Zitting