You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by mi...@zapatatechnology.com on 2017/10/03 20:10:00 UTC

Backup and Recovery

Please forgive the newbie question. What options are there for backup and
recovery of accumulo data?

 

Ideally I would like something that would replicate to S3 in realtime.


-- 


Zapata Technology your *8(a)* and *HUBZone *IT Solutions Provider
Washington Technology and Inc Magazine fastest growing company two years in 
a row. 

þ Please consider our environment before printing this e-mail.

*CONFIDENTIALITY NOTE:*  This communication is intended solely to be used 
by the intended recipient only and may contain information that is 
privileged, confidential, or otherwise prohibited by law from disclosure. 
 If you are not the intended recipient, you are hereby notified that any 
dissemination, distribution, copying, taking any action in reliance upon, 
or other use of this information is strictly prohibited.  If you received 
this communication in error, please contact the sender and then delete it. 
 Thank you.  

RE: Backup and Recovery

Posted by mi...@zapatatechnology.com.
What’s the name of the utility?

 

From: Christopher [mailto:ctubbsii@apache.org] 
Sent: Tuesday, October 3, 2017 2:01 PM
To: user@accumulo.apache.org
Subject: Re: Backup and Recovery

 

Oh, sorry, no. That's not the case. I did not mean to mislead. You also need to back up the metadata from ZooKeeper for a complete backup. We have a utility for that, which I believe is mentioned in the documentation. If not, that's a documentation bug and we should add it. (Sorry, unable to check at the moment, but please file a bug if you can't find it.)

On Tue, Oct 3, 2017 at 4:47 PM <mikewestman@zapatatechnology.com <ma...@zapatatechnology.com> > wrote:

So if I backup the HDFS I have a backup of accumulo? There isn’t any other data that I’d need to grab?

 

From: Christopher [mailto:ctubbsii@apache.org <ma...@apache.org> ] 
Sent: Tuesday, October 3, 2017 1:41 PM
To: user@accumulo.apache.org <ma...@accumulo.apache.org> 
Subject: Re: Backup and Recovery

 

Hi Mike. This is a great question. Accumulo has several options for backup.

Accumulo is backed by HDFS for persisting its data on disk. It may be possible to use S3 directly at this layer. I'm not sure what the current state is for doing something like this, but a brief Googling for "HDFS on S3" shows a few historical projects which may still be active and mature.

Accumulo also has a replication feature to automatically mirror live ingest to a pluggable external receiver, which could be a backup service you've written to store data in S3. Recovery would depend on how you store the data in S3. You could also implement an ingest system which stores data to a backup as well as to Accumulo, to handle both live and bulk ingest.

Accumulo also has an "exporttable" feature, which exports the metadata for a table, along with a list of files in HDFS for you to back up to S3 (or another file system). Recovery involves using the "importtable" feature which recreates the metadata, and bulk importing the files after you've moved them from your backup location back onto HDFS.

This is just a rough outline of 3 possible solutions. I don't know which (if any) would match your requirements best. There may be many other solutions as well.

On Tue, Oct 3, 2017 at 4:10 PM <mikewestman@zapatatechnology.com <ma...@zapatatechnology.com> > wrote:

Please forgive the newbie question. What options are there for backup and recovery of accumulo data?

 

Ideally I would like something that would replicate to S3 in realtime.

 


-- 


Zapata Technology your *8(a)* and *HUBZone *IT Solutions Provider
Washington Technology and Inc Magazine fastest growing company two years in 
a row. 

þ Please consider our environment before printing this e-mail.

*CONFIDENTIALITY NOTE:*  This communication is intended solely to be used 
by the intended recipient only and may contain information that is 
privileged, confidential, or otherwise prohibited by law from disclosure. 
 If you are not the intended recipient, you are hereby notified that any 
dissemination, distribution, copying, taking any action in reliance upon, 
or other use of this information is strictly prohibited.  If you received 
this communication in error, please contact the sender and then delete it. 
 Thank you.  

Re: Backup and Recovery

Posted by Christopher <ct...@apache.org>.
Oh, sorry, no. That's not the case. I did not mean to mislead. You also
need to back up the metadata from ZooKeeper for a complete backup. We have
a utility for that, which I believe is mentioned in the documentation. If
not, that's a documentation bug and we should add it. (Sorry, unable to
check at the moment, but please file a bug if you can't find it.)

On Tue, Oct 3, 2017 at 4:47 PM <mi...@zapatatechnology.com> wrote:

> So if I backup the HDFS I have a backup of accumulo? There isn’t any other
> data that I’d need to grab?
>
>
>
> *From:* Christopher [mailto:ctubbsii@apache.org]
> *Sent:* Tuesday, October 3, 2017 1:41 PM
> *To:* user@accumulo.apache.org
> *Subject:* Re: Backup and Recovery
>
>
>
> Hi Mike. This is a great question. Accumulo has several options for backup.
>
> Accumulo is backed by HDFS for persisting its data on disk. It may be
> possible to use S3 directly at this layer. I'm not sure what the current
> state is for doing something like this, but a brief Googling for "HDFS on
> S3" shows a few historical projects which may still be active and mature.
>
> Accumulo also has a replication feature to automatically mirror live
> ingest to a pluggable external receiver, which could be a backup service
> you've written to store data in S3. Recovery would depend on how you store
> the data in S3. You could also implement an ingest system which stores data
> to a backup as well as to Accumulo, to handle both live and bulk ingest.
>
> Accumulo also has an "exporttable" feature, which exports the metadata for
> a table, along with a list of files in HDFS for you to back up to S3 (or
> another file system). Recovery involves using the "importtable" feature
> which recreates the metadata, and bulk importing the files after you've
> moved them from your backup location back onto HDFS.
>
> This is just a rough outline of 3 possible solutions. I don't know which
> (if any) would match your requirements best. There may be many other
> solutions as well.
>
> On Tue, Oct 3, 2017 at 4:10 PM <mi...@zapatatechnology.com> wrote:
>
> Please forgive the newbie question. What options are there for backup and
> recovery of accumulo data?
>
>
>
> Ideally I would like something that would replicate to S3 in realtime.
>
>
>
>

RE: Backup and Recovery

Posted by mi...@zapatatechnology.com.
So if I backup the HDFS I have a backup of accumulo? There isn’t any other data that I’d need to grab?

 

From: Christopher [mailto:ctubbsii@apache.org] 
Sent: Tuesday, October 3, 2017 1:41 PM
To: user@accumulo.apache.org
Subject: Re: Backup and Recovery

 

Hi Mike. This is a great question. Accumulo has several options for backup.

Accumulo is backed by HDFS for persisting its data on disk. It may be possible to use S3 directly at this layer. I'm not sure what the current state is for doing something like this, but a brief Googling for "HDFS on S3" shows a few historical projects which may still be active and mature.

Accumulo also has a replication feature to automatically mirror live ingest to a pluggable external receiver, which could be a backup service you've written to store data in S3. Recovery would depend on how you store the data in S3. You could also implement an ingest system which stores data to a backup as well as to Accumulo, to handle both live and bulk ingest.

Accumulo also has an "exporttable" feature, which exports the metadata for a table, along with a list of files in HDFS for you to back up to S3 (or another file system). Recovery involves using the "importtable" feature which recreates the metadata, and bulk importing the files after you've moved them from your backup location back onto HDFS.

This is just a rough outline of 3 possible solutions. I don't know which (if any) would match your requirements best. There may be many other solutions as well.

On Tue, Oct 3, 2017 at 4:10 PM <mikewestman@zapatatechnology.com <ma...@zapatatechnology.com> > wrote:

Please forgive the newbie question. What options are there for backup and recovery of accumulo data?

 

Ideally I would like something that would replicate to S3 in realtime.

 


-- 


Zapata Technology your *8(a)* and *HUBZone *IT Solutions Provider
Washington Technology and Inc Magazine fastest growing company two years in 
a row. 

þ Please consider our environment before printing this e-mail.

*CONFIDENTIALITY NOTE:*  This communication is intended solely to be used 
by the intended recipient only and may contain information that is 
privileged, confidential, or otherwise prohibited by law from disclosure. 
 If you are not the intended recipient, you are hereby notified that any 
dissemination, distribution, copying, taking any action in reliance upon, 
or other use of this information is strictly prohibited.  If you received 
this communication in error, please contact the sender and then delete it. 
 Thank you.  

Re: Backup and Recovery

Posted by Josh Elser <el...@apache.org>.
The s3a Hadoop FileSystem isn't robust enough to support the 
requirements Accumulo has to guarantee no data loss around Write-Ahead Logs.

You can use the ExportSnapshot tool for Accumulo to get an immutable 
"picture" of a table. The expectation is that you would use DistCp to 
copy the files referenced by this snapshot to some other "cold" storage.

The downside of this approach is that each snapshot is a full copy. 
There is no such thing as an incremental snapshot.

Hypothetically, you could build some additional logic which would 
prevent re-copying a file to your cold-storage (all Accumulo files are 
immutable, thus if Snapshot1 already referenced fileA, you wouldn't need 
to re-copy fileA if Snapshot2 also references it). This is left as an 
exercise to the user :)

On 10/3/17 4:40 PM, Christopher wrote:
> Hi Mike. This is a great question. Accumulo has several options for backup.
> 
> Accumulo is backed by HDFS for persisting its data on disk. It may be 
> possible to use S3 directly at this layer. I'm not sure what the current 
> state is for doing something like this, but a brief Googling for "HDFS 
> on S3" shows a few historical projects which may still be active and mature.
> 
> Accumulo also has a replication feature to automatically mirror live 
> ingest to a pluggable external receiver, which could be a backup service 
> you've written to store data in S3. Recovery would depend on how you 
> store the data in S3. You could also implement an ingest system which 
> stores data to a backup as well as to Accumulo, to handle both live and 
> bulk ingest.
> 
> Accumulo also has an "exporttable" feature, which exports the metadata 
> for a table, along with a list of files in HDFS for you to back up to S3 
> (or another file system). Recovery involves using the "importtable" 
> feature which recreates the metadata, and bulk importing the files after 
> you've moved them from your backup location back onto HDFS.
> 
> This is just a rough outline of 3 possible solutions. I don't know which 
> (if any) would match your requirements best. There may be many other 
> solutions as well.
> 
> On Tue, Oct 3, 2017 at 4:10 PM <mikewestman@zapatatechnology.com 
> <ma...@zapatatechnology.com>> wrote:
> 
>     Please forgive the newbie question. What options are there for
>     backup and recovery of accumulo data?____
> 
>     __ __
> 
>     Ideally I would like something that would replicate to S3 in
>     realtime.____
> 
> 

Re: Backup and Recovery

Posted by Christopher <ct...@apache.org>.
Hi Mike. This is a great question. Accumulo has several options for backup.

Accumulo is backed by HDFS for persisting its data on disk. It may be
possible to use S3 directly at this layer. I'm not sure what the current
state is for doing something like this, but a brief Googling for "HDFS on
S3" shows a few historical projects which may still be active and mature.

Accumulo also has a replication feature to automatically mirror live ingest
to a pluggable external receiver, which could be a backup service you've
written to store data in S3. Recovery would depend on how you store the
data in S3. You could also implement an ingest system which stores data to
a backup as well as to Accumulo, to handle both live and bulk ingest.

Accumulo also has an "exporttable" feature, which exports the metadata for
a table, along with a list of files in HDFS for you to back up to S3 (or
another file system). Recovery involves using the "importtable" feature
which recreates the metadata, and bulk importing the files after you've
moved them from your backup location back onto HDFS.

This is just a rough outline of 3 possible solutions. I don't know which
(if any) would match your requirements best. There may be many other
solutions as well.

On Tue, Oct 3, 2017 at 4:10 PM <mi...@zapatatechnology.com> wrote:

> Please forgive the newbie question. What options are there for backup and
> recovery of accumulo data?
>
>
>
> Ideally I would like something that would replicate to S3 in realtime.
>
>