You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@zookeeper.apache.org by Sergey Maslyakov <ev...@gmail.com> on 2013/07/05 23:26:31 UTC

Efficient backup and a reasonable restore of an ensemble

Hi!

I'm facing the problem that has been raised by multiple people but none of
the discussion threads seem to provide a good answer. I dug in Zookeeper
source code trying to come up with some possible approaches and I would
like to get your inputs on those.

Initial conditions:

* I have an ensemble of five Zookeeper servers running v3.4.5 code.
* The size of a committed snapshot file is in vicinity of 1GB.
* There are about 80 clients connected to the ensemble.
* Clients a heavily read biased, i.e., they mostly read and rarely write. I
would say less than 0.1% of queries modify the data.

Problem statement:

* Under certain conditions, I may need to revert the data stored in the
ensemble to an earlier state. For example, one of the clients may ruin the
application-level data integrity and I need to perform a disaster recovery.

Things look nice and easy if I'm dealing with a single Zookeeper server. A
file-level copy of the data and dataLog directories should allow me to
recover later by stopping Zookeeper, swapping the corrupted data and
dataLog directories with a backup, and firing Zookeeper back up.

Now, the ensemble deployment and the leader election algorithm in the
quorum make things much more difficult. In order to restore from a single
file-level backup, I need to take the whole ensemble down, wipe out data
and dataLog directories on all servers, replace these directories with
backed up content on one of the servers, bring this server up first, and
then bring up the rest of the ensemble. This [somewhat] guarantees that the
populated Zookeeper server becomes a member of a majority and populates the
ensemble. This approach works but it is very involving and, thus,
error-prone due to a human error.

Based on a study of Zookeeper source code, I am considering the following
alternatives. And I seek advice from Zookeeper development community as to
which approach looks more promising or if there is a better way.

Approach #1:

Develop a complementary pair of utilities for export and import of the
data. Both utilities will act as Zookeeper clients and use the existing
API. The "export" utility will recursively retrieve data and store it in a
file. The "import" utility will first purge all data from the ensemble and
then reload it from the file.

This approach seems to be the simplest and there are similar tools
developed already. For example, the Guano Project:
https://github.com/d2fn/guano

I don't like two things about it:
* Poor performance even on a backup for the data store of my size.
* Possible data consistency issues due to concurrent access by the export
utility as well as other "normal" clients.

Approach #2:

Add another four-letter command that would force rolling up the
transactions and creating a snapshot. The result of this command would be a
new snapshot.XXXX file on disk and the name of the file could be reported
back to the client as a response to the four-letter command. This way, I
would know which snapshot file to grab for future possible restore. But
restoring from a snapshot file is almost as involving as the error-prone
sequence described in the "Initial conditions" above.

Approach #3:

Come up with a way to temporarily add a new Zookeeper server into a live
ensemble, that would overtake (how?) the leader role and push out the
snapshot that it has into all ensemble members upon restore. This approach
could be difficult and error-prone to implement because it will require
hacking the existing election algorithm to designate a leader.

So, which of the approaches do you think works best for an ensemble and for
the database size of about 1GB?


Any advice will be highly appreciated!
/Sergey

Re: Efficient backup and a reasonable restore of an ensemble

Posted by jack ma <ja...@gmail.com>.

Does someone have the answers for Sergey's questions?

I want to make sure I am fully understanding the procedures of zookeeper
backup and disaster recovery:

For the backup procedures at zookeeper assemble:
(1) Login to any host which state is "Serving"
           Question:
                  Do I have to login to leader node, or any node is ok?
(2) Copy latest snapshot file and transaction log from version-2 directory.
           Question:
                  How to make sure we do not copy corrupt files if the
snapshot/transaction log is in the middle of update? Do we have to shutdown
the node to make the copy?
                  besides the transaction log and snapshot, do we have to
copy other files such as the ecoch files

For the disaster recovery procedures at zookeeper assemble:
(1) recreate the machines for the zookeeper ensemble
(2) copy snapshot/transaction log we backed up into the zookeeper
dataDir\version-2 and logDir\version2.
           Question:
                 Do we have to copy the epoch files?
                 Do we have to copy snapshot/transaction log backed up to
all the zookeeper node, or just the first node we starts?

Appreciate your time and help.
Jack


On Mon, Jul 8, 2013 at 9:25 PM, Sergey Maslyakov <ev...@gmail.com> wrote:

> These are interesting points, Thawan. I'd like to make sure that I get them
> right.
>
> 1. Are you saying that a snapshot file may not be sufficient to restore
> Zookeeper to a consistent state? Does it always require a transaction log
> file or is it required to get to the most current state? I was hoping that
> a snapshot is self-sufficient to do a restore to recent but not necessarily
> most current state. Was I wrong?
>
> 2. Do you suggest that the same pair of a snapshot (and a transaction log)
> needs to be copied on all servers before they are brought online? The what
> about the "epoch" files? Do they need to be purged, preserved, or same one
> populated through the whole ensemble?
>
>
> On Mon, Jul 8, 2013 at 7:53 PM, Thawan Kooburat <th...@fb.com> wrote:
>
> > Just saw that  this is the corresponding use case to the question posted
> > in dev list.
> >
> > In order to restore the data to a given point in time correctly, you need
> > both snapshot and txnlog. This is because zookeeper snapshot is fuzzy and
> > snapshot alone may not represent a valid state of the server if there are
> > in-flight requests.
> >
> > The 4wl command should cause the server to roll the log and take a
> > snapshot similar to periodic snapshotting operation. Your backup script
> > need grap the snapshot and corresponding txnlog file from the data dir.
> >
> > To restore, just shutdown all hosts, clear the data dir, copy over the
> > snapshot and txnlog, and restart them.
> >
> >
> > --
> > Thawan Kooburat
> >
> >
> >
> >
> >
> > On 7/8/13 3:28 PM, "Sergey Maslyakov" <ev...@gmail.com> wrote:
> >
> > >Thank you for your response, Flavio. I apologize, I did not provide a
> > >clear
> > >explanation of the use case.
> > >
> > >This backup/restore is not intended to be tied to any write event,
> > >instead,
> > >it is expected to run as a periodic (daily?) cron job on one of the
> > >servers, which is not guaranteed to be the leader of the ensemble. There
> > >is
> > >no expectation that all recent changes are committed and persisted to
> > >disk.
> > >The system can sustain the loss of several hours worth of recent changes
> > >in
> > >the event of restore.
> > >
> > >As for finding the leader dynamically and performing backup on it, this
> > >approach could be more difficult as the leader can change time to time
> and
> > >I still need to fetch the file to store it in my designated backup
> > >location. Taking backup on one server and picking it up from a local
> file
> > >system looks less error-prone. Even if I went the fancy route and had
> > >Zookeeper send me the serialized DataTree in response to the 4wl, this
> > >approach would involve a lot of moving parts.
> > >
> > >I have already made a PoC for a new 4wl that invokes takeSnapshot() and
> > >returns an absolute path to the snapshot it drops on disk. I have
> already
> > >protected takeSnapshot() from concurrent invocation, which is likely to
> > >corrupt the snapshot file on disk. This approach works but I'm thinking
> to
> > >take it one step further by providing the desired path name as an
> argument
> > >to my new 4lw and to have Zookeeper server drop the snapshot into the
> > >specified file and report success/failure back. This way I can avoid
> > >cluttering the data directory and interfering with what Zookeeper finds
> > >when it scans the data directory.
> > >
> > >Approach with having an additional server that would take the leadership
> > >and populate the ensemble is just a theory. I don't see a clean way of
> > >making a quorum member the leader of the quorum. Am I overlooking
> > >something
> > >simple?
> > >
> > >In backup and restore of an ensemble the biggest unknown for me remains
> > >populating the ensemble with desired data. I can think of two ways:
> > >
> > >1. Clear out all servers by stopping them, purge version-2 directories,
> > >restore a snapshot file on one server that will be brought first, and
> then
> > >bring up the rest of the ensemble. This way I somewhat force the first
> > >server to be the leader because it has data and it will be the only
> member
> > >of a quorum with data, provided to the way I start the ensemble. This
> > >looks
> > >like a hack, though.
> > >
> > >2. Clear out the ensemble and reload it with a dedicated client using
> the
> > >provided Zookeeper API.
> > >
> > >With the approach of backing up an actual snapshot file, option #1
> appears
> > >to be more practical.
> > >
> > >I wish I could start the ensemble with a designate leader that would
> > >bootstrap the ensemble with data and then the ensemble would go into its
> > >normal business...
> > >
> > >
> > >
> > >On Mon, Jul 8, 2013 at 4:30 PM, Flavio Junqueira
> > ><fp...@yahoo.com>wrote:
> > >
> > >> One bit that is still a bit confusing to me in your use case is if you
> > >> need to take a snapshot right after some event in your application.
> > >>Even if
> > >> you're able to tell ZooKeeper to take a snapshot, there is no
> guarantee
> > >> that it will happen at the exact point you want it if update
> operations
> > >> keep coming.
> > >>
> > >> If you use your four-letter word approach, then would you search for
> the
> > >> leader or would you simply take a snapshot at any server? If it has to
> > >>go
> > >> through the leader so that you make sure to have the most recent
> > >>committed
> > >> state, then it might not be a bad idea to have an api call that tells
> > >>the
> > >> leader to take a snapshot at some directory of your choice. Informing
> > >>you
> > >> the name of the snapshot file so that you can copy sounds like an
> > >>option,
> > >> but perhaps it is not as convenient.
> > >>
> > >> The approach of adding another server is not very clear. How do you
> > >>force
> > >> it to be the leader? Keep in mind that if it crashes, then it will
> lose
> > >> leadership.
> > >>
> > >> -Flavio
> > >>
> > >> On Jul 8, 2013, at 8:34 AM, Sergey Maslyakov <ev...@gmail.com>
> wrote:
> > >>
> > >> > It looks like the "dev" mailing list is rather inactive. Over the
> past
> > >> few
> > >> > days I only saw several automated emails from JIRA and this is
> pretty
> > >> much
> > >> > it. Contrary to this, the "user" mailing list seems to be more alive
> > >>and
> > >> > more populated.
> > >> >
> > >> > With this in mind, please allow me to cross-post here the message I
> > >>sent
> > >> > into the "dev" list a few days ago.
> > >> >
> > >> >
> > >> > Regards,
> > >> > /Sergey
> > >> >
> > >> > === forwarded message begins here ===
> > >> >
> > >> > Hi!
> > >> >
> > >> > I'm facing the problem that has been raised by multiple people but
> > >>none
> > >> of
> > >> > the discussion threads seem to provide a good answer. I dug in
> > >>Zookeeper
> > >> > source code trying to come up with some possible approaches and I
> > >>would
> > >> > like to get your inputs on those.
> > >> >
> > >> > Initial conditions:
> > >> >
> > >> > * I have an ensemble of five Zookeeper servers running v3.4.5 code.
> > >> > * The size of a committed snapshot file is in vicinity of 1GB.
> > >> > * There are about 80 clients connected to the ensemble.
> > >> > * Clients a heavily read biased, i.e., they mostly read and rarely
> > >> write. I
> > >> > would say less than 0.1% of queries modify the data.
> > >> >
> > >> > Problem statement:
> > >> >
> > >> > * Under certain conditions, I may need to revert the data stored in
> > >>the
> > >> > ensemble to an earlier state. For example, one of the clients may
> ruin
> > >> the
> > >> > application-level data integrity and I need to perform a disaster
> > >> recovery.
> > >> >
> > >> > Things look nice and easy if I'm dealing with a single Zookeeper
> > >>server.
> > >> A
> > >> > file-level copy of the data and dataLog directories should allow me
> to
> > >> > recover later by stopping Zookeeper, swapping the corrupted data and
> > >> > dataLog directories with a backup, and firing Zookeeper back up.
> > >> >
> > >> > Now, the ensemble deployment and the leader election algorithm in
> the
> > >> > quorum make things much more difficult. In order to restore from a
> > >>single
> > >> > file-level backup, I need to take the whole ensemble down, wipe out
> > >>data
> > >> > and dataLog directories on all servers, replace these directories
> with
> > >> > backed up content on one of the servers, bring this server up first,
> > >>and
> > >> > then bring up the rest of the ensemble. This [somewhat] guarantees
> > >>that
> > >> the
> > >> > populated Zookeeper server becomes a member of a majority and
> > >>populates
> > >> the
> > >> > ensemble. This approach works but it is very involving and, thus,
> > >> > error-prone due to a human error.
> > >> >
> > >> > Based on a study of Zookeeper source code, I am considering the
> > >>following
> > >> > alternatives. And I seek advice from Zookeeper development community
> > >>as
> > >> to
> > >> > which approach looks more promising or if there is a better way.
> > >> >
> > >> > Approach #1:
> > >> >
> > >> > Develop a complementary pair of utilities for export and import of
> the
> > >> > data. Both utilities will act as Zookeeper clients and use the
> > >>existing
> > >> > API. The "export" utility will recursively retrieve data and store
> it
> > >>in
> > >> a
> > >> > file. The "import" utility will first purge all data from the
> ensemble
> > >> and
> > >> > then reload it from the file.
> > >> >
> > >> > This approach seems to be the simplest and there are similar tools
> > >> > developed already. For example, the Guano Project:
> > >> > https://github.com/d2fn/guano
> > >> >
> > >> > I don't like two things about it:
> > >> > * Poor performance even on a backup for the data store of my size.
> > >> > * Possible data consistency issues due to concurrent access by the
> > >>export
> > >> > utility as well as other "normal" clients.
> > >> >
> > >> > Approach #2:
> > >> >
> > >> > Add another four-letter command that would force rolling up the
> > >> > transactions and creating a snapshot. The result of this command
> would
> > >> be a
> > >> > new snapshot.XXXX file on disk and the name of the file could be
> > >>reported
> > >> > back to the client as a response to the four-letter command. This
> > >>way, I
> > >> > would know which snapshot file to grab for future possible restore.
> > >>But
> > >> > restoring from a snapshot file is almost as involving as the
> > >>error-prone
> > >> > sequence described in the "Initial conditions" above.
> > >> >
> > >> > Approach #3:
> > >> >
> > >> > Come up with a way to temporarily add a new Zookeeper server into a
> > >>live
> > >> > ensemble, that would overtake (how?) the leader role and push out
> the
> > >> > snapshot that it has into all ensemble members upon restore. This
> > >> approach
> > >> > could be difficult and error-prone to implement because it will
> > >>require
> > >> > hacking the existing election algorithm to designate a leader.
> > >> >
> > >> > So, which of the approaches do you think works best for an ensemble
> > >>and
> > >> for
> > >> > the database size of about 1GB?
> > >> >
> > >> >
> > >> > Any advice will be highly appreciated!
> > >> > /Sergey
> > >>
> > >>
> >
> >
>

Re: Efficient backup and a reasonable restore of an ensemble

Posted by Sergey Maslyakov <ev...@gmail.com>.

These are interesting points, Thawan. I'd like to make sure that I get them
right.

1. Are you saying that a snapshot file may not be sufficient to restore
Zookeeper to a consistent state? Does it always require a transaction log
file or is it required to get to the most current state? I was hoping that
a snapshot is self-sufficient to do a restore to recent but not necessarily
most current state. Was I wrong?

2. Do you suggest that the same pair of a snapshot (and a transaction log)
needs to be copied on all servers before they are brought online? The what
about the "epoch" files? Do they need to be purged, preserved, or same one
populated through the whole ensemble?


On Mon, Jul 8, 2013 at 7:53 PM, Thawan Kooburat <th...@fb.com> wrote:

> Just saw that  this is the corresponding use case to the question posted
> in dev list.
>
> In order to restore the data to a given point in time correctly, you need
> both snapshot and txnlog. This is because zookeeper snapshot is fuzzy and
> snapshot alone may not represent a valid state of the server if there are
> in-flight requests.
>
> The 4wl command should cause the server to roll the log and take a
> snapshot similar to periodic snapshotting operation. Your backup script
> need grap the snapshot and corresponding txnlog file from the data dir.
>
> To restore, just shutdown all hosts, clear the data dir, copy over the
> snapshot and txnlog, and restart them.
>
>
> --
> Thawan Kooburat
>
>
>
>
>
> On 7/8/13 3:28 PM, "Sergey Maslyakov" <ev...@gmail.com> wrote:
>
> >Thank you for your response, Flavio. I apologize, I did not provide a
> >clear
> >explanation of the use case.
> >
> >This backup/restore is not intended to be tied to any write event,
> >instead,
> >it is expected to run as a periodic (daily?) cron job on one of the
> >servers, which is not guaranteed to be the leader of the ensemble. There
> >is
> >no expectation that all recent changes are committed and persisted to
> >disk.
> >The system can sustain the loss of several hours worth of recent changes
> >in
> >the event of restore.
> >
> >As for finding the leader dynamically and performing backup on it, this
> >approach could be more difficult as the leader can change time to time and
> >I still need to fetch the file to store it in my designated backup
> >location. Taking backup on one server and picking it up from a local file
> >system looks less error-prone. Even if I went the fancy route and had
> >Zookeeper send me the serialized DataTree in response to the 4wl, this
> >approach would involve a lot of moving parts.
> >
> >I have already made a PoC for a new 4wl that invokes takeSnapshot() and
> >returns an absolute path to the snapshot it drops on disk. I have already
> >protected takeSnapshot() from concurrent invocation, which is likely to
> >corrupt the snapshot file on disk. This approach works but I'm thinking to
> >take it one step further by providing the desired path name as an argument
> >to my new 4lw and to have Zookeeper server drop the snapshot into the
> >specified file and report success/failure back. This way I can avoid
> >cluttering the data directory and interfering with what Zookeeper finds
> >when it scans the data directory.
> >
> >Approach with having an additional server that would take the leadership
> >and populate the ensemble is just a theory. I don't see a clean way of
> >making a quorum member the leader of the quorum. Am I overlooking
> >something
> >simple?
> >
> >In backup and restore of an ensemble the biggest unknown for me remains
> >populating the ensemble with desired data. I can think of two ways:
> >
> >1. Clear out all servers by stopping them, purge version-2 directories,
> >restore a snapshot file on one server that will be brought first, and then
> >bring up the rest of the ensemble. This way I somewhat force the first
> >server to be the leader because it has data and it will be the only member
> >of a quorum with data, provided to the way I start the ensemble. This
> >looks
> >like a hack, though.
> >
> >2. Clear out the ensemble and reload it with a dedicated client using the
> >provided Zookeeper API.
> >
> >With the approach of backing up an actual snapshot file, option #1 appears
> >to be more practical.
> >
> >I wish I could start the ensemble with a designate leader that would
> >bootstrap the ensemble with data and then the ensemble would go into its
> >normal business...
> >
> >
> >
> >On Mon, Jul 8, 2013 at 4:30 PM, Flavio Junqueira
> ><fp...@yahoo.com>wrote:
> >
> >> One bit that is still a bit confusing to me in your use case is if you
> >> need to take a snapshot right after some event in your application.
> >>Even if
> >> you're able to tell ZooKeeper to take a snapshot, there is no guarantee
> >> that it will happen at the exact point you want it if update operations
> >> keep coming.
> >>
> >> If you use your four-letter word approach, then would you search for the
> >> leader or would you simply take a snapshot at any server? If it has to
> >>go
> >> through the leader so that you make sure to have the most recent
> >>committed
> >> state, then it might not be a bad idea to have an api call that tells
> >>the
> >> leader to take a snapshot at some directory of your choice. Informing
> >>you
> >> the name of the snapshot file so that you can copy sounds like an
> >>option,
> >> but perhaps it is not as convenient.
> >>
> >> The approach of adding another server is not very clear. How do you
> >>force
> >> it to be the leader? Keep in mind that if it crashes, then it will lose
> >> leadership.
> >>
> >> -Flavio
> >>
> >> On Jul 8, 2013, at 8:34 AM, Sergey Maslyakov <ev...@gmail.com> wrote:
> >>
> >> > It looks like the "dev" mailing list is rather inactive. Over the past
> >> few
> >> > days I only saw several automated emails from JIRA and this is pretty
> >> much
> >> > it. Contrary to this, the "user" mailing list seems to be more alive
> >>and
> >> > more populated.
> >> >
> >> > With this in mind, please allow me to cross-post here the message I
> >>sent
> >> > into the "dev" list a few days ago.
> >> >
> >> >
> >> > Regards,
> >> > /Sergey
> >> >
> >> > === forwarded message begins here ===
> >> >
> >> > Hi!
> >> >
> >> > I'm facing the problem that has been raised by multiple people but
> >>none
> >> of
> >> > the discussion threads seem to provide a good answer. I dug in
> >>Zookeeper
> >> > source code trying to come up with some possible approaches and I
> >>would
> >> > like to get your inputs on those.
> >> >
> >> > Initial conditions:
> >> >
> >> > * I have an ensemble of five Zookeeper servers running v3.4.5 code.
> >> > * The size of a committed snapshot file is in vicinity of 1GB.
> >> > * There are about 80 clients connected to the ensemble.
> >> > * Clients a heavily read biased, i.e., they mostly read and rarely
> >> write. I
> >> > would say less than 0.1% of queries modify the data.
> >> >
> >> > Problem statement:
> >> >
> >> > * Under certain conditions, I may need to revert the data stored in
> >>the
> >> > ensemble to an earlier state. For example, one of the clients may ruin
> >> the
> >> > application-level data integrity and I need to perform a disaster
> >> recovery.
> >> >
> >> > Things look nice and easy if I'm dealing with a single Zookeeper
> >>server.
> >> A
> >> > file-level copy of the data and dataLog directories should allow me to
> >> > recover later by stopping Zookeeper, swapping the corrupted data and
> >> > dataLog directories with a backup, and firing Zookeeper back up.
> >> >
> >> > Now, the ensemble deployment and the leader election algorithm in the
> >> > quorum make things much more difficult. In order to restore from a
> >>single
> >> > file-level backup, I need to take the whole ensemble down, wipe out
> >>data
> >> > and dataLog directories on all servers, replace these directories with
> >> > backed up content on one of the servers, bring this server up first,
> >>and
> >> > then bring up the rest of the ensemble. This [somewhat] guarantees
> >>that
> >> the
> >> > populated Zookeeper server becomes a member of a majority and
> >>populates
> >> the
> >> > ensemble. This approach works but it is very involving and, thus,
> >> > error-prone due to a human error.
> >> >
> >> > Based on a study of Zookeeper source code, I am considering the
> >>following
> >> > alternatives. And I seek advice from Zookeeper development community
> >>as
> >> to
> >> > which approach looks more promising or if there is a better way.
> >> >
> >> > Approach #1:
> >> >
> >> > Develop a complementary pair of utilities for export and import of the
> >> > data. Both utilities will act as Zookeeper clients and use the
> >>existing
> >> > API. The "export" utility will recursively retrieve data and store it
> >>in
> >> a
> >> > file. The "import" utility will first purge all data from the ensemble
> >> and
> >> > then reload it from the file.
> >> >
> >> > This approach seems to be the simplest and there are similar tools
> >> > developed already. For example, the Guano Project:
> >> > https://github.com/d2fn/guano
> >> >
> >> > I don't like two things about it:
> >> > * Poor performance even on a backup for the data store of my size.
> >> > * Possible data consistency issues due to concurrent access by the
> >>export
> >> > utility as well as other "normal" clients.
> >> >
> >> > Approach #2:
> >> >
> >> > Add another four-letter command that would force rolling up the
> >> > transactions and creating a snapshot. The result of this command would
> >> be a
> >> > new snapshot.XXXX file on disk and the name of the file could be
> >>reported
> >> > back to the client as a response to the four-letter command. This
> >>way, I
> >> > would know which snapshot file to grab for future possible restore.
> >>But
> >> > restoring from a snapshot file is almost as involving as the
> >>error-prone
> >> > sequence described in the "Initial conditions" above.
> >> >
> >> > Approach #3:
> >> >
> >> > Come up with a way to temporarily add a new Zookeeper server into a
> >>live
> >> > ensemble, that would overtake (how?) the leader role and push out the
> >> > snapshot that it has into all ensemble members upon restore. This
> >> approach
> >> > could be difficult and error-prone to implement because it will
> >>require
> >> > hacking the existing election algorithm to designate a leader.
> >> >
> >> > So, which of the approaches do you think works best for an ensemble
> >>and
> >> for
> >> > the database size of about 1GB?
> >> >
> >> >
> >> > Any advice will be highly appreciated!
> >> > /Sergey
> >>
> >>
>
>

Re: Efficient backup and a reasonable restore of an ensemble

Posted by Ted Dunning <te...@gmail.com>.

Sergey

It isn't that bad.  The deal is that a snapshot takes time to write to disk.  During this time updates are still allowed to the contents of memory.  All such updates are logged however so if you have the transaction log from the moment before the snap starts until some moment after the snap completes you can load the snapshot and then replay the log to get a moment in time snapshot as of the time if the final transaction that you have applied.  

This works because all if the logged transactions are idem potent. If they are applied to part of the snapshot that already recorded their effect, there is no problem. 

If you want you can even do the replay in a side process after the snapshot is complete so that you don't have to carry around the transaction log.  

Sent from my iPhone

On Jul 8, 2013, at 21:42, Sergey Maslyakov <ev...@gmail.com> wrote:

> Kishore,
> 
> This sounds like a very elaborate tool. I was trying to find a simplistic
> approach but what Thawan said about "fuzzy snapshots" makes me a little
> afraid that there is no simple solution.
> 
> 
> On Mon, Jul 8, 2013 at 11:05 PM, kishore g <g....@gmail.com> wrote:
> 
>> Agree, we already have such a tool. In fact we use it to reconstruct the
>> sequence of events that led to a failure and actually restore the system to
>> a previous stable point and replay the events. Unfortunately this is tied
>> closely with Helix but it should be easy to make this a generic tool.
>> 
>> Sergey is this something that will be useful in your case.
>> 
>> Thanks,
>> Kishore G
>> 
>> 
>> On Mon, Jul 8, 2013 at 8:09 PM, Thawan Kooburat <th...@fb.com> wrote:
>> 
>>> On restore part, I think having a separate utility to manipulate the
>>> data/snap dir (by truncating the log/removing snapshot to a given zxid)
>>> would be easier than modifying the server.
>>> 
>>> 
>>> --
>>> Thawan Kooburat
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On 7/8/13 6:34 PM, "kishore g" <g....@gmail.com> wrote:
>>> 
>>>> I think what we are looking at is a  point in time restore
>> functionality.
>>>> How about adding a feature that says go back to a specific
>> zxid/timestamp.
>>>> This way before doing any change to zookeeper simply note down the
>>>> timestamp/zxid on leader. If things go wrong after making changes, bring
>>>> down zookeepers and provide additional parameter of a zxid/timestamp
>> while
>>>> restarting. The server can go the exact point and make it current. The
>>>> followers can be started blank.
>>>> 
>>>> 
>>>> 
>>>> On Mon, Jul 8, 2013 at 5:53 PM, Thawan Kooburat <th...@fb.com> wrote:
>>>> 
>>>>> Just saw that  this is the corresponding use case to the question
>> posted
>>>>> in dev list.
>>>>> 
>>>>> In order to restore the data to a given point in time correctly, you
>>>>> need
>>>>> both snapshot and txnlog. This is because zookeeper snapshot is fuzzy
>>>>> and
>>>>> snapshot alone may not represent a valid state of the server if there
>>>>> are
>>>>> in-flight requests.
>>>>> 
>>>>> The 4wl command should cause the server to roll the log and take a
>>>>> snapshot similar to periodic snapshotting operation. Your backup
>> script
>>>>> need grap the snapshot and corresponding txnlog file from the data
>> dir.
>>>>> 
>>>>> To restore, just shutdown all hosts, clear the data dir, copy over the
>>>>> snapshot and txnlog, and restart them.
>>>>> 
>>>>> 
>>>>> --
>>>>> Thawan Kooburat
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On 7/8/13 3:28 PM, "Sergey Maslyakov" <ev...@gmail.com> wrote:
>>>>> 
>>>>>> Thank you for your response, Flavio. I apologize, I did not provide a
>>>>>> clear
>>>>>> explanation of the use case.
>>>>>> 
>>>>>> This backup/restore is not intended to be tied to any write event,
>>>>>> instead,
>>>>>> it is expected to run as a periodic (daily?) cron job on one of the
>>>>>> servers, which is not guaranteed to be the leader of the ensemble.
>>>>> There
>>>>>> is
>>>>>> no expectation that all recent changes are committed and persisted to
>>>>>> disk.
>>>>>> The system can sustain the loss of several hours worth of recent
>>>>> changes
>>>>>> in
>>>>>> the event of restore.
>>>>>> 
>>>>>> As for finding the leader dynamically and performing backup on it,
>> this
>>>>>> approach could be more difficult as the leader can change time to
>> time
>>>>> and
>>>>>> I still need to fetch the file to store it in my designated backup
>>>>>> location. Taking backup on one server and picking it up from a local
>>>>> file
>>>>>> system looks less error-prone. Even if I went the fancy route and had
>>>>>> Zookeeper send me the serialized DataTree in response to the 4wl,
>> this
>>>>>> approach would involve a lot of moving parts.
>>>>>> 
>>>>>> I have already made a PoC for a new 4wl that invokes takeSnapshot()
>> and
>>>>>> returns an absolute path to the snapshot it drops on disk. I have
>>>>> already
>>>>>> protected takeSnapshot() from concurrent invocation, which is likely
>> to
>>>>>> corrupt the snapshot file on disk. This approach works but I'm
>>>>> thinking to
>>>>>> take it one step further by providing the desired path name as an
>>>>> argument
>>>>>> to my new 4lw and to have Zookeeper server drop the snapshot into the
>>>>>> specified file and report success/failure back. This way I can avoid
>>>>>> cluttering the data directory and interfering with what Zookeeper
>> finds
>>>>>> when it scans the data directory.
>>>>>> 
>>>>>> Approach with having an additional server that would take the
>>>>> leadership
>>>>>> and populate the ensemble is just a theory. I don't see a clean way
>> of
>>>>>> making a quorum member the leader of the quorum. Am I overlooking
>>>>>> something
>>>>>> simple?
>>>>>> 
>>>>>> In backup and restore of an ensemble the biggest unknown for me
>> remains
>>>>>> populating the ensemble with desired data. I can think of two ways:
>>>>>> 
>>>>>> 1. Clear out all servers by stopping them, purge version-2
>> directories,
>>>>>> restore a snapshot file on one server that will be brought first, and
>>>>> then
>>>>>> bring up the rest of the ensemble. This way I somewhat force the
>> first
>>>>>> server to be the leader because it has data and it will be the only
>>>>> member
>>>>>> of a quorum with data, provided to the way I start the ensemble. This
>>>>>> looks
>>>>>> like a hack, though.
>>>>>> 
>>>>>> 2. Clear out the ensemble and reload it with a dedicated client using
>>>>> the
>>>>>> provided Zookeeper API.
>>>>>> 
>>>>>> With the approach of backing up an actual snapshot file, option #1
>>>>> appears
>>>>>> to be more practical.
>>>>>> 
>>>>>> I wish I could start the ensemble with a designate leader that would
>>>>>> bootstrap the ensemble with data and then the ensemble would go into
>>>>> its
>>>>>> normal business...
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Mon, Jul 8, 2013 at 4:30 PM, Flavio Junqueira
>>>>>> <fp...@yahoo.com>wrote:
>>>>>> 
>>>>>>> One bit that is still a bit confusing to me in your use case is if
>>>>> you
>>>>>>> need to take a snapshot right after some event in your application.
>>>>>>> Even if
>>>>>>> you're able to tell ZooKeeper to take a snapshot, there is no
>>>>> guarantee
>>>>>>> that it will happen at the exact point you want it if update
>>>>> operations
>>>>>>> keep coming.
>>>>>>> 
>>>>>>> If you use your four-letter word approach, then would you search
>> for
>>>>> the
>>>>>>> leader or would you simply take a snapshot at any server? If it has
>>>>> to
>>>>>>> go
>>>>>>> through the leader so that you make sure to have the most recent
>>>>>>> committed
>>>>>>> state, then it might not be a bad idea to have an api call that
>> tells
>>>>>>> the
>>>>>>> leader to take a snapshot at some directory of your choice.
>> Informing
>>>>>>> you
>>>>>>> the name of the snapshot file so that you can copy sounds like an
>>>>>>> option,
>>>>>>> but perhaps it is not as convenient.
>>>>>>> 
>>>>>>> The approach of adding another server is not very clear. How do you
>>>>>>> force
>>>>>>> it to be the leader? Keep in mind that if it crashes, then it will
>>>>> lose
>>>>>>> leadership.
>>>>>>> 
>>>>>>> -Flavio
>>>>>>> 
>>>>>>> On Jul 8, 2013, at 8:34 AM, Sergey Maslyakov <ev...@gmail.com>
>>>>> wrote:
>>>>>>> 
>>>>>>>> It looks like the "dev" mailing list is rather inactive. Over the
>>>>> past
>>>>>>> few
>>>>>>>> days I only saw several automated emails from JIRA and this is
>>>>> pretty
>>>>>>> much
>>>>>>>> it. Contrary to this, the "user" mailing list seems to be more
>>>>> alive
>>>>>>> and
>>>>>>>> more populated.
>>>>>>>> 
>>>>>>>> With this in mind, please allow me to cross-post here the
>> message I
>>>>>>> sent
>>>>>>>> into the "dev" list a few days ago.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> /Sergey
>>>>>>>> 
>>>>>>>> === forwarded message begins here ===
>>>>>>>> 
>>>>>>>> Hi!
>>>>>>>> 
>>>>>>>> I'm facing the problem that has been raised by multiple people
>> but
>>>>>>> none
>>>>>>> of
>>>>>>>> the discussion threads seem to provide a good answer. I dug in
>>>>>>> Zookeeper
>>>>>>>> source code trying to come up with some possible approaches and I
>>>>>>> would
>>>>>>>> like to get your inputs on those.
>>>>>>>> 
>>>>>>>> Initial conditions:
>>>>>>>> 
>>>>>>>> * I have an ensemble of five Zookeeper servers running v3.4.5
>> code.
>>>>>>>> * The size of a committed snapshot file is in vicinity of 1GB.
>>>>>>>> * There are about 80 clients connected to the ensemble.
>>>>>>>> * Clients a heavily read biased, i.e., they mostly read and
>> rarely
>>>>>>> write. I
>>>>>>>> would say less than 0.1% of queries modify the data.
>>>>>>>> 
>>>>>>>> Problem statement:
>>>>>>>> 
>>>>>>>> * Under certain conditions, I may need to revert the data stored
>> in
>>>>>>> the
>>>>>>>> ensemble to an earlier state. For example, one of the clients may
>>>>> ruin
>>>>>>> the
>>>>>>>> application-level data integrity and I need to perform a disaster
>>>>>>> recovery.
>>>>>>>> 
>>>>>>>> Things look nice and easy if I'm dealing with a single Zookeeper
>>>>>>> server.
>>>>>>> A
>>>>>>>> file-level copy of the data and dataLog directories should allow
>>>>> me to
>>>>>>>> recover later by stopping Zookeeper, swapping the corrupted data
>>>>> and
>>>>>>>> dataLog directories with a backup, and firing Zookeeper back up.
>>>>>>>> 
>>>>>>>> Now, the ensemble deployment and the leader election algorithm in
>>>>> the
>>>>>>>> quorum make things much more difficult. In order to restore from
>> a
>>>>>>> single
>>>>>>>> file-level backup, I need to take the whole ensemble down, wipe
>> out
>>>>>>> data
>>>>>>>> and dataLog directories on all servers, replace these directories
>>>>> with
>>>>>>>> backed up content on one of the servers, bring this server up
>>>>> first,
>>>>>>> and
>>>>>>>> then bring up the rest of the ensemble. This [somewhat]
>> guarantees
>>>>>>> that
>>>>>>> the
>>>>>>>> populated Zookeeper server becomes a member of a majority and
>>>>>>> populates
>>>>>>> the
>>>>>>>> ensemble. This approach works but it is very involving and, thus,
>>>>>>>> error-prone due to a human error.
>>>>>>>> 
>>>>>>>> Based on a study of Zookeeper source code, I am considering the
>>>>>>> following
>>>>>>>> alternatives. And I seek advice from Zookeeper development
>>>>> community
>>>>>>> as
>>>>>>> to
>>>>>>>> which approach looks more promising or if there is a better way.
>>>>>>>> 
>>>>>>>> Approach #1:
>>>>>>>> 
>>>>>>>> Develop a complementary pair of utilities for export and import
>> of
>>>>> the
>>>>>>>> data. Both utilities will act as Zookeeper clients and use the
>>>>>>> existing
>>>>>>>> API. The "export" utility will recursively retrieve data and
>> store
>>>>> it
>>>>>>> in
>>>>>>> a
>>>>>>>> file. The "import" utility will first purge all data from the
>>>>> ensemble
>>>>>>> and
>>>>>>>> then reload it from the file.
>>>>>>>> 
>>>>>>>> This approach seems to be the simplest and there are similar
>> tools
>>>>>>>> developed already. For example, the Guano Project:
>>>>>>>> https://github.com/d2fn/guano
>>>>>>>> 
>>>>>>>> I don't like two things about it:
>>>>>>>> * Poor performance even on a backup for the data store of my
>> size.
>>>>>>>> * Possible data consistency issues due to concurrent access by
>> the
>>>>>>> export
>>>>>>>> utility as well as other "normal" clients.
>>>>>>>> 
>>>>>>>> Approach #2:
>>>>>>>> 
>>>>>>>> Add another four-letter command that would force rolling up the
>>>>>>>> transactions and creating a snapshot. The result of this command
>>>>> would
>>>>>>> be a
>>>>>>>> new snapshot.XXXX file on disk and the name of the file could be
>>>>>>> reported
>>>>>>>> back to the client as a response to the four-letter command. This
>>>>>>> way, I
>>>>>>>> would know which snapshot file to grab for future possible
>> restore.
>>>>>>> But
>>>>>>>> restoring from a snapshot file is almost as involving as the
>>>>>>> error-prone
>>>>>>>> sequence described in the "Initial conditions" above.
>>>>>>>> 
>>>>>>>> Approach #3:
>>>>>>>> 
>>>>>>>> Come up with a way to temporarily add a new Zookeeper server
>> into a
>>>>>>> live
>>>>>>>> ensemble, that would overtake (how?) the leader role and push out
>>>>> the
>>>>>>>> snapshot that it has into all ensemble members upon restore. This
>>>>>>> approach
>>>>>>>> could be difficult and error-prone to implement because it will
>>>>>>> require
>>>>>>>> hacking the existing election algorithm to designate a leader.
>>>>>>>> 
>>>>>>>> So, which of the approaches do you think works best for an
>> ensemble
>>>>>>> and
>>>>>>> for
>>>>>>>> the database size of about 1GB?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Any advice will be highly appreciated!
>>>>>>>> /Sergey
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>>

RE: Efficient backup and a reasonable restore of an ensemble

Posted by Flavio Junqueira <fp...@yahoo.com>.

Heh, nothing to be sorry about, thanks for feedback and for raising these
points, Kishore.

-Flavio

-----Original Message-----
From: kishore g [mailto:g.kishore@gmail.com] 
Sent: 09 July 2013 19:01
To: user@zookeeper.apache.org
Subject: Re: Efficient backup and a reasonable restore of an ensemble

Sorry Flavio, I mixed two things in my previous email. When i said
checkpoint A, it means just save the last committed transaction id (No
snapshot will be taken). When we need to do restore we will simply run the
tool to bring the data directory to that particular zxid( We will truncate
the txn log after that zxid). We can now restart the server and we should
get back to that particular point.


The second part about fuzzy snapshot, I was just trying to explain to Sergey
that its not really fuzzy if he knows for sure that there are no updates
while taking snapshot. This really depends on the use case, for example if
all writes happen via a manually run tool then snapshot should not be fuzzy.





On Tue, Jul 9, 2013 at 9:02 AM, Sergey Maslyakov <ev...@gmail.com> wrote:

> I think I am having difficulties understanding the "fuzzy" concept. 
> Let's say I started to serialize DataTree into a snapshot file and it 
> took 30 seconds. During these 30 seconds, the server saw 5 
> transactions that updated the data. Does this mean that the snapshot 
> that I get on disk at the end of the 30-second interval will have some of
these 5 transactions?
> Or will it have none? Or will it have all of them? Or will it be 
> inconsistent and unreadable by Zookeeper?
>
> Please help me better understand the behavior behind the "fuzzy" term.
>
> For my use case, I am perfectly fine if I get a snapshot with none of 
> these
> 5 transactions, considering that I will pick them up next time I take 
> a snapshot.
>
>
> /Sergey
>
>
> On Tue, Jul 9, 2013 at 12:08 AM, kishore g <g....@gmail.com> wrote:
>
> > Its not really elaborate, it is very similar to what zookeeper does 
> > when
> it
> > starts up. It first reads the latest snapshot file and then the
> transaction
> > logs and applies each and every transaction. What I am suggesting is 
> > that instead of applying all transactions stop at a transaction i
provide.
> >
> > Having this tool will actually simplify your task, you can go back 
> > to any point in time. Think of a something like this.
> >
> > checkpoint A // this can store the last zxid or timestamp from the
> leader.
> > Make changes to zk
> > //if things fails
> > stop zks
> > rollback A//run this on each zk, brings back the cluster to its 
> > previous state.
> > start zks // any order should be fine.
> >
> >
> > Also keep in mind that snapshot is fuzzy only if there are writes
> happening
> > while taking snapshot. If you are sure no writes will happen when 
> > you are taking the snapshot then you are good. Experts, please 
> > correct me if this is incorrect.
> >
> > thanks,
> > Kishore G
> >
> >
> > On Mon, Jul 8, 2013 at 9:42 PM, Sergey Maslyakov <ev...@gmail.com>
> > wrote:
> >
> > > Kishore,
> > >
> > > This sounds like a very elaborate tool. I was trying to find a
> simplistic
> > > approach but what Thawan said about "fuzzy snapshots" makes me a 
> > > little afraid that there is no simple solution.
> > >
> > >
> > > On Mon, Jul 8, 2013 at 11:05 PM, kishore g <g....@gmail.com>
> wrote:
> > >
> > > > Agree, we already have such a tool. In fact we use it to 
> > > > reconstruct
> > the
> > > > sequence of events that led to a failure and actually restore 
> > > > the
> > system
> > > to
> > > > a previous stable point and replay the events. Unfortunately 
> > > > this is
> > tied
> > > > closely with Helix but it should be easy to make this a generic
tool.
> > > >
> > > > Sergey is this something that will be useful in your case.
> > > >
> > > > Thanks,
> > > > Kishore G
> > > >
> > > >
> > > > On Mon, Jul 8, 2013 at 8:09 PM, Thawan Kooburat <th...@fb.com>
> wrote:
> > > >
> > > > > On restore part, I think having a separate utility to 
> > > > > manipulate
> the
> > > > > data/snap dir (by truncating the log/removing snapshot to a 
> > > > > given
> > zxid)
> > > > > would be easier than modifying the server.
> > > > >
> > > > >
> > > > > --
> > > > > Thawan Kooburat
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On 7/8/13 6:34 PM, "kishore g" <g....@gmail.com> wrote:
> > > > >
> > > > > >I think what we are looking at is a  point in time restore
> > > > functionality.
> > > > > >How about adding a feature that says go back to a specific
> > > > zxid/timestamp.
> > > > > >This way before doing any change to zookeeper simply note 
> > > > > >down the timestamp/zxid on leader. If things go wrong after 
> > > > > >making changes,
> > > bring
> > > > > >down zookeepers and provide additional parameter of a
> zxid/timestamp
> > > > while
> > > > > >restarting. The server can go the exact point and make it
current.
> > The
> > > > > >followers can be started blank.
> > > > > >
> > > > > >
> > > > > >
> > > > > >On Mon, Jul 8, 2013 at 5:53 PM, Thawan Kooburat 
> > > > > ><th...@fb.com>
> > > wrote:
> > > > > >
> > > > > >> Just saw that  this is the corresponding use case to the
> question
> > > > posted
> > > > > >> in dev list.
> > > > > >>
> > > > > >> In order to restore the data to a given point in time 
> > > > > >> correctly,
> > you
> > > > > >>need
> > > > > >> both snapshot and txnlog. This is because zookeeper 
> > > > > >>snapshot is
> > > fuzzy
> > > > > >>and
> > > > > >> snapshot alone may not represent a valid state of the 
> > > > > >>server if
> > > there
> > > > > >>are
> > > > > >> in-flight requests.
> > > > > >>
> > > > > >> The 4wl command should cause the server to roll the log and
> take a
> > > > > >> snapshot similar to periodic snapshotting operation. Your 
> > > > > >> backup
> > > > script
> > > > > >> need grap the snapshot and corresponding txnlog file from 
> > > > > >> the
> data
> > > > dir.
> > > > > >>
> > > > > >> To restore, just shutdown all hosts, clear the data dir, 
> > > > > >> copy
> over
> > > the
> > > > > >> snapshot and txnlog, and restart them.
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> Thawan Kooburat
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On 7/8/13 3:28 PM, "Sergey Maslyakov" <ev...@gmail.com>
> wrote:
> > > > > >>
> > > > > >> >Thank you for your response, Flavio. I apologize, I did 
> > > > > >> >not
> > > provide a
> > > > > >> >clear
> > > > > >> >explanation of the use case.
> > > > > >> >
> > > > > >> >This backup/restore is not intended to be tied to any 
> > > > > >> >write
> > event,
> > > > > >> >instead,
> > > > > >> >it is expected to run as a periodic (daily?) cron job on 
> > > > > >> >one of
> > the
> > > > > >> >servers, which is not guaranteed to be the leader of the
> > ensemble.
> > > > > >>There
> > > > > >> >is
> > > > > >> >no expectation that all recent changes are committed and
> > persisted
> > > to
> > > > > >> >disk.
> > > > > >> >The system can sustain the loss of several hours worth of
> recent
> > > > > >>changes
> > > > > >> >in
> > > > > >> >the event of restore.
> > > > > >> >
> > > > > >> >As for finding the leader dynamically and performing 
> > > > > >> >backup on
> > it,
> > > > this
> > > > > >> >approach could be more difficult as the leader can change 
> > > > > >> >time
> to
> > > > time
> > > > > >>and
> > > > > >> >I still need to fetch the file to store it in my 
> > > > > >> >designated
> > backup
> > > > > >> >location. Taking backup on one server and picking it up 
> > > > > >> >from a
> > > local
> > > > > >>file
> > > > > >> >system looks less error-prone. Even if I went the fancy 
> > > > > >> >route
> and
> > > had
> > > > > >> >Zookeeper send me the serialized DataTree in response to 
> > > > > >> >the
> 4wl,
> > > > this
> > > > > >> >approach would involve a lot of moving parts.
> > > > > >> >
> > > > > >> >I have already made a PoC for a new 4wl that invokes
> > takeSnapshot()
> > > > and
> > > > > >> >returns an absolute path to the snapshot it drops on disk. 
> > > > > >> >I
> have
> > > > > >>already
> > > > > >> >protected takeSnapshot() from concurrent invocation, which 
> > > > > >> >is
> > > likely
> > > > to
> > > > > >> >corrupt the snapshot file on disk. This approach works but 
> > > > > >> >I'm
> > > > > >>thinking to
> > > > > >> >take it one step further by providing the desired path 
> > > > > >> >name as
> an
> > > > > >>argument
> > > > > >> >to my new 4lw and to have Zookeeper server drop the 
> > > > > >> >snapshot
> into
> > > the
> > > > > >> >specified file and report success/failure back. This way I 
> > > > > >> >can
> > > avoid
> > > > > >> >cluttering the data directory and interfering with what
> Zookeeper
> > > > finds
> > > > > >> >when it scans the data directory.
> > > > > >> >
> > > > > >> >Approach with having an additional server that would take 
> > > > > >> >the
> > > > > >>leadership
> > > > > >> >and populate the ensemble is just a theory. I don't see a 
> > > > > >> >clean
> > way
> > > > of
> > > > > >> >making a quorum member the leader of the quorum. Am I
> overlooking
> > > > > >> >something
> > > > > >> >simple?
> > > > > >> >
> > > > > >> >In backup and restore of an ensemble the biggest unknown 
> > > > > >> >for me
> > > > remains
> > > > > >> >populating the ensemble with desired data. I can think of 
> > > > > >> >two
> > ways:
> > > > > >> >
> > > > > >> >1. Clear out all servers by stopping them, purge version-2
> > > > directories,
> > > > > >> >restore a snapshot file on one server that will be brought
> first,
> > > and
> > > > > >>then
> > > > > >> >bring up the rest of the ensemble. This way I somewhat 
> > > > > >> >force
> the
> > > > first
> > > > > >> >server to be the leader because it has data and it will be 
> > > > > >> >the
> > only
> > > > > >>member
> > > > > >> >of a quorum with data, provided to the way I start the
> ensemble.
> > > This
> > > > > >> >looks
> > > > > >> >like a hack, though.
> > > > > >> >
> > > > > >> >2. Clear out the ensemble and reload it with a dedicated 
> > > > > >> >client
> > > using
> > > > > >>the
> > > > > >> >provided Zookeeper API.
> > > > > >> >
> > > > > >> >With the approach of backing up an actual snapshot file, 
> > > > > >> >option
> > #1
> > > > > >>appears
> > > > > >> >to be more practical.
> > > > > >> >
> > > > > >> >I wish I could start the ensemble with a designate leader 
> > > > > >> >that
> > > would
> > > > > >> >bootstrap the ensemble with data and then the ensemble 
> > > > > >> >would go
> > > into
> > > > > >>its
> > > > > >> >normal business...
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >On Mon, Jul 8, 2013 at 4:30 PM, Flavio Junqueira
> > > > > >> ><fp...@yahoo.com>wrote:
> > > > > >> >
> > > > > >> >> One bit that is still a bit confusing to me in your use 
> > > > > >> >> case
> is
> > > if
> > > > > >>you
> > > > > >> >> need to take a snapshot right after some event in your
> > > application.
> > > > > >> >>Even if
> > > > > >> >> you're able to tell ZooKeeper to take a snapshot, there 
> > > > > >> >>is no
> > > > > >>guarantee
> > > > > >> >> that it will happen at the exact point you want it if 
> > > > > >> >> update
> > > > > >>operations
> > > > > >> >> keep coming.
> > > > > >> >>
> > > > > >> >> If you use your four-letter word approach, then would 
> > > > > >> >> you
> > search
> > > > for
> > > > > >>the
> > > > > >> >> leader or would you simply take a snapshot at any 
> > > > > >> >> server? If
> it
> > > has
> > > > > >>to
> > > > > >> >>go
> > > > > >> >> through the leader so that you make sure to have the 
> > > > > >> >>most
> > recent
> > > > > >> >>committed
> > > > > >> >> state, then it might not be a bad idea to have an api 
> > > > > >> >>call
> that
> > > > tells
> > > > > >> >>the
> > > > > >> >> leader to take a snapshot at some directory of your choice.
> > > > Informing
> > > > > >> >>you
> > > > > >> >> the name of the snapshot file so that you can copy 
> > > > > >> >>sounds
> like
> > an
> > > > > >> >>option,
> > > > > >> >> but perhaps it is not as convenient.
> > > > > >> >>
> > > > > >> >> The approach of adding another server is not very clear. 
> > > > > >> >> How
> do
> > > you
> > > > > >> >>force
> > > > > >> >> it to be the leader? Keep in mind that if it crashes, 
> > > > > >> >>then it
> > > will
> > > > > >>lose
> > > > > >> >> leadership.
> > > > > >> >>
> > > > > >> >> -Flavio
> > > > > >> >>
> > > > > >> >> On Jul 8, 2013, at 8:34 AM, Sergey Maslyakov <
> > evolvah@gmail.com>
> > > > > >>wrote:
> > > > > >> >>
> > > > > >> >> > It looks like the "dev" mailing list is rather inactive.
> Over
> > > the
> > > > > >>past
> > > > > >> >> few
> > > > > >> >> > days I only saw several automated emails from JIRA and 
> > > > > >> >> > this
> > is
> > > > > >>pretty
> > > > > >> >> much
> > > > > >> >> > it. Contrary to this, the "user" mailing list seems to 
> > > > > >> >> > be
> > more
> > > > > >>alive
> > > > > >> >>and
> > > > > >> >> > more populated.
> > > > > >> >> >
> > > > > >> >> > With this in mind, please allow me to cross-post here 
> > > > > >> >> > the
> > > > message I
> > > > > >> >>sent
> > > > > >> >> > into the "dev" list a few days ago.
> > > > > >> >> >
> > > > > >> >> >
> > > > > >> >> > Regards,
> > > > > >> >> > /Sergey
> > > > > >> >> >
> > > > > >> >> > === forwarded message begins here ===
> > > > > >> >> >
> > > > > >> >> > Hi!
> > > > > >> >> >
> > > > > >> >> > I'm facing the problem that has been raised by 
> > > > > >> >> > multiple
> > people
> > > > but
> > > > > >> >>none
> > > > > >> >> of
> > > > > >> >> > the discussion threads seem to provide a good answer. 
> > > > > >> >> > I dug
> > in
> > > > > >> >>Zookeeper
> > > > > >> >> > source code trying to come up with some possible 
> > > > > >> >> > approaches
> > > and I
> > > > > >> >>would
> > > > > >> >> > like to get your inputs on those.
> > > > > >> >> >
> > > > > >> >> > Initial conditions:
> > > > > >> >> >
> > > > > >> >> > * I have an ensemble of five Zookeeper servers running
> v3.4.5
> > > > code.
> > > > > >> >> > * The size of a committed snapshot file is in vicinity 
> > > > > >> >> > of
> > 1GB.
> > > > > >> >> > * There are about 80 clients connected to the ensemble.
> > > > > >> >> > * Clients a heavily read biased, i.e., they mostly 
> > > > > >> >> > read and
> > > > rarely
> > > > > >> >> write. I
> > > > > >> >> > would say less than 0.1% of queries modify the data.
> > > > > >> >> >
> > > > > >> >> > Problem statement:
> > > > > >> >> >
> > > > > >> >> > * Under certain conditions, I may need to revert the 
> > > > > >> >> > data
> > > stored
> > > > in
> > > > > >> >>the
> > > > > >> >> > ensemble to an earlier state. For example, one of the
> clients
> > > may
> > > > > >>ruin
> > > > > >> >> the
> > > > > >> >> > application-level data integrity and I need to perform 
> > > > > >> >> > a
> > > disaster
> > > > > >> >> recovery.
> > > > > >> >> >
> > > > > >> >> > Things look nice and easy if I'm dealing with a single
> > > Zookeeper
> > > > > >> >>server.
> > > > > >> >> A
> > > > > >> >> > file-level copy of the data and dataLog directories 
> > > > > >> >> > should
> > > allow
> > > > > >>me to
> > > > > >> >> > recover later by stopping Zookeeper, swapping the 
> > > > > >> >> > corrupted
> > > data
> > > > > >>and
> > > > > >> >> > dataLog directories with a backup, and firing 
> > > > > >> >> > Zookeeper
> back
> > > up.
> > > > > >> >> >
> > > > > >> >> > Now, the ensemble deployment and the leader election
> > algorithm
> > > in
> > > > > >>the
> > > > > >> >> > quorum make things much more difficult. In order to 
> > > > > >> >> > restore
> > > from
> > > > a
> > > > > >> >>single
> > > > > >> >> > file-level backup, I need to take the whole ensemble 
> > > > > >> >> > down,
> > wipe
> > > > out
> > > > > >> >>data
> > > > > >> >> > and dataLog directories on all servers, replace these
> > > directories
> > > > > >>with
> > > > > >> >> > backed up content on one of the servers, bring this 
> > > > > >> >> > server
> up
> > > > > >>first,
> > > > > >> >>and
> > > > > >> >> > then bring up the rest of the ensemble. This 
> > > > > >> >> > [somewhat]
> > > > guarantees
> > > > > >> >>that
> > > > > >> >> the
> > > > > >> >> > populated Zookeeper server becomes a member of a 
> > > > > >> >> > majority
> and
> > > > > >> >>populates
> > > > > >> >> the
> > > > > >> >> > ensemble. This approach works but it is very involving 
> > > > > >> >> > and,
> > > thus,
> > > > > >> >> > error-prone due to a human error.
> > > > > >> >> >
> > > > > >> >> > Based on a study of Zookeeper source code, I am 
> > > > > >> >> > considering
> > the
> > > > > >> >>following
> > > > > >> >> > alternatives. And I seek advice from Zookeeper 
> > > > > >> >> > development
> > > > > >>community
> > > > > >> >>as
> > > > > >> >> to
> > > > > >> >> > which approach looks more promising or if there is a 
> > > > > >> >> > better
> > > way.
> > > > > >> >> >
> > > > > >> >> > Approach #1:
> > > > > >> >> >
> > > > > >> >> > Develop a complementary pair of utilities for export 
> > > > > >> >> > and
> > import
> > > > of
> > > > > >>the
> > > > > >> >> > data. Both utilities will act as Zookeeper clients and 
> > > > > >> >> > use
> > the
> > > > > >> >>existing
> > > > > >> >> > API. The "export" utility will recursively retrieve 
> > > > > >> >> > data
> and
> > > > store
> > > > > >>it
> > > > > >> >>in
> > > > > >> >> a
> > > > > >> >> > file. The "import" utility will first purge all data 
> > > > > >> >> > from
> the
> > > > > >>ensemble
> > > > > >> >> and
> > > > > >> >> > then reload it from the file.
> > > > > >> >> >
> > > > > >> >> > This approach seems to be the simplest and there are
> similar
> > > > tools
> > > > > >> >> > developed already. For example, the Guano Project:
> > > > > >> >> > https://github.com/d2fn/guano
> > > > > >> >> >
> > > > > >> >> > I don't like two things about it:
> > > > > >> >> > * Poor performance even on a backup for the data store 
> > > > > >> >> > of
> my
> > > > size.
> > > > > >> >> > * Possible data consistency issues due to concurrent 
> > > > > >> >> > access
> > by
> > > > the
> > > > > >> >>export
> > > > > >> >> > utility as well as other "normal" clients.
> > > > > >> >> >
> > > > > >> >> > Approach #2:
> > > > > >> >> >
> > > > > >> >> > Add another four-letter command that would force 
> > > > > >> >> > rolling up
> > the
> > > > > >> >> > transactions and creating a snapshot. The result of 
> > > > > >> >> > this
> > > command
> > > > > >>would
> > > > > >> >> be a
> > > > > >> >> > new snapshot.XXXX file on disk and the name of the 
> > > > > >> >> > file
> could
> > > be
> > > > > >> >>reported
> > > > > >> >> > back to the client as a response to the four-letter
> command.
> > > This
> > > > > >> >>way, I
> > > > > >> >> > would know which snapshot file to grab for future 
> > > > > >> >> > possible
> > > > restore.
> > > > > >> >>But
> > > > > >> >> > restoring from a snapshot file is almost as involving 
> > > > > >> >> > as
> the
> > > > > >> >>error-prone
> > > > > >> >> > sequence described in the "Initial conditions" above.
> > > > > >> >> >
> > > > > >> >> > Approach #3:
> > > > > >> >> >
> > > > > >> >> > Come up with a way to temporarily add a new Zookeeper
> server
> > > > into a
> > > > > >> >>live
> > > > > >> >> > ensemble, that would overtake (how?) the leader role 
> > > > > >> >> > and
> push
> > > out
> > > > > >>the
> > > > > >> >> > snapshot that it has into all ensemble members upon
> restore.
> > > This
> > > > > >> >> approach
> > > > > >> >> > could be difficult and error-prone to implement 
> > > > > >> >> > because it
> > will
> > > > > >> >>require
> > > > > >> >> > hacking the existing election algorithm to designate a
> > leader.
> > > > > >> >> >
> > > > > >> >> > So, which of the approaches do you think works best 
> > > > > >> >> > for an
> > > > ensemble
> > > > > >> >>and
> > > > > >> >> for
> > > > > >> >> > the database size of about 1GB?
> > > > > >> >> >
> > > > > >> >> >
> > > > > >> >> > Any advice will be highly appreciated!
> > > > > >> >> > /Sergey
> > > > > >> >>
> > > > > >> >>
> > > > > >>
> > > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Efficient backup and a reasonable restore of an ensemble

Posted by kishore g <g....@gmail.com>.

Sorry Flavio, I mixed two things in my previous email. When i said
checkpoint A, it means just save the last committed transaction id (No
snapshot will be taken). When we need to do restore we will simply run the
tool to bring the data directory to that particular zxid( We will truncate
the txn log after that zxid). We can now restart the server and we should
get back to that particular point.


The second part about fuzzy snapshot, I was just trying to explain to
Sergey that its not really fuzzy if he knows for sure that there are no
updates while taking snapshot. This really depends on the use case, for
example if all writes happen via a manually run tool then snapshot should
not be fuzzy.





On Tue, Jul 9, 2013 at 9:02 AM, Sergey Maslyakov <ev...@gmail.com> wrote:

> I think I am having difficulties understanding the "fuzzy" concept. Let's
> say I started to serialize DataTree into a snapshot file and it took 30
> seconds. During these 30 seconds, the server saw 5 transactions that
> updated the data. Does this mean that the snapshot that I get on disk at
> the end of the 30-second interval will have some of these 5 transactions?
> Or will it have none? Or will it have all of them? Or will it be
> inconsistent and unreadable by Zookeeper?
>
> Please help me better understand the behavior behind the "fuzzy" term.
>
> For my use case, I am perfectly fine if I get a snapshot with none of these
> 5 transactions, considering that I will pick them up next time I take a
> snapshot.
>
>
> /Sergey
>
>
> On Tue, Jul 9, 2013 at 12:08 AM, kishore g <g....@gmail.com> wrote:
>
> > Its not really elaborate, it is very similar to what zookeeper does when
> it
> > starts up. It first reads the latest snapshot file and then the
> transaction
> > logs and applies each and every transaction. What I am suggesting is that
> > instead of applying all transactions stop at a transaction i provide.
> >
> > Having this tool will actually simplify your task, you can go back to any
> > point in time. Think of a something like this.
> >
> > checkpoint A // this can store the last zxid or timestamp from the
> leader.
> > Make changes to zk
> > //if things fails
> > stop zks
> > rollback A//run this on each zk, brings back the cluster to its previous
> > state.
> > start zks // any order should be fine.
> >
> >
> > Also keep in mind that snapshot is fuzzy only if there are writes
> happening
> > while taking snapshot. If you are sure no writes will happen when you are
> > taking the snapshot then you are good. Experts, please correct me if this
> > is incorrect.
> >
> > thanks,
> > Kishore G
> >
> >
> > On Mon, Jul 8, 2013 at 9:42 PM, Sergey Maslyakov <ev...@gmail.com>
> > wrote:
> >
> > > Kishore,
> > >
> > > This sounds like a very elaborate tool. I was trying to find a
> simplistic
> > > approach but what Thawan said about "fuzzy snapshots" makes me a little
> > > afraid that there is no simple solution.
> > >
> > >
> > > On Mon, Jul 8, 2013 at 11:05 PM, kishore g <g....@gmail.com>
> wrote:
> > >
> > > > Agree, we already have such a tool. In fact we use it to reconstruct
> > the
> > > > sequence of events that led to a failure and actually restore the
> > system
> > > to
> > > > a previous stable point and replay the events. Unfortunately this is
> > tied
> > > > closely with Helix but it should be easy to make this a generic tool.
> > > >
> > > > Sergey is this something that will be useful in your case.
> > > >
> > > > Thanks,
> > > > Kishore G
> > > >
> > > >
> > > > On Mon, Jul 8, 2013 at 8:09 PM, Thawan Kooburat <th...@fb.com>
> wrote:
> > > >
> > > > > On restore part, I think having a separate utility to manipulate
> the
> > > > > data/snap dir (by truncating the log/removing snapshot to a given
> > zxid)
> > > > > would be easier than modifying the server.
> > > > >
> > > > >
> > > > > --
> > > > > Thawan Kooburat
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On 7/8/13 6:34 PM, "kishore g" <g....@gmail.com> wrote:
> > > > >
> > > > > >I think what we are looking at is a  point in time restore
> > > > functionality.
> > > > > >How about adding a feature that says go back to a specific
> > > > zxid/timestamp.
> > > > > >This way before doing any change to zookeeper simply note down the
> > > > > >timestamp/zxid on leader. If things go wrong after making changes,
> > > bring
> > > > > >down zookeepers and provide additional parameter of a
> zxid/timestamp
> > > > while
> > > > > >restarting. The server can go the exact point and make it current.
> > The
> > > > > >followers can be started blank.
> > > > > >
> > > > > >
> > > > > >
> > > > > >On Mon, Jul 8, 2013 at 5:53 PM, Thawan Kooburat <th...@fb.com>
> > > wrote:
> > > > > >
> > > > > >> Just saw that  this is the corresponding use case to the
> question
> > > > posted
> > > > > >> in dev list.
> > > > > >>
> > > > > >> In order to restore the data to a given point in time correctly,
> > you
> > > > > >>need
> > > > > >> both snapshot and txnlog. This is because zookeeper snapshot is
> > > fuzzy
> > > > > >>and
> > > > > >> snapshot alone may not represent a valid state of the server if
> > > there
> > > > > >>are
> > > > > >> in-flight requests.
> > > > > >>
> > > > > >> The 4wl command should cause the server to roll the log and
> take a
> > > > > >> snapshot similar to periodic snapshotting operation. Your backup
> > > > script
> > > > > >> need grap the snapshot and corresponding txnlog file from the
> data
> > > > dir.
> > > > > >>
> > > > > >> To restore, just shutdown all hosts, clear the data dir, copy
> over
> > > the
> > > > > >> snapshot and txnlog, and restart them.
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> Thawan Kooburat
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On 7/8/13 3:28 PM, "Sergey Maslyakov" <ev...@gmail.com>
> wrote:
> > > > > >>
> > > > > >> >Thank you for your response, Flavio. I apologize, I did not
> > > provide a
> > > > > >> >clear
> > > > > >> >explanation of the use case.
> > > > > >> >
> > > > > >> >This backup/restore is not intended to be tied to any write
> > event,
> > > > > >> >instead,
> > > > > >> >it is expected to run as a periodic (daily?) cron job on one of
> > the
> > > > > >> >servers, which is not guaranteed to be the leader of the
> > ensemble.
> > > > > >>There
> > > > > >> >is
> > > > > >> >no expectation that all recent changes are committed and
> > persisted
> > > to
> > > > > >> >disk.
> > > > > >> >The system can sustain the loss of several hours worth of
> recent
> > > > > >>changes
> > > > > >> >in
> > > > > >> >the event of restore.
> > > > > >> >
> > > > > >> >As for finding the leader dynamically and performing backup on
> > it,
> > > > this
> > > > > >> >approach could be more difficult as the leader can change time
> to
> > > > time
> > > > > >>and
> > > > > >> >I still need to fetch the file to store it in my designated
> > backup
> > > > > >> >location. Taking backup on one server and picking it up from a
> > > local
> > > > > >>file
> > > > > >> >system looks less error-prone. Even if I went the fancy route
> and
> > > had
> > > > > >> >Zookeeper send me the serialized DataTree in response to the
> 4wl,
> > > > this
> > > > > >> >approach would involve a lot of moving parts.
> > > > > >> >
> > > > > >> >I have already made a PoC for a new 4wl that invokes
> > takeSnapshot()
> > > > and
> > > > > >> >returns an absolute path to the snapshot it drops on disk. I
> have
> > > > > >>already
> > > > > >> >protected takeSnapshot() from concurrent invocation, which is
> > > likely
> > > > to
> > > > > >> >corrupt the snapshot file on disk. This approach works but I'm
> > > > > >>thinking to
> > > > > >> >take it one step further by providing the desired path name as
> an
> > > > > >>argument
> > > > > >> >to my new 4lw and to have Zookeeper server drop the snapshot
> into
> > > the
> > > > > >> >specified file and report success/failure back. This way I can
> > > avoid
> > > > > >> >cluttering the data directory and interfering with what
> Zookeeper
> > > > finds
> > > > > >> >when it scans the data directory.
> > > > > >> >
> > > > > >> >Approach with having an additional server that would take the
> > > > > >>leadership
> > > > > >> >and populate the ensemble is just a theory. I don't see a clean
> > way
> > > > of
> > > > > >> >making a quorum member the leader of the quorum. Am I
> overlooking
> > > > > >> >something
> > > > > >> >simple?
> > > > > >> >
> > > > > >> >In backup and restore of an ensemble the biggest unknown for me
> > > > remains
> > > > > >> >populating the ensemble with desired data. I can think of two
> > ways:
> > > > > >> >
> > > > > >> >1. Clear out all servers by stopping them, purge version-2
> > > > directories,
> > > > > >> >restore a snapshot file on one server that will be brought
> first,
> > > and
> > > > > >>then
> > > > > >> >bring up the rest of the ensemble. This way I somewhat force
> the
> > > > first
> > > > > >> >server to be the leader because it has data and it will be the
> > only
> > > > > >>member
> > > > > >> >of a quorum with data, provided to the way I start the
> ensemble.
> > > This
> > > > > >> >looks
> > > > > >> >like a hack, though.
> > > > > >> >
> > > > > >> >2. Clear out the ensemble and reload it with a dedicated client
> > > using
> > > > > >>the
> > > > > >> >provided Zookeeper API.
> > > > > >> >
> > > > > >> >With the approach of backing up an actual snapshot file, option
> > #1
> > > > > >>appears
> > > > > >> >to be more practical.
> > > > > >> >
> > > > > >> >I wish I could start the ensemble with a designate leader that
> > > would
> > > > > >> >bootstrap the ensemble with data and then the ensemble would go
> > > into
> > > > > >>its
> > > > > >> >normal business...
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >On Mon, Jul 8, 2013 at 4:30 PM, Flavio Junqueira
> > > > > >> ><fp...@yahoo.com>wrote:
> > > > > >> >
> > > > > >> >> One bit that is still a bit confusing to me in your use case
> is
> > > if
> > > > > >>you
> > > > > >> >> need to take a snapshot right after some event in your
> > > application.
> > > > > >> >>Even if
> > > > > >> >> you're able to tell ZooKeeper to take a snapshot, there is no
> > > > > >>guarantee
> > > > > >> >> that it will happen at the exact point you want it if update
> > > > > >>operations
> > > > > >> >> keep coming.
> > > > > >> >>
> > > > > >> >> If you use your four-letter word approach, then would you
> > search
> > > > for
> > > > > >>the
> > > > > >> >> leader or would you simply take a snapshot at any server? If
> it
> > > has
> > > > > >>to
> > > > > >> >>go
> > > > > >> >> through the leader so that you make sure to have the most
> > recent
> > > > > >> >>committed
> > > > > >> >> state, then it might not be a bad idea to have an api call
> that
> > > > tells
> > > > > >> >>the
> > > > > >> >> leader to take a snapshot at some directory of your choice.
> > > > Informing
> > > > > >> >>you
> > > > > >> >> the name of the snapshot file so that you can copy sounds
> like
> > an
> > > > > >> >>option,
> > > > > >> >> but perhaps it is not as convenient.
> > > > > >> >>
> > > > > >> >> The approach of adding another server is not very clear. How
> do
> > > you
> > > > > >> >>force
> > > > > >> >> it to be the leader? Keep in mind that if it crashes, then it
> > > will
> > > > > >>lose
> > > > > >> >> leadership.
> > > > > >> >>
> > > > > >> >> -Flavio
> > > > > >> >>
> > > > > >> >> On Jul 8, 2013, at 8:34 AM, Sergey Maslyakov <
> > evolvah@gmail.com>
> > > > > >>wrote:
> > > > > >> >>
> > > > > >> >> > It looks like the "dev" mailing list is rather inactive.
> Over
> > > the
> > > > > >>past
> > > > > >> >> few
> > > > > >> >> > days I only saw several automated emails from JIRA and this
> > is
> > > > > >>pretty
> > > > > >> >> much
> > > > > >> >> > it. Contrary to this, the "user" mailing list seems to be
> > more
> > > > > >>alive
> > > > > >> >>and
> > > > > >> >> > more populated.
> > > > > >> >> >
> > > > > >> >> > With this in mind, please allow me to cross-post here the
> > > > message I
> > > > > >> >>sent
> > > > > >> >> > into the "dev" list a few days ago.
> > > > > >> >> >
> > > > > >> >> >
> > > > > >> >> > Regards,
> > > > > >> >> > /Sergey
> > > > > >> >> >
> > > > > >> >> > === forwarded message begins here ===
> > > > > >> >> >
> > > > > >> >> > Hi!
> > > > > >> >> >
> > > > > >> >> > I'm facing the problem that has been raised by multiple
> > people
> > > > but
> > > > > >> >>none
> > > > > >> >> of
> > > > > >> >> > the discussion threads seem to provide a good answer. I dug
> > in
> > > > > >> >>Zookeeper
> > > > > >> >> > source code trying to come up with some possible approaches
> > > and I
> > > > > >> >>would
> > > > > >> >> > like to get your inputs on those.
> > > > > >> >> >
> > > > > >> >> > Initial conditions:
> > > > > >> >> >
> > > > > >> >> > * I have an ensemble of five Zookeeper servers running
> v3.4.5
> > > > code.
> > > > > >> >> > * The size of a committed snapshot file is in vicinity of
> > 1GB.
> > > > > >> >> > * There are about 80 clients connected to the ensemble.
> > > > > >> >> > * Clients a heavily read biased, i.e., they mostly read and
> > > > rarely
> > > > > >> >> write. I
> > > > > >> >> > would say less than 0.1% of queries modify the data.
> > > > > >> >> >
> > > > > >> >> > Problem statement:
> > > > > >> >> >
> > > > > >> >> > * Under certain conditions, I may need to revert the data
> > > stored
> > > > in
> > > > > >> >>the
> > > > > >> >> > ensemble to an earlier state. For example, one of the
> clients
> > > may
> > > > > >>ruin
> > > > > >> >> the
> > > > > >> >> > application-level data integrity and I need to perform a
> > > disaster
> > > > > >> >> recovery.
> > > > > >> >> >
> > > > > >> >> > Things look nice and easy if I'm dealing with a single
> > > Zookeeper
> > > > > >> >>server.
> > > > > >> >> A
> > > > > >> >> > file-level copy of the data and dataLog directories should
> > > allow
> > > > > >>me to
> > > > > >> >> > recover later by stopping Zookeeper, swapping the corrupted
> > > data
> > > > > >>and
> > > > > >> >> > dataLog directories with a backup, and firing Zookeeper
> back
> > > up.
> > > > > >> >> >
> > > > > >> >> > Now, the ensemble deployment and the leader election
> > algorithm
> > > in
> > > > > >>the
> > > > > >> >> > quorum make things much more difficult. In order to restore
> > > from
> > > > a
> > > > > >> >>single
> > > > > >> >> > file-level backup, I need to take the whole ensemble down,
> > wipe
> > > > out
> > > > > >> >>data
> > > > > >> >> > and dataLog directories on all servers, replace these
> > > directories
> > > > > >>with
> > > > > >> >> > backed up content on one of the servers, bring this server
> up
> > > > > >>first,
> > > > > >> >>and
> > > > > >> >> > then bring up the rest of the ensemble. This [somewhat]
> > > > guarantees
> > > > > >> >>that
> > > > > >> >> the
> > > > > >> >> > populated Zookeeper server becomes a member of a majority
> and
> > > > > >> >>populates
> > > > > >> >> the
> > > > > >> >> > ensemble. This approach works but it is very involving and,
> > > thus,
> > > > > >> >> > error-prone due to a human error.
> > > > > >> >> >
> > > > > >> >> > Based on a study of Zookeeper source code, I am considering
> > the
> > > > > >> >>following
> > > > > >> >> > alternatives. And I seek advice from Zookeeper development
> > > > > >>community
> > > > > >> >>as
> > > > > >> >> to
> > > > > >> >> > which approach looks more promising or if there is a better
> > > way.
> > > > > >> >> >
> > > > > >> >> > Approach #1:
> > > > > >> >> >
> > > > > >> >> > Develop a complementary pair of utilities for export and
> > import
> > > > of
> > > > > >>the
> > > > > >> >> > data. Both utilities will act as Zookeeper clients and use
> > the
> > > > > >> >>existing
> > > > > >> >> > API. The "export" utility will recursively retrieve data
> and
> > > > store
> > > > > >>it
> > > > > >> >>in
> > > > > >> >> a
> > > > > >> >> > file. The "import" utility will first purge all data from
> the
> > > > > >>ensemble
> > > > > >> >> and
> > > > > >> >> > then reload it from the file.
> > > > > >> >> >
> > > > > >> >> > This approach seems to be the simplest and there are
> similar
> > > > tools
> > > > > >> >> > developed already. For example, the Guano Project:
> > > > > >> >> > https://github.com/d2fn/guano
> > > > > >> >> >
> > > > > >> >> > I don't like two things about it:
> > > > > >> >> > * Poor performance even on a backup for the data store of
> my
> > > > size.
> > > > > >> >> > * Possible data consistency issues due to concurrent access
> > by
> > > > the
> > > > > >> >>export
> > > > > >> >> > utility as well as other "normal" clients.
> > > > > >> >> >
> > > > > >> >> > Approach #2:
> > > > > >> >> >
> > > > > >> >> > Add another four-letter command that would force rolling up
> > the
> > > > > >> >> > transactions and creating a snapshot. The result of this
> > > command
> > > > > >>would
> > > > > >> >> be a
> > > > > >> >> > new snapshot.XXXX file on disk and the name of the file
> could
> > > be
> > > > > >> >>reported
> > > > > >> >> > back to the client as a response to the four-letter
> command.
> > > This
> > > > > >> >>way, I
> > > > > >> >> > would know which snapshot file to grab for future possible
> > > > restore.
> > > > > >> >>But
> > > > > >> >> > restoring from a snapshot file is almost as involving as
> the
> > > > > >> >>error-prone
> > > > > >> >> > sequence described in the "Initial conditions" above.
> > > > > >> >> >
> > > > > >> >> > Approach #3:
> > > > > >> >> >
> > > > > >> >> > Come up with a way to temporarily add a new Zookeeper
> server
> > > > into a
> > > > > >> >>live
> > > > > >> >> > ensemble, that would overtake (how?) the leader role and
> push
> > > out
> > > > > >>the
> > > > > >> >> > snapshot that it has into all ensemble members upon
> restore.
> > > This
> > > > > >> >> approach
> > > > > >> >> > could be difficult and error-prone to implement because it
> > will
> > > > > >> >>require
> > > > > >> >> > hacking the existing election algorithm to designate a
> > leader.
> > > > > >> >> >
> > > > > >> >> > So, which of the approaches do you think works best for an
> > > > ensemble
> > > > > >> >>and
> > > > > >> >> for
> > > > > >> >> > the database size of about 1GB?
> > > > > >> >> >
> > > > > >> >> >
> > > > > >> >> > Any advice will be highly appreciated!
> > > > > >> >> > /Sergey
> > > > > >> >>
> > > > > >> >>
> > > > > >>
> > > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Efficient backup and a reasonable restore of an ensemble

Posted by Ted Dunning <te...@gmail.com>.

The snapshot will include any or all of those 5 updates.  But the logs from
that time *will* include all 5.

Thus, if you apply that part of the log to the snapshot, you will either
overwrite the latest value (with no effect) or you will update the previous
value to the latest value for each of those 5 log entries.  There are
obviously 1024 possible alternatives here, but they all lead to the same
final state and that is exactly a moment in time snapshot as of the final
transaction.

The fuzzy term refers to the fact that you can't say exactly what time the
the original snapshot corresponds to.  In your example, the data in the
snapshot represents a combination of states from anywhere in the 30 second
window that it took to write the snapshot.



On Tue, Jul 9, 2013 at 9:02 AM, Sergey Maslyakov <ev...@gmail.com> wrote:

> I think I am having difficulties understanding the "fuzzy" concept. Let's
> say I started to serialize DataTree into a snapshot file and it took 30
> seconds. During these 30 seconds, the server saw 5 transactions that
> updated the data. Does this mean that the snapshot that I get on disk at
> the end of the 30-second interval will have some of these 5 transactions?
> Or will it have none? Or will it have all of them? Or will it be
> inconsistent and unreadable by Zookeeper?
>
> Please help me better understand the behavior behind the "fuzzy" term.
>
> For my use case, I am perfectly fine if I get a snapshot with none of these
> 5 transactions, considering that I will pick them up next time I take a
> snapshot.
>
>
> /Sergey
>
>
> On Tue, Jul 9, 2013 at 12:08 AM, kishore g <g....@gmail.com> wrote:
>
> > Its not really elaborate, it is very similar to what zookeeper does when
> it
> > starts up. It first reads the latest snapshot file and then the
> transaction
> > logs and applies each and every transaction. What I am suggesting is that
> > instead of applying all transactions stop at a transaction i provide.
> >
> > Having this tool will actually simplify your task, you can go back to any
> > point in time. Think of a something like this.
> >
> > checkpoint A // this can store the last zxid or timestamp from the
> leader.
> > Make changes to zk
> > //if things fails
> > stop zks
> > rollback A//run this on each zk, brings back the cluster to its previous
> > state.
> > start zks // any order should be fine.
> >
> >
> > Also keep in mind that snapshot is fuzzy only if there are writes
> happening
> > while taking snapshot. If you are sure no writes will happen when you are
> > taking the snapshot then you are good. Experts, please correct me if this
> > is incorrect.
> >
> > thanks,
> > Kishore G
> >
> >
> > On Mon, Jul 8, 2013 at 9:42 PM, Sergey Maslyakov <ev...@gmail.com>
> > wrote:
> >
> > > Kishore,
> > >
> > > This sounds like a very elaborate tool. I was trying to find a
> simplistic
> > > approach but what Thawan said about "fuzzy snapshots" makes me a little
> > > afraid that there is no simple solution.
> > >
> > >
> > > On Mon, Jul 8, 2013 at 11:05 PM, kishore g <g....@gmail.com>
> wrote:
> > >
> > > > Agree, we already have such a tool. In fact we use it to reconstruct
> > the
> > > > sequence of events that led to a failure and actually restore the
> > system
> > > to
> > > > a previous stable point and replay the events. Unfortunately this is
> > tied
> > > > closely with Helix but it should be easy to make this a generic tool.
> > > >
> > > > Sergey is this something that will be useful in your case.
> > > >
> > > > Thanks,
> > > > Kishore G
> > > >
> > > >
> > > > On Mon, Jul 8, 2013 at 8:09 PM, Thawan Kooburat <th...@fb.com>
> wrote:
> > > >
> > > > > On restore part, I think having a separate utility to manipulate
> the
> > > > > data/snap dir (by truncating the log/removing snapshot to a given
> > zxid)
> > > > > would be easier than modifying the server.
> > > > >
> > > > >
> > > > > --
> > > > > Thawan Kooburat
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On 7/8/13 6:34 PM, "kishore g" <g....@gmail.com> wrote:
> > > > >
> > > > > >I think what we are looking at is a  point in time restore
> > > > functionality.
> > > > > >How about adding a feature that says go back to a specific
> > > > zxid/timestamp.
> > > > > >This way before doing any change to zookeeper simply note down the
> > > > > >timestamp/zxid on leader. If things go wrong after making changes,
> > > bring
> > > > > >down zookeepers and provide additional parameter of a
> zxid/timestamp
> > > > while
> > > > > >restarting. The server can go the exact point and make it current.
> > The
> > > > > >followers can be started blank.
> > > > > >
> > > > > >
> > > > > >
> > > > > >On Mon, Jul 8, 2013 at 5:53 PM, Thawan Kooburat <th...@fb.com>
> > > wrote:
> > > > > >
> > > > > >> Just saw that  this is the corresponding use case to the
> question
> > > > posted
> > > > > >> in dev list.
> > > > > >>
> > > > > >> In order to restore the data to a given point in time correctly,
> > you
> > > > > >>need
> > > > > >> both snapshot and txnlog. This is because zookeeper snapshot is
> > > fuzzy
> > > > > >>and
> > > > > >> snapshot alone may not represent a valid state of the server if
> > > there
> > > > > >>are
> > > > > >> in-flight requests.
> > > > > >>
> > > > > >> The 4wl command should cause the server to roll the log and
> take a
> > > > > >> snapshot similar to periodic snapshotting operation. Your backup
> > > > script
> > > > > >> need grap the snapshot and corresponding txnlog file from the
> data
> > > > dir.
> > > > > >>
> > > > > >> To restore, just shutdown all hosts, clear the data dir, copy
> over
> > > the
> > > > > >> snapshot and txnlog, and restart them.
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> Thawan Kooburat
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On 7/8/13 3:28 PM, "Sergey Maslyakov" <ev...@gmail.com>
> wrote:
> > > > > >>
> > > > > >> >Thank you for your response, Flavio. I apologize, I did not
> > > provide a
> > > > > >> >clear
> > > > > >> >explanation of the use case.
> > > > > >> >
> > > > > >> >This backup/restore is not intended to be tied to any write
> > event,
> > > > > >> >instead,
> > > > > >> >it is expected to run as a periodic (daily?) cron job on one of
> > the
> > > > > >> >servers, which is not guaranteed to be the leader of the
> > ensemble.
> > > > > >>There
> > > > > >> >is
> > > > > >> >no expectation that all recent changes are committed and
> > persisted
> > > to
> > > > > >> >disk.
> > > > > >> >The system can sustain the loss of several hours worth of
> recent
> > > > > >>changes
> > > > > >> >in
> > > > > >> >the event of restore.
> > > > > >> >
> > > > > >> >As for finding the leader dynamically and performing backup on
> > it,
> > > > this
> > > > > >> >approach could be more difficult as the leader can change time
> to
> > > > time
> > > > > >>and
> > > > > >> >I still need to fetch the file to store it in my designated
> > backup
> > > > > >> >location. Taking backup on one server and picking it up from a
> > > local
> > > > > >>file
> > > > > >> >system looks less error-prone. Even if I went the fancy route
> and
> > > had
> > > > > >> >Zookeeper send me the serialized DataTree in response to the
> 4wl,
> > > > this
> > > > > >> >approach would involve a lot of moving parts.
> > > > > >> >
> > > > > >> >I have already made a PoC for a new 4wl that invokes
> > takeSnapshot()
> > > > and
> > > > > >> >returns an absolute path to the snapshot it drops on disk. I
> have
> > > > > >>already
> > > > > >> >protected takeSnapshot() from concurrent invocation, which is
> > > likely
> > > > to
> > > > > >> >corrupt the snapshot file on disk. This approach works but I'm
> > > > > >>thinking to
> > > > > >> >take it one step further by providing the desired path name as
> an
> > > > > >>argument
> > > > > >> >to my new 4lw and to have Zookeeper server drop the snapshot
> into
> > > the
> > > > > >> >specified file and report success/failure back. This way I can
> > > avoid
> > > > > >> >cluttering the data directory and interfering with what
> Zookeeper
> > > > finds
> > > > > >> >when it scans the data directory.
> > > > > >> >
> > > > > >> >Approach with having an additional server that would take the
> > > > > >>leadership
> > > > > >> >and populate the ensemble is just a theory. I don't see a clean
> > way
> > > > of
> > > > > >> >making a quorum member the leader of the quorum. Am I
> overlooking
> > > > > >> >something
> > > > > >> >simple?
> > > > > >> >
> > > > > >> >In backup and restore of an ensemble the biggest unknown for me
> > > > remains
> > > > > >> >populating the ensemble with desired data. I can think of two
> > ways:
> > > > > >> >
> > > > > >> >1. Clear out all servers by stopping them, purge version-2
> > > > directories,
> > > > > >> >restore a snapshot file on one server that will be brought
> first,
> > > and
> > > > > >>then
> > > > > >> >bring up the rest of the ensemble. This way I somewhat force
> the
> > > > first
> > > > > >> >server to be the leader because it has data and it will be the
> > only
> > > > > >>member
> > > > > >> >of a quorum with data, provided to the way I start the
> ensemble.
> > > This
> > > > > >> >looks
> > > > > >> >like a hack, though.
> > > > > >> >
> > > > > >> >2. Clear out the ensemble and reload it with a dedicated client
> > > using
> > > > > >>the
> > > > > >> >provided Zookeeper API.
> > > > > >> >
> > > > > >> >With the approach of backing up an actual snapshot file, option
> > #1
> > > > > >>appears
> > > > > >> >to be more practical.
> > > > > >> >
> > > > > >> >I wish I could start the ensemble with a designate leader that
> > > would
> > > > > >> >bootstrap the ensemble with data and then the ensemble would go
> > > into
> > > > > >>its
> > > > > >> >normal business...
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >On Mon, Jul 8, 2013 at 4:30 PM, Flavio Junqueira
> > > > > >> ><fp...@yahoo.com>wrote:
> > > > > >> >
> > > > > >> >> One bit that is still a bit confusing to me in your use case
> is
> > > if
> > > > > >>you
> > > > > >> >> need to take a snapshot right after some event in your
> > > application.
> > > > > >> >>Even if
> > > > > >> >> you're able to tell ZooKeeper to take a snapshot, there is no
> > > > > >>guarantee
> > > > > >> >> that it will happen at the exact point you want it if update
> > > > > >>operations
> > > > > >> >> keep coming.
> > > > > >> >>
> > > > > >> >> If you use your four-letter word approach, then would you
> > search
> > > > for
> > > > > >>the
> > > > > >> >> leader or would you simply take a snapshot at any server? If
> it
> > > has
> > > > > >>to
> > > > > >> >>go
> > > > > >> >> through the leader so that you make sure to have the most
> > recent
> > > > > >> >>committed
> > > > > >> >> state, then it might not be a bad idea to have an api call
> that
> > > > tells
> > > > > >> >>the
> > > > > >> >> leader to take a snapshot at some directory of your choice.
> > > > Informing
> > > > > >> >>you
> > > > > >> >> the name of the snapshot file so that you can copy sounds
> like
> > an
> > > > > >> >>option,
> > > > > >> >> but perhaps it is not as convenient.
> > > > > >> >>
> > > > > >> >> The approach of adding another server is not very clear. How
> do
> > > you
> > > > > >> >>force
> > > > > >> >> it to be the leader? Keep in mind that if it crashes, then it
> > > will
> > > > > >>lose
> > > > > >> >> leadership.
> > > > > >> >>
> > > > > >> >> -Flavio
> > > > > >> >>
> > > > > >> >> On Jul 8, 2013, at 8:34 AM, Sergey Maslyakov <
> > evolvah@gmail.com>
> > > > > >>wrote:
> > > > > >> >>
> > > > > >> >> > It looks like the "dev" mailing list is rather inactive.
> Over
> > > the
> > > > > >>past
> > > > > >> >> few
> > > > > >> >> > days I only saw several automated emails from JIRA and this
> > is
> > > > > >>pretty
> > > > > >> >> much
> > > > > >> >> > it. Contrary to this, the "user" mailing list seems to be
> > more
> > > > > >>alive
> > > > > >> >>and
> > > > > >> >> > more populated.
> > > > > >> >> >
> > > > > >> >> > With this in mind, please allow me to cross-post here the
> > > > message I
> > > > > >> >>sent
> > > > > >> >> > into the "dev" list a few days ago.
> > > > > >> >> >
> > > > > >> >> >
> > > > > >> >> > Regards,
> > > > > >> >> > /Sergey
> > > > > >> >> >
> > > > > >> >> > === forwarded message begins here ===
> > > > > >> >> >
> > > > > >> >> > Hi!
> > > > > >> >> >
> > > > > >> >> > I'm facing the problem that has been raised by multiple
> > people
> > > > but
> > > > > >> >>none
> > > > > >> >> of
> > > > > >> >> > the discussion threads seem to provide a good answer. I dug
> > in
> > > > > >> >>Zookeeper
> > > > > >> >> > source code trying to come up with some possible approaches
> > > and I
> > > > > >> >>would
> > > > > >> >> > like to get your inputs on those.
> > > > > >> >> >
> > > > > >> >> > Initial conditions:
> > > > > >> >> >
> > > > > >> >> > * I have an ensemble of five Zookeeper servers running
> v3.4.5
> > > > code.
> > > > > >> >> > * The size of a committed snapshot file is in vicinity of
> > 1GB.
> > > > > >> >> > * There are about 80 clients connected to the ensemble.
> > > > > >> >> > * Clients a heavily read biased, i.e., they mostly read and
> > > > rarely
> > > > > >> >> write. I
> > > > > >> >> > would say less than 0.1% of queries modify the data.
> > > > > >> >> >
> > > > > >> >> > Problem statement:
> > > > > >> >> >
> > > > > >> >> > * Under certain conditions, I may need to revert the data
> > > stored
> > > > in
> > > > > >> >>the
> > > > > >> >> > ensemble to an earlier state. For example, one of the
> clients
> > > may
> > > > > >>ruin
> > > > > >> >> the
> > > > > >> >> > application-level data integrity and I need to perform a
> > > disaster
> > > > > >> >> recovery.
> > > > > >> >> >
> > > > > >> >> > Things look nice and easy if I'm dealing with a single
> > > Zookeeper
> > > > > >> >>server.
> > > > > >> >> A
> > > > > >> >> > file-level copy of the data and dataLog directories should
> > > allow
> > > > > >>me to
> > > > > >> >> > recover later by stopping Zookeeper, swapping the corrupted
> > > data
> > > > > >>and
> > > > > >> >> > dataLog directories with a backup, and firing Zookeeper
> back
> > > up.
> > > > > >> >> >
> > > > > >> >> > Now, the ensemble deployment and the leader election
> > algorithm
> > > in
> > > > > >>the
> > > > > >> >> > quorum make things much more difficult. In order to restore
> > > from
> > > > a
> > > > > >> >>single
> > > > > >> >> > file-level backup, I need to take the whole ensemble down,
> > wipe
> > > > out
> > > > > >> >>data
> > > > > >> >> > and dataLog directories on all servers, replace these
> > > directories
> > > > > >>with
> > > > > >> >> > backed up content on one of the servers, bring this server
> up
> > > > > >>first,
> > > > > >> >>and
> > > > > >> >> > then bring up the rest of the ensemble. This [somewhat]
> > > > guarantees
> > > > > >> >>that
> > > > > >> >> the
> > > > > >> >> > populated Zookeeper server becomes a member of a majority
> and
> > > > > >> >>populates
> > > > > >> >> the
> > > > > >> >> > ensemble. This approach works but it is very involving and,
> > > thus,
> > > > > >> >> > error-prone due to a human error.
> > > > > >> >> >
> > > > > >> >> > Based on a study of Zookeeper source code, I am considering
> > the
> > > > > >> >>following
> > > > > >> >> > alternatives. And I seek advice from Zookeeper development
> > > > > >>community
> > > > > >> >>as
> > > > > >> >> to
> > > > > >> >> > which approach looks more promising or if there is a better
> > > way.
> > > > > >> >> >
> > > > > >> >> > Approach #1:
> > > > > >> >> >
> > > > > >> >> > Develop a complementary pair of utilities for export and
> > import
> > > > of
> > > > > >>the
> > > > > >> >> > data. Both utilities will act as Zookeeper clients and use
> > the
> > > > > >> >>existing
> > > > > >> >> > API. The "export" utility will recursively retrieve data
> and
> > > > store
> > > > > >>it
> > > > > >> >>in
> > > > > >> >> a
> > > > > >> >> > file. The "import" utility will first purge all data from
> the
> > > > > >>ensemble
> > > > > >> >> and
> > > > > >> >> > then reload it from the file.
> > > > > >> >> >
> > > > > >> >> > This approach seems to be the simplest and there are
> similar
> > > > tools
> > > > > >> >> > developed already. For example, the Guano Project:
> > > > > >> >> > https://github.com/d2fn/guano
> > > > > >> >> >
> > > > > >> >> > I don't like two things about it:
> > > > > >> >> > * Poor performance even on a backup for the data store of
> my
> > > > size.
> > > > > >> >> > * Possible data consistency issues due to concurrent access
> > by
> > > > the
> > > > > >> >>export
> > > > > >> >> > utility as well as other "normal" clients.
> > > > > >> >> >
> > > > > >> >> > Approach #2:
> > > > > >> >> >
> > > > > >> >> > Add another four-letter command that would force rolling up
> > the
> > > > > >> >> > transactions and creating a snapshot. The result of this
> > > command
> > > > > >>would
> > > > > >> >> be a
> > > > > >> >> > new snapshot.XXXX file on disk and the name of the file
> could
> > > be
> > > > > >> >>reported
> > > > > >> >> > back to the client as a response to the four-letter
> command.
> > > This
> > > > > >> >>way, I
> > > > > >> >> > would know which snapshot file to grab for future possible
> > > > restore.
> > > > > >> >>But
> > > > > >> >> > restoring from a snapshot file is almost as involving as
> the
> > > > > >> >>error-prone
> > > > > >> >> > sequence described in the "Initial conditions" above.
> > > > > >> >> >
> > > > > >> >> > Approach #3:
> > > > > >> >> >
> > > > > >> >> > Come up with a way to temporarily add a new Zookeeper
> server
> > > > into a
> > > > > >> >>live
> > > > > >> >> > ensemble, that would overtake (how?) the leader role and
> push
> > > out
> > > > > >>the
> > > > > >> >> > snapshot that it has into all ensemble members upon
> restore.
> > > This
> > > > > >> >> approach
> > > > > >> >> > could be difficult and error-prone to implement because it
> > will
> > > > > >> >>require
> > > > > >> >> > hacking the existing election algorithm to designate a
> > leader.
> > > > > >> >> >
> > > > > >> >> > So, which of the approaches do you think works best for an
> > > > ensemble
> > > > > >> >>and
> > > > > >> >> for
> > > > > >> >> > the database size of about 1GB?
> > > > > >> >> >
> > > > > >> >> >
> > > > > >> >> > Any advice will be highly appreciated!
> > > > > >> >> > /Sergey
> > > > > >> >>
> > > > > >> >>
> > > > > >>
> > > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

RE: Efficient backup and a reasonable restore of an ensemble

Posted by Flavio Junqueira <fp...@yahoo.com>.

The snapshot might have some of those transactions, it depends on when it
reads the znode affected by the transaction. Say you have txn T that sets
the data of /a. When generating the snapshot, if it serializes /a before T
is committed, then the snapshot will not include T. Otw, it includes T.

-Flavio

-----Original Message-----
From: Sergey Maslyakov [mailto:evolvah@gmail.com] 
Sent: 09 July 2013 18:03
To: user@zookeeper.apache.org
Subject: Re: Efficient backup and a reasonable restore of an ensemble

I think I am having difficulties understanding the "fuzzy" concept. Let's
say I started to serialize DataTree into a snapshot file and it took 30
seconds. During these 30 seconds, the server saw 5 transactions that updated
the data. Does this mean that the snapshot that I get on disk at the end of
the 30-second interval will have some of these 5 transactions?
Or will it have none? Or will it have all of them? Or will it be
inconsistent and unreadable by Zookeeper?

Please help me better understand the behavior behind the "fuzzy" term.

For my use case, I am perfectly fine if I get a snapshot with none of these
5 transactions, considering that I will pick them up next time I take a
snapshot.


/Sergey


On Tue, Jul 9, 2013 at 12:08 AM, kishore g <g....@gmail.com> wrote:

> Its not really elaborate, it is very similar to what zookeeper does 
> when it starts up. It first reads the latest snapshot file and then 
> the transaction logs and applies each and every transaction. What I am 
> suggesting is that instead of applying all transactions stop at a
transaction i provide.
>
> Having this tool will actually simplify your task, you can go back to 
> any point in time. Think of a something like this.
>
> checkpoint A // this can store the last zxid or timestamp from the leader.
> Make changes to zk
> //if things fails
> stop zks
> rollback A//run this on each zk, brings back the cluster to its 
> previous state.
> start zks // any order should be fine.
>
>
> Also keep in mind that snapshot is fuzzy only if there are writes 
> happening while taking snapshot. If you are sure no writes will happen 
> when you are taking the snapshot then you are good. Experts, please 
> correct me if this is incorrect.
>
> thanks,
> Kishore G
>
>
> On Mon, Jul 8, 2013 at 9:42 PM, Sergey Maslyakov <ev...@gmail.com>
> wrote:
>
> > Kishore,
> >
> > This sounds like a very elaborate tool. I was trying to find a 
> > simplistic approach but what Thawan said about "fuzzy snapshots" 
> > makes me a little afraid that there is no simple solution.
> >
> >
> > On Mon, Jul 8, 2013 at 11:05 PM, kishore g <g....@gmail.com> wrote:
> >
> > > Agree, we already have such a tool. In fact we use it to 
> > > reconstruct
> the
> > > sequence of events that led to a failure and actually restore the
> system
> > to
> > > a previous stable point and replay the events. Unfortunately this 
> > > is
> tied
> > > closely with Helix but it should be easy to make this a generic tool.
> > >
> > > Sergey is this something that will be useful in your case.
> > >
> > > Thanks,
> > > Kishore G
> > >
> > >
> > > On Mon, Jul 8, 2013 at 8:09 PM, Thawan Kooburat <th...@fb.com> wrote:
> > >
> > > > On restore part, I think having a separate utility to manipulate 
> > > > the data/snap dir (by truncating the log/removing snapshot to a 
> > > > given
> zxid)
> > > > would be easier than modifying the server.
> > > >
> > > >
> > > > --
> > > > Thawan Kooburat
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On 7/8/13 6:34 PM, "kishore g" <g....@gmail.com> wrote:
> > > >
> > > > >I think what we are looking at is a  point in time restore
> > > functionality.
> > > > >How about adding a feature that says go back to a specific
> > > zxid/timestamp.
> > > > >This way before doing any change to zookeeper simply note down 
> > > > >the timestamp/zxid on leader. If things go wrong after making 
> > > > >changes,
> > bring
> > > > >down zookeepers and provide additional parameter of a 
> > > > >zxid/timestamp
> > > while
> > > > >restarting. The server can go the exact point and make it current.
> The
> > > > >followers can be started blank.
> > > > >
> > > > >
> > > > >
> > > > >On Mon, Jul 8, 2013 at 5:53 PM, Thawan Kooburat <th...@fb.com>
> > wrote:
> > > > >
> > > > >> Just saw that  this is the corresponding use case to the 
> > > > >> question
> > > posted
> > > > >> in dev list.
> > > > >>
> > > > >> In order to restore the data to a given point in time 
> > > > >> correctly,
> you
> > > > >>need
> > > > >> both snapshot and txnlog. This is because zookeeper snapshot 
> > > > >>is
> > fuzzy
> > > > >>and
> > > > >> snapshot alone may not represent a valid state of the server 
> > > > >>if
> > there
> > > > >>are
> > > > >> in-flight requests.
> > > > >>
> > > > >> The 4wl command should cause the server to roll the log and 
> > > > >> take a snapshot similar to periodic snapshotting operation. 
> > > > >> Your backup
> > > script
> > > > >> need grap the snapshot and corresponding txnlog file from the 
> > > > >> data
> > > dir.
> > > > >>
> > > > >> To restore, just shutdown all hosts, clear the data dir, copy 
> > > > >> over
> > the
> > > > >> snapshot and txnlog, and restart them.
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Thawan Kooburat
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> On 7/8/13 3:28 PM, "Sergey Maslyakov" <ev...@gmail.com> wrote:
> > > > >>
> > > > >> >Thank you for your response, Flavio. I apologize, I did not
> > provide a
> > > > >> >clear
> > > > >> >explanation of the use case.
> > > > >> >
> > > > >> >This backup/restore is not intended to be tied to any write
> event,
> > > > >> >instead,
> > > > >> >it is expected to run as a periodic (daily?) cron job on one 
> > > > >> >of
> the
> > > > >> >servers, which is not guaranteed to be the leader of the
> ensemble.
> > > > >>There
> > > > >> >is
> > > > >> >no expectation that all recent changes are committed and
> persisted
> > to
> > > > >> >disk.
> > > > >> >The system can sustain the loss of several hours worth of 
> > > > >> >recent
> > > > >>changes
> > > > >> >in
> > > > >> >the event of restore.
> > > > >> >
> > > > >> >As for finding the leader dynamically and performing backup 
> > > > >> >on
> it,
> > > this
> > > > >> >approach could be more difficult as the leader can change 
> > > > >> >time to
> > > time
> > > > >>and
> > > > >> >I still need to fetch the file to store it in my designated
> backup
> > > > >> >location. Taking backup on one server and picking it up from 
> > > > >> >a
> > local
> > > > >>file
> > > > >> >system looks less error-prone. Even if I went the fancy 
> > > > >> >route and
> > had
> > > > >> >Zookeeper send me the serialized DataTree in response to the 
> > > > >> >4wl,
> > > this
> > > > >> >approach would involve a lot of moving parts.
> > > > >> >
> > > > >> >I have already made a PoC for a new 4wl that invokes
> takeSnapshot()
> > > and
> > > > >> >returns an absolute path to the snapshot it drops on disk. I 
> > > > >> >have
> > > > >>already
> > > > >> >protected takeSnapshot() from concurrent invocation, which 
> > > > >> >is
> > likely
> > > to
> > > > >> >corrupt the snapshot file on disk. This approach works but 
> > > > >> >I'm
> > > > >>thinking to
> > > > >> >take it one step further by providing the desired path name 
> > > > >> >as an
> > > > >>argument
> > > > >> >to my new 4lw and to have Zookeeper server drop the snapshot 
> > > > >> >into
> > the
> > > > >> >specified file and report success/failure back. This way I 
> > > > >> >can
> > avoid
> > > > >> >cluttering the data directory and interfering with what 
> > > > >> >Zookeeper
> > > finds
> > > > >> >when it scans the data directory.
> > > > >> >
> > > > >> >Approach with having an additional server that would take 
> > > > >> >the
> > > > >>leadership
> > > > >> >and populate the ensemble is just a theory. I don't see a 
> > > > >> >clean
> way
> > > of
> > > > >> >making a quorum member the leader of the quorum. Am I 
> > > > >> >overlooking something simple?
> > > > >> >
> > > > >> >In backup and restore of an ensemble the biggest unknown for 
> > > > >> >me
> > > remains
> > > > >> >populating the ensemble with desired data. I can think of 
> > > > >> >two
> ways:
> > > > >> >
> > > > >> >1. Clear out all servers by stopping them, purge version-2
> > > directories,
> > > > >> >restore a snapshot file on one server that will be brought 
> > > > >> >first,
> > and
> > > > >>then
> > > > >> >bring up the rest of the ensemble. This way I somewhat force 
> > > > >> >the
> > > first
> > > > >> >server to be the leader because it has data and it will be 
> > > > >> >the
> only
> > > > >>member
> > > > >> >of a quorum with data, provided to the way I start the ensemble.
> > This
> > > > >> >looks
> > > > >> >like a hack, though.
> > > > >> >
> > > > >> >2. Clear out the ensemble and reload it with a dedicated 
> > > > >> >client
> > using
> > > > >>the
> > > > >> >provided Zookeeper API.
> > > > >> >
> > > > >> >With the approach of backing up an actual snapshot file, 
> > > > >> >option
> #1
> > > > >>appears
> > > > >> >to be more practical.
> > > > >> >
> > > > >> >I wish I could start the ensemble with a designate leader 
> > > > >> >that
> > would
> > > > >> >bootstrap the ensemble with data and then the ensemble would 
> > > > >> >go
> > into
> > > > >>its
> > > > >> >normal business...
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >On Mon, Jul 8, 2013 at 4:30 PM, Flavio Junqueira
> > > > >> ><fp...@yahoo.com>wrote:
> > > > >> >
> > > > >> >> One bit that is still a bit confusing to me in your use 
> > > > >> >> case is
> > if
> > > > >>you
> > > > >> >> need to take a snapshot right after some event in your
> > application.
> > > > >> >>Even if
> > > > >> >> you're able to tell ZooKeeper to take a snapshot, there is 
> > > > >> >>no
> > > > >>guarantee
> > > > >> >> that it will happen at the exact point you want it if 
> > > > >> >> update
> > > > >>operations
> > > > >> >> keep coming.
> > > > >> >>
> > > > >> >> If you use your four-letter word approach, then would you
> search
> > > for
> > > > >>the
> > > > >> >> leader or would you simply take a snapshot at any server? 
> > > > >> >> If it
> > has
> > > > >>to
> > > > >> >>go
> > > > >> >> through the leader so that you make sure to have the most
> recent
> > > > >> >>committed
> > > > >> >> state, then it might not be a bad idea to have an api call 
> > > > >> >>that
> > > tells
> > > > >> >>the
> > > > >> >> leader to take a snapshot at some directory of your choice.
> > > Informing
> > > > >> >>you
> > > > >> >> the name of the snapshot file so that you can copy sounds 
> > > > >> >>like
> an
> > > > >> >>option,
> > > > >> >> but perhaps it is not as convenient.
> > > > >> >>
> > > > >> >> The approach of adding another server is not very clear. 
> > > > >> >> How do
> > you
> > > > >> >>force
> > > > >> >> it to be the leader? Keep in mind that if it crashes, then 
> > > > >> >>it
> > will
> > > > >>lose
> > > > >> >> leadership.
> > > > >> >>
> > > > >> >> -Flavio
> > > > >> >>
> > > > >> >> On Jul 8, 2013, at 8:34 AM, Sergey Maslyakov <
> evolvah@gmail.com>
> > > > >>wrote:
> > > > >> >>
> > > > >> >> > It looks like the "dev" mailing list is rather inactive. 
> > > > >> >> > Over
> > the
> > > > >>past
> > > > >> >> few
> > > > >> >> > days I only saw several automated emails from JIRA and 
> > > > >> >> > this
> is
> > > > >>pretty
> > > > >> >> much
> > > > >> >> > it. Contrary to this, the "user" mailing list seems to 
> > > > >> >> > be
> more
> > > > >>alive
> > > > >> >>and
> > > > >> >> > more populated.
> > > > >> >> >
> > > > >> >> > With this in mind, please allow me to cross-post here 
> > > > >> >> > the
> > > message I
> > > > >> >>sent
> > > > >> >> > into the "dev" list a few days ago.
> > > > >> >> >
> > > > >> >> >
> > > > >> >> > Regards,
> > > > >> >> > /Sergey
> > > > >> >> >
> > > > >> >> > === forwarded message begins here ===
> > > > >> >> >
> > > > >> >> > Hi!
> > > > >> >> >
> > > > >> >> > I'm facing the problem that has been raised by multiple
> people
> > > but
> > > > >> >>none
> > > > >> >> of
> > > > >> >> > the discussion threads seem to provide a good answer. I 
> > > > >> >> > dug
> in
> > > > >> >>Zookeeper
> > > > >> >> > source code trying to come up with some possible 
> > > > >> >> > approaches
> > and I
> > > > >> >>would
> > > > >> >> > like to get your inputs on those.
> > > > >> >> >
> > > > >> >> > Initial conditions:
> > > > >> >> >
> > > > >> >> > * I have an ensemble of five Zookeeper servers running 
> > > > >> >> > v3.4.5
> > > code.
> > > > >> >> > * The size of a committed snapshot file is in vicinity 
> > > > >> >> > of
> 1GB.
> > > > >> >> > * There are about 80 clients connected to the ensemble.
> > > > >> >> > * Clients a heavily read biased, i.e., they mostly read 
> > > > >> >> > and
> > > rarely
> > > > >> >> write. I
> > > > >> >> > would say less than 0.1% of queries modify the data.
> > > > >> >> >
> > > > >> >> > Problem statement:
> > > > >> >> >
> > > > >> >> > * Under certain conditions, I may need to revert the 
> > > > >> >> > data
> > stored
> > > in
> > > > >> >>the
> > > > >> >> > ensemble to an earlier state. For example, one of the 
> > > > >> >> > clients
> > may
> > > > >>ruin
> > > > >> >> the
> > > > >> >> > application-level data integrity and I need to perform a
> > disaster
> > > > >> >> recovery.
> > > > >> >> >
> > > > >> >> > Things look nice and easy if I'm dealing with a single
> > Zookeeper
> > > > >> >>server.
> > > > >> >> A
> > > > >> >> > file-level copy of the data and dataLog directories 
> > > > >> >> > should
> > allow
> > > > >>me to
> > > > >> >> > recover later by stopping Zookeeper, swapping the 
> > > > >> >> > corrupted
> > data
> > > > >>and
> > > > >> >> > dataLog directories with a backup, and firing Zookeeper 
> > > > >> >> > back
> > up.
> > > > >> >> >
> > > > >> >> > Now, the ensemble deployment and the leader election
> algorithm
> > in
> > > > >>the
> > > > >> >> > quorum make things much more difficult. In order to 
> > > > >> >> > restore
> > from
> > > a
> > > > >> >>single
> > > > >> >> > file-level backup, I need to take the whole ensemble 
> > > > >> >> > down,
> wipe
> > > out
> > > > >> >>data
> > > > >> >> > and dataLog directories on all servers, replace these
> > directories
> > > > >>with
> > > > >> >> > backed up content on one of the servers, bring this 
> > > > >> >> > server up
> > > > >>first,
> > > > >> >>and
> > > > >> >> > then bring up the rest of the ensemble. This [somewhat]
> > > guarantees
> > > > >> >>that
> > > > >> >> the
> > > > >> >> > populated Zookeeper server becomes a member of a 
> > > > >> >> > majority and
> > > > >> >>populates
> > > > >> >> the
> > > > >> >> > ensemble. This approach works but it is very involving 
> > > > >> >> > and,
> > thus,
> > > > >> >> > error-prone due to a human error.
> > > > >> >> >
> > > > >> >> > Based on a study of Zookeeper source code, I am 
> > > > >> >> > considering
> the
> > > > >> >>following
> > > > >> >> > alternatives. And I seek advice from Zookeeper 
> > > > >> >> > development
> > > > >>community
> > > > >> >>as
> > > > >> >> to
> > > > >> >> > which approach looks more promising or if there is a 
> > > > >> >> > better
> > way.
> > > > >> >> >
> > > > >> >> > Approach #1:
> > > > >> >> >
> > > > >> >> > Develop a complementary pair of utilities for export and
> import
> > > of
> > > > >>the
> > > > >> >> > data. Both utilities will act as Zookeeper clients and 
> > > > >> >> > use
> the
> > > > >> >>existing
> > > > >> >> > API. The "export" utility will recursively retrieve data 
> > > > >> >> > and
> > > store
> > > > >>it
> > > > >> >>in
> > > > >> >> a
> > > > >> >> > file. The "import" utility will first purge all data 
> > > > >> >> > from the
> > > > >>ensemble
> > > > >> >> and
> > > > >> >> > then reload it from the file.
> > > > >> >> >
> > > > >> >> > This approach seems to be the simplest and there are 
> > > > >> >> > similar
> > > tools
> > > > >> >> > developed already. For example, the Guano Project:
> > > > >> >> > https://github.com/d2fn/guano
> > > > >> >> >
> > > > >> >> > I don't like two things about it:
> > > > >> >> > * Poor performance even on a backup for the data store 
> > > > >> >> > of my
> > > size.
> > > > >> >> > * Possible data consistency issues due to concurrent 
> > > > >> >> > access
> by
> > > the
> > > > >> >>export
> > > > >> >> > utility as well as other "normal" clients.
> > > > >> >> >
> > > > >> >> > Approach #2:
> > > > >> >> >
> > > > >> >> > Add another four-letter command that would force rolling 
> > > > >> >> > up
> the
> > > > >> >> > transactions and creating a snapshot. The result of this
> > command
> > > > >>would
> > > > >> >> be a
> > > > >> >> > new snapshot.XXXX file on disk and the name of the file 
> > > > >> >> > could
> > be
> > > > >> >>reported
> > > > >> >> > back to the client as a response to the four-letter command.
> > This
> > > > >> >>way, I
> > > > >> >> > would know which snapshot file to grab for future 
> > > > >> >> > possible
> > > restore.
> > > > >> >>But
> > > > >> >> > restoring from a snapshot file is almost as involving as 
> > > > >> >> > the
> > > > >> >>error-prone
> > > > >> >> > sequence described in the "Initial conditions" above.
> > > > >> >> >
> > > > >> >> > Approach #3:
> > > > >> >> >
> > > > >> >> > Come up with a way to temporarily add a new Zookeeper 
> > > > >> >> > server
> > > into a
> > > > >> >>live
> > > > >> >> > ensemble, that would overtake (how?) the leader role and 
> > > > >> >> > push
> > out
> > > > >>the
> > > > >> >> > snapshot that it has into all ensemble members upon restore.
> > This
> > > > >> >> approach
> > > > >> >> > could be difficult and error-prone to implement because 
> > > > >> >> > it
> will
> > > > >> >>require
> > > > >> >> > hacking the existing election algorithm to designate a
> leader.
> > > > >> >> >
> > > > >> >> > So, which of the approaches do you think works best for 
> > > > >> >> > an
> > > ensemble
> > > > >> >>and
> > > > >> >> for
> > > > >> >> > the database size of about 1GB?
> > > > >> >> >
> > > > >> >> >
> > > > >> >> > Any advice will be highly appreciated!
> > > > >> >> > /Sergey
> > > > >> >>
> > > > >> >>
> > > > >>
> > > > >>
> > > >
> > > >
> > >
> >
>

Re: Efficient backup and a reasonable restore of an ensemble

Posted by Sergey Maslyakov <ev...@gmail.com>.

I think I am having difficulties understanding the "fuzzy" concept. Let's
say I started to serialize DataTree into a snapshot file and it took 30
seconds. During these 30 seconds, the server saw 5 transactions that
updated the data. Does this mean that the snapshot that I get on disk at
the end of the 30-second interval will have some of these 5 transactions?
Or will it have none? Or will it have all of them? Or will it be
inconsistent and unreadable by Zookeeper?

Please help me better understand the behavior behind the "fuzzy" term.

For my use case, I am perfectly fine if I get a snapshot with none of these
5 transactions, considering that I will pick them up next time I take a
snapshot.


/Sergey


On Tue, Jul 9, 2013 at 12:08 AM, kishore g <g....@gmail.com> wrote:

> Its not really elaborate, it is very similar to what zookeeper does when it
> starts up. It first reads the latest snapshot file and then the transaction
> logs and applies each and every transaction. What I am suggesting is that
> instead of applying all transactions stop at a transaction i provide.
>
> Having this tool will actually simplify your task, you can go back to any
> point in time. Think of a something like this.
>
> checkpoint A // this can store the last zxid or timestamp from the leader.
> Make changes to zk
> //if things fails
> stop zks
> rollback A//run this on each zk, brings back the cluster to its previous
> state.
> start zks // any order should be fine.
>
>
> Also keep in mind that snapshot is fuzzy only if there are writes happening
> while taking snapshot. If you are sure no writes will happen when you are
> taking the snapshot then you are good. Experts, please correct me if this
> is incorrect.
>
> thanks,
> Kishore G
>
>
> On Mon, Jul 8, 2013 at 9:42 PM, Sergey Maslyakov <ev...@gmail.com>
> wrote:
>
> > Kishore,
> >
> > This sounds like a very elaborate tool. I was trying to find a simplistic
> > approach but what Thawan said about "fuzzy snapshots" makes me a little
> > afraid that there is no simple solution.
> >
> >
> > On Mon, Jul 8, 2013 at 11:05 PM, kishore g <g....@gmail.com> wrote:
> >
> > > Agree, we already have such a tool. In fact we use it to reconstruct
> the
> > > sequence of events that led to a failure and actually restore the
> system
> > to
> > > a previous stable point and replay the events. Unfortunately this is
> tied
> > > closely with Helix but it should be easy to make this a generic tool.
> > >
> > > Sergey is this something that will be useful in your case.
> > >
> > > Thanks,
> > > Kishore G
> > >
> > >
> > > On Mon, Jul 8, 2013 at 8:09 PM, Thawan Kooburat <th...@fb.com> wrote:
> > >
> > > > On restore part, I think having a separate utility to manipulate the
> > > > data/snap dir (by truncating the log/removing snapshot to a given
> zxid)
> > > > would be easier than modifying the server.
> > > >
> > > >
> > > > --
> > > > Thawan Kooburat
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On 7/8/13 6:34 PM, "kishore g" <g....@gmail.com> wrote:
> > > >
> > > > >I think what we are looking at is a  point in time restore
> > > functionality.
> > > > >How about adding a feature that says go back to a specific
> > > zxid/timestamp.
> > > > >This way before doing any change to zookeeper simply note down the
> > > > >timestamp/zxid on leader. If things go wrong after making changes,
> > bring
> > > > >down zookeepers and provide additional parameter of a zxid/timestamp
> > > while
> > > > >restarting. The server can go the exact point and make it current.
> The
> > > > >followers can be started blank.
> > > > >
> > > > >
> > > > >
> > > > >On Mon, Jul 8, 2013 at 5:53 PM, Thawan Kooburat <th...@fb.com>
> > wrote:
> > > > >
> > > > >> Just saw that  this is the corresponding use case to the question
> > > posted
> > > > >> in dev list.
> > > > >>
> > > > >> In order to restore the data to a given point in time correctly,
> you
> > > > >>need
> > > > >> both snapshot and txnlog. This is because zookeeper snapshot is
> > fuzzy
> > > > >>and
> > > > >> snapshot alone may not represent a valid state of the server if
> > there
> > > > >>are
> > > > >> in-flight requests.
> > > > >>
> > > > >> The 4wl command should cause the server to roll the log and take a
> > > > >> snapshot similar to periodic snapshotting operation. Your backup
> > > script
> > > > >> need grap the snapshot and corresponding txnlog file from the data
> > > dir.
> > > > >>
> > > > >> To restore, just shutdown all hosts, clear the data dir, copy over
> > the
> > > > >> snapshot and txnlog, and restart them.
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Thawan Kooburat
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> On 7/8/13 3:28 PM, "Sergey Maslyakov" <ev...@gmail.com> wrote:
> > > > >>
> > > > >> >Thank you for your response, Flavio. I apologize, I did not
> > provide a
> > > > >> >clear
> > > > >> >explanation of the use case.
> > > > >> >
> > > > >> >This backup/restore is not intended to be tied to any write
> event,
> > > > >> >instead,
> > > > >> >it is expected to run as a periodic (daily?) cron job on one of
> the
> > > > >> >servers, which is not guaranteed to be the leader of the
> ensemble.
> > > > >>There
> > > > >> >is
> > > > >> >no expectation that all recent changes are committed and
> persisted
> > to
> > > > >> >disk.
> > > > >> >The system can sustain the loss of several hours worth of recent
> > > > >>changes
> > > > >> >in
> > > > >> >the event of restore.
> > > > >> >
> > > > >> >As for finding the leader dynamically and performing backup on
> it,
> > > this
> > > > >> >approach could be more difficult as the leader can change time to
> > > time
> > > > >>and
> > > > >> >I still need to fetch the file to store it in my designated
> backup
> > > > >> >location. Taking backup on one server and picking it up from a
> > local
> > > > >>file
> > > > >> >system looks less error-prone. Even if I went the fancy route and
> > had
> > > > >> >Zookeeper send me the serialized DataTree in response to the 4wl,
> > > this
> > > > >> >approach would involve a lot of moving parts.
> > > > >> >
> > > > >> >I have already made a PoC for a new 4wl that invokes
> takeSnapshot()
> > > and
> > > > >> >returns an absolute path to the snapshot it drops on disk. I have
> > > > >>already
> > > > >> >protected takeSnapshot() from concurrent invocation, which is
> > likely
> > > to
> > > > >> >corrupt the snapshot file on disk. This approach works but I'm
> > > > >>thinking to
> > > > >> >take it one step further by providing the desired path name as an
> > > > >>argument
> > > > >> >to my new 4lw and to have Zookeeper server drop the snapshot into
> > the
> > > > >> >specified file and report success/failure back. This way I can
> > avoid
> > > > >> >cluttering the data directory and interfering with what Zookeeper
> > > finds
> > > > >> >when it scans the data directory.
> > > > >> >
> > > > >> >Approach with having an additional server that would take the
> > > > >>leadership
> > > > >> >and populate the ensemble is just a theory. I don't see a clean
> way
> > > of
> > > > >> >making a quorum member the leader of the quorum. Am I overlooking
> > > > >> >something
> > > > >> >simple?
> > > > >> >
> > > > >> >In backup and restore of an ensemble the biggest unknown for me
> > > remains
> > > > >> >populating the ensemble with desired data. I can think of two
> ways:
> > > > >> >
> > > > >> >1. Clear out all servers by stopping them, purge version-2
> > > directories,
> > > > >> >restore a snapshot file on one server that will be brought first,
> > and
> > > > >>then
> > > > >> >bring up the rest of the ensemble. This way I somewhat force the
> > > first
> > > > >> >server to be the leader because it has data and it will be the
> only
> > > > >>member
> > > > >> >of a quorum with data, provided to the way I start the ensemble.
> > This
> > > > >> >looks
> > > > >> >like a hack, though.
> > > > >> >
> > > > >> >2. Clear out the ensemble and reload it with a dedicated client
> > using
> > > > >>the
> > > > >> >provided Zookeeper API.
> > > > >> >
> > > > >> >With the approach of backing up an actual snapshot file, option
> #1
> > > > >>appears
> > > > >> >to be more practical.
> > > > >> >
> > > > >> >I wish I could start the ensemble with a designate leader that
> > would
> > > > >> >bootstrap the ensemble with data and then the ensemble would go
> > into
> > > > >>its
> > > > >> >normal business...
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >On Mon, Jul 8, 2013 at 4:30 PM, Flavio Junqueira
> > > > >> ><fp...@yahoo.com>wrote:
> > > > >> >
> > > > >> >> One bit that is still a bit confusing to me in your use case is
> > if
> > > > >>you
> > > > >> >> need to take a snapshot right after some event in your
> > application.
> > > > >> >>Even if
> > > > >> >> you're able to tell ZooKeeper to take a snapshot, there is no
> > > > >>guarantee
> > > > >> >> that it will happen at the exact point you want it if update
> > > > >>operations
> > > > >> >> keep coming.
> > > > >> >>
> > > > >> >> If you use your four-letter word approach, then would you
> search
> > > for
> > > > >>the
> > > > >> >> leader or would you simply take a snapshot at any server? If it
> > has
> > > > >>to
> > > > >> >>go
> > > > >> >> through the leader so that you make sure to have the most
> recent
> > > > >> >>committed
> > > > >> >> state, then it might not be a bad idea to have an api call that
> > > tells
> > > > >> >>the
> > > > >> >> leader to take a snapshot at some directory of your choice.
> > > Informing
> > > > >> >>you
> > > > >> >> the name of the snapshot file so that you can copy sounds like
> an
> > > > >> >>option,
> > > > >> >> but perhaps it is not as convenient.
> > > > >> >>
> > > > >> >> The approach of adding another server is not very clear. How do
> > you
> > > > >> >>force
> > > > >> >> it to be the leader? Keep in mind that if it crashes, then it
> > will
> > > > >>lose
> > > > >> >> leadership.
> > > > >> >>
> > > > >> >> -Flavio
> > > > >> >>
> > > > >> >> On Jul 8, 2013, at 8:34 AM, Sergey Maslyakov <
> evolvah@gmail.com>
> > > > >>wrote:
> > > > >> >>
> > > > >> >> > It looks like the "dev" mailing list is rather inactive. Over
> > the
> > > > >>past
> > > > >> >> few
> > > > >> >> > days I only saw several automated emails from JIRA and this
> is
> > > > >>pretty
> > > > >> >> much
> > > > >> >> > it. Contrary to this, the "user" mailing list seems to be
> more
> > > > >>alive
> > > > >> >>and
> > > > >> >> > more populated.
> > > > >> >> >
> > > > >> >> > With this in mind, please allow me to cross-post here the
> > > message I
> > > > >> >>sent
> > > > >> >> > into the "dev" list a few days ago.
> > > > >> >> >
> > > > >> >> >
> > > > >> >> > Regards,
> > > > >> >> > /Sergey
> > > > >> >> >
> > > > >> >> > === forwarded message begins here ===
> > > > >> >> >
> > > > >> >> > Hi!
> > > > >> >> >
> > > > >> >> > I'm facing the problem that has been raised by multiple
> people
> > > but
> > > > >> >>none
> > > > >> >> of
> > > > >> >> > the discussion threads seem to provide a good answer. I dug
> in
> > > > >> >>Zookeeper
> > > > >> >> > source code trying to come up with some possible approaches
> > and I
> > > > >> >>would
> > > > >> >> > like to get your inputs on those.
> > > > >> >> >
> > > > >> >> > Initial conditions:
> > > > >> >> >
> > > > >> >> > * I have an ensemble of five Zookeeper servers running v3.4.5
> > > code.
> > > > >> >> > * The size of a committed snapshot file is in vicinity of
> 1GB.
> > > > >> >> > * There are about 80 clients connected to the ensemble.
> > > > >> >> > * Clients a heavily read biased, i.e., they mostly read and
> > > rarely
> > > > >> >> write. I
> > > > >> >> > would say less than 0.1% of queries modify the data.
> > > > >> >> >
> > > > >> >> > Problem statement:
> > > > >> >> >
> > > > >> >> > * Under certain conditions, I may need to revert the data
> > stored
> > > in
> > > > >> >>the
> > > > >> >> > ensemble to an earlier state. For example, one of the clients
> > may
> > > > >>ruin
> > > > >> >> the
> > > > >> >> > application-level data integrity and I need to perform a
> > disaster
> > > > >> >> recovery.
> > > > >> >> >
> > > > >> >> > Things look nice and easy if I'm dealing with a single
> > Zookeeper
> > > > >> >>server.
> > > > >> >> A
> > > > >> >> > file-level copy of the data and dataLog directories should
> > allow
> > > > >>me to
> > > > >> >> > recover later by stopping Zookeeper, swapping the corrupted
> > data
> > > > >>and
> > > > >> >> > dataLog directories with a backup, and firing Zookeeper back
> > up.
> > > > >> >> >
> > > > >> >> > Now, the ensemble deployment and the leader election
> algorithm
> > in
> > > > >>the
> > > > >> >> > quorum make things much more difficult. In order to restore
> > from
> > > a
> > > > >> >>single
> > > > >> >> > file-level backup, I need to take the whole ensemble down,
> wipe
> > > out
> > > > >> >>data
> > > > >> >> > and dataLog directories on all servers, replace these
> > directories
> > > > >>with
> > > > >> >> > backed up content on one of the servers, bring this server up
> > > > >>first,
> > > > >> >>and
> > > > >> >> > then bring up the rest of the ensemble. This [somewhat]
> > > guarantees
> > > > >> >>that
> > > > >> >> the
> > > > >> >> > populated Zookeeper server becomes a member of a majority and
> > > > >> >>populates
> > > > >> >> the
> > > > >> >> > ensemble. This approach works but it is very involving and,
> > thus,
> > > > >> >> > error-prone due to a human error.
> > > > >> >> >
> > > > >> >> > Based on a study of Zookeeper source code, I am considering
> the
> > > > >> >>following
> > > > >> >> > alternatives. And I seek advice from Zookeeper development
> > > > >>community
> > > > >> >>as
> > > > >> >> to
> > > > >> >> > which approach looks more promising or if there is a better
> > way.
> > > > >> >> >
> > > > >> >> > Approach #1:
> > > > >> >> >
> > > > >> >> > Develop a complementary pair of utilities for export and
> import
> > > of
> > > > >>the
> > > > >> >> > data. Both utilities will act as Zookeeper clients and use
> the
> > > > >> >>existing
> > > > >> >> > API. The "export" utility will recursively retrieve data and
> > > store
> > > > >>it
> > > > >> >>in
> > > > >> >> a
> > > > >> >> > file. The "import" utility will first purge all data from the
> > > > >>ensemble
> > > > >> >> and
> > > > >> >> > then reload it from the file.
> > > > >> >> >
> > > > >> >> > This approach seems to be the simplest and there are similar
> > > tools
> > > > >> >> > developed already. For example, the Guano Project:
> > > > >> >> > https://github.com/d2fn/guano
> > > > >> >> >
> > > > >> >> > I don't like two things about it:
> > > > >> >> > * Poor performance even on a backup for the data store of my
> > > size.
> > > > >> >> > * Possible data consistency issues due to concurrent access
> by
> > > the
> > > > >> >>export
> > > > >> >> > utility as well as other "normal" clients.
> > > > >> >> >
> > > > >> >> > Approach #2:
> > > > >> >> >
> > > > >> >> > Add another four-letter command that would force rolling up
> the
> > > > >> >> > transactions and creating a snapshot. The result of this
> > command
> > > > >>would
> > > > >> >> be a
> > > > >> >> > new snapshot.XXXX file on disk and the name of the file could
> > be
> > > > >> >>reported
> > > > >> >> > back to the client as a response to the four-letter command.
> > This
> > > > >> >>way, I
> > > > >> >> > would know which snapshot file to grab for future possible
> > > restore.
> > > > >> >>But
> > > > >> >> > restoring from a snapshot file is almost as involving as the
> > > > >> >>error-prone
> > > > >> >> > sequence described in the "Initial conditions" above.
> > > > >> >> >
> > > > >> >> > Approach #3:
> > > > >> >> >
> > > > >> >> > Come up with a way to temporarily add a new Zookeeper server
> > > into a
> > > > >> >>live
> > > > >> >> > ensemble, that would overtake (how?) the leader role and push
> > out
> > > > >>the
> > > > >> >> > snapshot that it has into all ensemble members upon restore.
> > This
> > > > >> >> approach
> > > > >> >> > could be difficult and error-prone to implement because it
> will
> > > > >> >>require
> > > > >> >> > hacking the existing election algorithm to designate a
> leader.
> > > > >> >> >
> > > > >> >> > So, which of the approaches do you think works best for an
> > > ensemble
> > > > >> >>and
> > > > >> >> for
> > > > >> >> > the database size of about 1GB?
> > > > >> >> >
> > > > >> >> >
> > > > >> >> > Any advice will be highly appreciated!
> > > > >> >> > /Sergey
> > > > >> >>
> > > > >> >>
> > > > >>
> > > > >>
> > > >
> > > >
> > >
> >
>

Re: Efficient backup and a reasonable restore of an ensemble

Posted by Flavio Junqueira <fp...@yahoo.com>.

On Jul 9, 2013, at 7:08 AM, kishore g <g....@gmail.com> wrote:

> Its not really elaborate, it is very similar to what zookeeper does when it
> starts up. It first reads the latest snapshot file and then the transaction
> logs and applies each and every transaction. What I am suggesting is that
> instead of applying all transactions stop at a transaction i provide.
> 
> Having this tool will actually simplify your task, you can go back to any
> point in time. Think of a something like this.
> 
> checkpoint A // this can store the last zxid or timestamp from the leader.
> Make changes to zk
> //if things fails
> stop zks
> rollback A//run this on each zk, brings back the cluster to its previous
> state.
> start zks // any order should be fine.
> 
> 
> Also keep in mind that snapshot is fuzzy only if there are writes happening
> while taking snapshot. If you are sure no writes will happen when you are
> taking the snapshot then you are good. Experts, please correct me if this
> is incorrect.


If there are no concurrent writes, then the snapshot will contain all zxids up to the one in the file name and that one will be the last. The problem is making sure that there are no concurrent updates... Would you tell all the clients to stop first? Keep trying until you get no concurrent updates? It sounds difficult, right?

-Flavio

> 
> thanks,
> Kishore G
> 
> 
> On Mon, Jul 8, 2013 at 9:42 PM, Sergey Maslyakov <ev...@gmail.com> wrote:
> 
>> Kishore,
>> 
>> This sounds like a very elaborate tool. I was trying to find a simplistic
>> approach but what Thawan said about "fuzzy snapshots" makes me a little
>> afraid that there is no simple solution.
>> 
>> 
>> On Mon, Jul 8, 2013 at 11:05 PM, kishore g <g....@gmail.com> wrote:
>> 
>>> Agree, we already have such a tool. In fact we use it to reconstruct the
>>> sequence of events that led to a failure and actually restore the system
>> to
>>> a previous stable point and replay the events. Unfortunately this is tied
>>> closely with Helix but it should be easy to make this a generic tool.
>>> 
>>> Sergey is this something that will be useful in your case.
>>> 
>>> Thanks,
>>> Kishore G
>>> 
>>> 
>>> On Mon, Jul 8, 2013 at 8:09 PM, Thawan Kooburat <th...@fb.com> wrote:
>>> 
>>>> On restore part, I think having a separate utility to manipulate the
>>>> data/snap dir (by truncating the log/removing snapshot to a given zxid)
>>>> would be easier than modifying the server.
>>>> 
>>>> 
>>>> --
>>>> Thawan Kooburat
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 7/8/13 6:34 PM, "kishore g" <g....@gmail.com> wrote:
>>>> 
>>>>> I think what we are looking at is a  point in time restore
>>> functionality.
>>>>> How about adding a feature that says go back to a specific
>>> zxid/timestamp.
>>>>> This way before doing any change to zookeeper simply note down the
>>>>> timestamp/zxid on leader. If things go wrong after making changes,
>> bring
>>>>> down zookeepers and provide additional parameter of a zxid/timestamp
>>> while
>>>>> restarting. The server can go the exact point and make it current. The
>>>>> followers can be started blank.
>>>>> 
>>>>> 
>>>>> 
>>>>> On Mon, Jul 8, 2013 at 5:53 PM, Thawan Kooburat <th...@fb.com>
>> wrote:
>>>>> 
>>>>>> Just saw that  this is the corresponding use case to the question
>>> posted
>>>>>> in dev list.
>>>>>> 
>>>>>> In order to restore the data to a given point in time correctly, you
>>>>>> need
>>>>>> both snapshot and txnlog. This is because zookeeper snapshot is
>> fuzzy
>>>>>> and
>>>>>> snapshot alone may not represent a valid state of the server if
>> there
>>>>>> are
>>>>>> in-flight requests.
>>>>>> 
>>>>>> The 4wl command should cause the server to roll the log and take a
>>>>>> snapshot similar to periodic snapshotting operation. Your backup
>>> script
>>>>>> need grap the snapshot and corresponding txnlog file from the data
>>> dir.
>>>>>> 
>>>>>> To restore, just shutdown all hosts, clear the data dir, copy over
>> the
>>>>>> snapshot and txnlog, and restart them.
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Thawan Kooburat
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 7/8/13 3:28 PM, "Sergey Maslyakov" <ev...@gmail.com> wrote:
>>>>>> 
>>>>>>> Thank you for your response, Flavio. I apologize, I did not
>> provide a
>>>>>>> clear
>>>>>>> explanation of the use case.
>>>>>>> 
>>>>>>> This backup/restore is not intended to be tied to any write event,
>>>>>>> instead,
>>>>>>> it is expected to run as a periodic (daily?) cron job on one of the
>>>>>>> servers, which is not guaranteed to be the leader of the ensemble.
>>>>>> There
>>>>>>> is
>>>>>>> no expectation that all recent changes are committed and persisted
>> to
>>>>>>> disk.
>>>>>>> The system can sustain the loss of several hours worth of recent
>>>>>> changes
>>>>>>> in
>>>>>>> the event of restore.
>>>>>>> 
>>>>>>> As for finding the leader dynamically and performing backup on it,
>>> this
>>>>>>> approach could be more difficult as the leader can change time to
>>> time
>>>>>> and
>>>>>>> I still need to fetch the file to store it in my designated backup
>>>>>>> location. Taking backup on one server and picking it up from a
>> local
>>>>>> file
>>>>>>> system looks less error-prone. Even if I went the fancy route and
>> had
>>>>>>> Zookeeper send me the serialized DataTree in response to the 4wl,
>>> this
>>>>>>> approach would involve a lot of moving parts.
>>>>>>> 
>>>>>>> I have already made a PoC for a new 4wl that invokes takeSnapshot()
>>> and
>>>>>>> returns an absolute path to the snapshot it drops on disk. I have
>>>>>> already
>>>>>>> protected takeSnapshot() from concurrent invocation, which is
>> likely
>>> to
>>>>>>> corrupt the snapshot file on disk. This approach works but I'm
>>>>>> thinking to
>>>>>>> take it one step further by providing the desired path name as an
>>>>>> argument
>>>>>>> to my new 4lw and to have Zookeeper server drop the snapshot into
>> the
>>>>>>> specified file and report success/failure back. This way I can
>> avoid
>>>>>>> cluttering the data directory and interfering with what Zookeeper
>>> finds
>>>>>>> when it scans the data directory.
>>>>>>> 
>>>>>>> Approach with having an additional server that would take the
>>>>>> leadership
>>>>>>> and populate the ensemble is just a theory. I don't see a clean way
>>> of
>>>>>>> making a quorum member the leader of the quorum. Am I overlooking
>>>>>>> something
>>>>>>> simple?
>>>>>>> 
>>>>>>> In backup and restore of an ensemble the biggest unknown for me
>>> remains
>>>>>>> populating the ensemble with desired data. I can think of two ways:
>>>>>>> 
>>>>>>> 1. Clear out all servers by stopping them, purge version-2
>>> directories,
>>>>>>> restore a snapshot file on one server that will be brought first,
>> and
>>>>>> then
>>>>>>> bring up the rest of the ensemble. This way I somewhat force the
>>> first
>>>>>>> server to be the leader because it has data and it will be the only
>>>>>> member
>>>>>>> of a quorum with data, provided to the way I start the ensemble.
>> This
>>>>>>> looks
>>>>>>> like a hack, though.
>>>>>>> 
>>>>>>> 2. Clear out the ensemble and reload it with a dedicated client
>> using
>>>>>> the
>>>>>>> provided Zookeeper API.
>>>>>>> 
>>>>>>> With the approach of backing up an actual snapshot file, option #1
>>>>>> appears
>>>>>>> to be more practical.
>>>>>>> 
>>>>>>> I wish I could start the ensemble with a designate leader that
>> would
>>>>>>> bootstrap the ensemble with data and then the ensemble would go
>> into
>>>>>> its
>>>>>>> normal business...
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, Jul 8, 2013 at 4:30 PM, Flavio Junqueira
>>>>>>> <fp...@yahoo.com>wrote:
>>>>>>> 
>>>>>>>> One bit that is still a bit confusing to me in your use case is
>> if
>>>>>> you
>>>>>>>> need to take a snapshot right after some event in your
>> application.
>>>>>>>> Even if
>>>>>>>> you're able to tell ZooKeeper to take a snapshot, there is no
>>>>>> guarantee
>>>>>>>> that it will happen at the exact point you want it if update
>>>>>> operations
>>>>>>>> keep coming.
>>>>>>>> 
>>>>>>>> If you use your four-letter word approach, then would you search
>>> for
>>>>>> the
>>>>>>>> leader or would you simply take a snapshot at any server? If it
>> has
>>>>>> to
>>>>>>>> go
>>>>>>>> through the leader so that you make sure to have the most recent
>>>>>>>> committed
>>>>>>>> state, then it might not be a bad idea to have an api call that
>>> tells
>>>>>>>> the
>>>>>>>> leader to take a snapshot at some directory of your choice.
>>> Informing
>>>>>>>> you
>>>>>>>> the name of the snapshot file so that you can copy sounds like an
>>>>>>>> option,
>>>>>>>> but perhaps it is not as convenient.
>>>>>>>> 
>>>>>>>> The approach of adding another server is not very clear. How do
>> you
>>>>>>>> force
>>>>>>>> it to be the leader? Keep in mind that if it crashes, then it
>> will
>>>>>> lose
>>>>>>>> leadership.
>>>>>>>> 
>>>>>>>> -Flavio
>>>>>>>> 
>>>>>>>> On Jul 8, 2013, at 8:34 AM, Sergey Maslyakov <ev...@gmail.com>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> It looks like the "dev" mailing list is rather inactive. Over
>> the
>>>>>> past
>>>>>>>> few
>>>>>>>>> days I only saw several automated emails from JIRA and this is
>>>>>> pretty
>>>>>>>> much
>>>>>>>>> it. Contrary to this, the "user" mailing list seems to be more
>>>>>> alive
>>>>>>>> and
>>>>>>>>> more populated.
>>>>>>>>> 
>>>>>>>>> With this in mind, please allow me to cross-post here the
>>> message I
>>>>>>>> sent
>>>>>>>>> into the "dev" list a few days ago.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>> /Sergey
>>>>>>>>> 
>>>>>>>>> === forwarded message begins here ===
>>>>>>>>> 
>>>>>>>>> Hi!
>>>>>>>>> 
>>>>>>>>> I'm facing the problem that has been raised by multiple people
>>> but
>>>>>>>> none
>>>>>>>> of
>>>>>>>>> the discussion threads seem to provide a good answer. I dug in
>>>>>>>> Zookeeper
>>>>>>>>> source code trying to come up with some possible approaches
>> and I
>>>>>>>> would
>>>>>>>>> like to get your inputs on those.
>>>>>>>>> 
>>>>>>>>> Initial conditions:
>>>>>>>>> 
>>>>>>>>> * I have an ensemble of five Zookeeper servers running v3.4.5
>>> code.
>>>>>>>>> * The size of a committed snapshot file is in vicinity of 1GB.
>>>>>>>>> * There are about 80 clients connected to the ensemble.
>>>>>>>>> * Clients a heavily read biased, i.e., they mostly read and
>>> rarely
>>>>>>>> write. I
>>>>>>>>> would say less than 0.1% of queries modify the data.
>>>>>>>>> 
>>>>>>>>> Problem statement:
>>>>>>>>> 
>>>>>>>>> * Under certain conditions, I may need to revert the data
>> stored
>>> in
>>>>>>>> the
>>>>>>>>> ensemble to an earlier state. For example, one of the clients
>> may
>>>>>> ruin
>>>>>>>> the
>>>>>>>>> application-level data integrity and I need to perform a
>> disaster
>>>>>>>> recovery.
>>>>>>>>> 
>>>>>>>>> Things look nice and easy if I'm dealing with a single
>> Zookeeper
>>>>>>>> server.
>>>>>>>> A
>>>>>>>>> file-level copy of the data and dataLog directories should
>> allow
>>>>>> me to
>>>>>>>>> recover later by stopping Zookeeper, swapping the corrupted
>> data
>>>>>> and
>>>>>>>>> dataLog directories with a backup, and firing Zookeeper back
>> up.
>>>>>>>>> 
>>>>>>>>> Now, the ensemble deployment and the leader election algorithm
>> in
>>>>>> the
>>>>>>>>> quorum make things much more difficult. In order to restore
>> from
>>> a
>>>>>>>> single
>>>>>>>>> file-level backup, I need to take the whole ensemble down, wipe
>>> out
>>>>>>>> data
>>>>>>>>> and dataLog directories on all servers, replace these
>> directories
>>>>>> with
>>>>>>>>> backed up content on one of the servers, bring this server up
>>>>>> first,
>>>>>>>> and
>>>>>>>>> then bring up the rest of the ensemble. This [somewhat]
>>> guarantees
>>>>>>>> that
>>>>>>>> the
>>>>>>>>> populated Zookeeper server becomes a member of a majority and
>>>>>>>> populates
>>>>>>>> the
>>>>>>>>> ensemble. This approach works but it is very involving and,
>> thus,
>>>>>>>>> error-prone due to a human error.
>>>>>>>>> 
>>>>>>>>> Based on a study of Zookeeper source code, I am considering the
>>>>>>>> following
>>>>>>>>> alternatives. And I seek advice from Zookeeper development
>>>>>> community
>>>>>>>> as
>>>>>>>> to
>>>>>>>>> which approach looks more promising or if there is a better
>> way.
>>>>>>>>> 
>>>>>>>>> Approach #1:
>>>>>>>>> 
>>>>>>>>> Develop a complementary pair of utilities for export and import
>>> of
>>>>>> the
>>>>>>>>> data. Both utilities will act as Zookeeper clients and use the
>>>>>>>> existing
>>>>>>>>> API. The "export" utility will recursively retrieve data and
>>> store
>>>>>> it
>>>>>>>> in
>>>>>>>> a
>>>>>>>>> file. The "import" utility will first purge all data from the
>>>>>> ensemble
>>>>>>>> and
>>>>>>>>> then reload it from the file.
>>>>>>>>> 
>>>>>>>>> This approach seems to be the simplest and there are similar
>>> tools
>>>>>>>>> developed already. For example, the Guano Project:
>>>>>>>>> https://github.com/d2fn/guano
>>>>>>>>> 
>>>>>>>>> I don't like two things about it:
>>>>>>>>> * Poor performance even on a backup for the data store of my
>>> size.
>>>>>>>>> * Possible data consistency issues due to concurrent access by
>>> the
>>>>>>>> export
>>>>>>>>> utility as well as other "normal" clients.
>>>>>>>>> 
>>>>>>>>> Approach #2:
>>>>>>>>> 
>>>>>>>>> Add another four-letter command that would force rolling up the
>>>>>>>>> transactions and creating a snapshot. The result of this
>> command
>>>>>> would
>>>>>>>> be a
>>>>>>>>> new snapshot.XXXX file on disk and the name of the file could
>> be
>>>>>>>> reported
>>>>>>>>> back to the client as a response to the four-letter command.
>> This
>>>>>>>> way, I
>>>>>>>>> would know which snapshot file to grab for future possible
>>> restore.
>>>>>>>> But
>>>>>>>>> restoring from a snapshot file is almost as involving as the
>>>>>>>> error-prone
>>>>>>>>> sequence described in the "Initial conditions" above.
>>>>>>>>> 
>>>>>>>>> Approach #3:
>>>>>>>>> 
>>>>>>>>> Come up with a way to temporarily add a new Zookeeper server
>>> into a
>>>>>>>> live
>>>>>>>>> ensemble, that would overtake (how?) the leader role and push
>> out
>>>>>> the
>>>>>>>>> snapshot that it has into all ensemble members upon restore.
>> This
>>>>>>>> approach
>>>>>>>>> could be difficult and error-prone to implement because it will
>>>>>>>> require
>>>>>>>>> hacking the existing election algorithm to designate a leader.
>>>>>>>>> 
>>>>>>>>> So, which of the approaches do you think works best for an
>>> ensemble
>>>>>>>> and
>>>>>>>> for
>>>>>>>>> the database size of about 1GB?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Any advice will be highly appreciated!
>>>>>>>>> /Sergey
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>>

Re: Efficient backup and a reasonable restore of an ensemble

Posted by kishore g <g....@gmail.com>.

Its not really elaborate, it is very similar to what zookeeper does when it
starts up. It first reads the latest snapshot file and then the transaction
logs and applies each and every transaction. What I am suggesting is that
instead of applying all transactions stop at a transaction i provide.

Having this tool will actually simplify your task, you can go back to any
point in time. Think of a something like this.

checkpoint A // this can store the last zxid or timestamp from the leader.
Make changes to zk
//if things fails
stop zks
rollback A//run this on each zk, brings back the cluster to its previous
state.
start zks // any order should be fine.


Also keep in mind that snapshot is fuzzy only if there are writes happening
while taking snapshot. If you are sure no writes will happen when you are
taking the snapshot then you are good. Experts, please correct me if this
is incorrect.

thanks,
Kishore G


On Mon, Jul 8, 2013 at 9:42 PM, Sergey Maslyakov <ev...@gmail.com> wrote:

> Kishore,
>
> This sounds like a very elaborate tool. I was trying to find a simplistic
> approach but what Thawan said about "fuzzy snapshots" makes me a little
> afraid that there is no simple solution.
>
>
> On Mon, Jul 8, 2013 at 11:05 PM, kishore g <g....@gmail.com> wrote:
>
> > Agree, we already have such a tool. In fact we use it to reconstruct the
> > sequence of events that led to a failure and actually restore the system
> to
> > a previous stable point and replay the events. Unfortunately this is tied
> > closely with Helix but it should be easy to make this a generic tool.
> >
> > Sergey is this something that will be useful in your case.
> >
> > Thanks,
> > Kishore G
> >
> >
> > On Mon, Jul 8, 2013 at 8:09 PM, Thawan Kooburat <th...@fb.com> wrote:
> >
> > > On restore part, I think having a separate utility to manipulate the
> > > data/snap dir (by truncating the log/removing snapshot to a given zxid)
> > > would be easier than modifying the server.
> > >
> > >
> > > --
> > > Thawan Kooburat
> > >
> > >
> > >
> > >
> > >
> > > On 7/8/13 6:34 PM, "kishore g" <g....@gmail.com> wrote:
> > >
> > > >I think what we are looking at is a  point in time restore
> > functionality.
> > > >How about adding a feature that says go back to a specific
> > zxid/timestamp.
> > > >This way before doing any change to zookeeper simply note down the
> > > >timestamp/zxid on leader. If things go wrong after making changes,
> bring
> > > >down zookeepers and provide additional parameter of a zxid/timestamp
> > while
> > > >restarting. The server can go the exact point and make it current. The
> > > >followers can be started blank.
> > > >
> > > >
> > > >
> > > >On Mon, Jul 8, 2013 at 5:53 PM, Thawan Kooburat <th...@fb.com>
> wrote:
> > > >
> > > >> Just saw that  this is the corresponding use case to the question
> > posted
> > > >> in dev list.
> > > >>
> > > >> In order to restore the data to a given point in time correctly, you
> > > >>need
> > > >> both snapshot and txnlog. This is because zookeeper snapshot is
> fuzzy
> > > >>and
> > > >> snapshot alone may not represent a valid state of the server if
> there
> > > >>are
> > > >> in-flight requests.
> > > >>
> > > >> The 4wl command should cause the server to roll the log and take a
> > > >> snapshot similar to periodic snapshotting operation. Your backup
> > script
> > > >> need grap the snapshot and corresponding txnlog file from the data
> > dir.
> > > >>
> > > >> To restore, just shutdown all hosts, clear the data dir, copy over
> the
> > > >> snapshot and txnlog, and restart them.
> > > >>
> > > >>
> > > >> --
> > > >> Thawan Kooburat
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On 7/8/13 3:28 PM, "Sergey Maslyakov" <ev...@gmail.com> wrote:
> > > >>
> > > >> >Thank you for your response, Flavio. I apologize, I did not
> provide a
> > > >> >clear
> > > >> >explanation of the use case.
> > > >> >
> > > >> >This backup/restore is not intended to be tied to any write event,
> > > >> >instead,
> > > >> >it is expected to run as a periodic (daily?) cron job on one of the
> > > >> >servers, which is not guaranteed to be the leader of the ensemble.
> > > >>There
> > > >> >is
> > > >> >no expectation that all recent changes are committed and persisted
> to
> > > >> >disk.
> > > >> >The system can sustain the loss of several hours worth of recent
> > > >>changes
> > > >> >in
> > > >> >the event of restore.
> > > >> >
> > > >> >As for finding the leader dynamically and performing backup on it,
> > this
> > > >> >approach could be more difficult as the leader can change time to
> > time
> > > >>and
> > > >> >I still need to fetch the file to store it in my designated backup
> > > >> >location. Taking backup on one server and picking it up from a
> local
> > > >>file
> > > >> >system looks less error-prone. Even if I went the fancy route and
> had
> > > >> >Zookeeper send me the serialized DataTree in response to the 4wl,
> > this
> > > >> >approach would involve a lot of moving parts.
> > > >> >
> > > >> >I have already made a PoC for a new 4wl that invokes takeSnapshot()
> > and
> > > >> >returns an absolute path to the snapshot it drops on disk. I have
> > > >>already
> > > >> >protected takeSnapshot() from concurrent invocation, which is
> likely
> > to
> > > >> >corrupt the snapshot file on disk. This approach works but I'm
> > > >>thinking to
> > > >> >take it one step further by providing the desired path name as an
> > > >>argument
> > > >> >to my new 4lw and to have Zookeeper server drop the snapshot into
> the
> > > >> >specified file and report success/failure back. This way I can
> avoid
> > > >> >cluttering the data directory and interfering with what Zookeeper
> > finds
> > > >> >when it scans the data directory.
> > > >> >
> > > >> >Approach with having an additional server that would take the
> > > >>leadership
> > > >> >and populate the ensemble is just a theory. I don't see a clean way
> > of
> > > >> >making a quorum member the leader of the quorum. Am I overlooking
> > > >> >something
> > > >> >simple?
> > > >> >
> > > >> >In backup and restore of an ensemble the biggest unknown for me
> > remains
> > > >> >populating the ensemble with desired data. I can think of two ways:
> > > >> >
> > > >> >1. Clear out all servers by stopping them, purge version-2
> > directories,
> > > >> >restore a snapshot file on one server that will be brought first,
> and
> > > >>then
> > > >> >bring up the rest of the ensemble. This way I somewhat force the
> > first
> > > >> >server to be the leader because it has data and it will be the only
> > > >>member
> > > >> >of a quorum with data, provided to the way I start the ensemble.
> This
> > > >> >looks
> > > >> >like a hack, though.
> > > >> >
> > > >> >2. Clear out the ensemble and reload it with a dedicated client
> using
> > > >>the
> > > >> >provided Zookeeper API.
> > > >> >
> > > >> >With the approach of backing up an actual snapshot file, option #1
> > > >>appears
> > > >> >to be more practical.
> > > >> >
> > > >> >I wish I could start the ensemble with a designate leader that
> would
> > > >> >bootstrap the ensemble with data and then the ensemble would go
> into
> > > >>its
> > > >> >normal business...
> > > >> >
> > > >> >
> > > >> >
> > > >> >On Mon, Jul 8, 2013 at 4:30 PM, Flavio Junqueira
> > > >> ><fp...@yahoo.com>wrote:
> > > >> >
> > > >> >> One bit that is still a bit confusing to me in your use case is
> if
> > > >>you
> > > >> >> need to take a snapshot right after some event in your
> application.
> > > >> >>Even if
> > > >> >> you're able to tell ZooKeeper to take a snapshot, there is no
> > > >>guarantee
> > > >> >> that it will happen at the exact point you want it if update
> > > >>operations
> > > >> >> keep coming.
> > > >> >>
> > > >> >> If you use your four-letter word approach, then would you search
> > for
> > > >>the
> > > >> >> leader or would you simply take a snapshot at any server? If it
> has
> > > >>to
> > > >> >>go
> > > >> >> through the leader so that you make sure to have the most recent
> > > >> >>committed
> > > >> >> state, then it might not be a bad idea to have an api call that
> > tells
> > > >> >>the
> > > >> >> leader to take a snapshot at some directory of your choice.
> > Informing
> > > >> >>you
> > > >> >> the name of the snapshot file so that you can copy sounds like an
> > > >> >>option,
> > > >> >> but perhaps it is not as convenient.
> > > >> >>
> > > >> >> The approach of adding another server is not very clear. How do
> you
> > > >> >>force
> > > >> >> it to be the leader? Keep in mind that if it crashes, then it
> will
> > > >>lose
> > > >> >> leadership.
> > > >> >>
> > > >> >> -Flavio
> > > >> >>
> > > >> >> On Jul 8, 2013, at 8:34 AM, Sergey Maslyakov <ev...@gmail.com>
> > > >>wrote:
> > > >> >>
> > > >> >> > It looks like the "dev" mailing list is rather inactive. Over
> the
> > > >>past
> > > >> >> few
> > > >> >> > days I only saw several automated emails from JIRA and this is
> > > >>pretty
> > > >> >> much
> > > >> >> > it. Contrary to this, the "user" mailing list seems to be more
> > > >>alive
> > > >> >>and
> > > >> >> > more populated.
> > > >> >> >
> > > >> >> > With this in mind, please allow me to cross-post here the
> > message I
> > > >> >>sent
> > > >> >> > into the "dev" list a few days ago.
> > > >> >> >
> > > >> >> >
> > > >> >> > Regards,
> > > >> >> > /Sergey
> > > >> >> >
> > > >> >> > === forwarded message begins here ===
> > > >> >> >
> > > >> >> > Hi!
> > > >> >> >
> > > >> >> > I'm facing the problem that has been raised by multiple people
> > but
> > > >> >>none
> > > >> >> of
> > > >> >> > the discussion threads seem to provide a good answer. I dug in
> > > >> >>Zookeeper
> > > >> >> > source code trying to come up with some possible approaches
> and I
> > > >> >>would
> > > >> >> > like to get your inputs on those.
> > > >> >> >
> > > >> >> > Initial conditions:
> > > >> >> >
> > > >> >> > * I have an ensemble of five Zookeeper servers running v3.4.5
> > code.
> > > >> >> > * The size of a committed snapshot file is in vicinity of 1GB.
> > > >> >> > * There are about 80 clients connected to the ensemble.
> > > >> >> > * Clients a heavily read biased, i.e., they mostly read and
> > rarely
> > > >> >> write. I
> > > >> >> > would say less than 0.1% of queries modify the data.
> > > >> >> >
> > > >> >> > Problem statement:
> > > >> >> >
> > > >> >> > * Under certain conditions, I may need to revert the data
> stored
> > in
> > > >> >>the
> > > >> >> > ensemble to an earlier state. For example, one of the clients
> may
> > > >>ruin
> > > >> >> the
> > > >> >> > application-level data integrity and I need to perform a
> disaster
> > > >> >> recovery.
> > > >> >> >
> > > >> >> > Things look nice and easy if I'm dealing with a single
> Zookeeper
> > > >> >>server.
> > > >> >> A
> > > >> >> > file-level copy of the data and dataLog directories should
> allow
> > > >>me to
> > > >> >> > recover later by stopping Zookeeper, swapping the corrupted
> data
> > > >>and
> > > >> >> > dataLog directories with a backup, and firing Zookeeper back
> up.
> > > >> >> >
> > > >> >> > Now, the ensemble deployment and the leader election algorithm
> in
> > > >>the
> > > >> >> > quorum make things much more difficult. In order to restore
> from
> > a
> > > >> >>single
> > > >> >> > file-level backup, I need to take the whole ensemble down, wipe
> > out
> > > >> >>data
> > > >> >> > and dataLog directories on all servers, replace these
> directories
> > > >>with
> > > >> >> > backed up content on one of the servers, bring this server up
> > > >>first,
> > > >> >>and
> > > >> >> > then bring up the rest of the ensemble. This [somewhat]
> > guarantees
> > > >> >>that
> > > >> >> the
> > > >> >> > populated Zookeeper server becomes a member of a majority and
> > > >> >>populates
> > > >> >> the
> > > >> >> > ensemble. This approach works but it is very involving and,
> thus,
> > > >> >> > error-prone due to a human error.
> > > >> >> >
> > > >> >> > Based on a study of Zookeeper source code, I am considering the
> > > >> >>following
> > > >> >> > alternatives. And I seek advice from Zookeeper development
> > > >>community
> > > >> >>as
> > > >> >> to
> > > >> >> > which approach looks more promising or if there is a better
> way.
> > > >> >> >
> > > >> >> > Approach #1:
> > > >> >> >
> > > >> >> > Develop a complementary pair of utilities for export and import
> > of
> > > >>the
> > > >> >> > data. Both utilities will act as Zookeeper clients and use the
> > > >> >>existing
> > > >> >> > API. The "export" utility will recursively retrieve data and
> > store
> > > >>it
> > > >> >>in
> > > >> >> a
> > > >> >> > file. The "import" utility will first purge all data from the
> > > >>ensemble
> > > >> >> and
> > > >> >> > then reload it from the file.
> > > >> >> >
> > > >> >> > This approach seems to be the simplest and there are similar
> > tools
> > > >> >> > developed already. For example, the Guano Project:
> > > >> >> > https://github.com/d2fn/guano
> > > >> >> >
> > > >> >> > I don't like two things about it:
> > > >> >> > * Poor performance even on a backup for the data store of my
> > size.
> > > >> >> > * Possible data consistency issues due to concurrent access by
> > the
> > > >> >>export
> > > >> >> > utility as well as other "normal" clients.
> > > >> >> >
> > > >> >> > Approach #2:
> > > >> >> >
> > > >> >> > Add another four-letter command that would force rolling up the
> > > >> >> > transactions and creating a snapshot. The result of this
> command
> > > >>would
> > > >> >> be a
> > > >> >> > new snapshot.XXXX file on disk and the name of the file could
> be
> > > >> >>reported
> > > >> >> > back to the client as a response to the four-letter command.
> This
> > > >> >>way, I
> > > >> >> > would know which snapshot file to grab for future possible
> > restore.
> > > >> >>But
> > > >> >> > restoring from a snapshot file is almost as involving as the
> > > >> >>error-prone
> > > >> >> > sequence described in the "Initial conditions" above.
> > > >> >> >
> > > >> >> > Approach #3:
> > > >> >> >
> > > >> >> > Come up with a way to temporarily add a new Zookeeper server
> > into a
> > > >> >>live
> > > >> >> > ensemble, that would overtake (how?) the leader role and push
> out
> > > >>the
> > > >> >> > snapshot that it has into all ensemble members upon restore.
> This
> > > >> >> approach
> > > >> >> > could be difficult and error-prone to implement because it will
> > > >> >>require
> > > >> >> > hacking the existing election algorithm to designate a leader.
> > > >> >> >
> > > >> >> > So, which of the approaches do you think works best for an
> > ensemble
> > > >> >>and
> > > >> >> for
> > > >> >> > the database size of about 1GB?
> > > >> >> >
> > > >> >> >
> > > >> >> > Any advice will be highly appreciated!
> > > >> >> > /Sergey
> > > >> >>
> > > >> >>
> > > >>
> > > >>
> > >
> > >
> >
>

Re: Efficient backup and a reasonable restore of an ensemble

Posted by Sergey Maslyakov <ev...@gmail.com>.

Kishore,

This sounds like a very elaborate tool. I was trying to find a simplistic
approach but what Thawan said about "fuzzy snapshots" makes me a little
afraid that there is no simple solution.


On Mon, Jul 8, 2013 at 11:05 PM, kishore g <g....@gmail.com> wrote:

> Agree, we already have such a tool. In fact we use it to reconstruct the
> sequence of events that led to a failure and actually restore the system to
> a previous stable point and replay the events. Unfortunately this is tied
> closely with Helix but it should be easy to make this a generic tool.
>
> Sergey is this something that will be useful in your case.
>
> Thanks,
> Kishore G
>
>
> On Mon, Jul 8, 2013 at 8:09 PM, Thawan Kooburat <th...@fb.com> wrote:
>
> > On restore part, I think having a separate utility to manipulate the
> > data/snap dir (by truncating the log/removing snapshot to a given zxid)
> > would be easier than modifying the server.
> >
> >
> > --
> > Thawan Kooburat
> >
> >
> >
> >
> >
> > On 7/8/13 6:34 PM, "kishore g" <g....@gmail.com> wrote:
> >
> > >I think what we are looking at is a  point in time restore
> functionality.
> > >How about adding a feature that says go back to a specific
> zxid/timestamp.
> > >This way before doing any change to zookeeper simply note down the
> > >timestamp/zxid on leader. If things go wrong after making changes, bring
> > >down zookeepers and provide additional parameter of a zxid/timestamp
> while
> > >restarting. The server can go the exact point and make it current. The
> > >followers can be started blank.
> > >
> > >
> > >
> > >On Mon, Jul 8, 2013 at 5:53 PM, Thawan Kooburat <th...@fb.com> wrote:
> > >
> > >> Just saw that  this is the corresponding use case to the question
> posted
> > >> in dev list.
> > >>
> > >> In order to restore the data to a given point in time correctly, you
> > >>need
> > >> both snapshot and txnlog. This is because zookeeper snapshot is fuzzy
> > >>and
> > >> snapshot alone may not represent a valid state of the server if there
> > >>are
> > >> in-flight requests.
> > >>
> > >> The 4wl command should cause the server to roll the log and take a
> > >> snapshot similar to periodic snapshotting operation. Your backup
> script
> > >> need grap the snapshot and corresponding txnlog file from the data
> dir.
> > >>
> > >> To restore, just shutdown all hosts, clear the data dir, copy over the
> > >> snapshot and txnlog, and restart them.
> > >>
> > >>
> > >> --
> > >> Thawan Kooburat
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On 7/8/13 3:28 PM, "Sergey Maslyakov" <ev...@gmail.com> wrote:
> > >>
> > >> >Thank you for your response, Flavio. I apologize, I did not provide a
> > >> >clear
> > >> >explanation of the use case.
> > >> >
> > >> >This backup/restore is not intended to be tied to any write event,
> > >> >instead,
> > >> >it is expected to run as a periodic (daily?) cron job on one of the
> > >> >servers, which is not guaranteed to be the leader of the ensemble.
> > >>There
> > >> >is
> > >> >no expectation that all recent changes are committed and persisted to
> > >> >disk.
> > >> >The system can sustain the loss of several hours worth of recent
> > >>changes
> > >> >in
> > >> >the event of restore.
> > >> >
> > >> >As for finding the leader dynamically and performing backup on it,
> this
> > >> >approach could be more difficult as the leader can change time to
> time
> > >>and
> > >> >I still need to fetch the file to store it in my designated backup
> > >> >location. Taking backup on one server and picking it up from a local
> > >>file
> > >> >system looks less error-prone. Even if I went the fancy route and had
> > >> >Zookeeper send me the serialized DataTree in response to the 4wl,
> this
> > >> >approach would involve a lot of moving parts.
> > >> >
> > >> >I have already made a PoC for a new 4wl that invokes takeSnapshot()
> and
> > >> >returns an absolute path to the snapshot it drops on disk. I have
> > >>already
> > >> >protected takeSnapshot() from concurrent invocation, which is likely
> to
> > >> >corrupt the snapshot file on disk. This approach works but I'm
> > >>thinking to
> > >> >take it one step further by providing the desired path name as an
> > >>argument
> > >> >to my new 4lw and to have Zookeeper server drop the snapshot into the
> > >> >specified file and report success/failure back. This way I can avoid
> > >> >cluttering the data directory and interfering with what Zookeeper
> finds
> > >> >when it scans the data directory.
> > >> >
> > >> >Approach with having an additional server that would take the
> > >>leadership
> > >> >and populate the ensemble is just a theory. I don't see a clean way
> of
> > >> >making a quorum member the leader of the quorum. Am I overlooking
> > >> >something
> > >> >simple?
> > >> >
> > >> >In backup and restore of an ensemble the biggest unknown for me
> remains
> > >> >populating the ensemble with desired data. I can think of two ways:
> > >> >
> > >> >1. Clear out all servers by stopping them, purge version-2
> directories,
> > >> >restore a snapshot file on one server that will be brought first, and
> > >>then
> > >> >bring up the rest of the ensemble. This way I somewhat force the
> first
> > >> >server to be the leader because it has data and it will be the only
> > >>member
> > >> >of a quorum with data, provided to the way I start the ensemble. This
> > >> >looks
> > >> >like a hack, though.
> > >> >
> > >> >2. Clear out the ensemble and reload it with a dedicated client using
> > >>the
> > >> >provided Zookeeper API.
> > >> >
> > >> >With the approach of backing up an actual snapshot file, option #1
> > >>appears
> > >> >to be more practical.
> > >> >
> > >> >I wish I could start the ensemble with a designate leader that would
> > >> >bootstrap the ensemble with data and then the ensemble would go into
> > >>its
> > >> >normal business...
> > >> >
> > >> >
> > >> >
> > >> >On Mon, Jul 8, 2013 at 4:30 PM, Flavio Junqueira
> > >> ><fp...@yahoo.com>wrote:
> > >> >
> > >> >> One bit that is still a bit confusing to me in your use case is if
> > >>you
> > >> >> need to take a snapshot right after some event in your application.
> > >> >>Even if
> > >> >> you're able to tell ZooKeeper to take a snapshot, there is no
> > >>guarantee
> > >> >> that it will happen at the exact point you want it if update
> > >>operations
> > >> >> keep coming.
> > >> >>
> > >> >> If you use your four-letter word approach, then would you search
> for
> > >>the
> > >> >> leader or would you simply take a snapshot at any server? If it has
> > >>to
> > >> >>go
> > >> >> through the leader so that you make sure to have the most recent
> > >> >>committed
> > >> >> state, then it might not be a bad idea to have an api call that
> tells
> > >> >>the
> > >> >> leader to take a snapshot at some directory of your choice.
> Informing
> > >> >>you
> > >> >> the name of the snapshot file so that you can copy sounds like an
> > >> >>option,
> > >> >> but perhaps it is not as convenient.
> > >> >>
> > >> >> The approach of adding another server is not very clear. How do you
> > >> >>force
> > >> >> it to be the leader? Keep in mind that if it crashes, then it will
> > >>lose
> > >> >> leadership.
> > >> >>
> > >> >> -Flavio
> > >> >>
> > >> >> On Jul 8, 2013, at 8:34 AM, Sergey Maslyakov <ev...@gmail.com>
> > >>wrote:
> > >> >>
> > >> >> > It looks like the "dev" mailing list is rather inactive. Over the
> > >>past
> > >> >> few
> > >> >> > days I only saw several automated emails from JIRA and this is
> > >>pretty
> > >> >> much
> > >> >> > it. Contrary to this, the "user" mailing list seems to be more
> > >>alive
> > >> >>and
> > >> >> > more populated.
> > >> >> >
> > >> >> > With this in mind, please allow me to cross-post here the
> message I
> > >> >>sent
> > >> >> > into the "dev" list a few days ago.
> > >> >> >
> > >> >> >
> > >> >> > Regards,
> > >> >> > /Sergey
> > >> >> >
> > >> >> > === forwarded message begins here ===
> > >> >> >
> > >> >> > Hi!
> > >> >> >
> > >> >> > I'm facing the problem that has been raised by multiple people
> but
> > >> >>none
> > >> >> of
> > >> >> > the discussion threads seem to provide a good answer. I dug in
> > >> >>Zookeeper
> > >> >> > source code trying to come up with some possible approaches and I
> > >> >>would
> > >> >> > like to get your inputs on those.
> > >> >> >
> > >> >> > Initial conditions:
> > >> >> >
> > >> >> > * I have an ensemble of five Zookeeper servers running v3.4.5
> code.
> > >> >> > * The size of a committed snapshot file is in vicinity of 1GB.
> > >> >> > * There are about 80 clients connected to the ensemble.
> > >> >> > * Clients a heavily read biased, i.e., they mostly read and
> rarely
> > >> >> write. I
> > >> >> > would say less than 0.1% of queries modify the data.
> > >> >> >
> > >> >> > Problem statement:
> > >> >> >
> > >> >> > * Under certain conditions, I may need to revert the data stored
> in
> > >> >>the
> > >> >> > ensemble to an earlier state. For example, one of the clients may
> > >>ruin
> > >> >> the
> > >> >> > application-level data integrity and I need to perform a disaster
> > >> >> recovery.
> > >> >> >
> > >> >> > Things look nice and easy if I'm dealing with a single Zookeeper
> > >> >>server.
> > >> >> A
> > >> >> > file-level copy of the data and dataLog directories should allow
> > >>me to
> > >> >> > recover later by stopping Zookeeper, swapping the corrupted data
> > >>and
> > >> >> > dataLog directories with a backup, and firing Zookeeper back up.
> > >> >> >
> > >> >> > Now, the ensemble deployment and the leader election algorithm in
> > >>the
> > >> >> > quorum make things much more difficult. In order to restore from
> a
> > >> >>single
> > >> >> > file-level backup, I need to take the whole ensemble down, wipe
> out
> > >> >>data
> > >> >> > and dataLog directories on all servers, replace these directories
> > >>with
> > >> >> > backed up content on one of the servers, bring this server up
> > >>first,
> > >> >>and
> > >> >> > then bring up the rest of the ensemble. This [somewhat]
> guarantees
> > >> >>that
> > >> >> the
> > >> >> > populated Zookeeper server becomes a member of a majority and
> > >> >>populates
> > >> >> the
> > >> >> > ensemble. This approach works but it is very involving and, thus,
> > >> >> > error-prone due to a human error.
> > >> >> >
> > >> >> > Based on a study of Zookeeper source code, I am considering the
> > >> >>following
> > >> >> > alternatives. And I seek advice from Zookeeper development
> > >>community
> > >> >>as
> > >> >> to
> > >> >> > which approach looks more promising or if there is a better way.
> > >> >> >
> > >> >> > Approach #1:
> > >> >> >
> > >> >> > Develop a complementary pair of utilities for export and import
> of
> > >>the
> > >> >> > data. Both utilities will act as Zookeeper clients and use the
> > >> >>existing
> > >> >> > API. The "export" utility will recursively retrieve data and
> store
> > >>it
> > >> >>in
> > >> >> a
> > >> >> > file. The "import" utility will first purge all data from the
> > >>ensemble
> > >> >> and
> > >> >> > then reload it from the file.
> > >> >> >
> > >> >> > This approach seems to be the simplest and there are similar
> tools
> > >> >> > developed already. For example, the Guano Project:
> > >> >> > https://github.com/d2fn/guano
> > >> >> >
> > >> >> > I don't like two things about it:
> > >> >> > * Poor performance even on a backup for the data store of my
> size.
> > >> >> > * Possible data consistency issues due to concurrent access by
> the
> > >> >>export
> > >> >> > utility as well as other "normal" clients.
> > >> >> >
> > >> >> > Approach #2:
> > >> >> >
> > >> >> > Add another four-letter command that would force rolling up the
> > >> >> > transactions and creating a snapshot. The result of this command
> > >>would
> > >> >> be a
> > >> >> > new snapshot.XXXX file on disk and the name of the file could be
> > >> >>reported
> > >> >> > back to the client as a response to the four-letter command. This
> > >> >>way, I
> > >> >> > would know which snapshot file to grab for future possible
> restore.
> > >> >>But
> > >> >> > restoring from a snapshot file is almost as involving as the
> > >> >>error-prone
> > >> >> > sequence described in the "Initial conditions" above.
> > >> >> >
> > >> >> > Approach #3:
> > >> >> >
> > >> >> > Come up with a way to temporarily add a new Zookeeper server
> into a
> > >> >>live
> > >> >> > ensemble, that would overtake (how?) the leader role and push out
> > >>the
> > >> >> > snapshot that it has into all ensemble members upon restore. This
> > >> >> approach
> > >> >> > could be difficult and error-prone to implement because it will
> > >> >>require
> > >> >> > hacking the existing election algorithm to designate a leader.
> > >> >> >
> > >> >> > So, which of the approaches do you think works best for an
> ensemble
> > >> >>and
> > >> >> for
> > >> >> > the database size of about 1GB?
> > >> >> >
> > >> >> >
> > >> >> > Any advice will be highly appreciated!
> > >> >> > /Sergey
> > >> >>
> > >> >>
> > >>
> > >>
> >
> >
>

Re: Efficient backup and a reasonable restore of an ensemble

Posted by kishore g <g....@gmail.com>.

Agree, we already have such a tool. In fact we use it to reconstruct the
sequence of events that led to a failure and actually restore the system to
a previous stable point and replay the events. Unfortunately this is tied
closely with Helix but it should be easy to make this a generic tool.

Sergey is this something that will be useful in your case.

Thanks,
Kishore G


On Mon, Jul 8, 2013 at 8:09 PM, Thawan Kooburat <th...@fb.com> wrote:

> On restore part, I think having a separate utility to manipulate the
> data/snap dir (by truncating the log/removing snapshot to a given zxid)
> would be easier than modifying the server.
>
>
> --
> Thawan Kooburat
>
>
>
>
>
> On 7/8/13 6:34 PM, "kishore g" <g....@gmail.com> wrote:
>
> >I think what we are looking at is a  point in time restore functionality.
> >How about adding a feature that says go back to a specific zxid/timestamp.
> >This way before doing any change to zookeeper simply note down the
> >timestamp/zxid on leader. If things go wrong after making changes, bring
> >down zookeepers and provide additional parameter of a zxid/timestamp while
> >restarting. The server can go the exact point and make it current. The
> >followers can be started blank.
> >
> >
> >
> >On Mon, Jul 8, 2013 at 5:53 PM, Thawan Kooburat <th...@fb.com> wrote:
> >
> >> Just saw that  this is the corresponding use case to the question posted
> >> in dev list.
> >>
> >> In order to restore the data to a given point in time correctly, you
> >>need
> >> both snapshot and txnlog. This is because zookeeper snapshot is fuzzy
> >>and
> >> snapshot alone may not represent a valid state of the server if there
> >>are
> >> in-flight requests.
> >>
> >> The 4wl command should cause the server to roll the log and take a
> >> snapshot similar to periodic snapshotting operation. Your backup script
> >> need grap the snapshot and corresponding txnlog file from the data dir.
> >>
> >> To restore, just shutdown all hosts, clear the data dir, copy over the
> >> snapshot and txnlog, and restart them.
> >>
> >>
> >> --
> >> Thawan Kooburat
> >>
> >>
> >>
> >>
> >>
> >> On 7/8/13 3:28 PM, "Sergey Maslyakov" <ev...@gmail.com> wrote:
> >>
> >> >Thank you for your response, Flavio. I apologize, I did not provide a
> >> >clear
> >> >explanation of the use case.
> >> >
> >> >This backup/restore is not intended to be tied to any write event,
> >> >instead,
> >> >it is expected to run as a periodic (daily?) cron job on one of the
> >> >servers, which is not guaranteed to be the leader of the ensemble.
> >>There
> >> >is
> >> >no expectation that all recent changes are committed and persisted to
> >> >disk.
> >> >The system can sustain the loss of several hours worth of recent
> >>changes
> >> >in
> >> >the event of restore.
> >> >
> >> >As for finding the leader dynamically and performing backup on it, this
> >> >approach could be more difficult as the leader can change time to time
> >>and
> >> >I still need to fetch the file to store it in my designated backup
> >> >location. Taking backup on one server and picking it up from a local
> >>file
> >> >system looks less error-prone. Even if I went the fancy route and had
> >> >Zookeeper send me the serialized DataTree in response to the 4wl, this
> >> >approach would involve a lot of moving parts.
> >> >
> >> >I have already made a PoC for a new 4wl that invokes takeSnapshot() and
> >> >returns an absolute path to the snapshot it drops on disk. I have
> >>already
> >> >protected takeSnapshot() from concurrent invocation, which is likely to
> >> >corrupt the snapshot file on disk. This approach works but I'm
> >>thinking to
> >> >take it one step further by providing the desired path name as an
> >>argument
> >> >to my new 4lw and to have Zookeeper server drop the snapshot into the
> >> >specified file and report success/failure back. This way I can avoid
> >> >cluttering the data directory and interfering with what Zookeeper finds
> >> >when it scans the data directory.
> >> >
> >> >Approach with having an additional server that would take the
> >>leadership
> >> >and populate the ensemble is just a theory. I don't see a clean way of
> >> >making a quorum member the leader of the quorum. Am I overlooking
> >> >something
> >> >simple?
> >> >
> >> >In backup and restore of an ensemble the biggest unknown for me remains
> >> >populating the ensemble with desired data. I can think of two ways:
> >> >
> >> >1. Clear out all servers by stopping them, purge version-2 directories,
> >> >restore a snapshot file on one server that will be brought first, and
> >>then
> >> >bring up the rest of the ensemble. This way I somewhat force the first
> >> >server to be the leader because it has data and it will be the only
> >>member
> >> >of a quorum with data, provided to the way I start the ensemble. This
> >> >looks
> >> >like a hack, though.
> >> >
> >> >2. Clear out the ensemble and reload it with a dedicated client using
> >>the
> >> >provided Zookeeper API.
> >> >
> >> >With the approach of backing up an actual snapshot file, option #1
> >>appears
> >> >to be more practical.
> >> >
> >> >I wish I could start the ensemble with a designate leader that would
> >> >bootstrap the ensemble with data and then the ensemble would go into
> >>its
> >> >normal business...
> >> >
> >> >
> >> >
> >> >On Mon, Jul 8, 2013 at 4:30 PM, Flavio Junqueira
> >> ><fp...@yahoo.com>wrote:
> >> >
> >> >> One bit that is still a bit confusing to me in your use case is if
> >>you
> >> >> need to take a snapshot right after some event in your application.
> >> >>Even if
> >> >> you're able to tell ZooKeeper to take a snapshot, there is no
> >>guarantee
> >> >> that it will happen at the exact point you want it if update
> >>operations
> >> >> keep coming.
> >> >>
> >> >> If you use your four-letter word approach, then would you search for
> >>the
> >> >> leader or would you simply take a snapshot at any server? If it has
> >>to
> >> >>go
> >> >> through the leader so that you make sure to have the most recent
> >> >>committed
> >> >> state, then it might not be a bad idea to have an api call that tells
> >> >>the
> >> >> leader to take a snapshot at some directory of your choice. Informing
> >> >>you
> >> >> the name of the snapshot file so that you can copy sounds like an
> >> >>option,
> >> >> but perhaps it is not as convenient.
> >> >>
> >> >> The approach of adding another server is not very clear. How do you
> >> >>force
> >> >> it to be the leader? Keep in mind that if it crashes, then it will
> >>lose
> >> >> leadership.
> >> >>
> >> >> -Flavio
> >> >>
> >> >> On Jul 8, 2013, at 8:34 AM, Sergey Maslyakov <ev...@gmail.com>
> >>wrote:
> >> >>
> >> >> > It looks like the "dev" mailing list is rather inactive. Over the
> >>past
> >> >> few
> >> >> > days I only saw several automated emails from JIRA and this is
> >>pretty
> >> >> much
> >> >> > it. Contrary to this, the "user" mailing list seems to be more
> >>alive
> >> >>and
> >> >> > more populated.
> >> >> >
> >> >> > With this in mind, please allow me to cross-post here the message I
> >> >>sent
> >> >> > into the "dev" list a few days ago.
> >> >> >
> >> >> >
> >> >> > Regards,
> >> >> > /Sergey
> >> >> >
> >> >> > === forwarded message begins here ===
> >> >> >
> >> >> > Hi!
> >> >> >
> >> >> > I'm facing the problem that has been raised by multiple people but
> >> >>none
> >> >> of
> >> >> > the discussion threads seem to provide a good answer. I dug in
> >> >>Zookeeper
> >> >> > source code trying to come up with some possible approaches and I
> >> >>would
> >> >> > like to get your inputs on those.
> >> >> >
> >> >> > Initial conditions:
> >> >> >
> >> >> > * I have an ensemble of five Zookeeper servers running v3.4.5 code.
> >> >> > * The size of a committed snapshot file is in vicinity of 1GB.
> >> >> > * There are about 80 clients connected to the ensemble.
> >> >> > * Clients a heavily read biased, i.e., they mostly read and rarely
> >> >> write. I
> >> >> > would say less than 0.1% of queries modify the data.
> >> >> >
> >> >> > Problem statement:
> >> >> >
> >> >> > * Under certain conditions, I may need to revert the data stored in
> >> >>the
> >> >> > ensemble to an earlier state. For example, one of the clients may
> >>ruin
> >> >> the
> >> >> > application-level data integrity and I need to perform a disaster
> >> >> recovery.
> >> >> >
> >> >> > Things look nice and easy if I'm dealing with a single Zookeeper
> >> >>server.
> >> >> A
> >> >> > file-level copy of the data and dataLog directories should allow
> >>me to
> >> >> > recover later by stopping Zookeeper, swapping the corrupted data
> >>and
> >> >> > dataLog directories with a backup, and firing Zookeeper back up.
> >> >> >
> >> >> > Now, the ensemble deployment and the leader election algorithm in
> >>the
> >> >> > quorum make things much more difficult. In order to restore from a
> >> >>single
> >> >> > file-level backup, I need to take the whole ensemble down, wipe out
> >> >>data
> >> >> > and dataLog directories on all servers, replace these directories
> >>with
> >> >> > backed up content on one of the servers, bring this server up
> >>first,
> >> >>and
> >> >> > then bring up the rest of the ensemble. This [somewhat] guarantees
> >> >>that
> >> >> the
> >> >> > populated Zookeeper server becomes a member of a majority and
> >> >>populates
> >> >> the
> >> >> > ensemble. This approach works but it is very involving and, thus,
> >> >> > error-prone due to a human error.
> >> >> >
> >> >> > Based on a study of Zookeeper source code, I am considering the
> >> >>following
> >> >> > alternatives. And I seek advice from Zookeeper development
> >>community
> >> >>as
> >> >> to
> >> >> > which approach looks more promising or if there is a better way.
> >> >> >
> >> >> > Approach #1:
> >> >> >
> >> >> > Develop a complementary pair of utilities for export and import of
> >>the
> >> >> > data. Both utilities will act as Zookeeper clients and use the
> >> >>existing
> >> >> > API. The "export" utility will recursively retrieve data and store
> >>it
> >> >>in
> >> >> a
> >> >> > file. The "import" utility will first purge all data from the
> >>ensemble
> >> >> and
> >> >> > then reload it from the file.
> >> >> >
> >> >> > This approach seems to be the simplest and there are similar tools
> >> >> > developed already. For example, the Guano Project:
> >> >> > https://github.com/d2fn/guano
> >> >> >
> >> >> > I don't like two things about it:
> >> >> > * Poor performance even on a backup for the data store of my size.
> >> >> > * Possible data consistency issues due to concurrent access by the
> >> >>export
> >> >> > utility as well as other "normal" clients.
> >> >> >
> >> >> > Approach #2:
> >> >> >
> >> >> > Add another four-letter command that would force rolling up the
> >> >> > transactions and creating a snapshot. The result of this command
> >>would
> >> >> be a
> >> >> > new snapshot.XXXX file on disk and the name of the file could be
> >> >>reported
> >> >> > back to the client as a response to the four-letter command. This
> >> >>way, I
> >> >> > would know which snapshot file to grab for future possible restore.
> >> >>But
> >> >> > restoring from a snapshot file is almost as involving as the
> >> >>error-prone
> >> >> > sequence described in the "Initial conditions" above.
> >> >> >
> >> >> > Approach #3:
> >> >> >
> >> >> > Come up with a way to temporarily add a new Zookeeper server into a
> >> >>live
> >> >> > ensemble, that would overtake (how?) the leader role and push out
> >>the
> >> >> > snapshot that it has into all ensemble members upon restore. This
> >> >> approach
> >> >> > could be difficult and error-prone to implement because it will
> >> >>require
> >> >> > hacking the existing election algorithm to designate a leader.
> >> >> >
> >> >> > So, which of the approaches do you think works best for an ensemble
> >> >>and
> >> >> for
> >> >> > the database size of about 1GB?
> >> >> >
> >> >> >
> >> >> > Any advice will be highly appreciated!
> >> >> > /Sergey
> >> >>
> >> >>
> >>
> >>
>
>

Re: Efficient backup and a reasonable restore of an ensemble

Posted by Sergey Maslyakov <ev...@gmail.com>.

Thawan,

Restoring a single server is not a big deal, but restoring data in an
ensemble gets tricky when there is no mechanism to "initialize" the
ensemble data store to a certain state. What I came up with so far is just
a couple of work-arounds that trick Zookeeper logic to follow the path I
cannot directly enforce.


On Mon, Jul 8, 2013 at 10:09 PM, Thawan Kooburat <th...@fb.com> wrote:

> On restore part, I think having a separate utility to manipulate the
> data/snap dir (by truncating the log/removing snapshot to a given zxid)
> would be easier than modifying the server.
>
>
> --
> Thawan Kooburat
>
>
>
>
>
> On 7/8/13 6:34 PM, "kishore g" <g....@gmail.com> wrote:
>
> >I think what we are looking at is a  point in time restore functionality.
> >How about adding a feature that says go back to a specific zxid/timestamp.
> >This way before doing any change to zookeeper simply note down the
> >timestamp/zxid on leader. If things go wrong after making changes, bring
> >down zookeepers and provide additional parameter of a zxid/timestamp while
> >restarting. The server can go the exact point and make it current. The
> >followers can be started blank.
> >
> >
> >
> >On Mon, Jul 8, 2013 at 5:53 PM, Thawan Kooburat <th...@fb.com> wrote:
> >
> >> Just saw that  this is the corresponding use case to the question posted
> >> in dev list.
> >>
> >> In order to restore the data to a given point in time correctly, you
> >>need
> >> both snapshot and txnlog. This is because zookeeper snapshot is fuzzy
> >>and
> >> snapshot alone may not represent a valid state of the server if there
> >>are
> >> in-flight requests.
> >>
> >> The 4wl command should cause the server to roll the log and take a
> >> snapshot similar to periodic snapshotting operation. Your backup script
> >> need grap the snapshot and corresponding txnlog file from the data dir.
> >>
> >> To restore, just shutdown all hosts, clear the data dir, copy over the
> >> snapshot and txnlog, and restart them.
> >>
> >>
> >> --
> >> Thawan Kooburat
> >>
> >>
> >>
> >>
> >>
> >> On 7/8/13 3:28 PM, "Sergey Maslyakov" <ev...@gmail.com> wrote:
> >>
> >> >Thank you for your response, Flavio. I apologize, I did not provide a
> >> >clear
> >> >explanation of the use case.
> >> >
> >> >This backup/restore is not intended to be tied to any write event,
> >> >instead,
> >> >it is expected to run as a periodic (daily?) cron job on one of the
> >> >servers, which is not guaranteed to be the leader of the ensemble.
> >>There
> >> >is
> >> >no expectation that all recent changes are committed and persisted to
> >> >disk.
> >> >The system can sustain the loss of several hours worth of recent
> >>changes
> >> >in
> >> >the event of restore.
> >> >
> >> >As for finding the leader dynamically and performing backup on it, this
> >> >approach could be more difficult as the leader can change time to time
> >>and
> >> >I still need to fetch the file to store it in my designated backup
> >> >location. Taking backup on one server and picking it up from a local
> >>file
> >> >system looks less error-prone. Even if I went the fancy route and had
> >> >Zookeeper send me the serialized DataTree in response to the 4wl, this
> >> >approach would involve a lot of moving parts.
> >> >
> >> >I have already made a PoC for a new 4wl that invokes takeSnapshot() and
> >> >returns an absolute path to the snapshot it drops on disk. I have
> >>already
> >> >protected takeSnapshot() from concurrent invocation, which is likely to
> >> >corrupt the snapshot file on disk. This approach works but I'm
> >>thinking to
> >> >take it one step further by providing the desired path name as an
> >>argument
> >> >to my new 4lw and to have Zookeeper server drop the snapshot into the
> >> >specified file and report success/failure back. This way I can avoid
> >> >cluttering the data directory and interfering with what Zookeeper finds
> >> >when it scans the data directory.
> >> >
> >> >Approach with having an additional server that would take the
> >>leadership
> >> >and populate the ensemble is just a theory. I don't see a clean way of
> >> >making a quorum member the leader of the quorum. Am I overlooking
> >> >something
> >> >simple?
> >> >
> >> >In backup and restore of an ensemble the biggest unknown for me remains
> >> >populating the ensemble with desired data. I can think of two ways:
> >> >
> >> >1. Clear out all servers by stopping them, purge version-2 directories,
> >> >restore a snapshot file on one server that will be brought first, and
> >>then
> >> >bring up the rest of the ensemble. This way I somewhat force the first
> >> >server to be the leader because it has data and it will be the only
> >>member
> >> >of a quorum with data, provided to the way I start the ensemble. This
> >> >looks
> >> >like a hack, though.
> >> >
> >> >2. Clear out the ensemble and reload it with a dedicated client using
> >>the
> >> >provided Zookeeper API.
> >> >
> >> >With the approach of backing up an actual snapshot file, option #1
> >>appears
> >> >to be more practical.
> >> >
> >> >I wish I could start the ensemble with a designate leader that would
> >> >bootstrap the ensemble with data and then the ensemble would go into
> >>its
> >> >normal business...
> >> >
> >> >
> >> >
> >> >On Mon, Jul 8, 2013 at 4:30 PM, Flavio Junqueira
> >> ><fp...@yahoo.com>wrote:
> >> >
> >> >> One bit that is still a bit confusing to me in your use case is if
> >>you
> >> >> need to take a snapshot right after some event in your application.
> >> >>Even if
> >> >> you're able to tell ZooKeeper to take a snapshot, there is no
> >>guarantee
> >> >> that it will happen at the exact point you want it if update
> >>operations
> >> >> keep coming.
> >> >>
> >> >> If you use your four-letter word approach, then would you search for
> >>the
> >> >> leader or would you simply take a snapshot at any server? If it has
> >>to
> >> >>go
> >> >> through the leader so that you make sure to have the most recent
> >> >>committed
> >> >> state, then it might not be a bad idea to have an api call that tells
> >> >>the
> >> >> leader to take a snapshot at some directory of your choice. Informing
> >> >>you
> >> >> the name of the snapshot file so that you can copy sounds like an
> >> >>option,
> >> >> but perhaps it is not as convenient.
> >> >>
> >> >> The approach of adding another server is not very clear. How do you
> >> >>force
> >> >> it to be the leader? Keep in mind that if it crashes, then it will
> >>lose
> >> >> leadership.
> >> >>
> >> >> -Flavio
> >> >>
> >> >> On Jul 8, 2013, at 8:34 AM, Sergey Maslyakov <ev...@gmail.com>
> >>wrote:
> >> >>
> >> >> > It looks like the "dev" mailing list is rather inactive. Over the
> >>past
> >> >> few
> >> >> > days I only saw several automated emails from JIRA and this is
> >>pretty
> >> >> much
> >> >> > it. Contrary to this, the "user" mailing list seems to be more
> >>alive
> >> >>and
> >> >> > more populated.
> >> >> >
> >> >> > With this in mind, please allow me to cross-post here the message I
> >> >>sent
> >> >> > into the "dev" list a few days ago.
> >> >> >
> >> >> >
> >> >> > Regards,
> >> >> > /Sergey
> >> >> >
> >> >> > === forwarded message begins here ===
> >> >> >
> >> >> > Hi!
> >> >> >
> >> >> > I'm facing the problem that has been raised by multiple people but
> >> >>none
> >> >> of
> >> >> > the discussion threads seem to provide a good answer. I dug in
> >> >>Zookeeper
> >> >> > source code trying to come up with some possible approaches and I
> >> >>would
> >> >> > like to get your inputs on those.
> >> >> >
> >> >> > Initial conditions:
> >> >> >
> >> >> > * I have an ensemble of five Zookeeper servers running v3.4.5 code.
> >> >> > * The size of a committed snapshot file is in vicinity of 1GB.
> >> >> > * There are about 80 clients connected to the ensemble.
> >> >> > * Clients a heavily read biased, i.e., they mostly read and rarely
> >> >> write. I
> >> >> > would say less than 0.1% of queries modify the data.
> >> >> >
> >> >> > Problem statement:
> >> >> >
> >> >> > * Under certain conditions, I may need to revert the data stored in
> >> >>the
> >> >> > ensemble to an earlier state. For example, one of the clients may
> >>ruin
> >> >> the
> >> >> > application-level data integrity and I need to perform a disaster
> >> >> recovery.
> >> >> >
> >> >> > Things look nice and easy if I'm dealing with a single Zookeeper
> >> >>server.
> >> >> A
> >> >> > file-level copy of the data and dataLog directories should allow
> >>me to
> >> >> > recover later by stopping Zookeeper, swapping the corrupted data
> >>and
> >> >> > dataLog directories with a backup, and firing Zookeeper back up.
> >> >> >
> >> >> > Now, the ensemble deployment and the leader election algorithm in
> >>the
> >> >> > quorum make things much more difficult. In order to restore from a
> >> >>single
> >> >> > file-level backup, I need to take the whole ensemble down, wipe out
> >> >>data
> >> >> > and dataLog directories on all servers, replace these directories
> >>with
> >> >> > backed up content on one of the servers, bring this server up
> >>first,
> >> >>and
> >> >> > then bring up the rest of the ensemble. This [somewhat] guarantees
> >> >>that
> >> >> the
> >> >> > populated Zookeeper server becomes a member of a majority and
> >> >>populates
> >> >> the
> >> >> > ensemble. This approach works but it is very involving and, thus,
> >> >> > error-prone due to a human error.
> >> >> >
> >> >> > Based on a study of Zookeeper source code, I am considering the
> >> >>following
> >> >> > alternatives. And I seek advice from Zookeeper development
> >>community
> >> >>as
> >> >> to
> >> >> > which approach looks more promising or if there is a better way.
> >> >> >
> >> >> > Approach #1:
> >> >> >
> >> >> > Develop a complementary pair of utilities for export and import of
> >>the
> >> >> > data. Both utilities will act as Zookeeper clients and use the
> >> >>existing
> >> >> > API. The "export" utility will recursively retrieve data and store
> >>it
> >> >>in
> >> >> a
> >> >> > file. The "import" utility will first purge all data from the
> >>ensemble
> >> >> and
> >> >> > then reload it from the file.
> >> >> >
> >> >> > This approach seems to be the simplest and there are similar tools
> >> >> > developed already. For example, the Guano Project:
> >> >> > https://github.com/d2fn/guano
> >> >> >
> >> >> > I don't like two things about it:
> >> >> > * Poor performance even on a backup for the data store of my size.
> >> >> > * Possible data consistency issues due to concurrent access by the
> >> >>export
> >> >> > utility as well as other "normal" clients.
> >> >> >
> >> >> > Approach #2:
> >> >> >
> >> >> > Add another four-letter command that would force rolling up the
> >> >> > transactions and creating a snapshot. The result of this command
> >>would
> >> >> be a
> >> >> > new snapshot.XXXX file on disk and the name of the file could be
> >> >>reported
> >> >> > back to the client as a response to the four-letter command. This
> >> >>way, I
> >> >> > would know which snapshot file to grab for future possible restore.
> >> >>But
> >> >> > restoring from a snapshot file is almost as involving as the
> >> >>error-prone
> >> >> > sequence described in the "Initial conditions" above.
> >> >> >
> >> >> > Approach #3:
> >> >> >
> >> >> > Come up with a way to temporarily add a new Zookeeper server into a
> >> >>live
> >> >> > ensemble, that would overtake (how?) the leader role and push out
> >>the
> >> >> > snapshot that it has into all ensemble members upon restore. This
> >> >> approach
> >> >> > could be difficult and error-prone to implement because it will
> >> >>require
> >> >> > hacking the existing election algorithm to designate a leader.
> >> >> >
> >> >> > So, which of the approaches do you think works best for an ensemble
> >> >>and
> >> >> for
> >> >> > the database size of about 1GB?
> >> >> >
> >> >> >
> >> >> > Any advice will be highly appreciated!
> >> >> > /Sergey
> >> >>
> >> >>
> >>
> >>
>
>

Re: Efficient backup and a reasonable restore of an ensemble

Posted by Sergey Maslyakov <ev...@gmail.com>.

Sounds like a long transaction (or undo) log, which may impact the
performance.


On Mon, Jul 8, 2013 at 8:34 PM, kishore g <g....@gmail.com> wrote:

> I think what we are looking at is a  point in time restore functionality.
> How about adding a feature that says go back to a specific zxid/timestamp.
> This way before doing any change to zookeeper simply note down the
> timestamp/zxid on leader. If things go wrong after making changes, bring
> down zookeepers and provide additional parameter of a zxid/timestamp while
> restarting. The server can go the exact point and make it current. The
> followers can be started blank.
>
>
>
> On Mon, Jul 8, 2013 at 5:53 PM, Thawan Kooburat <th...@fb.com> wrote:
>
> > Just saw that  this is the corresponding use case to the question posted
> > in dev list.
> >
> > In order to restore the data to a given point in time correctly, you need
> > both snapshot and txnlog. This is because zookeeper snapshot is fuzzy and
> > snapshot alone may not represent a valid state of the server if there are
> > in-flight requests.
> >
> > The 4wl command should cause the server to roll the log and take a
> > snapshot similar to periodic snapshotting operation. Your backup script
> > need grap the snapshot and corresponding txnlog file from the data dir.
> >
> > To restore, just shutdown all hosts, clear the data dir, copy over the
> > snapshot and txnlog, and restart them.
> >
> >
> > --
> > Thawan Kooburat
> >
> >
> >
> >
> >
> > On 7/8/13 3:28 PM, "Sergey Maslyakov" <ev...@gmail.com> wrote:
> >
> > >Thank you for your response, Flavio. I apologize, I did not provide a
> > >clear
> > >explanation of the use case.
> > >
> > >This backup/restore is not intended to be tied to any write event,
> > >instead,
> > >it is expected to run as a periodic (daily?) cron job on one of the
> > >servers, which is not guaranteed to be the leader of the ensemble. There
> > >is
> > >no expectation that all recent changes are committed and persisted to
> > >disk.
> > >The system can sustain the loss of several hours worth of recent changes
> > >in
> > >the event of restore.
> > >
> > >As for finding the leader dynamically and performing backup on it, this
> > >approach could be more difficult as the leader can change time to time
> and
> > >I still need to fetch the file to store it in my designated backup
> > >location. Taking backup on one server and picking it up from a local
> file
> > >system looks less error-prone. Even if I went the fancy route and had
> > >Zookeeper send me the serialized DataTree in response to the 4wl, this
> > >approach would involve a lot of moving parts.
> > >
> > >I have already made a PoC for a new 4wl that invokes takeSnapshot() and
> > >returns an absolute path to the snapshot it drops on disk. I have
> already
> > >protected takeSnapshot() from concurrent invocation, which is likely to
> > >corrupt the snapshot file on disk. This approach works but I'm thinking
> to
> > >take it one step further by providing the desired path name as an
> argument
> > >to my new 4lw and to have Zookeeper server drop the snapshot into the
> > >specified file and report success/failure back. This way I can avoid
> > >cluttering the data directory and interfering with what Zookeeper finds
> > >when it scans the data directory.
> > >
> > >Approach with having an additional server that would take the leadership
> > >and populate the ensemble is just a theory. I don't see a clean way of
> > >making a quorum member the leader of the quorum. Am I overlooking
> > >something
> > >simple?
> > >
> > >In backup and restore of an ensemble the biggest unknown for me remains
> > >populating the ensemble with desired data. I can think of two ways:
> > >
> > >1. Clear out all servers by stopping them, purge version-2 directories,
> > >restore a snapshot file on one server that will be brought first, and
> then
> > >bring up the rest of the ensemble. This way I somewhat force the first
> > >server to be the leader because it has data and it will be the only
> member
> > >of a quorum with data, provided to the way I start the ensemble. This
> > >looks
> > >like a hack, though.
> > >
> > >2. Clear out the ensemble and reload it with a dedicated client using
> the
> > >provided Zookeeper API.
> > >
> > >With the approach of backing up an actual snapshot file, option #1
> appears
> > >to be more practical.
> > >
> > >I wish I could start the ensemble with a designate leader that would
> > >bootstrap the ensemble with data and then the ensemble would go into its
> > >normal business...
> > >
> > >
> > >
> > >On Mon, Jul 8, 2013 at 4:30 PM, Flavio Junqueira
> > ><fp...@yahoo.com>wrote:
> > >
> > >> One bit that is still a bit confusing to me in your use case is if you
> > >> need to take a snapshot right after some event in your application.
> > >>Even if
> > >> you're able to tell ZooKeeper to take a snapshot, there is no
> guarantee
> > >> that it will happen at the exact point you want it if update
> operations
> > >> keep coming.
> > >>
> > >> If you use your four-letter word approach, then would you search for
> the
> > >> leader or would you simply take a snapshot at any server? If it has to
> > >>go
> > >> through the leader so that you make sure to have the most recent
> > >>committed
> > >> state, then it might not be a bad idea to have an api call that tells
> > >>the
> > >> leader to take a snapshot at some directory of your choice. Informing
> > >>you
> > >> the name of the snapshot file so that you can copy sounds like an
> > >>option,
> > >> but perhaps it is not as convenient.
> > >>
> > >> The approach of adding another server is not very clear. How do you
> > >>force
> > >> it to be the leader? Keep in mind that if it crashes, then it will
> lose
> > >> leadership.
> > >>
> > >> -Flavio
> > >>
> > >> On Jul 8, 2013, at 8:34 AM, Sergey Maslyakov <ev...@gmail.com>
> wrote:
> > >>
> > >> > It looks like the "dev" mailing list is rather inactive. Over the
> past
> > >> few
> > >> > days I only saw several automated emails from JIRA and this is
> pretty
> > >> much
> > >> > it. Contrary to this, the "user" mailing list seems to be more alive
> > >>and
> > >> > more populated.
> > >> >
> > >> > With this in mind, please allow me to cross-post here the message I
> > >>sent
> > >> > into the "dev" list a few days ago.
> > >> >
> > >> >
> > >> > Regards,
> > >> > /Sergey
> > >> >
> > >> > === forwarded message begins here ===
> > >> >
> > >> > Hi!
> > >> >
> > >> > I'm facing the problem that has been raised by multiple people but
> > >>none
> > >> of
> > >> > the discussion threads seem to provide a good answer. I dug in
> > >>Zookeeper
> > >> > source code trying to come up with some possible approaches and I
> > >>would
> > >> > like to get your inputs on those.
> > >> >
> > >> > Initial conditions:
> > >> >
> > >> > * I have an ensemble of five Zookeeper servers running v3.4.5 code.
> > >> > * The size of a committed snapshot file is in vicinity of 1GB.
> > >> > * There are about 80 clients connected to the ensemble.
> > >> > * Clients a heavily read biased, i.e., they mostly read and rarely
> > >> write. I
> > >> > would say less than 0.1% of queries modify the data.
> > >> >
> > >> > Problem statement:
> > >> >
> > >> > * Under certain conditions, I may need to revert the data stored in
> > >>the
> > >> > ensemble to an earlier state. For example, one of the clients may
> ruin
> > >> the
> > >> > application-level data integrity and I need to perform a disaster
> > >> recovery.
> > >> >
> > >> > Things look nice and easy if I'm dealing with a single Zookeeper
> > >>server.
> > >> A
> > >> > file-level copy of the data and dataLog directories should allow me
> to
> > >> > recover later by stopping Zookeeper, swapping the corrupted data and
> > >> > dataLog directories with a backup, and firing Zookeeper back up.
> > >> >
> > >> > Now, the ensemble deployment and the leader election algorithm in
> the
> > >> > quorum make things much more difficult. In order to restore from a
> > >>single
> > >> > file-level backup, I need to take the whole ensemble down, wipe out
> > >>data
> > >> > and dataLog directories on all servers, replace these directories
> with
> > >> > backed up content on one of the servers, bring this server up first,
> > >>and
> > >> > then bring up the rest of the ensemble. This [somewhat] guarantees
> > >>that
> > >> the
> > >> > populated Zookeeper server becomes a member of a majority and
> > >>populates
> > >> the
> > >> > ensemble. This approach works but it is very involving and, thus,
> > >> > error-prone due to a human error.
> > >> >
> > >> > Based on a study of Zookeeper source code, I am considering the
> > >>following
> > >> > alternatives. And I seek advice from Zookeeper development community
> > >>as
> > >> to
> > >> > which approach looks more promising or if there is a better way.
> > >> >
> > >> > Approach #1:
> > >> >
> > >> > Develop a complementary pair of utilities for export and import of
> the
> > >> > data. Both utilities will act as Zookeeper clients and use the
> > >>existing
> > >> > API. The "export" utility will recursively retrieve data and store
> it
> > >>in
> > >> a
> > >> > file. The "import" utility will first purge all data from the
> ensemble
> > >> and
> > >> > then reload it from the file.
> > >> >
> > >> > This approach seems to be the simplest and there are similar tools
> > >> > developed already. For example, the Guano Project:
> > >> > https://github.com/d2fn/guano
> > >> >
> > >> > I don't like two things about it:
> > >> > * Poor performance even on a backup for the data store of my size.
> > >> > * Possible data consistency issues due to concurrent access by the
> > >>export
> > >> > utility as well as other "normal" clients.
> > >> >
> > >> > Approach #2:
> > >> >
> > >> > Add another four-letter command that would force rolling up the
> > >> > transactions and creating a snapshot. The result of this command
> would
> > >> be a
> > >> > new snapshot.XXXX file on disk and the name of the file could be
> > >>reported
> > >> > back to the client as a response to the four-letter command. This
> > >>way, I
> > >> > would know which snapshot file to grab for future possible restore.
> > >>But
> > >> > restoring from a snapshot file is almost as involving as the
> > >>error-prone
> > >> > sequence described in the "Initial conditions" above.
> > >> >
> > >> > Approach #3:
> > >> >
> > >> > Come up with a way to temporarily add a new Zookeeper server into a
> > >>live
> > >> > ensemble, that would overtake (how?) the leader role and push out
> the
> > >> > snapshot that it has into all ensemble members upon restore. This
> > >> approach
> > >> > could be difficult and error-prone to implement because it will
> > >>require
> > >> > hacking the existing election algorithm to designate a leader.
> > >> >
> > >> > So, which of the approaches do you think works best for an ensemble
> > >>and
> > >> for
> > >> > the database size of about 1GB?
> > >> >
> > >> >
> > >> > Any advice will be highly appreciated!
> > >> > /Sergey
> > >>
> > >>
> >
> >
>

Re: Efficient backup and a reasonable restore of an ensemble

Posted by Thawan Kooburat <th...@fb.com>.

On restore part, I think having a separate utility to manipulate the
data/snap dir (by truncating the log/removing snapshot to a given zxid)
would be easier than modifying the server.


-- 
Thawan Kooburat





On 7/8/13 6:34 PM, "kishore g" <g....@gmail.com> wrote:

>I think what we are looking at is a  point in time restore functionality.
>How about adding a feature that says go back to a specific zxid/timestamp.
>This way before doing any change to zookeeper simply note down the
>timestamp/zxid on leader. If things go wrong after making changes, bring
>down zookeepers and provide additional parameter of a zxid/timestamp while
>restarting. The server can go the exact point and make it current. The
>followers can be started blank.
>
>
>
>On Mon, Jul 8, 2013 at 5:53 PM, Thawan Kooburat <th...@fb.com> wrote:
>
>> Just saw that  this is the corresponding use case to the question posted
>> in dev list.
>>
>> In order to restore the data to a given point in time correctly, you
>>need
>> both snapshot and txnlog. This is because zookeeper snapshot is fuzzy
>>and
>> snapshot alone may not represent a valid state of the server if there
>>are
>> in-flight requests.
>>
>> The 4wl command should cause the server to roll the log and take a
>> snapshot similar to periodic snapshotting operation. Your backup script
>> need grap the snapshot and corresponding txnlog file from the data dir.
>>
>> To restore, just shutdown all hosts, clear the data dir, copy over the
>> snapshot and txnlog, and restart them.
>>
>>
>> --
>> Thawan Kooburat
>>
>>
>>
>>
>>
>> On 7/8/13 3:28 PM, "Sergey Maslyakov" <ev...@gmail.com> wrote:
>>
>> >Thank you for your response, Flavio. I apologize, I did not provide a
>> >clear
>> >explanation of the use case.
>> >
>> >This backup/restore is not intended to be tied to any write event,
>> >instead,
>> >it is expected to run as a periodic (daily?) cron job on one of the
>> >servers, which is not guaranteed to be the leader of the ensemble.
>>There
>> >is
>> >no expectation that all recent changes are committed and persisted to
>> >disk.
>> >The system can sustain the loss of several hours worth of recent
>>changes
>> >in
>> >the event of restore.
>> >
>> >As for finding the leader dynamically and performing backup on it, this
>> >approach could be more difficult as the leader can change time to time
>>and
>> >I still need to fetch the file to store it in my designated backup
>> >location. Taking backup on one server and picking it up from a local
>>file
>> >system looks less error-prone. Even if I went the fancy route and had
>> >Zookeeper send me the serialized DataTree in response to the 4wl, this
>> >approach would involve a lot of moving parts.
>> >
>> >I have already made a PoC for a new 4wl that invokes takeSnapshot() and
>> >returns an absolute path to the snapshot it drops on disk. I have
>>already
>> >protected takeSnapshot() from concurrent invocation, which is likely to
>> >corrupt the snapshot file on disk. This approach works but I'm
>>thinking to
>> >take it one step further by providing the desired path name as an
>>argument
>> >to my new 4lw and to have Zookeeper server drop the snapshot into the
>> >specified file and report success/failure back. This way I can avoid
>> >cluttering the data directory and interfering with what Zookeeper finds
>> >when it scans the data directory.
>> >
>> >Approach with having an additional server that would take the
>>leadership
>> >and populate the ensemble is just a theory. I don't see a clean way of
>> >making a quorum member the leader of the quorum. Am I overlooking
>> >something
>> >simple?
>> >
>> >In backup and restore of an ensemble the biggest unknown for me remains
>> >populating the ensemble with desired data. I can think of two ways:
>> >
>> >1. Clear out all servers by stopping them, purge version-2 directories,
>> >restore a snapshot file on one server that will be brought first, and
>>then
>> >bring up the rest of the ensemble. This way I somewhat force the first
>> >server to be the leader because it has data and it will be the only
>>member
>> >of a quorum with data, provided to the way I start the ensemble. This
>> >looks
>> >like a hack, though.
>> >
>> >2. Clear out the ensemble and reload it with a dedicated client using
>>the
>> >provided Zookeeper API.
>> >
>> >With the approach of backing up an actual snapshot file, option #1
>>appears
>> >to be more practical.
>> >
>> >I wish I could start the ensemble with a designate leader that would
>> >bootstrap the ensemble with data and then the ensemble would go into
>>its
>> >normal business...
>> >
>> >
>> >
>> >On Mon, Jul 8, 2013 at 4:30 PM, Flavio Junqueira
>> ><fp...@yahoo.com>wrote:
>> >
>> >> One bit that is still a bit confusing to me in your use case is if
>>you
>> >> need to take a snapshot right after some event in your application.
>> >>Even if
>> >> you're able to tell ZooKeeper to take a snapshot, there is no
>>guarantee
>> >> that it will happen at the exact point you want it if update
>>operations
>> >> keep coming.
>> >>
>> >> If you use your four-letter word approach, then would you search for
>>the
>> >> leader or would you simply take a snapshot at any server? If it has
>>to
>> >>go
>> >> through the leader so that you make sure to have the most recent
>> >>committed
>> >> state, then it might not be a bad idea to have an api call that tells
>> >>the
>> >> leader to take a snapshot at some directory of your choice. Informing
>> >>you
>> >> the name of the snapshot file so that you can copy sounds like an
>> >>option,
>> >> but perhaps it is not as convenient.
>> >>
>> >> The approach of adding another server is not very clear. How do you
>> >>force
>> >> it to be the leader? Keep in mind that if it crashes, then it will
>>lose
>> >> leadership.
>> >>
>> >> -Flavio
>> >>
>> >> On Jul 8, 2013, at 8:34 AM, Sergey Maslyakov <ev...@gmail.com>
>>wrote:
>> >>
>> >> > It looks like the "dev" mailing list is rather inactive. Over the
>>past
>> >> few
>> >> > days I only saw several automated emails from JIRA and this is
>>pretty
>> >> much
>> >> > it. Contrary to this, the "user" mailing list seems to be more
>>alive
>> >>and
>> >> > more populated.
>> >> >
>> >> > With this in mind, please allow me to cross-post here the message I
>> >>sent
>> >> > into the "dev" list a few days ago.
>> >> >
>> >> >
>> >> > Regards,
>> >> > /Sergey
>> >> >
>> >> > === forwarded message begins here ===
>> >> >
>> >> > Hi!
>> >> >
>> >> > I'm facing the problem that has been raised by multiple people but
>> >>none
>> >> of
>> >> > the discussion threads seem to provide a good answer. I dug in
>> >>Zookeeper
>> >> > source code trying to come up with some possible approaches and I
>> >>would
>> >> > like to get your inputs on those.
>> >> >
>> >> > Initial conditions:
>> >> >
>> >> > * I have an ensemble of five Zookeeper servers running v3.4.5 code.
>> >> > * The size of a committed snapshot file is in vicinity of 1GB.
>> >> > * There are about 80 clients connected to the ensemble.
>> >> > * Clients a heavily read biased, i.e., they mostly read and rarely
>> >> write. I
>> >> > would say less than 0.1% of queries modify the data.
>> >> >
>> >> > Problem statement:
>> >> >
>> >> > * Under certain conditions, I may need to revert the data stored in
>> >>the
>> >> > ensemble to an earlier state. For example, one of the clients may
>>ruin
>> >> the
>> >> > application-level data integrity and I need to perform a disaster
>> >> recovery.
>> >> >
>> >> > Things look nice and easy if I'm dealing with a single Zookeeper
>> >>server.
>> >> A
>> >> > file-level copy of the data and dataLog directories should allow
>>me to
>> >> > recover later by stopping Zookeeper, swapping the corrupted data
>>and
>> >> > dataLog directories with a backup, and firing Zookeeper back up.
>> >> >
>> >> > Now, the ensemble deployment and the leader election algorithm in
>>the
>> >> > quorum make things much more difficult. In order to restore from a
>> >>single
>> >> > file-level backup, I need to take the whole ensemble down, wipe out
>> >>data
>> >> > and dataLog directories on all servers, replace these directories
>>with
>> >> > backed up content on one of the servers, bring this server up
>>first,
>> >>and
>> >> > then bring up the rest of the ensemble. This [somewhat] guarantees
>> >>that
>> >> the
>> >> > populated Zookeeper server becomes a member of a majority and
>> >>populates
>> >> the
>> >> > ensemble. This approach works but it is very involving and, thus,
>> >> > error-prone due to a human error.
>> >> >
>> >> > Based on a study of Zookeeper source code, I am considering the
>> >>following
>> >> > alternatives. And I seek advice from Zookeeper development
>>community
>> >>as
>> >> to
>> >> > which approach looks more promising or if there is a better way.
>> >> >
>> >> > Approach #1:
>> >> >
>> >> > Develop a complementary pair of utilities for export and import of
>>the
>> >> > data. Both utilities will act as Zookeeper clients and use the
>> >>existing
>> >> > API. The "export" utility will recursively retrieve data and store
>>it
>> >>in
>> >> a
>> >> > file. The "import" utility will first purge all data from the
>>ensemble
>> >> and
>> >> > then reload it from the file.
>> >> >
>> >> > This approach seems to be the simplest and there are similar tools
>> >> > developed already. For example, the Guano Project:
>> >> > https://github.com/d2fn/guano
>> >> >
>> >> > I don't like two things about it:
>> >> > * Poor performance even on a backup for the data store of my size.
>> >> > * Possible data consistency issues due to concurrent access by the
>> >>export
>> >> > utility as well as other "normal" clients.
>> >> >
>> >> > Approach #2:
>> >> >
>> >> > Add another four-letter command that would force rolling up the
>> >> > transactions and creating a snapshot. The result of this command
>>would
>> >> be a
>> >> > new snapshot.XXXX file on disk and the name of the file could be
>> >>reported
>> >> > back to the client as a response to the four-letter command. This
>> >>way, I
>> >> > would know which snapshot file to grab for future possible restore.
>> >>But
>> >> > restoring from a snapshot file is almost as involving as the
>> >>error-prone
>> >> > sequence described in the "Initial conditions" above.
>> >> >
>> >> > Approach #3:
>> >> >
>> >> > Come up with a way to temporarily add a new Zookeeper server into a
>> >>live
>> >> > ensemble, that would overtake (how?) the leader role and push out
>>the
>> >> > snapshot that it has into all ensemble members upon restore. This
>> >> approach
>> >> > could be difficult and error-prone to implement because it will
>> >>require
>> >> > hacking the existing election algorithm to designate a leader.
>> >> >
>> >> > So, which of the approaches do you think works best for an ensemble
>> >>and
>> >> for
>> >> > the database size of about 1GB?
>> >> >
>> >> >
>> >> > Any advice will be highly appreciated!
>> >> > /Sergey
>> >>
>> >>
>>
>>

Re: Efficient backup and a reasonable restore of an ensemble

Posted by kishore g <g....@gmail.com>.

I think what we are looking at is a  point in time restore functionality.
How about adding a feature that says go back to a specific zxid/timestamp.
This way before doing any change to zookeeper simply note down the
timestamp/zxid on leader. If things go wrong after making changes, bring
down zookeepers and provide additional parameter of a zxid/timestamp while
restarting. The server can go the exact point and make it current. The
followers can be started blank.



On Mon, Jul 8, 2013 at 5:53 PM, Thawan Kooburat <th...@fb.com> wrote:

> Just saw that  this is the corresponding use case to the question posted
> in dev list.
>
> In order to restore the data to a given point in time correctly, you need
> both snapshot and txnlog. This is because zookeeper snapshot is fuzzy and
> snapshot alone may not represent a valid state of the server if there are
> in-flight requests.
>
> The 4wl command should cause the server to roll the log and take a
> snapshot similar to periodic snapshotting operation. Your backup script
> need grap the snapshot and corresponding txnlog file from the data dir.
>
> To restore, just shutdown all hosts, clear the data dir, copy over the
> snapshot and txnlog, and restart them.
>
>
> --
> Thawan Kooburat
>
>
>
>
>
> On 7/8/13 3:28 PM, "Sergey Maslyakov" <ev...@gmail.com> wrote:
>
> >Thank you for your response, Flavio. I apologize, I did not provide a
> >clear
> >explanation of the use case.
> >
> >This backup/restore is not intended to be tied to any write event,
> >instead,
> >it is expected to run as a periodic (daily?) cron job on one of the
> >servers, which is not guaranteed to be the leader of the ensemble. There
> >is
> >no expectation that all recent changes are committed and persisted to
> >disk.
> >The system can sustain the loss of several hours worth of recent changes
> >in
> >the event of restore.
> >
> >As for finding the leader dynamically and performing backup on it, this
> >approach could be more difficult as the leader can change time to time and
> >I still need to fetch the file to store it in my designated backup
> >location. Taking backup on one server and picking it up from a local file
> >system looks less error-prone. Even if I went the fancy route and had
> >Zookeeper send me the serialized DataTree in response to the 4wl, this
> >approach would involve a lot of moving parts.
> >
> >I have already made a PoC for a new 4wl that invokes takeSnapshot() and
> >returns an absolute path to the snapshot it drops on disk. I have already
> >protected takeSnapshot() from concurrent invocation, which is likely to
> >corrupt the snapshot file on disk. This approach works but I'm thinking to
> >take it one step further by providing the desired path name as an argument
> >to my new 4lw and to have Zookeeper server drop the snapshot into the
> >specified file and report success/failure back. This way I can avoid
> >cluttering the data directory and interfering with what Zookeeper finds
> >when it scans the data directory.
> >
> >Approach with having an additional server that would take the leadership
> >and populate the ensemble is just a theory. I don't see a clean way of
> >making a quorum member the leader of the quorum. Am I overlooking
> >something
> >simple?
> >
> >In backup and restore of an ensemble the biggest unknown for me remains
> >populating the ensemble with desired data. I can think of two ways:
> >
> >1. Clear out all servers by stopping them, purge version-2 directories,
> >restore a snapshot file on one server that will be brought first, and then
> >bring up the rest of the ensemble. This way I somewhat force the first
> >server to be the leader because it has data and it will be the only member
> >of a quorum with data, provided to the way I start the ensemble. This
> >looks
> >like a hack, though.
> >
> >2. Clear out the ensemble and reload it with a dedicated client using the
> >provided Zookeeper API.
> >
> >With the approach of backing up an actual snapshot file, option #1 appears
> >to be more practical.
> >
> >I wish I could start the ensemble with a designate leader that would
> >bootstrap the ensemble with data and then the ensemble would go into its
> >normal business...
> >
> >
> >
> >On Mon, Jul 8, 2013 at 4:30 PM, Flavio Junqueira
> ><fp...@yahoo.com>wrote:
> >
> >> One bit that is still a bit confusing to me in your use case is if you
> >> need to take a snapshot right after some event in your application.
> >>Even if
> >> you're able to tell ZooKeeper to take a snapshot, there is no guarantee
> >> that it will happen at the exact point you want it if update operations
> >> keep coming.
> >>
> >> If you use your four-letter word approach, then would you search for the
> >> leader or would you simply take a snapshot at any server? If it has to
> >>go
> >> through the leader so that you make sure to have the most recent
> >>committed
> >> state, then it might not be a bad idea to have an api call that tells
> >>the
> >> leader to take a snapshot at some directory of your choice. Informing
> >>you
> >> the name of the snapshot file so that you can copy sounds like an
> >>option,
> >> but perhaps it is not as convenient.
> >>
> >> The approach of adding another server is not very clear. How do you
> >>force
> >> it to be the leader? Keep in mind that if it crashes, then it will lose
> >> leadership.
> >>
> >> -Flavio
> >>
> >> On Jul 8, 2013, at 8:34 AM, Sergey Maslyakov <ev...@gmail.com> wrote:
> >>
> >> > It looks like the "dev" mailing list is rather inactive. Over the past
> >> few
> >> > days I only saw several automated emails from JIRA and this is pretty
> >> much
> >> > it. Contrary to this, the "user" mailing list seems to be more alive
> >>and
> >> > more populated.
> >> >
> >> > With this in mind, please allow me to cross-post here the message I
> >>sent
> >> > into the "dev" list a few days ago.
> >> >
> >> >
> >> > Regards,
> >> > /Sergey
> >> >
> >> > === forwarded message begins here ===
> >> >
> >> > Hi!
> >> >
> >> > I'm facing the problem that has been raised by multiple people but
> >>none
> >> of
> >> > the discussion threads seem to provide a good answer. I dug in
> >>Zookeeper
> >> > source code trying to come up with some possible approaches and I
> >>would
> >> > like to get your inputs on those.
> >> >
> >> > Initial conditions:
> >> >
> >> > * I have an ensemble of five Zookeeper servers running v3.4.5 code.
> >> > * The size of a committed snapshot file is in vicinity of 1GB.
> >> > * There are about 80 clients connected to the ensemble.
> >> > * Clients a heavily read biased, i.e., they mostly read and rarely
> >> write. I
> >> > would say less than 0.1% of queries modify the data.
> >> >
> >> > Problem statement:
> >> >
> >> > * Under certain conditions, I may need to revert the data stored in
> >>the
> >> > ensemble to an earlier state. For example, one of the clients may ruin
> >> the
> >> > application-level data integrity and I need to perform a disaster
> >> recovery.
> >> >
> >> > Things look nice and easy if I'm dealing with a single Zookeeper
> >>server.
> >> A
> >> > file-level copy of the data and dataLog directories should allow me to
> >> > recover later by stopping Zookeeper, swapping the corrupted data and
> >> > dataLog directories with a backup, and firing Zookeeper back up.
> >> >
> >> > Now, the ensemble deployment and the leader election algorithm in the
> >> > quorum make things much more difficult. In order to restore from a
> >>single
> >> > file-level backup, I need to take the whole ensemble down, wipe out
> >>data
> >> > and dataLog directories on all servers, replace these directories with
> >> > backed up content on one of the servers, bring this server up first,
> >>and
> >> > then bring up the rest of the ensemble. This [somewhat] guarantees
> >>that
> >> the
> >> > populated Zookeeper server becomes a member of a majority and
> >>populates
> >> the
> >> > ensemble. This approach works but it is very involving and, thus,
> >> > error-prone due to a human error.
> >> >
> >> > Based on a study of Zookeeper source code, I am considering the
> >>following
> >> > alternatives. And I seek advice from Zookeeper development community
> >>as
> >> to
> >> > which approach looks more promising or if there is a better way.
> >> >
> >> > Approach #1:
> >> >
> >> > Develop a complementary pair of utilities for export and import of the
> >> > data. Both utilities will act as Zookeeper clients and use the
> >>existing
> >> > API. The "export" utility will recursively retrieve data and store it
> >>in
> >> a
> >> > file. The "import" utility will first purge all data from the ensemble
> >> and
> >> > then reload it from the file.
> >> >
> >> > This approach seems to be the simplest and there are similar tools
> >> > developed already. For example, the Guano Project:
> >> > https://github.com/d2fn/guano
> >> >
> >> > I don't like two things about it:
> >> > * Poor performance even on a backup for the data store of my size.
> >> > * Possible data consistency issues due to concurrent access by the
> >>export
> >> > utility as well as other "normal" clients.
> >> >
> >> > Approach #2:
> >> >
> >> > Add another four-letter command that would force rolling up the
> >> > transactions and creating a snapshot. The result of this command would
> >> be a
> >> > new snapshot.XXXX file on disk and the name of the file could be
> >>reported
> >> > back to the client as a response to the four-letter command. This
> >>way, I
> >> > would know which snapshot file to grab for future possible restore.
> >>But
> >> > restoring from a snapshot file is almost as involving as the
> >>error-prone
> >> > sequence described in the "Initial conditions" above.
> >> >
> >> > Approach #3:
> >> >
> >> > Come up with a way to temporarily add a new Zookeeper server into a
> >>live
> >> > ensemble, that would overtake (how?) the leader role and push out the
> >> > snapshot that it has into all ensemble members upon restore. This
> >> approach
> >> > could be difficult and error-prone to implement because it will
> >>require
> >> > hacking the existing election algorithm to designate a leader.
> >> >
> >> > So, which of the approaches do you think works best for an ensemble
> >>and
> >> for
> >> > the database size of about 1GB?
> >> >
> >> >
> >> > Any advice will be highly appreciated!
> >> > /Sergey
> >>
> >>
>
>

Re: Efficient backup and a reasonable restore of an ensemble

Posted by Thawan Kooburat <th...@fb.com>.

Just saw that  this is the corresponding use case to the question posted
in dev list.

In order to restore the data to a given point in time correctly, you need
both snapshot and txnlog. This is because zookeeper snapshot is fuzzy and
snapshot alone may not represent a valid state of the server if there are
in-flight requests.

The 4wl command should cause the server to roll the log and take a
snapshot similar to periodic snapshotting operation. Your backup script
need grap the snapshot and corresponding txnlog file from the data dir.

To restore, just shutdown all hosts, clear the data dir, copy over the
snapshot and txnlog, and restart them.


-- 
Thawan Kooburat





On 7/8/13 3:28 PM, "Sergey Maslyakov" <ev...@gmail.com> wrote:

>Thank you for your response, Flavio. I apologize, I did not provide a
>clear
>explanation of the use case.
>
>This backup/restore is not intended to be tied to any write event,
>instead,
>it is expected to run as a periodic (daily?) cron job on one of the
>servers, which is not guaranteed to be the leader of the ensemble. There
>is
>no expectation that all recent changes are committed and persisted to
>disk.
>The system can sustain the loss of several hours worth of recent changes
>in
>the event of restore.
>
>As for finding the leader dynamically and performing backup on it, this
>approach could be more difficult as the leader can change time to time and
>I still need to fetch the file to store it in my designated backup
>location. Taking backup on one server and picking it up from a local file
>system looks less error-prone. Even if I went the fancy route and had
>Zookeeper send me the serialized DataTree in response to the 4wl, this
>approach would involve a lot of moving parts.
>
>I have already made a PoC for a new 4wl that invokes takeSnapshot() and
>returns an absolute path to the snapshot it drops on disk. I have already
>protected takeSnapshot() from concurrent invocation, which is likely to
>corrupt the snapshot file on disk. This approach works but I'm thinking to
>take it one step further by providing the desired path name as an argument
>to my new 4lw and to have Zookeeper server drop the snapshot into the
>specified file and report success/failure back. This way I can avoid
>cluttering the data directory and interfering with what Zookeeper finds
>when it scans the data directory.
>
>Approach with having an additional server that would take the leadership
>and populate the ensemble is just a theory. I don't see a clean way of
>making a quorum member the leader of the quorum. Am I overlooking
>something
>simple?
>
>In backup and restore of an ensemble the biggest unknown for me remains
>populating the ensemble with desired data. I can think of two ways:
>
>1. Clear out all servers by stopping them, purge version-2 directories,
>restore a snapshot file on one server that will be brought first, and then
>bring up the rest of the ensemble. This way I somewhat force the first
>server to be the leader because it has data and it will be the only member
>of a quorum with data, provided to the way I start the ensemble. This
>looks
>like a hack, though.
>
>2. Clear out the ensemble and reload it with a dedicated client using the
>provided Zookeeper API.
>
>With the approach of backing up an actual snapshot file, option #1 appears
>to be more practical.
>
>I wish I could start the ensemble with a designate leader that would
>bootstrap the ensemble with data and then the ensemble would go into its
>normal business...
>
>
>
>On Mon, Jul 8, 2013 at 4:30 PM, Flavio Junqueira
><fp...@yahoo.com>wrote:
>
>> One bit that is still a bit confusing to me in your use case is if you
>> need to take a snapshot right after some event in your application.
>>Even if
>> you're able to tell ZooKeeper to take a snapshot, there is no guarantee
>> that it will happen at the exact point you want it if update operations
>> keep coming.
>>
>> If you use your four-letter word approach, then would you search for the
>> leader or would you simply take a snapshot at any server? If it has to
>>go
>> through the leader so that you make sure to have the most recent
>>committed
>> state, then it might not be a bad idea to have an api call that tells
>>the
>> leader to take a snapshot at some directory of your choice. Informing
>>you
>> the name of the snapshot file so that you can copy sounds like an
>>option,
>> but perhaps it is not as convenient.
>>
>> The approach of adding another server is not very clear. How do you
>>force
>> it to be the leader? Keep in mind that if it crashes, then it will lose
>> leadership.
>>
>> -Flavio
>>
>> On Jul 8, 2013, at 8:34 AM, Sergey Maslyakov <ev...@gmail.com> wrote:
>>
>> > It looks like the "dev" mailing list is rather inactive. Over the past
>> few
>> > days I only saw several automated emails from JIRA and this is pretty
>> much
>> > it. Contrary to this, the "user" mailing list seems to be more alive
>>and
>> > more populated.
>> >
>> > With this in mind, please allow me to cross-post here the message I
>>sent
>> > into the "dev" list a few days ago.
>> >
>> >
>> > Regards,
>> > /Sergey
>> >
>> > === forwarded message begins here ===
>> >
>> > Hi!
>> >
>> > I'm facing the problem that has been raised by multiple people but
>>none
>> of
>> > the discussion threads seem to provide a good answer. I dug in
>>Zookeeper
>> > source code trying to come up with some possible approaches and I
>>would
>> > like to get your inputs on those.
>> >
>> > Initial conditions:
>> >
>> > * I have an ensemble of five Zookeeper servers running v3.4.5 code.
>> > * The size of a committed snapshot file is in vicinity of 1GB.
>> > * There are about 80 clients connected to the ensemble.
>> > * Clients a heavily read biased, i.e., they mostly read and rarely
>> write. I
>> > would say less than 0.1% of queries modify the data.
>> >
>> > Problem statement:
>> >
>> > * Under certain conditions, I may need to revert the data stored in
>>the
>> > ensemble to an earlier state. For example, one of the clients may ruin
>> the
>> > application-level data integrity and I need to perform a disaster
>> recovery.
>> >
>> > Things look nice and easy if I'm dealing with a single Zookeeper
>>server.
>> A
>> > file-level copy of the data and dataLog directories should allow me to
>> > recover later by stopping Zookeeper, swapping the corrupted data and
>> > dataLog directories with a backup, and firing Zookeeper back up.
>> >
>> > Now, the ensemble deployment and the leader election algorithm in the
>> > quorum make things much more difficult. In order to restore from a
>>single
>> > file-level backup, I need to take the whole ensemble down, wipe out
>>data
>> > and dataLog directories on all servers, replace these directories with
>> > backed up content on one of the servers, bring this server up first,
>>and
>> > then bring up the rest of the ensemble. This [somewhat] guarantees
>>that
>> the
>> > populated Zookeeper server becomes a member of a majority and
>>populates
>> the
>> > ensemble. This approach works but it is very involving and, thus,
>> > error-prone due to a human error.
>> >
>> > Based on a study of Zookeeper source code, I am considering the
>>following
>> > alternatives. And I seek advice from Zookeeper development community
>>as
>> to
>> > which approach looks more promising or if there is a better way.
>> >
>> > Approach #1:
>> >
>> > Develop a complementary pair of utilities for export and import of the
>> > data. Both utilities will act as Zookeeper clients and use the
>>existing
>> > API. The "export" utility will recursively retrieve data and store it
>>in
>> a
>> > file. The "import" utility will first purge all data from the ensemble
>> and
>> > then reload it from the file.
>> >
>> > This approach seems to be the simplest and there are similar tools
>> > developed already. For example, the Guano Project:
>> > https://github.com/d2fn/guano
>> >
>> > I don't like two things about it:
>> > * Poor performance even on a backup for the data store of my size.
>> > * Possible data consistency issues due to concurrent access by the
>>export
>> > utility as well as other "normal" clients.
>> >
>> > Approach #2:
>> >
>> > Add another four-letter command that would force rolling up the
>> > transactions and creating a snapshot. The result of this command would
>> be a
>> > new snapshot.XXXX file on disk and the name of the file could be
>>reported
>> > back to the client as a response to the four-letter command. This
>>way, I
>> > would know which snapshot file to grab for future possible restore.
>>But
>> > restoring from a snapshot file is almost as involving as the
>>error-prone
>> > sequence described in the "Initial conditions" above.
>> >
>> > Approach #3:
>> >
>> > Come up with a way to temporarily add a new Zookeeper server into a
>>live
>> > ensemble, that would overtake (how?) the leader role and push out the
>> > snapshot that it has into all ensemble members upon restore. This
>> approach
>> > could be difficult and error-prone to implement because it will
>>require
>> > hacking the existing election algorithm to designate a leader.
>> >
>> > So, which of the approaches do you think works best for an ensemble
>>and
>> for
>> > the database size of about 1GB?
>> >
>> >
>> > Any advice will be highly appreciated!
>> > /Sergey
>>
>>

Re: Efficient backup and a reasonable restore of an ensemble

Posted by Sergey Maslyakov <ev...@gmail.com>.

Thank you for your response, Flavio. I apologize, I did not provide a clear
explanation of the use case.

This backup/restore is not intended to be tied to any write event, instead,
it is expected to run as a periodic (daily?) cron job on one of the
servers, which is not guaranteed to be the leader of the ensemble. There is
no expectation that all recent changes are committed and persisted to disk.
The system can sustain the loss of several hours worth of recent changes in
the event of restore.

As for finding the leader dynamically and performing backup on it, this
approach could be more difficult as the leader can change time to time and
I still need to fetch the file to store it in my designated backup
location. Taking backup on one server and picking it up from a local file
system looks less error-prone. Even if I went the fancy route and had
Zookeeper send me the serialized DataTree in response to the 4wl, this
approach would involve a lot of moving parts.

I have already made a PoC for a new 4wl that invokes takeSnapshot() and
returns an absolute path to the snapshot it drops on disk. I have already
protected takeSnapshot() from concurrent invocation, which is likely to
corrupt the snapshot file on disk. This approach works but I'm thinking to
take it one step further by providing the desired path name as an argument
to my new 4lw and to have Zookeeper server drop the snapshot into the
specified file and report success/failure back. This way I can avoid
cluttering the data directory and interfering with what Zookeeper finds
when it scans the data directory.

Approach with having an additional server that would take the leadership
and populate the ensemble is just a theory. I don't see a clean way of
making a quorum member the leader of the quorum. Am I overlooking something
simple?

In backup and restore of an ensemble the biggest unknown for me remains
populating the ensemble with desired data. I can think of two ways:

1. Clear out all servers by stopping them, purge version-2 directories,
restore a snapshot file on one server that will be brought first, and then
bring up the rest of the ensemble. This way I somewhat force the first
server to be the leader because it has data and it will be the only member
of a quorum with data, provided to the way I start the ensemble. This looks
like a hack, though.

2. Clear out the ensemble and reload it with a dedicated client using the
provided Zookeeper API.

With the approach of backing up an actual snapshot file, option #1 appears
to be more practical.

I wish I could start the ensemble with a designate leader that would
bootstrap the ensemble with data and then the ensemble would go into its
normal business...

On Mon, Jul 8, 2013 at 4:30 PM, Flavio Junqueira <fp...@yahoo.com>wrote:

> One bit that is still a bit confusing to me in your use case is if you
> need to take a snapshot right after some event in your application. Even if
> you're able to tell ZooKeeper to take a snapshot, there is no guarantee
> that it will happen at the exact point you want it if update operations
> keep coming.
>
> If you use your four-letter word approach, then would you search for the
> leader or would you simply take a snapshot at any server? If it has to go
> through the leader so that you make sure to have the most recent committed
> state, then it might not be a bad idea to have an api call that tells the
> leader to take a snapshot at some directory of your choice. Informing you
> the name of the snapshot file so that you can copy sounds like an option,
> but perhaps it is not as convenient.
>
> The approach of adding another server is not very clear. How do you force
> it to be the leader? Keep in mind that if it crashes, then it will lose
> leadership.
>
> -Flavio
>
> On Jul 8, 2013, at 8:34 AM, Sergey Maslyakov <ev...@gmail.com> wrote:
>
> > It looks like the "dev" mailing list is rather inactive. Over the past
> few
> > days I only saw several automated emails from JIRA and this is pretty
> much
> > it. Contrary to this, the "user" mailing list seems to be more alive and
> > more populated.
> >
> > With this in mind, please allow me to cross-post here the message I sent
> > into the "dev" list a few days ago.
> >
> >
> > Regards,
> > /Sergey
> >
> > === forwarded message begins here ===
> >
> > Hi!
> >
> > I'm facing the problem that has been raised by multiple people but none
> of
> > the discussion threads seem to provide a good answer. I dug in Zookeeper
> > source code trying to come up with some possible approaches and I would
> > like to get your inputs on those.
> >
> > Initial conditions:
> >
> > * I have an ensemble of five Zookeeper servers running v3.4.5 code.
> > * The size of a committed snapshot file is in vicinity of 1GB.
> > * There are about 80 clients connected to the ensemble.
> > * Clients a heavily read biased, i.e., they mostly read and rarely
> write. I
> > would say less than 0.1% of queries modify the data.
> >
> > Problem statement:
> >
> > * Under certain conditions, I may need to revert the data stored in the
> > ensemble to an earlier state. For example, one of the clients may ruin
> the
> > application-level data integrity and I need to perform a disaster
> recovery.
> >
> > Things look nice and easy if I'm dealing with a single Zookeeper server.
> A
> > file-level copy of the data and dataLog directories should allow me to
> > recover later by stopping Zookeeper, swapping the corrupted data and
> > dataLog directories with a backup, and firing Zookeeper back up.
> >
> > Now, the ensemble deployment and the leader election algorithm in the
> > quorum make things much more difficult. In order to restore from a single
> > file-level backup, I need to take the whole ensemble down, wipe out data
> > and dataLog directories on all servers, replace these directories with
> > backed up content on one of the servers, bring this server up first, and
> > then bring up the rest of the ensemble. This [somewhat] guarantees that
> the
> > populated Zookeeper server becomes a member of a majority and populates
> the
> > ensemble. This approach works but it is very involving and, thus,
> > error-prone due to a human error.
> >
> > Based on a study of Zookeeper source code, I am considering the following
> > alternatives. And I seek advice from Zookeeper development community as
> to
> > which approach looks more promising or if there is a better way.
> >
> > Approach #1:
> >
> > Develop a complementary pair of utilities for export and import of the
> > data. Both utilities will act as Zookeeper clients and use the existing
> > API. The "export" utility will recursively retrieve data and store it in
> a
> > file. The "import" utility will first purge all data from the ensemble
> and
> > then reload it from the file.
> >
> > This approach seems to be the simplest and there are similar tools
> > developed already. For example, the Guano Project:
> > https://github.com/d2fn/guano
> >
> > I don't like two things about it:
> > * Poor performance even on a backup for the data store of my size.
> > * Possible data consistency issues due to concurrent access by the export
> > utility as well as other "normal" clients.
> >
> > Approach #2:
> >
> > Add another four-letter command that would force rolling up the
> > transactions and creating a snapshot. The result of this command would
> be a
> > new snapshot.XXXX file on disk and the name of the file could be reported
> > back to the client as a response to the four-letter command. This way, I
> > would know which snapshot file to grab for future possible restore. But
> > restoring from a snapshot file is almost as involving as the error-prone
> > sequence described in the "Initial conditions" above.
> >
> > Approach #3:
> >
> > Come up with a way to temporarily add a new Zookeeper server into a live
> > ensemble, that would overtake (how?) the leader role and push out the
> > snapshot that it has into all ensemble members upon restore. This
> approach
> > could be difficult and error-prone to implement because it will require
> > hacking the existing election algorithm to designate a leader.
> >
> > So, which of the approaches do you think works best for an ensemble and
> for
> > the database size of about 1GB?
> >
> >
> > Any advice will be highly appreciated!
> > /Sergey
>
>

Re: Efficient backup and a reasonable restore of an ensemble

Posted by Flavio Junqueira <fp...@yahoo.com>.

One bit that is still a bit confusing to me in your use case is if you need to take a snapshot right after some event in your application. Even if you're able to tell ZooKeeper to take a snapshot, there is no guarantee that it will happen at the exact point you want it if update operations keep coming.

If you use your four-letter word approach, then would you search for the leader or would you simply take a snapshot at any server? If it has to go through the leader so that you make sure to have the most recent committed state, then it might not be a bad idea to have an api call that tells the leader to take a snapshot at some directory of your choice. Informing you the name of the snapshot file so that you can copy sounds like an option, but perhaps it is not as convenient.

The approach of adding another server is not very clear. How do you force it to be the leader? Keep in mind that if it crashes, then it will lose leadership.

-Flavio 

On Jul 8, 2013, at 8:34 AM, Sergey Maslyakov <ev...@gmail.com> wrote:

> It looks like the "dev" mailing list is rather inactive. Over the past few
> days I only saw several automated emails from JIRA and this is pretty much
> it. Contrary to this, the "user" mailing list seems to be more alive and
> more populated.
> 
> With this in mind, please allow me to cross-post here the message I sent
> into the "dev" list a few days ago.
> 
> 
> Regards,
> /Sergey
> 
> === forwarded message begins here ===
> 
> Hi!
> 
> I'm facing the problem that has been raised by multiple people but none of
> the discussion threads seem to provide a good answer. I dug in Zookeeper
> source code trying to come up with some possible approaches and I would
> like to get your inputs on those.
> 
> Initial conditions:
> 
> * I have an ensemble of five Zookeeper servers running v3.4.5 code.
> * The size of a committed snapshot file is in vicinity of 1GB.
> * There are about 80 clients connected to the ensemble.
> * Clients a heavily read biased, i.e., they mostly read and rarely write. I
> would say less than 0.1% of queries modify the data.
> 
> Problem statement:
> 
> * Under certain conditions, I may need to revert the data stored in the
> ensemble to an earlier state. For example, one of the clients may ruin the
> application-level data integrity and I need to perform a disaster recovery.
> 
> Things look nice and easy if I'm dealing with a single Zookeeper server. A
> file-level copy of the data and dataLog directories should allow me to
> recover later by stopping Zookeeper, swapping the corrupted data and
> dataLog directories with a backup, and firing Zookeeper back up.
> 
> Now, the ensemble deployment and the leader election algorithm in the
> quorum make things much more difficult. In order to restore from a single
> file-level backup, I need to take the whole ensemble down, wipe out data
> and dataLog directories on all servers, replace these directories with
> backed up content on one of the servers, bring this server up first, and
> then bring up the rest of the ensemble. This [somewhat] guarantees that the
> populated Zookeeper server becomes a member of a majority and populates the
> ensemble. This approach works but it is very involving and, thus,
> error-prone due to a human error.
> 
> Based on a study of Zookeeper source code, I am considering the following
> alternatives. And I seek advice from Zookeeper development community as to
> which approach looks more promising or if there is a better way.
> 
> Approach #1:
> 
> Develop a complementary pair of utilities for export and import of the
> data. Both utilities will act as Zookeeper clients and use the existing
> API. The "export" utility will recursively retrieve data and store it in a
> file. The "import" utility will first purge all data from the ensemble and
> then reload it from the file.
> 
> This approach seems to be the simplest and there are similar tools
> developed already. For example, the Guano Project:
> https://github.com/d2fn/guano
> 
> I don't like two things about it:
> * Poor performance even on a backup for the data store of my size.
> * Possible data consistency issues due to concurrent access by the export
> utility as well as other "normal" clients.
> 
> Approach #2:
> 
> Add another four-letter command that would force rolling up the
> transactions and creating a snapshot. The result of this command would be a
> new snapshot.XXXX file on disk and the name of the file could be reported
> back to the client as a response to the four-letter command. This way, I
> would know which snapshot file to grab for future possible restore. But
> restoring from a snapshot file is almost as involving as the error-prone
> sequence described in the "Initial conditions" above.
> 
> Approach #3:
> 
> Come up with a way to temporarily add a new Zookeeper server into a live
> ensemble, that would overtake (how?) the leader role and push out the
> snapshot that it has into all ensemble members upon restore. This approach
> could be difficult and error-prone to implement because it will require
> hacking the existing election algorithm to designate a leader.
> 
> So, which of the approaches do you think works best for an ensemble and for
> the database size of about 1GB?
> 
> 
> Any advice will be highly appreciated!
> /Sergey

Fwd: Efficient backup and a reasonable restore of an ensemble

Posted by Sergey Maslyakov <ev...@gmail.com>.

It looks like the "dev" mailing list is rather inactive. Over the past few
days I only saw several automated emails from JIRA and this is pretty much
it. Contrary to this, the "user" mailing list seems to be more alive and
more populated.

With this in mind, please allow me to cross-post here the message I sent
into the "dev" list a few days ago.


Regards,
/Sergey

=== forwarded message begins here ===

Hi!

I'm facing the problem that has been raised by multiple people but none of
the discussion threads seem to provide a good answer. I dug in Zookeeper
source code trying to come up with some possible approaches and I would
like to get your inputs on those.

Initial conditions:

* I have an ensemble of five Zookeeper servers running v3.4.5 code.
* The size of a committed snapshot file is in vicinity of 1GB.
* There are about 80 clients connected to the ensemble.
* Clients a heavily read biased, i.e., they mostly read and rarely write. I
would say less than 0.1% of queries modify the data.

Problem statement:

* Under certain conditions, I may need to revert the data stored in the
ensemble to an earlier state. For example, one of the clients may ruin the
application-level data integrity and I need to perform a disaster recovery.

Things look nice and easy if I'm dealing with a single Zookeeper server. A
file-level copy of the data and dataLog directories should allow me to
recover later by stopping Zookeeper, swapping the corrupted data and
dataLog directories with a backup, and firing Zookeeper back up.

Now, the ensemble deployment and the leader election algorithm in the
quorum make things much more difficult. In order to restore from a single
file-level backup, I need to take the whole ensemble down, wipe out data
and dataLog directories on all servers, replace these directories with
backed up content on one of the servers, bring this server up first, and
then bring up the rest of the ensemble. This [somewhat] guarantees that the
populated Zookeeper server becomes a member of a majority and populates the
ensemble. This approach works but it is very involving and, thus,
error-prone due to a human error.

Based on a study of Zookeeper source code, I am considering the following
alternatives. And I seek advice from Zookeeper development community as to
which approach looks more promising or if there is a better way.

Approach #1:

Develop a complementary pair of utilities for export and import of the
data. Both utilities will act as Zookeeper clients and use the existing
API. The "export" utility will recursively retrieve data and store it in a
file. The "import" utility will first purge all data from the ensemble and
then reload it from the file.

This approach seems to be the simplest and there are similar tools
developed already. For example, the Guano Project:
https://github.com/d2fn/guano

I don't like two things about it:
* Poor performance even on a backup for the data store of my size.
* Possible data consistency issues due to concurrent access by the export
utility as well as other "normal" clients.

Approach #2:

Add another four-letter command that would force rolling up the
transactions and creating a snapshot. The result of this command would be a
new snapshot.XXXX file on disk and the name of the file could be reported
back to the client as a response to the four-letter command. This way, I
would know which snapshot file to grab for future possible restore. But
restoring from a snapshot file is almost as involving as the error-prone
sequence described in the "Initial conditions" above.

Approach #3:

Come up with a way to temporarily add a new Zookeeper server into a live
ensemble, that would overtake (how?) the leader role and push out the
snapshot that it has into all ensemble members upon restore. This approach
could be difficult and error-prone to implement because it will require
hacking the existing election algorithm to designate a leader.

So, which of the approaches do you think works best for an ensemble and for
the database size of about 1GB?


Any advice will be highly appreciated!
/Sergey