You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@zookeeper.apache.org by iv...@apache.org on 2012/12/17 15:42:24 UTC

svn commit: r1422952 - in /zookeeper/bookkeeper/trunk: CHANGES.txt doc/bookieRecovery.textile doc/bookkeeperConfig.textile

Author: ivank
Date: Mon Dec 17 14:42:23 2012
New Revision: 1422952

URL: http://svn.apache.org/viewvc?rev=1422952&view=rev
Log:
BOOKKEEPER-375: Document about Auto replication service in BK (umamahesh via ivank)

Modified:
    zookeeper/bookkeeper/trunk/CHANGES.txt
    zookeeper/bookkeeper/trunk/doc/bookieRecovery.textile
    zookeeper/bookkeeper/trunk/doc/bookkeeperConfig.textile

Modified: zookeeper/bookkeeper/trunk/CHANGES.txt
URL: http://svn.apache.org/viewvc/zookeeper/bookkeeper/trunk/CHANGES.txt?rev=1422952&r1=1422951&r2=1422952&view=diff
==============================================================================
--- zookeeper/bookkeeper/trunk/CHANGES.txt (original)
+++ zookeeper/bookkeeper/trunk/CHANGES.txt Mon Dec 17 14:42:23 2012
@@ -258,6 +258,8 @@ Trunk (unreleased changes)
 
         BOOKKEEPER-511: BookieShell is very noisy (ivank via sijie)
 
+        BOOKKEEPER-375: Document about Auto replication service in BK (umamahesh via ivank)
+
       hedwig-server:
 
         BOOKKEEPER-250: Need a ledger manager like interface to manage metadata operations in Hedwig (sijie via ivank)

Modified: zookeeper/bookkeeper/trunk/doc/bookieRecovery.textile
URL: http://svn.apache.org/viewvc/zookeeper/bookkeeper/trunk/doc/bookieRecovery.textile?rev=1422952&r1=1422951&r2=1422952&view=diff
==============================================================================
--- zookeeper/bookkeeper/trunk/doc/bookieRecovery.textile (original)
+++ zookeeper/bookkeeper/trunk/doc/bookieRecovery.textile Mon Dec 17 14:42:23 2012
@@ -15,27 +15,65 @@ Notice:    Licensed to the Apache Softwa
            KIND, either express or implied.  See the License for the
            specific language governing permissions and limitations
            under the License.
+h1. Bookie Ledger Recovery
 
-h1. Bookie Recovery
+p. When a Bookie crashes, any ledgers with data on that Bookie become underreplicated. There are two options for bringing the ledgers back to full replication, Autorecovery and Manual Bookie Recovery.
 
-p. When a bookie crashes, any ledgers with entries on the bookie potentially become under-replicated. For this reason, we provide a recovery tool which will ensure that all ledgers which had entries on the bookie are fully replicated. At the moment, this is not an automatic process. The administrator must run this tool manually when he sees that the bookie has died. 
+h2. Autorecovery
+
+p. Autorecovery runs as a daemon alongside the Bookie daemon on each Bookie. Autorecovery detects when a bookie in the cluster has become unavailable, and rereplicates all the ledgers which were on that bookie, so that those ledgers are brough back to full replication. See the "Admin Guide":./bookkeeperConfig.html for instructions on how to start autorecovery.
+
+h2. Manual Bookie Recovery
+
+p. If autorecovery is not enabled, it is possible for the adminisatrator to manually rereplicate the data from the failed bookie.
 
 To run recovery, with zk1.example.com as the zookeeper ensemble, and 192.168.1.10 as the failed bookie, do the following:
 
 @bookkeeper-server/bin/bookkeeper org.apache.bookkeeper.tools.BookKeeperTools zk1.example.com:2181 192.168.1.10:3181@
 
-It is necessary to specify the host and port portion of failed bookie, as this is how it identifies itself to zookeeper. It is possible to specify a third argument, which is the bookie to replicate to. If this is omitted, as in our example, a random bookie is chosen for each ledger fragment. A ledger fragment is a continuous sequence of entries in a bookie, which share the same ensemble. 
+It is necessary to specify the host and port portion of failed bookie, as this is how it identifies itself to zookeeper. It is possible to specify a third argument, which is the bookie to replicate to. If this is omitted, as in our example, a random bookie is chosen for each ledger segment. A ledger segment is a continuous sequence of entries in a bookie, which share the same ensemble.
+
+h2. AutoRecovery Internals
+
+Auto-Recovery has two components:
+
+* *Auditor*, a singleton node which watches for bookie failure, and creates rereplication tasks for the ledgers on failed bookies.
+* *ReplicationWorker*, runs on each Bookie, takes rereplication tasks and executes them.
+
+Both the components run as threads in the the *AutoRecoveryMain* process. The *AutoRecoveryMain* process runs on each Bookie in the cluster. All recovery nodes will participate in leader election to decide which node becomes the auditor. Those which fail to become the auditor will watch the elected auditor, and will run election again if they see that it has failed.
+
+h3. Auditor
+
+The auditor watches the the list of bookies registered with ZooKeeper in the cluster. A Bookie registers with ZooKeeper during startup. If the bookie crashes or is killed, the bookie's registration disappears. The auditor is notified of changes in the registered bookies list.
+
+When the auditor sees that a bookie has disappeared from the list, it immediately scans the complete ledger list to find ledgers which have stored data on the failed bookie. Once it has a list of ledgers which need to be rereplicated, it will publish a rereplication task for each ledger under the /underreplicated/ znode in ZooKeeeper.
+
+h3. ReplicationWorker
+
+Each replication worker watches for tasks being published in the /underreplicated/ znode. When a new task appears, it will try to get a lock on it. If it cannot acquire the lock, it tries the next entry. The locks are implemented using ZooKeeper ephemeral znodes.
+
+The replication worker will scan through the rereplication task's ledger for segments of which its local bookie is not a member. When it finds segments matching this criteria it will replicate the entries of that segment to the local bookie.  If, after this process, the ledger is fully replicated, the ledgers entry under /underreplicated/ is deleted, and the lock is released. If there is a problem replicating, or there are still segments in the ledger which are still underreplicated (due to the local bookie already being part of the ensemble for the segment), then the lock is simply released.
+
+If the replication worker finds a segment which needs rereplication, but does not have a defined endpoint (i.e. the final segment of a ledger currently being written to), it will wait for a grace period before attempting rereplication. If the segment needing rereplciation still does not have a defined endpoint, the ledger is fenced and rereplication then takes place. This avoids the case where a client is writing to a ledger, and one of the bookies goes down, but the client has not written an entry to that bookie before rereplication takes place. The client could continue writing to the old segment, even though the ensemble for the segment had changed. This could lead to data loss. Fencing prevents this scenario from happening. In the normal case, the client will try to write to the failed bookie within the grace period, and will have started a new segment before rereplication starts. See the "Admin Guide":./bookkeeperConfig.html for how to configure this grace period.
+
+h2. The Rereplication process
+
+The ledger rereplication process is as follows.
+
+# The client goes through all ledger segments in the ledger, selecting those which contain the failed bookie;
+# A recovery process is initiated for each ledger segment in this list;
+## The client selects a bookie to which all entries in the ledger segment will be replicated; In the case of autorecovery, this will always be the local bookie;
+## the client reads entries that belong to the ledger segment from other bookies in the ensemble and writes them to the selected bookie;
+## Once all entries have been replicated, the zookeeper metadata for the segment is updated to reflect the new ensemble;
+## The segment is marked as fully replicated in the recovery tool;
+# Once all ledger segments are marked as fully replicated, the ledger is marked as fully replicated.
+
+h2. The Manual Bookie Recovery process
 
-The recovery process is as follows.
+The manual bookie recovery process is as follows.
 
 # The client reads the metadata of active ledgers from zookeeper;
-# From this, the ledgers which contain fragments using the failed bookie in their ensemble are selected;
+# From this, the ledgers which contain segments using the failed bookie in their ensemble are selected;
 # A recovery process is initiated for each ledger in this list;
-## The client goes through all ledger fragments in the ledger, selecting those which contain the failed bookie;
-## A recovery process is initiated for each ledger fragment in this list;
-### The client selects a bookie to which all entries in the ledger fragment will be replicated;
-### the client reads entries that belong to the ledger fragment from other bookies in the ensemble and writes them to the selected bookie;
-### Once all entries have been replicated, the zookeeper metadata for the fragment is updated to reflect the new ensemble;
-### The fragment is marked as fully replicated in the recovery tool;
-## Once all ledger fragments are marked as fully replicated, the ledger is marked as fully replicated;
-# Once all ledgers are marked as fully replicated, bookie recovery is finished.
\ No newline at end of file
+## The Ledger rereplication process is run for each ledger;
+# Once all ledgers are marked as fully replicated, bookie recovery is finished.

Modified: zookeeper/bookkeeper/trunk/doc/bookkeeperConfig.textile
URL: http://svn.apache.org/viewvc/zookeeper/bookkeeper/trunk/doc/bookkeeperConfig.textile?rev=1422952&r1=1422951&r2=1422952&view=diff
==============================================================================
--- zookeeper/bookkeeper/trunk/doc/bookkeeperConfig.textile (original)
+++ zookeeper/bookkeeper/trunk/doc/bookkeeperConfig.textile Mon Dec 17 14:42:23 2012
@@ -115,6 +115,31 @@ The mechanism to prevent the bookie from
 # A strict subset of the ledger devices (among multiple ledger devices) has been replaced, consequently making the content of the replaced devices unavailable;
 # A strict subset of the ledger directories has been accidentally deleted.
 
+h2. Running Autorecovery nodes
+
+To run autorecovery nodes, we execute the following command in every Bookie node:
+ @bookkeeper-server/bin/bookkeeper autorecovery@
+
+Configuration parameters for autorecovery can be set in *bookkeeper-server/conf/bk_server.conf*.
+
+Important parameters are:
+
+* @rereplicationEntryBatchSize@ specifies the number of entries which a replication will rereplicate in parallel. The default value is 10. A larger value for this parameter will increase the speed at which autorecovery occurs but will increate the memory requirement of the autorecovery process, and create more load on the cluster.
+
+* @openLedgerRereplicationGracePeriod@, is the amount of time, in milliseconds, which a recovery worker will wait before recovering a ledger segment which has no defined ended, i.e. the client is still writing to that segment. If the client is still active, it should detect the bookie failure, and start writing to a new ledger segment, and a new ensemble, which doesn't include the failed bookie. Creating new ledger segment will define the end of the previous segment. If, after the grace period, the ledger segment's end has not been defined, we assume the writing client has crashed. The ledger is fenced and the client is blocked from writing any more entries to the ledger. The default value is 30000ms.
+
+h3. Disabling Autorecovery during maintenance
+
+It is useful to disable autorecovery during maintenance, for example, to avoid a Bookie's data being unnecessarily rereplicated when it is only being taken down for a short period to update the software, or change the configuration.
+
+To disable autorecovery, run:
+@bookkeeper-server/bin/bookkeeper shell autorecovery -disable@
+
+To reenable, run:
+@bookkeeper-server/bin/bookkeeper shell autorecovery -enable@
+
+Autorecovery enable/disable only needs to be run once for the whole cluster, and not individually on each Bookie in the cluster.
+
 h2. Setting up a test ensemble
 
 Sometimes it is useful to run a ensemble of bookies on your local machine for testing. We provide a utility for doing this. It will set up N bookies, and a zookeeper instance locally. The data on these bookies and of the zookeeper instance are not persisted over restarts, so obviously this should never be used in a production environment. To run a test ensemble of 10 bookies, do the following: