You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by amit jaiswal <am...@yahoo.com> on 2010/10/22 10:12:53 UTC

Recovery from a complete node failure in BookKeeper

Hi,

There is a class BookKeeperTools that has methods for complete recovery of a 
node. The recovery of dead bookie involves updating zk first with the 
replacement bookie and then replicating the necessary ledger entries. So, if the 
recovery process / target bookie dies before the actual entries could get 
copied, then there can be data inconsistency issues.

Data copy can take time and thus increases the window during a which a node can 
potentially fail. Is this an issue that needs to be addressed?

Also, this tool needs to be triggered manually for doing node recovery. Any 
plans for automatic node recovery (similar to Hadoop HDFS) in which if a machine 
goes down, then some background process replicates data to maintain the 
replication factor (quorum).

-regards
Amit