You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Milind Bhandarkar <mi...@yahoo-inc.com> on 2006/09/21 00:07:13 UTC

Proposal for replicating namenode state and transaction logs

Please comment on following proposal.

Proposal for Replication of DFS Namespace Images and Transaction Logs

Currently, when the namenode starts, it reads the namespace image  
from dfs.name.dir and from that, initializes the namespace data  
structures. If the transaction log exists, it merges the transaction  
logs with the in-memory namespace, and writes out the merged  
namespace image. It then reinitializes the transaction log file.

As namespace modifications occur, these modifications are logged into  
the transaction log file. The transaction log file is never flushed,  
and is closed only if the namenode shuts down normally. In case of  
error or forced shutdown of namenode, the last few buffered (but not  
yet written to the disk) transactions may get lost. In addition if  
the namespace image file is corrupted or is accidentally deleted, or  
if the disk holding that image file crashes, there is no way to  
recover the state of DFS.

This proposal suggests a design for replication the DFS namespace  
image as well as transaction logs, so that even in case of a  
catastrophic failure, the DFS state can almost always be recovered.

We suggest a two-pronged approach. First, allow multiple copies of  
image and transaction log on different volumes of namenode. Secondly,  
have backup "read-only" namenodes, that would allow continuous  
functioning of DFS even in case of namenode failure.

We propose that dfs.name.dir configuration parameter be allowed to  
have a comma-separated list of different locations within the  
namenode where DFS image and logs would be replicated. This allows  
for a disk failure to not hinder recoverability of DFS state. Each  
time the image file is updated, as well as each time a transaction is  
logged, it is written to all the locations specified in dfs.name.dir.  
The list of locations in dfs.name.dir could include all local disks  
of the namenode as well as NFS-mounted drives, thus providing a  
remote backup of DFS state. If the NFS-mounted drive is RAIDed, this  
itself provides the reliability required.

Currently, the transaction log file is always kept open in write- 
mode. Thus in case of the namenode failure, or forcibly shuttting  
down namenode can cause the last few transactions that have been  
buffered in memory to get lost. The number of transactions lost will  
depend on the buffer-size. We propose that the DFS administrator  
control this parameter. Configuration will include a parameter  
"dfs.namenode.edits.buffer" to specify number of transactions upon  
which the transaction log will be closed (thus flushing all the  
buffered transactions to disk), and reopened in append-mode.

In order to determine which image and log files are the snapshot of  
the latest state, these files should indicate a positive 4-byte  
"generation number". This can be achieved without even having to  
modify the image and transaction log file format. The filename can  
contain the generation number. Each time the namenode restarts, the  
generation number of both the image file as well as transaction log  
is incremented to reflect this. Upon startup, the namenode scans all  
the locations in dfs.name.dir to determine which location contains  
the latest image and corresponding logs according to the generation  
number, and loads the latest image and log (from possibly different  
locations). If in case the sizes of the transaction logs with the  
same name do not match, one with the larger size is chosen.

Second proposal (which can be in addition to the first multiple- 
volumes proposal) suggests having multiple backup namenodes. These  
backup name nodes are started on different machines with an  
additional command-line parameter "-backup" to the namenode.

The backup namenode functions in approximately the same way as the  
namenode in safe mode (i.e. read-only), except that upon startup, it  
connects to the main namenode specified in "fs.default.name",  
supplies the current generation of its image and transaction log and  
asks for the latest FSimage and transaction log, stores them on the  
disk locations in "dfs.name.dir", and accordingly also modifies its  
internal namesystem data structures. The backup name nodes do not  
listen to blockreports or heartbeats from datanodes. Their sole task  
is to keep a backup of DFS state. When the main namenode fails, any  
of these backup namenodes can be restarted by DFS administrator in  
normal mode, and DFS can continue functioning.

Later, the backup namenode can also be allowed to entertain read-only  
requests from DFS clients, thus making DFS more performant and scalable.