You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Milind Bhandarkar <mi...@yahoo-inc.com> on 2006/09/21 00:07:13 UTC
Proposal for replicating namenode state and transaction logs
Please comment on following proposal.
Proposal for Replication of DFS Namespace Images and Transaction Logs
Currently, when the namenode starts, it reads the namespace image
from dfs.name.dir and from that, initializes the namespace data
structures. If the transaction log exists, it merges the transaction
logs with the in-memory namespace, and writes out the merged
namespace image. It then reinitializes the transaction log file.
As namespace modifications occur, these modifications are logged into
the transaction log file. The transaction log file is never flushed,
and is closed only if the namenode shuts down normally. In case of
error or forced shutdown of namenode, the last few buffered (but not
yet written to the disk) transactions may get lost. In addition if
the namespace image file is corrupted or is accidentally deleted, or
if the disk holding that image file crashes, there is no way to
recover the state of DFS.
This proposal suggests a design for replication the DFS namespace
image as well as transaction logs, so that even in case of a
catastrophic failure, the DFS state can almost always be recovered.
We suggest a two-pronged approach. First, allow multiple copies of
image and transaction log on different volumes of namenode. Secondly,
have backup "read-only" namenodes, that would allow continuous
functioning of DFS even in case of namenode failure.
We propose that dfs.name.dir configuration parameter be allowed to
have a comma-separated list of different locations within the
namenode where DFS image and logs would be replicated. This allows
for a disk failure to not hinder recoverability of DFS state. Each
time the image file is updated, as well as each time a transaction is
logged, it is written to all the locations specified in dfs.name.dir.
The list of locations in dfs.name.dir could include all local disks
of the namenode as well as NFS-mounted drives, thus providing a
remote backup of DFS state. If the NFS-mounted drive is RAIDed, this
itself provides the reliability required.
Currently, the transaction log file is always kept open in write-
mode. Thus in case of the namenode failure, or forcibly shuttting
down namenode can cause the last few transactions that have been
buffered in memory to get lost. The number of transactions lost will
depend on the buffer-size. We propose that the DFS administrator
control this parameter. Configuration will include a parameter
"dfs.namenode.edits.buffer" to specify number of transactions upon
which the transaction log will be closed (thus flushing all the
buffered transactions to disk), and reopened in append-mode.
In order to determine which image and log files are the snapshot of
the latest state, these files should indicate a positive 4-byte
"generation number". This can be achieved without even having to
modify the image and transaction log file format. The filename can
contain the generation number. Each time the namenode restarts, the
generation number of both the image file as well as transaction log
is incremented to reflect this. Upon startup, the namenode scans all
the locations in dfs.name.dir to determine which location contains
the latest image and corresponding logs according to the generation
number, and loads the latest image and log (from possibly different
locations). If in case the sizes of the transaction logs with the
same name do not match, one with the larger size is chosen.
Second proposal (which can be in addition to the first multiple-
volumes proposal) suggests having multiple backup namenodes. These
backup name nodes are started on different machines with an
additional command-line parameter "-backup" to the namenode.
The backup namenode functions in approximately the same way as the
namenode in safe mode (i.e. read-only), except that upon startup, it
connects to the main namenode specified in "fs.default.name",
supplies the current generation of its image and transaction log and
asks for the latest FSimage and transaction log, stores them on the
disk locations in "dfs.name.dir", and accordingly also modifies its
internal namesystem data structures. The backup name nodes do not
listen to blockreports or heartbeats from datanodes. Their sole task
is to keep a backup of DFS state. When the main namenode fails, any
of these backup namenodes can be restarted by DFS administrator in
normal mode, and DFS can continue functioning.
Later, the backup namenode can also be allowed to entertain read-only
requests from DFS clients, thus making DFS more performant and scalable.