You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2010/09/17 17:57:51 UTC

[Hadoop Wiki] Update of "NameNodeFailover" by SomeOtherAccount

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "NameNodeFailover" page has been changed by SomeOtherAccount.
http://wiki.apache.org/hadoop/NameNodeFailover?action=diff&rev1=5&rev2=7

--------------------------------------------------

- The name node is a critical resource for the cluster because data nodes don't know enough about the blocks that they contain to coherently answer requests for anything but the block contents. This isn't generally a serious problem because single machines are typically fairly reliable (it is only with a large cluster that we expect daily or hourly failures).
+ === Introduction ===
+ As of 0.20, Hadoop does not support automatic recovery in the case of a NameNode failure. This is a well known and recognized single point of failure in Hadoop. The good news is that as you add more nodes to a grid, the statistical probability that the NameNode will fail goes down.

- That said, there is a secondary name node that talks to the primary name node on a regular basis in order to keep track of the files in the system. It does this by copying the fsimage and editlog files from the primary name node.
+ Experience at Yahoo! shows that NameNodes are more likely to fail due to misconfiguration, network issues, and bad behavior amongst clients than actual hardware problems. Out of fifteen grids over three year period, only three NameNode failures were related to hardware problems.

- If the name node dies, the simplest procedure is to simply use DNS to rename the primary and secondary name nodes. The secondary name node will serve as primary name node as long as nodes request meta-data from it. Once you get your old primary back up, you should reconfigure it to be the secondary name node and you will be back in full operation.
+ === Configuring Hadoop for Failover ===
+ There are some preliminary steps that must be in place prior to performing a NameNode recovery. The most important is the dfs.name.dir property. This setting configures the NameNode such that it can write to more than one directory. A typcal configuration might look something like this:

- Note that the secondary name node only copies information every few minutes. For a more up-to-date recovery, you can make the name node log transactions to multiple directories, including one networked mounted one. You can then copy the fsimage and fsedit files from that networked directory and have a recovery that is up to the second.
+ <property>
+ <name>dfs.name.dir</name>
+ <value>/export/hadoop/namedir,/remote/export/hadoop/namedir</value>
+ </property>

- Questions I still have include:
+ The first directory is a local directory and the second directory is a NFS mounted directory. The NameNode will write to both locations, keeping the HDFS metadata in sync. This allows for storage of the metadata off-machine so that one will have something to recover. During startup, the NameNode will pick the most recent version of these two directories to use and then sync both of them to use the same data.

- * what do you have to do to the old primary to make it be a secondary?
+ After we have configured the NameNode to write to two or more directories, we now have a working backup of the metadata. Using this data, in the more common failure scenarios, we can use this data to bring the dead NameNode from the grave.

- * can you have more than one secondary name node (for off-site backup purposes)?
+ '''When a Failure Occurs'''

- * are there plans for distributing the name node function?
+ Now the recovery steps:

- === Answer ===
- Secondary Namenode does not have function to be a failover mechanism. It is a helping process to the namenode. It is not of help if the namenode fails. The name is possibly misleading.
+ 1. Just to be safe, make a copy of the data on the remote NFS mount for safe keeping.
+ 1. Pick a target machine on the same network.
+ 1. Change the IP address of that machine to match the NameNode's IP address. Using an interface alias to provide this address movement works as well. If this is not an option, be prepared to restart the entire grid to avoid hitting https://issues.apache.org/jira/browse/HADOOP-3988 .
+ 1. Install Hadoop similarly to how you did the NameNode
+ 1. Do '''not''' format this node!
+ 1. Mount the remote NFS directory in the same location.
+ 1. Startup the NameNode.
+ 1. The NameNode should start replaying the edits file, updating the image, block reports should come in, etc.

+ At this point, your NameNode should be up.
- In order to provide redundancy for data protection in case of namenode failure the best way is to store the namenode metadata on a different machine. Hadoop has an option to have multiple namenode directories and the recommended option is to have one of the namenode directories on an NFS share. However you have to make sure the NFS locking will not cause problems and it is NOT recommended to change this on a live system because it can corrupt namenode data. Another option is to simply copy namenode metadata to another machine.
- --Ankur Sethi

- '''Question'''
+ '''Other Ideas'''

+ There are some other ideas to help with NameNode recovery:
- Why not keep the fsimage and editlog in the DFS (somehow that they could be located by data nodes without the name node)?
- Then when then name node fails, by an election mechanism, a data node becomes the new name node.
- --Cosmin Lehene
-

+ 1. Keep in mind that the SecondaryNameNode and/or the CheckpointNode also has an older copy of the NameNode metadata. If you haven't done the preliminary work above, you might still be able to recover using the data on those systems. Just note that it will only be as fresh as the last run and you will likely experience some data loss.
+ 1. Instead of using NFS on Linux, it may be worth while looking into DRBD. A few sites are using this with great success.
+