You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Alexandru Pacurar <Al...@PropertyShark.com> on 2014/12/05 10:24:41 UTC

Question about namenode HA

Hello,

I'm trying to configure HA for the HDFS namenode with QJM following the instructions form here http://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithQJM.html.

My setup is the following : Ubuntu 12.04.5 LTS on all the nodes, Hadoop 2.4.1 installed, two namenodes (QJM processes run on this), one machine for a third QJM.

Initially we didn't have HA, so this is a migration from a non-HA enabled cluster to a HA enabled one.

For the migration I :

* added all the necessary configuration specified in the link above

* stopped the non-HA cluster

* started the three QJMs

* started my first namenode(the one that was the only namenode in the non-HA setup) with the new configs.

* On my second namenode I ran hdfs namenode -bootstrapStandby which copied the fsimage, and went ok

* Also on my secondary I ran hdfs namenode -initializeSharedEdits which initialized all three of my QJMs

* Then I started the secondary namenode.

After this I started to have some problems. Both nodes were in standby with the following WARN :
"2014-12-04 13:35:56,074 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a roll of the active NN
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category JOURNAL is not supported in state standby"

After half an hour of this I thought I could just move one of them into primary because I'm thinking based on the warning that it should solve the problem. So I ran hdfs haadmin -transitionToActive node1, but this gave me the following fatal error, which I haven't been able to figure out:

2014-12-04 14:16:55,835 FATAL org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error encountered while tailing edits. Shutting down standby NN.
java.io.IOException: There appears to be a gap in the edit log. We expected txid 1, but got txid 1542903

Now If I try to restart the secondary, it just gives me the same error, and if I try to restart my other node which is still running I get the same.

The thing is that before configuring the HA my dfs.data.dir had only this file of edits edits_inprogress_0000000000000000001, so it should start at txid 1. After I initialize the Shared Edits it jumps to edits_0000000000001542903-0000000000001542904.

Could anyone shed some light on this issue for me?

Thank you,
Alex