You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by phil young <ph...@gmail.com> on 2010/10/26 01:12:53 UTC

Namenode corruption: need help quickly please

Wow. I could use help quickly...

My name node is reporting a null BV. All the data nodes report the same
Build Version.
We were not upgrading the DFS, but did stop, restart, after adding a jar to
$HADOOP_HOME/lib.
So, we think we understand the cause.

Web searching shows a number of people have had this issue, but I don't see
a response to the plea here for advice on repairing the problem:
http://old.nabble.com/namenode-failure-td20199395.html


-- Here's a log from a DataNode
/************************************************************
STARTUP_MSG: Starting DataNode
STARTUP_MSG:   host = hdp12n.tripadvisor.com/192.168.33.231
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build =
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r
911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
2010-10-25 18:23:07,081 FATAL
org.apache.hadoop.hdfs.server.datanode.DataNode: Incompatible build
versions: namenode BV = ; datanode BV = 911707
2010-10-25 18:23:07,186 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException:
Incompatible build versions: namenode BV = ; datanode BV = 911707
at
org.apache.hadoop.hdfs.server.datanode.DataNode.handshake(DataNode.java:436)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:275)
at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:216)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)

2010-10-25 18:23:07,187 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at
hdp12n.tripadvisor.com/192.168.33.231

-- This is from the NameNode

2010-10-25 18:38:58,760 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
ugi=root,root,bin,daemon,sys,adm,disk,wheel ip=/192.168.33.230
cmd=listStatus src=/disk1/hadoop-root/mapred/system dst=null perm=null
2010-10-25 18:38:58,764 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 9 on 54310, call delete(/disk1/hadoop-root/mapred/system, true) from
192.168.33.230:44574: error:
org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete
/disk1/hadoop-root/mapred/system. Name node is in safe mode.
The ratio of reported blocks 0.0000 has not reached the threshold 0.9990.
Safe mode will be turned off automatically.
org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete
/disk1/hadoop-root/mapred/system. Name node is in safe mode.
The ratio of reported blocks 0.0000 has not reached the threshold 0.9990.
Safe mode will be turned off automatically.
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:1700)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:1680)
at org.apache.hadoop.hdfs.server.namenode.NameNode.delete(NameNode.java:517)
at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
2010-10-25 18:39:08,770 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
ugi=root,root,bin,daemon,sys,adm,disk,wheel ip=/192.168.33.230
cmd=listStatus src=/disk1/hadoop-root/mapred/system dst=null perm=null
2010-10-25 18:39:08,774 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 54310, call delete(/disk1/hadoop-root/mapred/system, true) from
192.168.33.230:44574: error:
org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete
/disk1/hadoop-root/mapred/system. Name node is in safe mode.
The ratio of reported blocks 0.0000 has not reached the threshold 0.9990.
Safe mode will be turned off automatically.
org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete
/disk1/hadoop-root/mapred/system. Name node is in safe mode.
The ratio of reported blocks 0.0000 has not reached the threshold 0.9990.
Safe mode will be turned off automatically.
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:1700)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:1680)
at org.apache.hadoop.hdfs.server.namenode.NameNode.delete(NameNode.java:517)
at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
2010-10-25 18:39:09,504 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at
hdp11an.tripadvisor.com/192.168.33.228
************************************************************/

[root@hdp11an current]# ls -l /disk1/hadoop-root/
total 4
drwxr-xr-x 4 root root 4096 Oct 22 10:11 hadoop-unjar1448904914513586870

Re: Namenode corruption: need help quickly please

Posted by Allen Wittenauer <aw...@linkedin.com>.

On Oct 25, 2010, at 5:30 PM, phil young wrote:
> We and others have seen the following error. Apparently it occurs when
> there's some change resulting in a difference in the "build" versions. This
> is not DFS corruption but may apper to be so because the master and task
> tracker processes start fine, but the
> DataNodes<https://twiki.tripadvisor.com/bin/edit/Development/DataNodes?topicparent=Development.HadoopTroubleshooting;nowysiwyg=0>report
> the following error:

	It actually *can* lead to a "sort of" DFS corruption if the NN and DN versions are too far apart and a block conversion was required.  i.e., 0.21 vs 0.20 .  In that scenario, you can only roll forward, not back.  If you really wanted the older version, then you are a bit SOL, as there is no roll back without a snapshot being taken beforehand.

Fwd: Namenode corruption: need help quickly please

Posted by phil young <ph...@gmail.com>.

---------- Forwarded message ----------
From: phil young <ph...@gmail.com>
Date: Mon, Oct 25, 2010 at 8:30 PM
Subject: Re: Namenode corruption: need help quickly please
To: common-user@hadoop.apache.org


In the interests of helping others, here's some details on what happened to
us and how we recovered....


Incompatible Build Versions (between the
NameNode<https://twiki.tripadvisor.com/bin/edit/Development/NameNode?topicparent=Development.HadoopTroubleshooting;nowysiwyg=0>and
the
DataNodes<https://twiki.tripadvisor.com/bin/edit/Development/DataNodes?topicparent=Development.HadoopTroubleshooting;nowysiwyg=0>)


We and others have seen the following error. Apparently it occurs when
there's some change resulting in a difference in the "build" versions. This
is not DFS corruption but may apper to be so because the master and task
tracker processes start fine, but the
DataNodes<https://twiki.tripadvisor.com/bin/edit/Development/DataNodes?topicparent=Development.HadoopTroubleshooting;nowysiwyg=0>report
the following error:

*2010-10-25 18:35:38,470 FATAL
org.apache.hadoop.hdfs.server.datanode.DataNode: Incompatible build
versions: namenode BV = ; datanode BV =* xxxxx

This was caused by running "ant package" on the master.

To recover, we restored /hadoop on the master using the following steps:

   1. Stop the cluster (somewhat violently)
      1. Normal shut down
         1. stop-all.sh
       2. Find and kill lingering processes
         1. mon_jps #an alias in ~/.bash_profile that runs jps on all slaves

         2. kill -9 each running Java process
       3. Remove pid files
         1. ls -ltr /tmp/*pid
         2. rm -f /tmp/*pid #on each slave
       2. Restore /hadoop on the master from a slave
      1. cd /usr/local/hadoop
      2. mv hadoop-0.20.2 hadoop-0.20.2.MOVED
      3. restore hadoop-0.20.2 from a tarball generated on a slave
    3. Restore the original "conf" folder for the master (since it's not the
   same as the slaves)
      1. cd hadoop-0.20.2
      2. mv ./conf ./conf.MOVED
      3. cp -r ../../hadoop-0.20.2.MOVED/conf ./
    4. Start the cluster
      1. start-all.sh
      2. test_hadoop #an alias in ~/.bash_profile that runs a test
      map-reduce job







On Mon, Oct 25, 2010 at 8:00 PM, Brian Bockelman <bb...@cse.unl.edu>wrote:

>
> On Oct 25, 2010, at 6:35 PM, phil young wrote:
>
> > I had also assumed that some other jar or configuration file had been
> > changed, but reviewing the timestamps on the files did not reveal the
> > problem.
> > On the assumption that something did in fact change, that I was not
> seeing,
> > I renamed my $HADOOP_HOME directory and replaced it with one from a
> slave.
> > I then restored $HADOOP_HOME/conf from the original/renamed directory,
> and
> > voila - we're back in business.
> >
>
> Glad to hear this.
>
> > Brian, thanks very much for your help.
> > It took literally more time for me to write the original email (5
> minutes)
> > than to get a reply which indicated a way to solve the problem, and
> another
> > 5 minutes to solve it.
> > That says a lot about the user group. I don't think I would have reached
> a
> > human being in 5 minutes for the tech support for most products.
> > I'll make sure to monitor this list more closely so I can pay it forward
> ;)
> >
>
> No problem.  There are lots of good people on this list, and I certainly
> have done the "oh crap, I put my neck on the line for this new Hadoop thing
> and now its broke" email.
>
> Brian

Re: Namenode corruption: need help quickly please

Posted by phil young <ph...@gmail.com>.

In the interests of helping others, here's some details on what happened to
us and how we recovered....


Incompatible Build Versions (between the
NameNode<https://twiki.tripadvisor.com/bin/edit/Development/NameNode?topicparent=Development.HadoopTroubleshooting;nowysiwyg=0>and
the
DataNodes<https://twiki.tripadvisor.com/bin/edit/Development/DataNodes?topicparent=Development.HadoopTroubleshooting;nowysiwyg=0>)


We and others have seen the following error. Apparently it occurs when
there's some change resulting in a difference in the "build" versions. This
is not DFS corruption but may apper to be so because the master and task
tracker processes start fine, but the
DataNodes<https://twiki.tripadvisor.com/bin/edit/Development/DataNodes?topicparent=Development.HadoopTroubleshooting;nowysiwyg=0>report
the following error:

*2010-10-25 18:35:38,470 FATAL
org.apache.hadoop.hdfs.server.datanode.DataNode: Incompatible build
versions: namenode BV = ; datanode BV =* xxxxx

This was caused by running "ant package" on the master.

To recover, we restored /hadoop on the master using the following steps:

   1. Stop the cluster (somewhat violently)
      1. Normal shut down
         1. stop-all.sh
       2. Find and kill lingering processes
         1. mon_jps #an alias in ~/.bash_profile that runs jps on all slaves

         2. kill -9 each running Java process
       3. Remove pid files
         1. ls -ltr /tmp/*pid
         2. rm -f /tmp/*pid #on each slave
       2. Restore /hadoop on the master from a slave
      1. cd /usr/local/hadoop
      2. mv hadoop-0.20.2 hadoop-0.20.2.MOVED
      3. restore hadoop-0.20.2 from a tarball generated on a slave
    3. Restore the original "conf" folder for the master (since it's not the
   same as the slaves)
      1. cd hadoop-0.20.2
      2. mv ./conf ./conf.MOVED
      3. cp -r ../../hadoop-0.20.2.MOVED/conf ./
    4. Start the cluster
      1. start-all.sh
      2. test_hadoop #an alias in ~/.bash_profile that runs a test
      map-reduce job







On Mon, Oct 25, 2010 at 8:00 PM, Brian Bockelman <bb...@cse.unl.edu>wrote:

>
> On Oct 25, 2010, at 6:35 PM, phil young wrote:
>
> > I had also assumed that some other jar or configuration file had been
> > changed, but reviewing the timestamps on the files did not reveal the
> > problem.
> > On the assumption that something did in fact change, that I was not
> seeing,
> > I renamed my $HADOOP_HOME directory and replaced it with one from a
> slave.
> > I then restored $HADOOP_HOME/conf from the original/renamed directory,
> and
> > voila - we're back in business.
> >
>
> Glad to hear this.
>
> > Brian, thanks very much for your help.
> > It took literally more time for me to write the original email (5
> minutes)
> > than to get a reply which indicated a way to solve the problem, and
> another
> > 5 minutes to solve it.
> > That says a lot about the user group. I don't think I would have reached
> a
> > human being in 5 minutes for the tech support for most products.
> > I'll make sure to monitor this list more closely so I can pay it forward
> ;)
> >
>
> No problem.  There are lots of good people on this list, and I certainly
> have done the "oh crap, I put my neck on the line for this new Hadoop thing
> and now its broke" email.
>
> Brian

Re: Namenode corruption: need help quickly please

Posted by Brian Bockelman <bb...@cse.unl.edu>.

On Oct 25, 2010, at 6:35 PM, phil young wrote:

> I had also assumed that some other jar or configuration file had been
> changed, but reviewing the timestamps on the files did not reveal the
> problem.
> On the assumption that something did in fact change, that I was not seeing,
> I renamed my $HADOOP_HOME directory and replaced it with one from a slave.
> I then restored $HADOOP_HOME/conf from the original/renamed directory, and
> voila - we're back in business.
> 

Glad to hear this.

> Brian, thanks very much for your help.
> It took literally more time for me to write the original email (5 minutes)
> than to get a reply which indicated a way to solve the problem, and another
> 5 minutes to solve it.
> That says a lot about the user group. I don't think I would have reached a
> human being in 5 minutes for the tech support for most products.
> I'll make sure to monitor this list more closely so I can pay it forward ;)
> 

No problem.  There are lots of good people on this list, and I certainly have done the "oh crap, I put my neck on the line for this new Hadoop thing and now its broke" email.

Brian

Re: Namenode corruption: need help quickly please

Posted by phil young <ph...@gmail.com>.

I had also assumed that some other jar or configuration file had been
changed, but reviewing the timestamps on the files did not reveal the
problem.
On the assumption that something did in fact change, that I was not seeing,
I renamed my $HADOOP_HOME directory and replaced it with one from a slave.
I then restored $HADOOP_HOME/conf from the original/renamed directory, and
voila - we're back in business.

Brian, thanks very much for your help.
It took literally more time for me to write the original email (5 minutes)
than to get a reply which indicated a way to solve the problem, and another
5 minutes to solve it.
That says a lot about the user group. I don't think I would have reached a
human being in 5 minutes for the tech support for most products.
I'll make sure to monitor this list more closely so I can pay it forward ;)

Thanks,
-Phil






On Mon, Oct 25, 2010 at 7:16 PM, Brian Bockelman <bb...@cse.unl.edu>wrote:

> Hi Phil,
>
> Typically, this is due to running inconsistent versions of the Hadoop,
> right?
>
> I would compare the output of 'md5sum' on the NN versus the DN for the
> various Hadoop JARs.
>
> If you do "jar tf" on the jar you added to $HADOOP_HOME/lib, did it
> inadvertently add another implementation of the NN classes?
>
> Brian
>
> On Oct 25, 2010, at 6:12 PM, phil young wrote:
>
> > Wow. I could use help quickly...
> >
> > My name node is reporting a null BV. All the data nodes report the same
> > Build Version.
> > We were not upgrading the DFS, but did stop, restart, after adding a jar
> to
> > $HADOOP_HOME/lib.
> > So, we think we understand the cause.
> >
> > Web searching shows a number of people have had this issue, but I don't
> see
> > a response to the plea here for advice on repairing the problem:
> > http://old.nabble.com/namenode-failure-td20199395.html
> >
> >
> > -- Here's a log from a DataNode
> > /************************************************************
> > STARTUP_MSG: Starting DataNode
> > STARTUP_MSG:   host = hdp12n.tripadvisor.com/192.168.33.231
> > STARTUP_MSG:   args = []
> > STARTUP_MSG:   version = 0.20.2
> > STARTUP_MSG:   build =
> > https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r
> > 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
> > ************************************************************/
> > 2010-10-25 18:23:07,081 FATAL
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Incompatible build
> > versions: namenode BV = ; datanode BV = 911707
> > 2010-10-25 18:23:07,186 ERROR
> > org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException:
> > Incompatible build versions: namenode BV = ; datanode BV = 911707
> > at
> >
> org.apache.hadoop.hdfs.server.datanode.DataNode.handshake(DataNode.java:436)
> > at
> >
> org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:275)
> > at
> org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:216)
> > at
> >
> org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
> > at
> >
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
> > at
> >
> org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
> > at
> org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)
> >
> > 2010-10-25 18:23:07,187 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
> > /************************************************************
> > SHUTDOWN_MSG: Shutting down DataNode at
> > hdp12n.tripadvisor.com/192.168.33.231
> >
> > -- This is from the NameNode
> >
> > 2010-10-25 18:38:58,760 INFO
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
> > ugi=root,root,bin,daemon,sys,adm,disk,wheel ip=/192.168.33.230
> > cmd=listStatus src=/disk1/hadoop-root/mapred/system dst=null perm=null
> > 2010-10-25 18:38:58,764 INFO org.apache.hadoop.ipc.Server: IPC Server
> > handler 9 on 54310, call delete(/disk1/hadoop-root/mapred/system, true)
> from
> > 192.168.33.230:44574: error:
> > org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete
> > /disk1/hadoop-root/mapred/system. Name node is in safe mode.
> > The ratio of reported blocks 0.0000 has not reached the threshold 0.9990.
> > Safe mode will be turned off automatically.
> > org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete
> > /disk1/hadoop-root/mapred/system. Name node is in safe mode.
> > The ratio of reported blocks 0.0000 has not reached the threshold 0.9990.
> > Safe mode will be turned off automatically.
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:1700)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:1680)
> > at
> org.apache.hadoop.hdfs.server.namenode.NameNode.delete(NameNode.java:517)
> > at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > at java.lang.reflect.Method.invoke(Method.java:597)
> > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
> > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
> > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:396)
> > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
> > 2010-10-25 18:39:08,770 INFO
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
> > ugi=root,root,bin,daemon,sys,adm,disk,wheel ip=/192.168.33.230
> > cmd=listStatus src=/disk1/hadoop-root/mapred/system dst=null perm=null
> > 2010-10-25 18:39:08,774 INFO org.apache.hadoop.ipc.Server: IPC Server
> > handler 0 on 54310, call delete(/disk1/hadoop-root/mapred/system, true)
> from
> > 192.168.33.230:44574: error:
> > org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete
> > /disk1/hadoop-root/mapred/system. Name node is in safe mode.
> > The ratio of reported blocks 0.0000 has not reached the threshold 0.9990.
> > Safe mode will be turned off automatically.
> > org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete
> > /disk1/hadoop-root/mapred/system. Name node is in safe mode.
> > The ratio of reported blocks 0.0000 has not reached the threshold 0.9990.
> > Safe mode will be turned off automatically.
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:1700)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:1680)
> > at
> org.apache.hadoop.hdfs.server.namenode.NameNode.delete(NameNode.java:517)
> > at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > at java.lang.reflect.Method.invoke(Method.java:597)
> > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
> > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
> > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:396)
> > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
> > 2010-10-25 18:39:09,504 INFO
> > org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
> > /************************************************************
> > SHUTDOWN_MSG: Shutting down NameNode at
> > hdp11an.tripadvisor.com/192.168.33.228
> > ************************************************************/
> >
> > [root@hdp11an current]# ls -l /disk1/hadoop-root/
> > total 4
> > drwxr-xr-x 4 root root 4096 Oct 22 10:11 hadoop-unjar1448904914513586870
>
>

Re: Namenode corruption: need help quickly please

Posted by Brian Bockelman <bb...@cse.unl.edu>.

Hi Phil,

Typically, this is due to running inconsistent versions of the Hadoop, right?

I would compare the output of 'md5sum' on the NN versus the DN for the various Hadoop JARs.

If you do "jar tf" on the jar you added to $HADOOP_HOME/lib, did it inadvertently add another implementation of the NN classes?

Brian

On Oct 25, 2010, at 6:12 PM, phil young wrote:

> Wow. I could use help quickly...
> 
> My name node is reporting a null BV. All the data nodes report the same
> Build Version.
> We were not upgrading the DFS, but did stop, restart, after adding a jar to
> $HADOOP_HOME/lib.
> So, we think we understand the cause.
> 
> Web searching shows a number of people have had this issue, but I don't see
> a response to the plea here for advice on repairing the problem:
> http://old.nabble.com/namenode-failure-td20199395.html
> 
> 
> -- Here's a log from a DataNode
> /************************************************************
> STARTUP_MSG: Starting DataNode
> STARTUP_MSG:   host = hdp12n.tripadvisor.com/192.168.33.231
> STARTUP_MSG:   args = []
> STARTUP_MSG:   version = 0.20.2
> STARTUP_MSG:   build =
> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r
> 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
> ************************************************************/
> 2010-10-25 18:23:07,081 FATAL
> org.apache.hadoop.hdfs.server.datanode.DataNode: Incompatible build
> versions: namenode BV = ; datanode BV = 911707
> 2010-10-25 18:23:07,186 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException:
> Incompatible build versions: namenode BV = ; datanode BV = 911707
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.handshake(DataNode.java:436)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:275)
> at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:216)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
> at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)
> 
> 2010-10-25 18:23:07,187 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down DataNode at
> hdp12n.tripadvisor.com/192.168.33.231
> 
> -- This is from the NameNode
> 
> 2010-10-25 18:38:58,760 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
> ugi=root,root,bin,daemon,sys,adm,disk,wheel ip=/192.168.33.230
> cmd=listStatus src=/disk1/hadoop-root/mapred/system dst=null perm=null
> 2010-10-25 18:38:58,764 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 9 on 54310, call delete(/disk1/hadoop-root/mapred/system, true) from
> 192.168.33.230:44574: error:
> org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete
> /disk1/hadoop-root/mapred/system. Name node is in safe mode.
> The ratio of reported blocks 0.0000 has not reached the threshold 0.9990.
> Safe mode will be turned off automatically.
> org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete
> /disk1/hadoop-root/mapred/system. Name node is in safe mode.
> The ratio of reported blocks 0.0000 has not reached the threshold 0.9990.
> Safe mode will be turned off automatically.
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:1700)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:1680)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.delete(NameNode.java:517)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
> 2010-10-25 18:39:08,770 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
> ugi=root,root,bin,daemon,sys,adm,disk,wheel ip=/192.168.33.230
> cmd=listStatus src=/disk1/hadoop-root/mapred/system dst=null perm=null
> 2010-10-25 18:39:08,774 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 0 on 54310, call delete(/disk1/hadoop-root/mapred/system, true) from
> 192.168.33.230:44574: error:
> org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete
> /disk1/hadoop-root/mapred/system. Name node is in safe mode.
> The ratio of reported blocks 0.0000 has not reached the threshold 0.9990.
> Safe mode will be turned off automatically.
> org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete
> /disk1/hadoop-root/mapred/system. Name node is in safe mode.
> The ratio of reported blocks 0.0000 has not reached the threshold 0.9990.
> Safe mode will be turned off automatically.
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:1700)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:1680)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.delete(NameNode.java:517)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
> 2010-10-25 18:39:09,504 INFO
> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down NameNode at
> hdp11an.tripadvisor.com/192.168.33.228
> ************************************************************/
> 
> [root@hdp11an current]# ls -l /disk1/hadoop-root/
> total 4
> drwxr-xr-x 4 root root 4096 Oct 22 10:11 hadoop-unjar1448904914513586870