You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Andreas Kostyrka <an...@kostyrka.org> on 2008/09/05 13:30:34 UTC

critical name node problem

Hi!

My namenode has run out of space, and now I'm getting the following:

08/09/05 09:23:22 WARN dfs.StateChange: DIR* FSDirectory.unprotectedDelete: 
failed to 
remove /data_v1/2008/06/26/12/pub1-access-2008-06-26-11_52_07.log.gz because 
it does not exist
08/09/05 09:23:22 INFO ipc.Server: Stopping server on 9000
08/09/05 09:23:22 ERROR dfs.NameNode: java.io.EOFException
        at java.io.DataInputStream.readFully(DataInputStream.java:180)
        at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
        at 
org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90)
        at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:441)
        at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:766)
        at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:640)
        at 
org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:223)
        at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)
        at 
org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:274)
        at org.apache.hadoop.dfs.FSNamesystem.<init>(FSNamesystem.java:255)
        at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:133)
        at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:178)
        at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:164)
        at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848)
        at org.apache.hadoop.dfs.NameNode.main(NameNode.java:857)

08/09/05 09:23:22 INFO dfs.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at 
ec2-67-202-42-251.compute-1.amazonaws.com/10.251.39.196

hadoop-0.17.1 btw.

What do I do now?

Andreas

Re: critical name node problem

Posted by Steve Loughran <st...@apache.org>.
Allen Wittenauer wrote:
> 
> 
> On 9/5/08 5:53 AM, "Andreas Kostyrka" <an...@kostyrka.org> wrote:
> 
>> Another idea would be a tool or namenode startup mode that would make it
>> ignore EOFExceptions to recover as much of the edits as possible.
> 
>     We clearly need to change the "how to configure" docs to make sure
> people put at least two directories on two different storage systems for the
> dfs.name.dir  .  This problem seems to happen quite often, and having two+
> dirs helps protect against it.
> 
>     We recently had one of the disks on one of our copies go bad.  The
> system kept going just fine until we had a chance to reconfig the name node.
> 
>     That said, I've just HADOOP-4080 to help alert admins in these
> situations.
> 


that and HADOOP-4081.

Apache Axis has this production/development switch; in develop mode it 
sends stack traces over the wire and is generally more forgiving. By 
default it assumes you are in production rather than development, so you 
have to explicitly flip the switch to get slighly reduced security.

Hadoop could have something similar, where if the  production flag is 
set, the cluster would simply refuse to come up if it felt the 
configuration wasn't robust enough.

Re: critical name node problem

Posted by Allen Wittenauer <aw...@yahoo-inc.com>.


On 9/5/08 5:53 AM, "Andreas Kostyrka" <an...@kostyrka.org> wrote:

> Another idea would be a tool or namenode startup mode that would make it
> ignore EOFExceptions to recover as much of the edits as possible.

    We clearly need to change the "how to configure" docs to make sure
people put at least two directories on two different storage systems for the
dfs.name.dir  .  This problem seems to happen quite often, and having two+
dirs helps protect against it.

    We recently had one of the disks on one of our copies go bad.  The
system kept going just fine until we had a chance to reconfig the name node.

    That said, I've just HADOOP-4080 to help alert admins in these
situations.


Re: critical name node problem

Posted by Andreas Kostyrka <an...@kostyrka.org>.
Ok, googling a little bit around, the solution seems to either delete the 
edits file, which in my case would be non-cool (24MB worth of edits in 
there), or truncate it correctly.

So I used the following script to figure out how much data needs to be 
dropped:

LEN=25497570

while true
do
   dd if=edits.org of=edits bs=$LEN count=1
   time hadoop namenode
   if [[ $? -ne 255 ]]
   then
      echo $LEN seems to have worked.
      exit 0
   fi
   LEN=$(expr $LEN - 1)
done

Guess something like this might make sense to add 
http://wiki.apache.org/hadoop/TroubleShooting
not everyone will be able to figure out how to get rid of the "last" 
incomplete record.

Another idea would be a tool or namenode startup mode that would make it 
ignore EOFExceptions to recover as much of the edits as possible.

Andreas

On Friday 05 September 2008 13:30:34 Andreas Kostyrka wrote:
> Hi!
>
> My namenode has run out of space, and now I'm getting the following:
>
> 08/09/05 09:23:22 WARN dfs.StateChange: DIR* FSDirectory.unprotectedDelete:
> failed to
> remove /data_v1/2008/06/26/12/pub1-access-2008-06-26-11_52_07.log.gz
> because it does not exist
> 08/09/05 09:23:22 INFO ipc.Server: Stopping server on 9000
> 08/09/05 09:23:22 ERROR dfs.NameNode: java.io.EOFException
>         at java.io.DataInputStream.readFully(DataInputStream.java:180)
>         at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
>         at
> org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90)
>         at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:441)
>         at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:766)
>         at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:640)
>         at
> org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:223)
>         at
> org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80) at
> org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:274)
>         at org.apache.hadoop.dfs.FSNamesystem.<init>(FSNamesystem.java:255)
>         at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:133)
>         at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:178)
>         at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:164)
>         at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848)
>         at org.apache.hadoop.dfs.NameNode.main(NameNode.java:857)
>
> 08/09/05 09:23:22 INFO dfs.NameNode: SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down NameNode at
> ec2-67-202-42-251.compute-1.amazonaws.com/10.251.39.196
>
> hadoop-0.17.1 btw.
>
> What do I do now?
>
> Andreas