You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Bill Graham <bi...@gmail.com> on 2010/02/02 22:54:58 UTC

NN fails to start with LeaseManager errors

Hi,

This morning the namenode of my hadoop cluster shut itself down after the
logs/ directory had filled itself with job configs, log files and all the
other fun things hadoop leaves there. It had been running for a few months.
I deleted all off the job configs and attempt log directories and tried to
restart the namenode, but it failed due to many LeaseManager errors.

Does anyone know what needs to be done to fix this and get the namenode back
up?

Here's what the logs report. I'm using Cloudera's 0.18.3 distro.

STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = my-host-name.com/10.15.137.204
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.18.3-2
STARTUP_MSG:   build =  -r ; compiled by 'httpd' on Fri Jun 12 15:27:43 PDT
2009
************************************************************/
2010-02-02 13:38:31,199 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
Initializing RPC Metrics with hostName=NameNode, port=9000
2010-02-02 13:38:31,208 INFO org.apache.hadoop.dfs.NameNode: Namenode up at:
my-host-name.com/10.15.137.204:9000
2010-02-02 13:38:31,212 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=NameNode, sessionId=null
2010-02-02 13:38:31,218 INFO org.apache.hadoop.dfs.NameNodeMetrics:
Initializing NameNodeMeterics using context
object:org.apache.hadoop.metrics.spi.NullContext
2010-02-02 13:38:31,318 INFO org.apache.hadoop.fs.FSNamesystem:
fsOwner=app,app
2010-02-02 13:38:31,319 INFO org.apache.hadoop.fs.FSNamesystem:
supergroup=supergroup
2010-02-02 13:38:31,319 INFO org.apache.hadoop.fs.FSNamesystem:
isPermissionEnabled=true
2010-02-02 13:38:31,329 INFO org.apache.hadoop.dfs.FSNamesystemMetrics:
Initializing FSNamesystemMeterics using context
object:org.apache.hadoop.metrics.spi.NullContext
2010-02-02 13:38:31,331 INFO org.apache.hadoop.fs.FSNamesystem: Registered
FSNamesystemStatusMBean
2010-02-02 13:38:31,375 INFO org.apache.hadoop.dfs.Storage: Number of files
= 248675
2010-02-02 13:38:36,932 INFO org.apache.hadoop.dfs.Storage: Number of files
under construction = 2
2010-02-02 13:38:37,008 INFO org.apache.hadoop.dfs.Storage: Image file of
size 42924164 loaded in 5 seconds.
2010-02-02 13:38:37,020 ERROR org.apache.hadoop.dfs.LeaseManager:
/path/on/hdfs/_logs/history/my-host-name.com_1261508934685_job_200912221108_15967_conf.xml
not found in lease.paths
(=[/path/on/hdfs/_logs/history/my-host-name.com_1261508934685_job_200912221108_15967_app_MyJobName_20100202_10_59])

[[ a bunch more errors like the one above ]]

2010-02-02 13:38:37,076 ERROR org.apache.hadoop.fs.FSNamesystem:
FSNamesystem initialization failed.
java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:375)
        at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:585)
        at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:846)
        at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:675)
        at
org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:289)
        at
org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)
        at
org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:294)
        at org.apache.hadoop.dfs.FSNamesystem.<init>(FSNamesystem.java:273)
        at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:148)
        at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:193)
        at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:179)
        at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:830)
        at org.apache.hadoop.dfs.NameNode.main(NameNode.java:839)
2010-02-02 13:38:37,077 INFO org.apache.hadoop.ipc.Server: Stopping server
on 9000
2010-02-02 13:38:37,081 ERROR org.apache.hadoop.dfs.NameNode:
java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:375)
        at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:585)
        at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:846)
        at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:675)
        at
org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:289)
        at
org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)
        at
org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:294)
        at org.apache.hadoop.dfs.FSNamesystem.<init>(FSNamesystem.java:273)
        at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:148)
        at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:193)
        at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:179)
        at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:830)
        at org.apache.hadoop.dfs.NameNode.main(NameNode.java:839)

2010-02-02 13:38:37,082 INFO org.apache.hadoop.dfs.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at my-host-name.com/10.15.137.204
************************************************************/

thanks,
Bill

Re: NN fails to start with LeaseManager errors

Posted by Bill Graham <bi...@gmail.com>.
I was able to fix this by restoring my namenode from the last checkpoint of
the secondary namenode. Searching the list I saw others have struggled with
this issue so I'll share my steps.

I did it by following Tom White's excellent instructions in Hadoop - The
Definitive Guide:

1. Stop the secondary name node. (Namenode was already stopped)
2. Moved my namenode dir (configured as dfs.name.dir) aside.
3. Started the namenode with the -importCheckpoint option like so:

bin/hadoop-daemon.sh start namenode -importCheckpoint



On Tue, Feb 2, 2010 at 1:54 PM, Bill Graham <bi...@gmail.com> wrote:

> Hi,
>
> This morning the namenode of my hadoop cluster shut itself down after the
> logs/ directory had filled itself with job configs, log files and all the
> other fun things hadoop leaves there. It had been running for a few months.
> I deleted all off the job configs and attempt log directories and tried to
> restart the namenode, but it failed due to many LeaseManager errors.
>
> Does anyone know what needs to be done to fix this and get the namenode
> back up?
>
> Here's what the logs report. I'm using Cloudera's 0.18.3 distro.
>
> STARTUP_MSG: Starting NameNode
> STARTUP_MSG:   host = my-host-name.com/10.15.137.204
> STARTUP_MSG:   args = []
> STARTUP_MSG:   version = 0.18.3-2
> STARTUP_MSG:   build =  -r ; compiled by 'httpd' on Fri Jun 12 15:27:43 PDT
> 2009
> ************************************************************/
> 2010-02-02 13:38:31,199 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
> Initializing RPC Metrics with hostName=NameNode, port=9000
> 2010-02-02 13:38:31,208 INFO org.apache.hadoop.dfs.NameNode: Namenode up
> at: my-host-name.com/10.15.137.204:9000
> 2010-02-02 13:38:31,212 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> Initializing JVM Metrics with processName=NameNode, sessionId=null
> 2010-02-02 13:38:31,218 INFO org.apache.hadoop.dfs.NameNodeMetrics:
> Initializing NameNodeMeterics using context
> object:org.apache.hadoop.metrics.spi.NullContext
> 2010-02-02 13:38:31,318 INFO org.apache.hadoop.fs.FSNamesystem:
> fsOwner=app,app
> 2010-02-02 13:38:31,319 INFO org.apache.hadoop.fs.FSNamesystem:
> supergroup=supergroup
> 2010-02-02 13:38:31,319 INFO org.apache.hadoop.fs.FSNamesystem:
> isPermissionEnabled=true
> 2010-02-02 13:38:31,329 INFO org.apache.hadoop.dfs.FSNamesystemMetrics:
> Initializing FSNamesystemMeterics using context
> object:org.apache.hadoop.metrics.spi.NullContext
> 2010-02-02 13:38:31,331 INFO org.apache.hadoop.fs.FSNamesystem: Registered
> FSNamesystemStatusMBean
> 2010-02-02 13:38:31,375 INFO org.apache.hadoop.dfs.Storage: Number of files
> = 248675
> 2010-02-02 13:38:36,932 INFO org.apache.hadoop.dfs.Storage: Number of files
> under construction = 2
> 2010-02-02 13:38:37,008 INFO org.apache.hadoop.dfs.Storage: Image file of
> size 42924164 loaded in 5 seconds.
> 2010-02-02 13:38:37,020 ERROR org.apache.hadoop.dfs.LeaseManager:
> /path/on/hdfs/_logs/history/my-host-name.com_1261508934685_job_200912221108_15967_conf.xml
> not found in lease.paths
> (=[/path/on/hdfs/_logs/history/my-host-name.com_1261508934685_job_200912221108_15967_app_MyJobName_20100202_10_59])
>
> [[ a bunch more errors like the one above ]]
>
> 2010-02-02 13:38:37,076 ERROR org.apache.hadoop.fs.FSNamesystem:
> FSNamesystem initialization failed.
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:375)
>         at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:585)
>         at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:846)
>         at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:675)
>         at
> org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:289)
>         at
> org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)
>         at
> org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:294)
>         at org.apache.hadoop.dfs.FSNamesystem.<init>(FSNamesystem.java:273)
>         at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:148)
>         at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:193)
>         at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:179)
>         at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:830)
>         at org.apache.hadoop.dfs.NameNode.main(NameNode.java:839)
> 2010-02-02 13:38:37,077 INFO org.apache.hadoop.ipc.Server: Stopping server
> on 9000
> 2010-02-02 13:38:37,081 ERROR org.apache.hadoop.dfs.NameNode:
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:375)
>         at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:585)
>         at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:846)
>         at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:675)
>         at
> org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:289)
>         at
> org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)
>         at
> org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:294)
>         at org.apache.hadoop.dfs.FSNamesystem.<init>(FSNamesystem.java:273)
>         at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:148)
>         at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:193)
>         at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:179)
>         at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:830)
>         at org.apache.hadoop.dfs.NameNode.main(NameNode.java:839)
>
> 2010-02-02 13:38:37,082 INFO org.apache.hadoop.dfs.NameNode: SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down NameNode at my-host-name.com/10.15.137.204
> ************************************************************/
>
> thanks,
> Bill
>