You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Guy Doulberg <gu...@conduit.com> on 2011/12/15 10:16:33 UTC

NameNode - didn't persist the edit log

Hi guys,

We recently had the following problem  on our production cluster:

The filesystem containing the editlog and fsimage had no free inodes.
  As a result the namenode wasn't able to obtain an inode for the 
fsimage and  editlog after a checkpiot has been reached, while the 
previous files were freed.
  Unfortunately, we had no monitoring on the inodes number, so it 
happens that the namenode ran in this state for a few hours.

We have noticed this failure in its DFS-status page.

But the namenode didn't enter safe-mode, so all the writes were made 
couldn't be persisted to the editlog.


After discovering the problem we freed inodes, and the file-system 
seemed to be okay again, we tried to force the namenode to persist to 
editlog with no success,

Eventually, we restarted the namenode -which of-course caused us to lose 
all the data that was written to the hdfs during these few hours 
(fortunately we have backup of the recent writes - so we restored the 
data from there )

This situation raises some severe concerns,
1. How come the namenode identified  a failure in persisting its editlog 
and didn't enter safe-mode? (The exception was given only a WARN 
-severity and not a CRITICAL)
2. How come after we freed  inodes, we couldn't persist the namenode? 
Maybe there should be a command in the CLI to should enable us to force 
the namenode to persist its editlog

Do you know of a JIRA opened for these issue, or should I open one?

Thanks Guy



Re: NameNode - didn't persist the edit log

Posted by Todd Lipcon <to...@cloudera.com>.
Hi Guy,

Eli has been looking into these issues and it looks like you found a
nasty bug. You can follow these JIRAs to track resolution: HDFS-2701,
HDFS-2702, HDFS-2703. I think in particular HDFS-2703 is the one that
bit you here.

-Todd

On Thu, Dec 15, 2011 at 2:06 AM, Guy Doulberg <gu...@conduit.com> wrote:
> Hi Todd, you are right I should be more specific:
> 1. from the namenode log:
> 2011-12-11 08:57:23,245 WARN org.apache.hadoop.hdfs.server.common.Storage:
> rollEdidLog: removing storage /srv/hadoop/hdfs/edit
> 2011-12-11 08:57:23,311 WARN org.apache.hadoop.hdfs.server.common.Storage:
> incrementCheckpointTime failed on /srv/hadoop/hdfs/name;type=IMAGE
> 2011-12-11 08:57:23,316 WARN org.mortbay.log: /getimage:
> java.io.IOException: GetImage failed. java.lang.NullPointerException
>       at
> org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:83)
>       at
> org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:78)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:396)
>       at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
>       at
> org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:78)
>       at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>       at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>       at
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
>       at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
>       at
> org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:829)
>       at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>       at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>       at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>       at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>       at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>       at
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>       at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>       at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>       at org.mortbay.jetty.Server.handle(Server.java:326)
>       at
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
>       at
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
>       at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
>       at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
>       at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>       at
> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
>       at
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
>
>
> 2.  Embarrassingly no, we had only, now we have 2 and periodically  backups
> , :(
> 3. Yes
> 4. hadoop version
> Hadoop 0.20.2-cdh3u2
> Subversion
> file:///tmp/nightly_2011-10-13_20-02-02_3/hadoop-0.20-0.20.2+923.142-1~lucid
> -r 95a824e4005b2a94fe1c11f1ef9db4c672ba43cb
> Compiled by root on Thu Oct 13 21:52:18 PDT 2011
> From source with checksum 644e5db6c59d45bca96cec7f220dda51
>
> Thanks, Guy
>
>
> On Thu 15 Dec 2011 11:39:26 AM IST, Todd Lipcon wrote:
>>
>> Hi Guy,
>>
>> Several questions come to mind here:
>> - What was the exact WARN level message you saw?
>> - Did you have multiple dfs.name.dirs configured as recommended by
>> most setup guides?
>> - Did you try entering safemode and then running saveNamespace to
>> persist the image before shutting down the NN? This would have saved
>> your data.
>> - What exact version of HDFS were you running?
>>
>> This is certainly not expected behavior... all of the places where an
>> edit log fails have a check against there being 0 edit logs remaining
>> and should issue a FATAL level message followed by a System.exit(-1).
>>
>> -Todd
>>
>> On Thu, Dec 15, 2011 at 1:16 AM, Guy Doulberg<gu...@conduit.com>
>>  wrote:
>>>
>>> Hi guys,
>>>
>>> We recently had the following problem  on our production cluster:
>>>
>>> The filesystem containing the editlog and fsimage had no free inodes.
>>>  As a result the namenode wasn't able to obtain an inode for the fsimage
>>> and
>>>  editlog after a checkpiot has been reached, while the previous files
>>> were
>>> freed.
>>>  Unfortunately, we had no monitoring on the inodes number, so it happens
>>> that the namenode ran in this state for a few hours.
>>>
>>> We have noticed this failure in its DFS-status page.
>>>
>>> But the namenode didn't enter safe-mode, so all the writes were made
>>> couldn't be persisted to the editlog.
>>>
>>>
>>> After discovering the problem we freed inodes, and the file-system seemed
>>> to
>>> be okay again, we tried to force the namenode to persist to editlog with
>>> no
>>> success,
>>>
>>> Eventually, we restarted the namenode -which of-course caused us to lose
>>> all
>>> the data that was written to the hdfs during these few hours (fortunately
>>> we
>>> have backup of the recent writes - so we restored the data from there )
>>>
>>> This situation raises some severe concerns,
>>> 1. How come the namenode identified  a failure in persisting its editlog
>>> and
>>> didn't enter safe-mode? (The exception was given only a WARN -severity
>>> and
>>> not a CRITICAL)
>>> 2. How come after we freed  inodes, we couldn't persist the namenode?
>>> Maybe
>>> there should be a command in the CLI to should enable us to force the
>>> namenode to persist its editlog
>>>
>>> Do you know of a JIRA opened for these issue, or should I open one?
>>>
>>> Thanks Guy
>>>
>>>
>>
>>
>>
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: NameNode - didn't persist the edit log

Posted by Guy Doulberg <gu...@conduit.com>.
Hi Todd, 
you are right I should be more specific:
1. from the namenode log:
2011-12-11 08:57:23,245 WARN 
org.apache.hadoop.hdfs.server.common.Storage: rollEdidLog: removing 
storage /srv/hadoop/hdfs/edit
2011-12-11 08:57:23,311 WARN 
org.apache.hadoop.hdfs.server.common.Storage: incrementCheckpointTime 
failed on /srv/hadoop/hdfs/name;type=IMAGE
2011-12-11 08:57:23,316 WARN org.mortbay.log: /getimage: 
java.io.IOException: GetImage failed. java.lang.NullPointerException
        at 
org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:83)
        at 
org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:78)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
        at 
org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:78)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
        at 
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
        at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
        at 
org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:829)
        at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
        at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
        at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
        at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
        at 
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
        at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
        at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
        at org.mortbay.jetty.Server.handle(Server.java:326)
        at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
        at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
        at 
org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
        at 
org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
        at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
        at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)


2.  Embarrassingly no, we had only, now we have 2 and periodically  
backups , :(
3. Yes
4. hadoop version
Hadoop 0.20.2-cdh3u2
Subversion 
file:///tmp/nightly_2011-10-13_20-02-02_3/hadoop-0.20-0.20.2+923.142-1~lucid 
-r 95a824e4005b2a94fe1c11f1ef9db4c672ba43cb
Compiled by root on Thu Oct 13 21:52:18 PDT 2011
>From source with checksum 644e5db6c59d45bca96cec7f220dda51

Thanks, 
Guy

On Thu 15 Dec 2011 11:39:26 AM IST, Todd Lipcon wrote:
> Hi Guy,
>
> Several questions come to mind here:
> - What was the exact WARN level message you saw?
> - Did you have multiple dfs.name.dirs configured as recommended by
> most setup guides?
> - Did you try entering safemode and then running saveNamespace to
> persist the image before shutting down the NN? This would have saved
> your data.
> - What exact version of HDFS were you running?
>
> This is certainly not expected behavior... all of the places where an
> edit log fails have a check against there being 0 edit logs remaining
> and should issue a FATAL level message followed by a System.exit(-1).
>
> -Todd
>
> On Thu, Dec 15, 2011 at 1:16 AM, Guy Doulberg<gu...@conduit.com>  wrote:
>> Hi guys,
>>
>> We recently had the following problem  on our production cluster:
>>
>> The filesystem containing the editlog and fsimage had no free inodes.
>>   As a result the namenode wasn't able to obtain an inode for the fsimage and
>>   editlog after a checkpiot has been reached, while the previous files were
>> freed.
>>   Unfortunately, we had no monitoring on the inodes number, so it happens
>> that the namenode ran in this state for a few hours.
>>
>> We have noticed this failure in its DFS-status page.
>>
>> But the namenode didn't enter safe-mode, so all the writes were made
>> couldn't be persisted to the editlog.
>>
>>
>> After discovering the problem we freed inodes, and the file-system seemed to
>> be okay again, we tried to force the namenode to persist to editlog with no
>> success,
>>
>> Eventually, we restarted the namenode -which of-course caused us to lose all
>> the data that was written to the hdfs during these few hours (fortunately we
>> have backup of the recent writes - so we restored the data from there )
>>
>> This situation raises some severe concerns,
>> 1. How come the namenode identified  a failure in persisting its editlog and
>> didn't enter safe-mode? (The exception was given only a WARN -severity and
>> not a CRITICAL)
>> 2. How come after we freed  inodes, we couldn't persist the namenode? Maybe
>> there should be a command in the CLI to should enable us to force the
>> namenode to persist its editlog
>>
>> Do you know of a JIRA opened for these issue, or should I open one?
>>
>> Thanks Guy
>>
>>
>
>
>

Re: NameNode - didn't persist the edit log

Posted by Todd Lipcon <to...@cloudera.com>.
Hi Guy,

Several questions come to mind here:
- What was the exact WARN level message you saw?
- Did you have multiple dfs.name.dirs configured as recommended by
most setup guides?
- Did you try entering safemode and then running saveNamespace to
persist the image before shutting down the NN? This would have saved
your data.
- What exact version of HDFS were you running?

This is certainly not expected behavior... all of the places where an
edit log fails have a check against there being 0 edit logs remaining
and should issue a FATAL level message followed by a System.exit(-1).

-Todd

On Thu, Dec 15, 2011 at 1:16 AM, Guy Doulberg <gu...@conduit.com> wrote:
> Hi guys,
>
> We recently had the following problem  on our production cluster:
>
> The filesystem containing the editlog and fsimage had no free inodes.
>  As a result the namenode wasn't able to obtain an inode for the fsimage and
>  editlog after a checkpiot has been reached, while the previous files were
> freed.
>  Unfortunately, we had no monitoring on the inodes number, so it happens
> that the namenode ran in this state for a few hours.
>
> We have noticed this failure in its DFS-status page.
>
> But the namenode didn't enter safe-mode, so all the writes were made
> couldn't be persisted to the editlog.
>
>
> After discovering the problem we freed inodes, and the file-system seemed to
> be okay again, we tried to force the namenode to persist to editlog with no
> success,
>
> Eventually, we restarted the namenode -which of-course caused us to lose all
> the data that was written to the hdfs during these few hours (fortunately we
> have backup of the recent writes - so we restored the data from there )
>
> This situation raises some severe concerns,
> 1. How come the namenode identified  a failure in persisting its editlog and
> didn't enter safe-mode? (The exception was given only a WARN -severity and
> not a CRITICAL)
> 2. How come after we freed  inodes, we couldn't persist the namenode? Maybe
> there should be a command in the CLI to should enable us to force the
> namenode to persist its editlog
>
> Do you know of a JIRA opened for these issue, or should I open one?
>
> Thanks Guy
>
>



-- 
Todd Lipcon
Software Engineer, Cloudera