You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@chukwa.apache.org by "Bill Graham (JIRA)" <ji...@apache.org> on 2010/05/10 22:50:31 UTC

[jira] Commented: (CHUKWA-487) Collector left in a bad state after temprorary NN outage

    [ https://issues.apache.org/jira/browse/CHUKWA-487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12865918#action_12865918 ] 

Bill Graham commented on CHUKWA-487:
------------------------------------

Here's what I saw in the logs when I had to restart my NN. It took a little while to exit safe mode. I had to restore from he secondary name node so there might have been some data loss upon restore.

131122010-05-06 17:32:19,515 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=318716 dataRate=10622
2010-05-06 17:32:49,518 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=196741 dataRate=6557
2010-05-06 17:33:06,367 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:129,numberchunks:217
2010-05-06 17:33:19,521 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=0 dataRate=0
2010-05-06 17:33:49,523 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=0 dataRate=0
2010-05-06 17:34:01,142 WARN org.apache.hadoop.dfs.DFSClient$LeaseChecker@36b60b93 DFSClient - Problem renewing lease for DFSClient_-10
88933168: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.SafeModeException: Cannot renew lease for DFSClient_-1088933168.
 Name node is in safe mode.
The ratio of reported blocks 0.0000 has not reached the threshold 0.9990. Safe mode will be turned off automatically.
        at org.apache.hadoop.dfs.FSNamesystem.renewLease(FSNamesystem.java:1823)
        at org.apache.hadoop.dfs.NameNode.renewLease(NameNode.java:458)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:890)
        at org.apache.hadoop.ipc.Client.call(Client.java:716)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
        at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source)
        at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:781)
        at java.lang.Thread.run(Thread.java:619)

2010-05-06 17:34:01,608 WARN Timer-2094 SeqFileWriter - Got an exception in rotate
org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.SafeModeException: Cannot complete file /chukwa/logs/201006172737418_xxxxxxxxxcom_71ea99261284ab9f0566faa.chukwa. Name node is in safe mode.
The ratio of reported blocks 0.0000 has not reached the threshold 0.9990. Safe mode will be turned off automatically.
        at org.apache.hadoop.dfs.FSNamesystem.completeFileInternal(FSNamesystem.java:1209)
        at org.apache.hadoop.dfs.FSNamesystem.completeFile(FSNamesystem.java:1200)
        at org.apache.hadoop.dfs.NameNode.complete(NameNode.java:351)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:890)
        at org.apache.hadoop.ipc.Client.call(Client.java:716)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
        at org.apache.hadoop.dfs.$Proxy0.complete(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        at org.apache.hadoop.dfs.$Proxy0.complete(Unknown Source)
        at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:2736)
        at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:2657)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
        at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
        at org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter.rotate(SeqFileWriter.java:194)
        at org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter$1.run(SeqFileWriter.java:235)
        at java.util.TimerThread.mainLoop(Timer.java:512)
        at java.util.TimerThread.run(Timer.java:462)
2010-05-06 17:34:01,647 FATAL Timer-2094 SeqFileWriter - IO Exception in rotate. Exiting!
2010-05-06 17:34:01,661 FATAL btpool0-6248 SeqFileWriter - IOException when trying to write a chunk, Collector is going to exit!
java.io.IOException: Stream closed.
        at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.isClosed(DFSClient.java:2245)
        at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:2481)
        at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:155)
        at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132)
        at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:121)
        at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:112)
        at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:86)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:47)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1016)
        at org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter.add(SeqFileWriter.java:281)
        at org.apache.hadoop.chukwa.datacollection.collector.servlet.ServletCollector.accept(ServletCollector.java:152)
        at org.apache.hadoop.chukwa.datacollection.collector.servlet.ServletCollector.doPost(ServletCollector.java:190)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
        at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:487)
        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:362)
        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
        at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:729)
        at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
        at org.mortbay.jetty.Server.handle(Server.java:324)
        at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:505)
        at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:843)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:647)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:211)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:380)
        at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:395)
        at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:450)
2010-05-06 17:34:06,370 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:28,numberchunks:0
2010-05-06 17:35:06,375 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
2010-05-06 17:36:06,379 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
2010-05-06 17:37:06,384 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
...

> Collector left in a bad state after temprorary NN outage
> --------------------------------------------------------
>
>                 Key: CHUKWA-487
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-487
>             Project: Hadoop Chukwa
>          Issue Type: Bug
>          Components: data collection
>    Affects Versions: 0.4.0
>            Reporter: Bill Graham
>
> When the name node returns errors to the collector, at some point the collector dies half way. This behavior should be changed to either resemble the agents and keep trying, or to completely shutdown. Instead, what I'm seeing is that the collector logs that it's shutting down, and the var/pidDir/Collector.pid file gets removed, but the collector continues to run, albeit not handling new data. Instead, this log entry is repeated ad infinitum:
> 2010-05-06 17:35:06,375 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
> 2010-05-06 17:36:06,379 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
> 2010-05-06 17:37:06,384 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.