You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Jeff Eastman <je...@collab.net> on 2008/01/03 18:26:23 UTC

Damage Control

I have a small cloud running with about 100 gb of data in the dfs. All
appeared normal until yesterday, when Eclipse could not access the dfs.
Investigating:

 

1. I logged onto the master machine and attempted to upload a local
file. Got 6 errors like:

 

08/01/02 21:34:43 WARN fs.DFSClient: Error while writing.

java.net.SocketException: Broken pipe

        at java.net.SocketOutputStream.socketWrite0(Native Method)

        at
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)

        at
java.net.SocketOutputStream.write(SocketOutputStream.java:136)

        at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)

        at
java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)

        at java.io.DataOutputStream.write(DataOutputStream.java:90)

        at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.endBlock(DFSClient.java:
1656)

        at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:174
4)

        at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutput
Stream.java:49)

        at
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64
)

        at org.apache.hadoop.fs.FileUtil.copyContent(FileUtil.java:263)

        at org.apache.hadoop.fs.FileUtil.copyContent(FileUtil.java:248)

        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:133)

        at
org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:776)

        at
org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:757)

        at org.apache.hadoop.fs.FsShell.copyFromLocal(FsShell.java:115)

        at org.apache.hadoop.fs.FsShell.run(FsShell.java:1220)

        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:187)

        at org.apache.hadoop.fs.FsShell.main(FsShell.java:1333)

put: Broken pipe

 

2. I bounced the cloud

3. Now I had 2x the number of nodes in Node manager (hosts were all
duplicated with 0 blocks allocated in each duplicate)

4. I brought down the cloud

5. Jps still showed master processes, but none on slaves

6. Tried to down the cloud again, no change

7. Rebooted the master server (stupid move)

8. Brought up the cloud. No name node

 

[jeastman@cu027 hadoop]$ jps

2436 DataNode

2539 SecondaryNameNode

2781 Jps

2739 TaskTracker

2605 JobTracker

 

9. Node manager page is absent, cannot connect to Hadoop

10. Checking the name node log, the directory
/tmp/hadoop-jeastman/dfs/name is missing

 

The simplest thing would be to just reinitialize the dfs, since the data
is stored elsewhere. But I would like to understand what went wrong if
possible and also fix it if that is possible. Any suggestions?

 

Jeff


RE: Damage Control

Posted by Jeff Eastman <je...@collab.net>.
I decided just to reset the dfs and it is up again. Any ideas on what
might have happened?

Jeff

RE: Damage Control

Posted by Jeff Eastman <je...@collab.net>.
Digging into the logs some more, the namenode and secondarynamenode logs
are full of exceptions like this going back to Dec 25th (the oldest logs
I have):

2007-12-25 00:03:38,497 INFO org.apache.hadoop.fs.FSNamesystem: Roll
Edit Log from 204.16.107.165
2007-12-25 00:03:38,557 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 7 on 54310, call rollEditLog() from 204.16.107.165:38982: error:
java.io.IOException: Attempt to roll edit log but edits.n
.107.165:38549: error: java.io.IOException: Attempt to roll edit log but
edits.new exists
java.io.IOException: Attempt to roll edit log but edits.new exists
        at
org.apache.hadoop.dfs.FSEditLog.rollEditLog(FSEditLog.java:577)
        at
org.apache.hadoop.dfs.FSNamesystem.rollEditLog(FSNamesystem.java:3519)
        at org.apache.hadoop.dfs.NameNode.rollEditLog(NameNode.java:553)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
a:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
Impl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:585)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:340)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:566)


The datanode logs on my master system look ok until New Year's Eve, when
for some reason it starts moving blocks around like crazy. I noticed the
next day that it seems to have rebalanced the whole file system. During
this process there are a number of errors like:

2007-12-31 05:04:09,413 WARN org.apache.hadoop.dfs.DataNode: Failed to
transfer blk_-8158005346611535914 to 204.16.107.200:50010 got
java.net.SocketException: Connection reset
        at
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
        at
java.net.SocketOutputStream.write(SocketOutputStream.java:136)
        at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at
java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at org.apache.hadoop.dfs.DataNode.sendBlock(DataNode.java:1231)
        at
org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1280)
        at java.lang.Thread.run(Thread.java:595)

2007-12-31 05:04:09,415 WARN org.apache.hadoop.dfs.DataNode: Failed to
transfer blk_-8158005346611535914 to 204.16.107.200:50010 got
java.net.SocketException: Broken pipe
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
        at
java.net.SocketOutputStream.write(SocketOutputStream.java:136)
        at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at
java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at org.apache.hadoop.dfs.DataNode.sendBlock(DataNode.java:1231)
        at
org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1280)
        at java.lang.Thread.run(Thread.java:595)


-----Original Message-----
From: Jeff Eastman [mailto:jeastman@collab.net] 
Sent: Thursday, January 03, 2008 9:26 AM
To: hadoop-user@lucene.apache.org
Subject: Damage Control

I have a small cloud running with about 100 gb of data in the dfs. All
appeared normal until yesterday, when Eclipse could not access the dfs.
Investigating:

 

1. I logged onto the master machine and attempted to upload a local
file. Got 6 errors like:

 

08/01/02 21:34:43 WARN fs.DFSClient: Error while writing.

java.net.SocketException: Broken pipe

        at java.net.SocketOutputStream.socketWrite0(Native Method)

        at
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)

        at
java.net.SocketOutputStream.write(SocketOutputStream.java:136)

        at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)

        at
java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)

        at java.io.DataOutputStream.write(DataOutputStream.java:90)

        at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.endBlock(DFSClient.java:
1656)

        at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:174
4)

        at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutput
Stream.java:49)

        at
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64
)

        at org.apache.hadoop.fs.FileUtil.copyContent(FileUtil.java:263)

        at org.apache.hadoop.fs.FileUtil.copyContent(FileUtil.java:248)

        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:133)

        at
org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:776)

        at
org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:757)

        at org.apache.hadoop.fs.FsShell.copyFromLocal(FsShell.java:115)

        at org.apache.hadoop.fs.FsShell.run(FsShell.java:1220)

        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:187)

        at org.apache.hadoop.fs.FsShell.main(FsShell.java:1333)

put: Broken pipe

 

2. I bounced the cloud

3. Now I had 2x the number of nodes in Node manager (hosts were all
duplicated with 0 blocks allocated in each duplicate)

4. I brought down the cloud

5. Jps still showed master processes, but none on slaves

6. Tried to down the cloud again, no change

7. Rebooted the master server (stupid move)

8. Brought up the cloud. No name node

 

[jeastman@cu027 hadoop]$ jps

2436 DataNode

2539 SecondaryNameNode

2781 Jps

2739 TaskTracker

2605 JobTracker

 

9. Node manager page is absent, cannot connect to Hadoop

10. Checking the name node log, the directory
/tmp/hadoop-jeastman/dfs/name is missing

 

The simplest thing would be to just reinitialize the dfs, since the data
is stored elsewhere. But I would like to understand what went wrong if
possible and also fix it if that is possible. Any suggestions?

 

Jeff