You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Christian Kunz (JIRA)" <ji...@apache.org> on 2007/03/30 18:43:25 UTC
[jira] Commented: (HADOOP-1182) DFS Scalability issue with filecache in large clusters

    [ https://issues.apache.org/jira/browse/HADOOP-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485560 ] 

Christian Kunz commented on HADOOP-1182:
----------------------------------------

Indeed, namenode server is on 99.9 CPU.
top output during transition of job submittal:
                                                                                                                                             
 2143 crawler   16   0 2199m 175m 6248 S  4.0  4.3 278:13.64 java                                                                                                                                                 
 2143 crawler   16   0 2199m 172m 6248 S 79.5  4.3 278:16.03 java                                                                                                                                                 
 2143 crawler   16   0 2199m 176m 6248 S 99.9  4.4 278:19.52 java                                                                                                                                                 
 2143 crawler   16   0 2199m 184m 6248 S 99.9  4.5 278:22.99 java                                                                                                                                                 
 2143 crawler   16   0 2199m 188m 6248 S 99.9  4.7 278:26.42 java                                                                                                                                                 
 2143 crawler   16   0 2199m 188m 6248 S 99.9  4.7 278:29.84 java      

'Call queue overflow discarding oldest call' warn messsages in namenode log:

Typical exception in namenode log:
2007-03-30 09:29:08,490 WARN org.apache.hadoop.ipc.Server: handler output error
java.nio.channels.ClosedChannelException
        at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
        at org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer(SocketChannelOutputStream.java:108)
        at org.apache.hadoop.ipc.SocketChannelOutputStream.write(SocketChannelOutputStream.java:89)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:78)
        at java.io.DataOutputStream.writeByte(DataOutputStream.java:136)
        at org.apache.hadoop.io.UTF8.writeChars(UTF8.java:275)
        at org.apache.hadoop.io.UTF8.writeString(UTF8.java:247)
        at org.apache.hadoop.dfs.DatanodeID.write(DatanodeID.java:138)
        at org.apache.hadoop.dfs.DatanodeInfo.write(DatanodeInfo.java:248)
        at org.apache.hadoop.dfs.LocatedBlock.write(LocatedBlock.java:76)
        at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:154)
        at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:121)
        at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:65)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:573)





> DFS Scalability issue with filecache in large clusters
> ------------------------------------------------------
>
>                 Key: HADOOP-1182
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1182
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.12.1
>            Reporter: Christian Kunz
>
> When using filecache to distribute supporting files for map/reduce applications in a 1000 node cluster, many map tasks fail  because of timeouts. There was no such problem using a 200 node cluster for the same applications with comparable input data. Either the whole job fails because of too many map failures, or even worse, some map tasks hang indefinitely.
> java.net.SocketTimeoutException: timed out waiting for rpc response
> 	at org.apache.hadoop.ipc.Client.call(Client.java:473)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
> 	at org.apache.hadoop.dfs.$Proxy1.exists(Unknown Source)
> 	at org.apache.hadoop.dfs.DFSClient.exists(DFSClient.java:320)
> 	at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.exists(DistributedFileSystem.java:170)
> 	at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:125)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.(ChecksumFileSystem.java:110)
> 	at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
> 	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
> 	at org.apache.hadoop.filecache.DistributedCache.createMD5(DistributedCache.java:327)
> 	at org.apache.hadoop.filecache.DistributedCache.ifExistsAndFresh(DistributedCache.java:253)
> 	at org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache.java:169)
> 	at org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:86)
> 	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:117)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.