You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "dhruba borthakur (JIRA)" <ji...@apache.org> on 2007/03/29 19:53:25 UTC
[jira] Updated: (HADOOP-1182) DFS Scalability issue with filecache
in large clusters
[ https://issues.apache.org/jira/browse/HADOOP-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
dhruba borthakur updated HADOOP-1182:
-------------------------------------
Component/s: (was: mapred)
dfs
Summary: DFS Scalability issue with filecache in large clusters (was: scalability issue with filecache in large clusters)
I am changing this to "dfs" because the stack trace shows that a connection to the namenode timed out and I suspecting that this is a DFS namenode issue.
> DFS Scalability issue with filecache in large clusters
> ------------------------------------------------------
>
> Key: HADOOP-1182
> URL: https://issues.apache.org/jira/browse/HADOOP-1182
> Project: Hadoop
> Issue Type: Bug
> Components: dfs
> Affects Versions: 0.12.1
> Reporter: Christian Kunz
>
> When using filecache to distribute supporting files for map/reduce applications in a 1000 node cluster, many map tasks fail because of timeouts. There was no such problem using a 200 node cluster for the same applications with comparable input data. Either the whole job fails because of too many map failures, or even worse, some map tasks hang indefinitely.
> java.net.SocketTimeoutException: timed out waiting for rpc response
> at org.apache.hadoop.ipc.Client.call(Client.java:473)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
> at org.apache.hadoop.dfs.$Proxy1.exists(Unknown Source)
> at org.apache.hadoop.dfs.DFSClient.exists(DFSClient.java:320)
> at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.exists(DistributedFileSystem.java:170)
> at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:125)
> at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.(ChecksumFileSystem.java:110)
> at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
> at org.apache.hadoop.filecache.DistributedCache.createMD5(DistributedCache.java:327)
> at org.apache.hadoop.filecache.DistributedCache.ifExistsAndFresh(DistributedCache.java:253)
> at org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache.java:169)
> at org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:86)
> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:117)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.