You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Raghu Angadi (JIRA)" <ji...@apache.org> on 2007/06/13 08:26:26 UTC

[jira] Commented: (HADOOP-1263) retry logic when dfs exist or open fails temporarily, e.g because of timeout

    [ https://issues.apache.org/jira/browse/HADOOP-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12504135 ] 

Raghu Angadi commented on HADOOP-1263:
--------------------------------------

I am planning to use this framework for some new RPCs I am adding. I just want to confirm if my understanding is correct: This patch adds a random exponential back off timeout starting with 400 milliseconds for 5 times. In all 5 retries, this add a max of 12 seconds. Since client RPC timeout is 60sec, time it takes for such RPC to fail takes between 300-312 seconds over 6 attempts. Is this expected?, because it is not exponential back off but essentially constant timeout  of around 60sec for each retry.


> retry logic when dfs exist or open fails temporarily, e.g because of timeout
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-1263
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1263
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.3
>            Reporter: Christian Kunz
>            Assignee: Hairong Kuang
>             Fix For: 0.13.0
>
>         Attachments: retry.patch, retry1.patch
>
>
> Sometimes, when many (e.g. 1000+) map jobs start at about the same time and require supporting files from filecache, it happens that some map tasks fail because of rpc timeouts. With only the default number of 10 handlers on the namenode, the probability is high that the whole job fails (see Hadoop-1182). It is much better with a higher number of handlers, but some map tasks still fail.
> This could be avoided if rpc clients did retry when encountering a timeout before throwing an exception.
> Examples of exceptions:
> java.net.SocketTimeoutException: timed out waiting for rpc response
> at org.apache.hadoop.ipc.Client.call(Client.java:473)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
> at org.apache.hadoop.dfs.$Proxy1.exists(Unknown Source)
> at org.apache.hadoop.dfs.DFSClient.exists(DFSClient.java:320)
> at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.exists(DistributedFileSystem.java:170)
> at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:125)
> at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.(ChecksumFileSystem.java:110)
> at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
> at org.apache.hadoop.filecache.DistributedCache.createMD5(DistributedCache.java:327)
> at org.apache.hadoop.filecache.DistributedCache.ifExistsAndFresh(DistributedCache.java:253)
> at org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache.java:169)
> at org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:86)
> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:117)
> java.net.SocketTimeoutException: timed out waiting for rpc response
>         at org.apache.hadoop.ipc.Client.call(Client.java:473)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
>         at org.apache.hadoop.dfs.$Proxy1.open(Unknown Source)
>         at org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:511)
>         at org.apache.hadoop.dfs.DFSClient$DFSInputStream.<init>(DFSClient.java:498)
>         at org.apache.hadoop.dfs.DFSClient.open(DFSClient.java:207)
>         at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:129)
>         at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.<init>(ChecksumFileSystem.java:110)
>         at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
>         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:82)
>         at org.apache.hadoop.fs.ChecksumFileSystem.copyToLocalFile(ChecksumFileSystem.java:577)
>         at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:766)
>         at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:370)
>         at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:877)
>         at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:545)
>         at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:913)
>         at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:1603)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.