You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Dennis Kubes (JIRA)" <ji...@apache.org> on 2007/12/10 04:51:44 UTC

[jira] Commented: (HADOOP-2391) Speculative Execution race condition with output paths

    [ https://issues.apache.org/jira/browse/HADOOP-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12549888 ] 

Dennis Kubes commented on HADOOP-2391:
--------------------------------------

After more looking I think I have a better understanding of what is happening.  Speculative execution would cause more tasks to be killed because in parallel tasks one task will finish before the other and the slower task will be killed.  Killed tasks are returned as completed immediately and have their output removed by a separate thread than the thread which completes a job.  Meaning it is possible for a job to complete and return before output for killed tasks is fully removed.

Would it be useful to have a structure that checks if killed tasks are removed before allowing a job to complete?  I think it would but I might not be fully understanding why killed tasks are returned immediately.

> Speculative Execution race condition with output paths
> ------------------------------------------------------
>
>                 Key: HADOOP-2391
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2391
>             Project: Hadoop
>          Issue Type: Bug
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>
> I am tracking a problem where when speculative execution is enabled, there is a race condition when trying to read output paths from a previously completed job.  More specifically when reduce tasks run their output is put into a working directory under the task name until the task in completed.  The directory name is something like workdir/_taskid.  Upon completion the output get moved into workdir.  Regular tasks are checked for this move and not considered completed until this move is made.  I have not verified it but all indications point to speculative tasks NOT having this same check for completion and more importantly removal when killed.  So what we end up with when trying to read the output of previous tasks with speculative execution enabled is the possibility that previous workdir/_taskid will be present when the output directory is read by a chained job.  Here is an error when supports my theory:
> Generator: org.apache.hadoop.ipc.RemoteException: java.io.IOException: Cannot open filename /u01/hadoop/mapred/temp/generate-temp-1197104928603/_task_200712080949_0005_r_000014_1
>         at org.apache.hadoop.dfs.NameNode.open(NameNode.java:234)
>         at sun.reflect.GeneratedMethodAccessor64.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:389)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:644)
>         at org.apache.hadoop.ipc.Client.call(Client.java:507)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:186)
>         at org.apache.hadoop.dfs.$Proxy0.open(Unknown Source)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>         at org.apache.hadoop.dfs.$Proxy0.open(Unknown Source)
>         at org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:839)
>         at org.apache.hadoop.dfs.DFSClient$DFSInputStream.<init>(DFSClient.java:831)
>         at org.apache.hadoop.dfs.DFSClient.open(DFSClient.java:263)
>         at org.apache.hadoop.dfs.DistributedFileSystem.open(DistributedFileSystem.java:114)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1356)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1349)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1344)
>         at org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:87)
>         at org.apache.nutch.crawl.Generator.generate(Generator.java:429)
>         at org.apache.nutch.crawl.Generator.run(Generator.java:563)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:54)
>         at org.apache.nutch.crawl.Generator.main(Generator.java:526)
> I will continue to research this and post as I make progress on tracking down this bug.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.