You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Arun C Murthy (JIRA)" <ji...@apache.org> on 2008/01/04 12:27:34 UTC

[jira] Resolved: (HADOOP-2167) Reduce tips complete 100%, but job does not complete saying reduces still running.

     [ https://issues.apache.org/jira/browse/HADOOP-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy resolved HADOOP-2167.
-----------------------------------

    Resolution: Cannot Reproduce

We haven't seen this nor can we seem to repro it. Also HADOOP-2216 led us astray...

I'm closing this for now, please re-open if required.

> Reduce tips complete 100%, but job does not complete saying reduces still running.
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-2167
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2167
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amareshwari Sri Ramadasu
>            Assignee: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.16.0
>
>
> Job's reduces are stuck at 99.43% progress and 2 reduces in running state and Job is not complete. 
> But the reduce task list on the job tracker shows they are complete 100% and marked as SUCCEEDED and Finishtime is available jobtasks.jsp and jobhistory also.
> With ipc.client.timeout = 600000, the exceptions on TT's running the reduces are
> On one of the TTs, the logs show the following:
> 2007-11-07 08:34:16,092 INFO org.apache.hadoop.mapred.TaskTracker: Task task_200711070637_0001_r_000150_0 is done.
> 2007-11-07 08:35:34,013 INFO org.apache.hadoop.mapred.TaskTracker: Task task_200711070637_0001_r_000156_0 is done.
> 2007-11-07 08:42:44,751 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.net.SocketTimeoutException: timedout waiting for rpc response
>         at org.apache.hadoop.ipc.Client.call(Client.java:484)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
>         at org.apache.hadoop.mapred.$Proxy0.heartbeat(Unknown Source)
>         at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:897)
>         at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:799)
>         at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1193)
>         at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2055)
> 2007-11-07 08:42:44,767 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to .................
> On the other TT,
> 2007-11-07 08:40:30,484 INFO org.apache.hadoop.mapred.TaskTracker: Task task_200711070637_0001_r_000160_0 is done.
> 2007-11-07 08:42:45,508 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.net.SocketTimeoutException: timedout waiting for rpc response
>         at org.apache.hadoop.ipc.Client.call(Client.java:484)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
>         at org.apache.hadoop.mapred.$Proxy0.heartbeat(Unknown Source)
>         at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:897)
>         at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:799)
>         at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1193)
>         at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2055)
> 2007-11-07 08:42:45,508 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to ..........
> On JT logs, the reduce tasks are done successfully:
> 2007-11-07 06:39:09,151 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'task_200711070637_0001_r_000160_0' to tip tip_200711070637_0001_r_000160, for tracker 'x'
> 2007-11-07 08:42:45,708 INFO org.apache.hadoop.mapred.TaskRunner: Saved output of task 'task_200711070637_0001_r_000160_0' to 'y'
> 2007-11-07 08:42:45,708 INFO org.apache.hadoop.mapred.JobInProgress: Task 'task_200711070637_0001_r_000160_0' has completed tip_200711070637_0001_r_000160 successfully.
> This would suggest that if tasks are done before the timeout, the problem occurs in progress update. This is also not consistent since other reduce tasks in the same situation are successful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.