You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Lance Riedel <la...@dotspots.com> on 2009/04/30 18:33:12 UTC

Infinite Loop Resending status from task tracker

Has anyone seen this before? Our task tracker produced a 2.7 gig log  
file in a few hours. The entry is all the same (every 2 ms):

2009-04-30 02:34:40,207 INFO org.apache.hadoop.mapred.TaskTracker:  
Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with  
reponseId '5341
2009-04-30 02:34:40,398 INFO org.apache.hadoop.mapred.TaskTracker:  
Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with  
reponseId '5341
2009-04-30 02:34:40,403 INFO org.apache.hadoop.mapred.TaskTracker:  
Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with  
reponseId '5341
2009-04-30 02:34:40,411 INFO org.apache.hadoop.mapred.TaskTracker:  
Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with  
reponseId '5341
2009-04-30 02:34:40,414 INFO org.apache.hadoop.mapred.TaskTracker:  
Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with  
reponseId '5341
2009-04-30 02:34:40,417 INFO org.apache.hadoop.mapred.TaskTracker:  
Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with  
reponseId '5341
2009-04-30 02:34:40,420 INFO org.apache.hadoop.mapred.TaskTracker:  
Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with  
reponseId '5341
... (And on and on and on...)


These are the few lines before it started:

2009-04-30 02:34:29,780 INFO  
org.apache.hadoop.mapred.TaskTracker.clienttrace: src: xxx.xxx.xxx.xxx: 
50060, dest: 10.253.178.95:40268, bytes: 3341324, op: MAPRED_SHUFFLE,  
cliID: attempt_200904291917_0352_m_000006_0
2009-04-30 02:34:31,522 INFO org.apache.hadoop.mapred.TaskTracker:  
Sent out 418891 bytes for reduce: 12 from map:  
attempt_200904291917_0352_m_000007_0 given 418891/418887 from 4301462  
with (22, 171)
2009-04-30 02:34:31,522 INFO  
org.apache.hadoop.mapred.TaskTracker.clienttrace: src: xxx.xxx.xxx.xxx: 
50060, dest: xxx.xxx.xxx.xxx:40268, bytes: 418891, op: MAPRED_SHUFFLE,  
cliID: attempt_200904291917_0352_m_000007_0
2009-04-30 02:34:35,382 INFO org.apache.hadoop.mapred.TaskTracker:  
attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10 of  
11 at 0.32 MB/s) >
2009-04-30 02:34:38,385 INFO org.apache.hadoop.mapred.TaskTracker:  
attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10 of  
11 at 0.32 MB/s) >
2009-04-30 02:34:40,207 INFO org.apache.hadoop.mapred.TaskTracker:  
Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with  
reponseId '5341
2009-04-30 02:34:40,398 INFO org.apache.hadoop.mapred.TaskTracker:  
Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with  
reponseId '5341
2009-04-30 02:34:40,403 INFO org.apache.hadoop.mapred.TaskTracker:  
Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with  
reponseId '5341
2009-04-30 02:34:40,411 INFO org.apache.hadoop.mapred.TaskTracker:  
Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with  
reponseId '5341
2009-04-30 02:34:40,414 INFO org.apache.hadoop.mapred.TaskTracker:  
Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with  
reponseId '5341
2009-04-30 02:34:40,417 INFO org.apache.hadoop.mapred.TaskTracker:  
Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with  
reponseId '5341
2009-04-30 02:34:40,420 INFO org.apache.hadoop.mapred.TaskTracker:  
Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with  
reponseId '5341

--And on for 	2+ gigs

Re: Infinite Loop Resending status from task tracker

Posted by Lance Riedel <la...@dotspots.com>.

Sorry, had missed that Todd had created Jira -
HADOOP-5761<https://issues.apache.org/jira/browse/HADOOP-5761>

Any progress there?

Thanks,
Lance

On Thu, May 14, 2009 at 8:52 AM, Lance Riedel <la...@dotspots.com> wrote:

> Here is the point in the logs where the infinite loop begins - see time
> stamp 2009-05-14 04:03:56,348 : (JobTracker)
>
> 2009-05-14 04:03:56,324 INFO org.apache.hadoop.mapred.JobTracker: Removed
> completed task 'attempt_200905122015_1168_m_000029_0' from
> 'tracker_domU-12-31-38-01-74-F1.compute-1.internal:localhost.localdomain/
> 127.0.0.1:35214'
> 2009-05-14 04:03:56,326 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task 'attempt_200905122015_1183_r_000007_0' to tip
> task_200905122015_1183_r_000007, for tracker
> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
> 127.0.0.1:58504'
> 2009-05-14 04:03:56,327 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task (cleanup)'attempt_200905122015_1184_m_000000_1' to tip
> task_200905122015_1184_m_000000, for tracker
> 'tracker_domU-12-31-38-00-80-21.compute-1.internal:localhost.localdomain/
> 127.0.0.1
> :57741'
> 2009-05-14 04:03:56,330 INFO org.apache.hadoop.mapred.JobInProgress: Task
> 'attempt_200905122015_1182_r_000011_0' has completed
> task_200905122015_1182_r_000011 successfully.
> 2009-05-14 04:03:56,330 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task (cleanup)'attempt_200905122015_1182_r_000010_1' to tip
> task_200905122015_1182_r_000010, for tracker
> 'tracker_domU-12-31-38-01-5C-41.compute-1.internal:localhost.localdomain/
> 127.0.0.1
> :46248'
> 2009-05-14 04:03:56,331 INFO org.apache.hadoop.mapred.JobTracker: Serious
> problem.  While updating status, cannot find taskid
> attempt_200905122015_0499_r_000004_1
> 2009-05-14 04:03:56,331 INFO org.apache.hadoop.mapred.JobInProgress: Task
> 'attempt_200905122015_1184_m_000004_1' has completed
> task_200905122015_1184_m_000004 successfully.
> 2009-05-14 04:03:56,331 INFO org.apache.hadoop.mapred.ResourceEstimator:
> measured blowup on task_200905122015_1184_m_000004 was 20150008/21581175 =
> 0.93368447269437372009-05-14 04:03:56,331 INFO
> org.apache.hadoop.mapred.ResourceEstimator: new estimate is blowup =
> 0.9152383292400812
> 2009-05-14 04:03:56,335 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task 'attempt_200905122015_1183_r_000008_0' to tip
> task_200905122015_1183_r_000008, for tracker
> 'tracker_domU-12-31-38-00-80-21.compute-1.internal:localhost.localdomain/
> 127.0.0.1:57741'
> 2009-05-14 04:03:56,336 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task 'attempt_200905122015_1183_r_000009_0' to tip
> task_200905122015_1183_r_000009, for tracker
> 'tracker_domU-12-31-38-01-74-F1.compute-1.internal:localhost.localdomain/
> 127.0.0.1:35214'
> 2009-05-14 04:03:56,336 INFO org.apache.hadoop.mapred.JobTracker: Removed
> completed task 'attempt_200905122015_1181_r_000009_0' from
> 'tracker_domU-12-31-38-01-74-F1.compute-1.internal:localhost.localdomain/
> 127.0.0.1:35214'
> 2009-05-14 04:03:56,336 INFO org.apache.hadoop.mapred.JobTracker: Serious
> problem.  While updating status, cannot find taskid
> attempt_200905122015_0499_r_000004_1
> 2009-05-14 04:03:56,337 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task 'attempt_200905122015_1183_r_000010_0' to tip
> task_200905122015_1183_r_000010, for tracker
> 'tracker_domU-12-31-38-01-81-31.compute-1.internal:localhost.localdomain/
> 127.0.0.1:46518'
> 2009-05-14 04:03:56,343 INFO org.apache.hadoop.mapred.JobTracker: Serious
> problem.  While updating status, cannot find taskid
> attempt_200905122015_1070_r_000014_1
> 2009-05-14 04:03:56,344 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task 'attempt_200905122015_1183_r_000011_0' to tip
> task_200905122015_1183_r_000011, for tracker
> 'tracker_domU-12-31-38-01-AD-91.compute-1.internal:localhost.localdomain/
> 127.0.0.1:33929'
> 2009-05-14 04:03:56,348 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 9 on 54311, call
> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@a4fe4, false, true,
> 21461) from 10.253.178.95:50709: error: java.io.IOException:
> java.lang.NullPointerException
> java.io.IOException: java.lang.NullPointerException
>         at
> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>         at
> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>         at sun.reflect.GeneratedMethodAccessor117.invoke(Unknown
> Source)        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
> 2009-05-14 04:03:56,351 INFO org.apache.hadoop.mapred.JobTracker:
> attempt_200905122015_1183_r_000007_0 is 24 ms debug.
> 2009-05-14 04:03:56,419 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task 'attempt_200905122015_1183_r_000012_0' to tip
> task_200905122015_1183_r_000012, for tracker
> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
> 127.0.0.1:58504'
> 2009-05-14 04:03:56,459 INFO org.apache.hadoop.mapred.JobTracker: Serious
> problem.  While updating status, cannot find taskid
> attempt_200905122015_1070_r_000014_1
> 2009-05-14 04:03:56,459 WARN org.apache.hadoop.mapred.TaskInProgress:
> Recieved duplicate status update of 'KILLED' for
> 'attempt_200905122015_1182_r_000009_1' of TIP
> 'task_200905122015_1182_r_000009'
> 2009-05-14 04:03:56,460 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task 'attempt_200905122015_1183_r_000013_0' to tip
> task_200905122015_1183_r_000013, for tracker
> 'tracker_domU-12-31-38-01-AD-91.compute-1.internal:localhost.localdomain/
> 127.0.0.1:33929'
> 2009-05-14 04:03:56,460 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 5 on 54311, call
> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1f1c507, false, true,
> 21461) from 10.253.178.95:50709: error: java.io.IOException:
> java.lang.NullPointerExceptio
> n
> java.io.IOException: java.lang.NullPointerException
>         at
> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
> at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>         at sun.reflect.GeneratedMethodAccessor117.invoke(Unknown Source)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
> 2009-05-14 04:03:56,464 INFO org.apache.hadoop.mapred.JobTracker: Serious
> problem.  While updating status, cannot find taskid
> attempt_200905122015_1070_r_000014_1
> 2009-05-14 04:03:56,464 WARN org.apache.hadoop.mapred.TaskInProgress:
> Recieved duplicate status update of 'KILLED' for
> 'attempt_200905122015_1182_r_000009_1' of TIP
> 'task_200905122015_1182_r_000009'
> 2009-05-14 04:03:56,465 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task 'attempt_200905122015_1183_r_000014_0' to tip
> task_200905122015_1183_r_000014, for tracker
> 'tracker_domU-12-31-38-01-AD-91.compute-1.internal:localhost.localdomain/
> 127.0.0.1:33929'
> 2009-05-14 04:03:56,465 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 2 on 54311, call
> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1d54fc9, false, true,
> 21461) from 10.253.178.95:50709: error: java.io.IOException:
> java.lang.NullPointerExceptio
> n
>
>
> <and on and on an on for 10 gigs worth in a few hours>
>
>
>
>
> On Thu, May 14, 2009 at 8:37 AM, Lance Riedel <la...@dotspots.com> wrote:
>
>> Just had another cluster crash with the same issue. This is still a huge
>> issue for us- still crashing our cluster every other night (actually almost
>> every night now).
>>
>> Should we move to .20? Is there more information i can provide?  Is this
>> related to my other email "Constantly getting DiskErrorExceptions - but
>> logged as INFO"? I haven't seen responses on that.
>>
>> Thanks!
>> Lance
>>
>> On Thu, May 14, 2009 at 7:48 AM, Lance Riedel <la...@dotspots.com> wrote:
>>
>>> Here is the latest here.. Haven't heard any more, but every other night
>>> we get 10 gigs logs and tons of failed tasks and have to restart the cluster
>>>
>>> ---------- Forwarded message ----------
>>> From: Lance Riedel <la...@dotspots.com>
>>> Date: Fri, May 8, 2009 at 10:49 AM
>>> Subject: Re: Infinite Loop Resending status from task tracker
>>> To: core-user@hadoop.apache.org
>>> Cc: Brian Long <br...@dotspots.com>
>>>
>>>
>>> Hi Todd,
>>> Sorry, my response got hung up in my outbox for a couple of days.. arghh
>>>
>>>
>>>
>>> Confirmed that 1) we are not running out of space and 2) that our
>>> mapred.local.dir directory is not in /tmp
>>>
>>> Not sure if this an ec2 problem with a mounted drive?
>>>
>>> We had the same thing happen again, exact same logs and symptoms
>>> (simultaneous in jobtracker and tasktracker)
>>>
>>> Thinking about moving to .20 because of this, any thoughts on that?
>>>
>>> Thanks,
>>> Lance
>>>
>>>
>>> On May 4, 2009, at 4:18 PM, Todd Lipcon wrote:
>>>
>>>  Hi Lance,
>>>>
>>>> Two thoughts here that might be the culprit:
>>>>
>>>> 1) Is it possible that the partition that your mapred.local.dir is on is
>>>> out
>>>> of space on that task tracker?
>>>>
>>>> 2) Is it possible that you're using a directory under /tmp for
>>>> mapred.local.dir and some system cron script cleared out /tmp?
>>>>
>>>> -Todd
>>>>
>>>> On Sat, May 2, 2009 at 9:01 AM, Lance Riedel <la...@dotspots.com>
>>>> wrote:
>>>>
>>>>  Hi Todd,
>>>>> Not sure if this is related, but our hadoop cluster in general is
>>>>> getting
>>>>> more and more unstable.  the logs are full of this error message (but
>>>>> having
>>>>> trouble tracking down the root problem):
>>>>>
>>>>> 2009-05-02 11:30:39,294 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>>>
>>>>> taskTracker/jobcache/job_200904301103_1111/attempt_200904301103_1111_m_000001_1/output/file.out
>>>>> in any of the configured local directories
>>>>> 2009-05-02 11:30:39,294 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>>>
>>>>> taskTracker/jobcache/job_200904301103_1675/attempt_200904301103_1675_r_000012_1/output/file.out
>>>>> in any of the configured local directories
>>>>> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>>>
>>>>> taskTracker/jobcache/job_200904301103_0944/attempt_200904301103_0944_r_000015_0/output/file.out
>>>>> in any of the configured local directories
>>>>> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>>>
>>>>> taskTracker/jobcache/job_200904301103_1111/attempt_200904301103_1111_m_000001_1/output/file.out
>>>>> in any of the configured local directories
>>>>> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>>>
>>>>> taskTracker/jobcache/job_200904301103_1675/attempt_200904301103_1675_r_000012_1/output/file.out
>>>>> in any of the configured local directories
>>>>> 2009-05-02 11:30:49,296 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>>>
>>>>> taskTracker/jobcache/job_200904301103_0944/attempt_200904301103_0944_r_000015_0/output/file.out
>>>>> in any of the configured local directories
>>>>> 2009-05-02 11:30:49,296 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>>>
>>>>> taskTracker/jobcache/job_200904301103_1111/attempt_200904301103_1111_m_000001_1/output/file.out
>>>>> in any of the configured local directories
>>>>> 2009-05-02 11:30:49,297 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>>>
>>>>> taskTracker/jobcache/job_200904301103_1675/attempt_200904301103_1675_r_000012_1/output/file.out
>>>>> in any of the configured local directories
>>>>> 2009-05-02 11:30:54,298 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>>>
>>>>> taskTracker/jobcache/job_200904301103_0944/attempt_200904301103_0944_r_000015_0/output/file.out
>>>>> in any of the configured local directories
>>>>>
>>>>>
>>>>> Lance
>>>>>
>>>>>
>>>>> On Apr 30, 2009, at 12:04 PM, Todd Lipcon wrote:
>>>>>
>>>>> Hey Lance,
>>>>>
>>>>>>
>>>>>> Thanks for the logs. They definitely confirmed my suspicion. There are
>>>>>> two
>>>>>> problems here:
>>>>>>
>>>>>> 1) If the JobTracker throws an exception during processing of a
>>>>>> heartbeat,
>>>>>> the tasktracker retries with no delay, since lastHeartbeat isn't
>>>>>> updated
>>>>>> in
>>>>>> TaskTracker.offerService. This is related to HADOOP-3987
>>>>>>
>>>>>> 2) If the TaskTracker sends a task in COMMIT_PENDING state with an
>>>>>> invalid
>>>>>> task id, the jobtracker will trigger a NullPointerException in
>>>>>> JobTracker.getTasksToSave. Instead it should probably create a
>>>>>> KillTaskAction. I just filed a JIRA to track this issue:
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/HADOOP-5761
>>>>>>
>>>>>> 3) The TaskTracker somehow had a task attempt in COMMIT_PENDING state
>>>>>> that
>>>>>> the JobTracker didn't know about. How it got there is a separate
>>>>>> problem
>>>>>> that's a bit harder to track down.
>>>>>>
>>>>>> Thanks
>>>>>> -Todd
>>>>>>
>>>>>> On Thu, Apr 30, 2009 at 11:17 AM, Lance Riedel <la...@dotspots.com>
>>>>>> wrote:
>>>>>>
>>>>>> Here are the job tracker logs from the same time (and yes.. there is
>>>>>>
>>>>>>> something there!!):
>>>>>>>
>>>>>>>
>>>>>>> 2009-04-30 02:34:28,484 INFO org.apache.hadoop.mapred.JobTracker:
>>>>>>> Serious
>>>>>>> problem.  While updating status, cannot find taskid
>>>>>>> attempt_200904291917_0252_r_000003_0
>>>>>>>
>>>>>>>
>>>>>>> 2009-04-30 02:34:40,215 INFO org.apache.hadoop.ipc.Server: IPC Server
>>>>>>> handler 2 on 54311, call
>>>>>>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1a93388, false,
>>>>>>> true,
>>>>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>>>>>> java.lang.NullPointerException
>>>>>>> java.io.IOException: java.lang.NullPointerException
>>>>>>>    at
>>>>>>>
>>>>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>>>>>    at
>>>>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>>>>>    at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>>>>>    at
>>>>>>>
>>>>>>>
>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>>    at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>>>>>    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>>>>>> 2009-04-30 02:34:40,215 INFO org.apache.hadoop.mapred.JobTracker:
>>>>>>> Serious
>>>>>>> problem.  While updating status, cannot find taskid
>>>>>>> attempt_200904291917_0296_r_000014_1
>>>>>>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker:
>>>>>>> Adding
>>>>>>> task 'attempt_200904291917_0352_r_000013_0' to tip
>>>>>>> task_200904291917_0352_r_000013, for tracker
>>>>>>>
>>>>>>> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
>>>>>>> 127.0.0.1:42479'
>>>>>>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker:
>>>>>>> Removed
>>>>>>> completed task 'attempt_200904291917_0343_m_000003_0' from
>>>>>>>
>>>>>>> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
>>>>>>> 127.0.0.1:42479'
>>>>>>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker:
>>>>>>> Removed
>>>>>>> completed task 'attempt_200904291917_0343_m_000007_0' from
>>>>>>>
>>>>>>> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
>>>>>>> 127.0.0.1:42479'
>>>>>>>
>>>>>>>
>>>>>>> And then.. a LOT more
>>>>>>>
>>>>>>>
>>>>>>> 2009-04-30 02:34:40,433 INFO org.apache.hadoop.mapred.JobTracker:
>>>>>>> Serious
>>>>>>> problem.  While updating status, cannot find taskid
>>>>>>> attempt_200904291917_0252_r_000003_0
>>>>>>> 2009-04-30 02:34:40,433 WARN org.apache.hadoop.mapred.TaskInProgress:
>>>>>>> Recieved duplicate status update of 'KILLED' for
>>>>>>> 'attempt_200904291917_0352_m_000010_1' of TIP
>>>>>>> 'task_200904291917_0352_m_000010'
>>>>>>> 2009-04-30 02:34:40,433 INFO org.apache.hadoop.ipc.Server: IPC Server
>>>>>>> handler 1 on 54311, call
>>>>>>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1b7b4c1, false,
>>>>>>> true,
>>>>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>>>>>> java.lang.NullPointerException
>>>>>>> java.io.IOException: java.lang.NullPointerException
>>>>>>>    at
>>>>>>>
>>>>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>>>>>    at
>>>>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>>>>>    at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>>>>>    at
>>>>>>>
>>>>>>>
>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>>    at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>>>>>    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>>>>>> 2009-04-30 02:34:40,441 INFO org.apache.hadoop.mapred.JobTracker:
>>>>>>> Serious
>>>>>>> problem.  While updating status, cannot find taskid
>>>>>>> attempt_200904291917_0252_r_000003_0
>>>>>>> 2009-04-30 02:34:40,441 WARN org.apache.hadoop.mapred.TaskInProgress:
>>>>>>> Recieved duplicate status update of 'KILLED' for
>>>>>>> 'attempt_200904291917_0352_m_000010_1' of TIP
>>>>>>> 'task_200904291917_0352_m_000010'
>>>>>>> 2009-04-30 02:34:40,442 INFO org.apache.hadoop.ipc.Server: IPC Server
>>>>>>> handler 1 on 54311, call
>>>>>>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1598c57, false,
>>>>>>> true,
>>>>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>>>>>> java.lang.NullPointerException
>>>>>>> java.io.IOException: java.lang.NullPointerException
>>>>>>>    at
>>>>>>>
>>>>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>>>>>    at
>>>>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>>>>>    at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>>>>>    at
>>>>>>>
>>>>>>>
>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>>    at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>>>>>    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>>>>>> 2009-04-30 02:34:40,444 INFO org.apache.hadoop.mapred.JobTracker:
>>>>>>> Serious
>>>>>>> problem.  While updating status, cannot find taskid
>>>>>>> attempt_200904291917_0252_r_000003_0
>>>>>>> 2009-04-30 02:34:40,444 WARN org.apache.hadoop.mapred.TaskInProgress:
>>>>>>> Recieved duplicate status update of 'KILLED' for
>>>>>>> 'attempt_200904291917_0352_m_000010_1' of TIP
>>>>>>> 'task_200904291917_0352_m_000010'
>>>>>>> 2009-04-30 02:34:40,444 INFO org.apache.hadoop.ipc.Server: IPC Server
>>>>>>> handler 7 on 54311, call
>>>>>>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@cb3d5, false,
>>>>>>> true,
>>>>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>>>>>> java.lang.NullPointerException
>>>>>>> java.io.IOException: java.lang.NullPointerException
>>>>>>>    at
>>>>>>>
>>>>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>>>>>    at
>>>>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>>>>>    at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>>>>>    at
>>>>>>>
>>>>>>>
>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>>    at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>>>>>    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>>>>>> 2009-04-30 02:34:40,447 INFO org.apache.hadoop.mapred.JobTracker:
>>>>>>> Serious
>>>>>>> problem.  While updating status, cannot find taskid
>>>>>>> attempt_200904291917_0252_r_000003_0
>>>>>>> 2009-04-30 02:34:40,447 WARN org.apache.hadoop.mapred.TaskInProgress:
>>>>>>> Recieved duplicate status update of 'KILLED' for
>>>>>>> 'attempt_200904291917_0352_m_000010_1' of TIP
>>>>>>> 'task_200904291917_0352_m_000010'
>>>>>>> 2009-04-30 02:34:40,447 INFO org.apache.hadoop.ipc.Server: IPC Server
>>>>>>> handler 1 on 54311, call
>>>>>>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1f27d98, false,
>>>>>>> true,
>>>>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>>>>>> java.lang.NullPointerException
>>>>>>> java.io.IOException: java.lang.NullPointerException
>>>>>>>    at
>>>>>>>
>>>>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>>>>>    at
>>>>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>>>>>    at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>>>>>    at
>>>>>>>
>>>>>>>
>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>>    at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>>>>>    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Apr 30, 2009, at 11:04 AM, Todd Lipcon wrote:
>>>>>>>
>>>>>>> Hey Lance,
>>>>>>>
>>>>>>>
>>>>>>>> Did you see any error messages in the JobTracker logs around the
>>>>>>>> time
>>>>>>>> this
>>>>>>>> started? I think I understand how this might happen.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -Todd
>>>>>>>>
>>>>>>>> On Thu, Apr 30, 2009 at 10:45 AM, Lance Riedel <la...@dotspots.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> I have not been able to reproduce.  We are using version 19.1 with
>>>>>>>> the
>>>>>>>>
>>>>>>>>  following patches:
>>>>>>>>> 4780-2v19.patch (Jira  4780)
>>>>>>>>> closeAll3.patch (Jira 3998)
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Lance
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Apr 30, 2009, at 10:40 AM, Todd Lipcon wrote:
>>>>>>>>>
>>>>>>>>> Hi Lance,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Can I ask what version you were running when you saw this? Is it
>>>>>>>>>> reproducible? I'm trying to look at the code path that might
>>>>>>>>>> produce
>>>>>>>>>> such
>>>>>>>>>> a
>>>>>>>>>> behavior and want to make sure I'm looking at the right version.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> -Todd
>>>>>>>>>>
>>>>>>>>>> On Thu, Apr 30, 2009 at 9:33 AM, Lance Riedel <lance@dotspots.com
>>>>>>>>>> >
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Has anyone seen this before? Our task tracker produced a 2.7 gig
>>>>>>>>>> log
>>>>>>>>>> file
>>>>>>>>>>
>>>>>>>>>> in a few hours. The entry is all the same (every 2 ms):
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2009-04-30 02:34:40,207 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>>> with
>>>>>>>>>>> reponseId '5341
>>>>>>>>>>> 2009-04-30 02:34:40,398 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>>> with
>>>>>>>>>>> reponseId '5341
>>>>>>>>>>> 2009-04-30 02:34:40,403 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>>> with
>>>>>>>>>>> reponseId '5341
>>>>>>>>>>> 2009-04-30 02:34:40,411 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>>> with
>>>>>>>>>>> reponseId '5341
>>>>>>>>>>> 2009-04-30 02:34:40,414 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>>> with
>>>>>>>>>>> reponseId '5341
>>>>>>>>>>> 2009-04-30 02:34:40,417 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>>> with
>>>>>>>>>>> reponseId '5341
>>>>>>>>>>> 2009-04-30 02:34:40,420 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>>> with
>>>>>>>>>>> reponseId '5341
>>>>>>>>>>> ... (And on and on and on...)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> These are the few lines before it started:
>>>>>>>>>>>
>>>>>>>>>>> 2009-04-30 02:34:29,780 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src:
>>>>>>>>>>> xxx.xxx.xxx.xxx:50060, dest: 10.253.178.95:40268, bytes:
>>>>>>>>>>> 3341324,
>>>>>>>>>>> op:
>>>>>>>>>>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000006_0
>>>>>>>>>>> 2009-04-30 02:34:31,522 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Sent
>>>>>>>>>>> out
>>>>>>>>>>> 418891 bytes for reduce: 12 from map:
>>>>>>>>>>> attempt_200904291917_0352_m_000007_0
>>>>>>>>>>> given 418891/418887 from 4301462 with (22, 171)
>>>>>>>>>>> 2009-04-30 02:34:31,522 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src:
>>>>>>>>>>> xxx.xxx.xxx.xxx:50060, dest: xxx.xxx.xxx.xxx:40268, bytes:
>>>>>>>>>>> 418891,
>>>>>>>>>>> op:
>>>>>>>>>>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000007_0
>>>>>>>>>>> 2009-04-30 02:34:35,382 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10
>>>>>>>>>>> of
>>>>>>>>>>> 11
>>>>>>>>>>> at
>>>>>>>>>>> 0.32 MB/s) >
>>>>>>>>>>> 2009-04-30 02:34:38,385 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10
>>>>>>>>>>> of
>>>>>>>>>>> 11
>>>>>>>>>>> at
>>>>>>>>>>> 0.32 MB/s) >
>>>>>>>>>>> 2009-04-30 02:34:40,207 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>>> with
>>>>>>>>>>> reponseId '5341
>>>>>>>>>>> 2009-04-30 02:34:40,398 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>>> with
>>>>>>>>>>> reponseId '5341
>>>>>>>>>>> 2009-04-30 02:34:40,403 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>>> with
>>>>>>>>>>> reponseId '5341
>>>>>>>>>>> 2009-04-30 02:34:40,411 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>>> with
>>>>>>>>>>> reponseId '5341
>>>>>>>>>>> 2009-04-30 02:34:40,414 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>>> with
>>>>>>>>>>> reponseId '5341
>>>>>>>>>>> 2009-04-30 02:34:40,417 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>>> with
>>>>>>>>>>> reponseId '5341
>>>>>>>>>>> 2009-04-30 02:34:40,420 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>>> with
>>>>>>>>>>> reponseId '5341
>>>>>>>>>>>
>>>>>>>>>>> --And on for    2+ gigs
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>>>
>>
>

Re: Infinite Loop Resending status from task tracker

Posted by Lance Riedel <la...@dotspots.com>.

Here is the point in the logs where the infinite loop begins - see time
stamp 2009-05-14 04:03:56,348 : (JobTracker)

2009-05-14 04:03:56,324 INFO org.apache.hadoop.mapred.JobTracker: Removed
completed task 'attempt_200905122015_1168_m_000029_0' from
'tracker_domU-12-31-38-01-74-F1.compute-1.internal:localhost.localdomain/
127.0.0.1:35214'
2009-05-14 04:03:56,326 INFO org.apache.hadoop.mapred.JobTracker: Adding
task 'attempt_200905122015_1183_r_000007_0' to tip
task_200905122015_1183_r_000007, for tracker
'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
127.0.0.1:58504'
2009-05-14 04:03:56,327 INFO org.apache.hadoop.mapred.JobTracker: Adding
task (cleanup)'attempt_200905122015_1184_m_000000_1' to tip
task_200905122015_1184_m_000000, for tracker
'tracker_domU-12-31-38-00-80-21.compute-1.internal:localhost.localdomain/
127.0.0.1
:57741'
2009-05-14 04:03:56,330 INFO org.apache.hadoop.mapred.JobInProgress: Task
'attempt_200905122015_1182_r_000011_0' has completed
task_200905122015_1182_r_000011 successfully.
2009-05-14 04:03:56,330 INFO org.apache.hadoop.mapred.JobTracker: Adding
task (cleanup)'attempt_200905122015_1182_r_000010_1' to tip
task_200905122015_1182_r_000010, for tracker
'tracker_domU-12-31-38-01-5C-41.compute-1.internal:localhost.localdomain/
127.0.0.1
:46248'
2009-05-14 04:03:56,331 INFO org.apache.hadoop.mapred.JobTracker: Serious
problem.  While updating status, cannot find taskid
attempt_200905122015_0499_r_000004_1
2009-05-14 04:03:56,331 INFO org.apache.hadoop.mapred.JobInProgress: Task
'attempt_200905122015_1184_m_000004_1' has completed
task_200905122015_1184_m_000004 successfully.
2009-05-14 04:03:56,331 INFO org.apache.hadoop.mapred.ResourceEstimator:
measured blowup on task_200905122015_1184_m_000004 was 20150008/21581175 =
0.93368447269437372009-05-14 04:03:56,331 INFO
org.apache.hadoop.mapred.ResourceEstimator: new estimate is blowup =
0.9152383292400812
2009-05-14 04:03:56,335 INFO org.apache.hadoop.mapred.JobTracker: Adding
task 'attempt_200905122015_1183_r_000008_0' to tip
task_200905122015_1183_r_000008, for tracker
'tracker_domU-12-31-38-00-80-21.compute-1.internal:localhost.localdomain/
127.0.0.1:57741'
2009-05-14 04:03:56,336 INFO org.apache.hadoop.mapred.JobTracker: Adding
task 'attempt_200905122015_1183_r_000009_0' to tip
task_200905122015_1183_r_000009, for tracker
'tracker_domU-12-31-38-01-74-F1.compute-1.internal:localhost.localdomain/
127.0.0.1:35214'
2009-05-14 04:03:56,336 INFO org.apache.hadoop.mapred.JobTracker: Removed
completed task 'attempt_200905122015_1181_r_000009_0' from
'tracker_domU-12-31-38-01-74-F1.compute-1.internal:localhost.localdomain/
127.0.0.1:35214'
2009-05-14 04:03:56,336 INFO org.apache.hadoop.mapred.JobTracker: Serious
problem.  While updating status, cannot find taskid
attempt_200905122015_0499_r_000004_1
2009-05-14 04:03:56,337 INFO org.apache.hadoop.mapred.JobTracker: Adding
task 'attempt_200905122015_1183_r_000010_0' to tip
task_200905122015_1183_r_000010, for tracker
'tracker_domU-12-31-38-01-81-31.compute-1.internal:localhost.localdomain/
127.0.0.1:46518'
2009-05-14 04:03:56,343 INFO org.apache.hadoop.mapred.JobTracker: Serious
problem.  While updating status, cannot find taskid
attempt_200905122015_1070_r_000014_1
2009-05-14 04:03:56,344 INFO org.apache.hadoop.mapred.JobTracker: Adding
task 'attempt_200905122015_1183_r_000011_0' to tip
task_200905122015_1183_r_000011, for tracker
'tracker_domU-12-31-38-01-AD-91.compute-1.internal:localhost.localdomain/
127.0.0.1:33929'
2009-05-14 04:03:56,348 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 9 on 54311, call
heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@a4fe4, false, true,
21461) from 10.253.178.95:50709: error: java.io.IOException:
java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
        at
org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
        at
org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
        at sun.reflect.GeneratedMethodAccessor117.invoke(Unknown
Source)        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
2009-05-14 04:03:56,351 INFO org.apache.hadoop.mapred.JobTracker:
attempt_200905122015_1183_r_000007_0 is 24 ms debug.
2009-05-14 04:03:56,419 INFO org.apache.hadoop.mapred.JobTracker: Adding
task 'attempt_200905122015_1183_r_000012_0' to tip
task_200905122015_1183_r_000012, for tracker
'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
127.0.0.1:58504'
2009-05-14 04:03:56,459 INFO org.apache.hadoop.mapred.JobTracker: Serious
problem.  While updating status, cannot find taskid
attempt_200905122015_1070_r_000014_1
2009-05-14 04:03:56,459 WARN org.apache.hadoop.mapred.TaskInProgress:
Recieved duplicate status update of 'KILLED' for
'attempt_200905122015_1182_r_000009_1' of TIP
'task_200905122015_1182_r_000009'
2009-05-14 04:03:56,460 INFO org.apache.hadoop.mapred.JobTracker: Adding
task 'attempt_200905122015_1183_r_000013_0' to tip
task_200905122015_1183_r_000013, for tracker
'tracker_domU-12-31-38-01-AD-91.compute-1.internal:localhost.localdomain/
127.0.0.1:33929'
2009-05-14 04:03:56,460 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 5 on 54311, call
heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1f1c507, false, true,
21461) from 10.253.178.95:50709: error: java.io.IOException:
java.lang.NullPointerExceptio
n
java.io.IOException: java.lang.NullPointerException
        at
org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
        at sun.reflect.GeneratedMethodAccessor117.invoke(Unknown Source)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
2009-05-14 04:03:56,464 INFO org.apache.hadoop.mapred.JobTracker: Serious
problem.  While updating status, cannot find taskid
attempt_200905122015_1070_r_000014_1
2009-05-14 04:03:56,464 WARN org.apache.hadoop.mapred.TaskInProgress:
Recieved duplicate status update of 'KILLED' for
'attempt_200905122015_1182_r_000009_1' of TIP
'task_200905122015_1182_r_000009'
2009-05-14 04:03:56,465 INFO org.apache.hadoop.mapred.JobTracker: Adding
task 'attempt_200905122015_1183_r_000014_0' to tip
task_200905122015_1183_r_000014, for tracker
'tracker_domU-12-31-38-01-AD-91.compute-1.internal:localhost.localdomain/
127.0.0.1:33929'
2009-05-14 04:03:56,465 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 2 on 54311, call
heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1d54fc9, false, true,
21461) from 10.253.178.95:50709: error: java.io.IOException:
java.lang.NullPointerExceptio
n


<and on and on an on for 10 gigs worth in a few hours>



On Thu, May 14, 2009 at 8:37 AM, Lance Riedel <la...@dotspots.com> wrote:

> Just had another cluster crash with the same issue. This is still a huge
> issue for us- still crashing our cluster every other night (actually almost
> every night now).
>
> Should we move to .20? Is there more information i can provide?  Is this
> related to my other email "Constantly getting DiskErrorExceptions - but
> logged as INFO"? I haven't seen responses on that.
>
> Thanks!
> Lance
>
> On Thu, May 14, 2009 at 7:48 AM, Lance Riedel <la...@dotspots.com> wrote:
>
>> Here is the latest here.. Haven't heard any more, but every other night we
>> get 10 gigs logs and tons of failed tasks and have to restart the cluster
>>
>> ---------- Forwarded message ----------
>> From: Lance Riedel <la...@dotspots.com>
>> Date: Fri, May 8, 2009 at 10:49 AM
>> Subject: Re: Infinite Loop Resending status from task tracker
>> To: core-user@hadoop.apache.org
>> Cc: Brian Long <br...@dotspots.com>
>>
>>
>> Hi Todd,
>> Sorry, my response got hung up in my outbox for a couple of days.. arghh
>>
>>
>>
>> Confirmed that 1) we are not running out of space and 2) that our
>> mapred.local.dir directory is not in /tmp
>>
>> Not sure if this an ec2 problem with a mounted drive?
>>
>> We had the same thing happen again, exact same logs and symptoms
>> (simultaneous in jobtracker and tasktracker)
>>
>> Thinking about moving to .20 because of this, any thoughts on that?
>>
>> Thanks,
>> Lance
>>
>>
>> On May 4, 2009, at 4:18 PM, Todd Lipcon wrote:
>>
>>  Hi Lance,
>>>
>>> Two thoughts here that might be the culprit:
>>>
>>> 1) Is it possible that the partition that your mapred.local.dir is on is
>>> out
>>> of space on that task tracker?
>>>
>>> 2) Is it possible that you're using a directory under /tmp for
>>> mapred.local.dir and some system cron script cleared out /tmp?
>>>
>>> -Todd
>>>
>>> On Sat, May 2, 2009 at 9:01 AM, Lance Riedel <la...@dotspots.com> wrote:
>>>
>>>  Hi Todd,
>>>> Not sure if this is related, but our hadoop cluster in general is
>>>> getting
>>>> more and more unstable.  the logs are full of this error message (but
>>>> having
>>>> trouble tracking down the root problem):
>>>>
>>>> 2009-05-02 11:30:39,294 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>>
>>>> taskTracker/jobcache/job_200904301103_1111/attempt_200904301103_1111_m_000001_1/output/file.out
>>>> in any of the configured local directories
>>>> 2009-05-02 11:30:39,294 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>>
>>>> taskTracker/jobcache/job_200904301103_1675/attempt_200904301103_1675_r_000012_1/output/file.out
>>>> in any of the configured local directories
>>>> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>>
>>>> taskTracker/jobcache/job_200904301103_0944/attempt_200904301103_0944_r_000015_0/output/file.out
>>>> in any of the configured local directories
>>>> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>>
>>>> taskTracker/jobcache/job_200904301103_1111/attempt_200904301103_1111_m_000001_1/output/file.out
>>>> in any of the configured local directories
>>>> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>>
>>>> taskTracker/jobcache/job_200904301103_1675/attempt_200904301103_1675_r_000012_1/output/file.out
>>>> in any of the configured local directories
>>>> 2009-05-02 11:30:49,296 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>>
>>>> taskTracker/jobcache/job_200904301103_0944/attempt_200904301103_0944_r_000015_0/output/file.out
>>>> in any of the configured local directories
>>>> 2009-05-02 11:30:49,296 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>>
>>>> taskTracker/jobcache/job_200904301103_1111/attempt_200904301103_1111_m_000001_1/output/file.out
>>>> in any of the configured local directories
>>>> 2009-05-02 11:30:49,297 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>>
>>>> taskTracker/jobcache/job_200904301103_1675/attempt_200904301103_1675_r_000012_1/output/file.out
>>>> in any of the configured local directories
>>>> 2009-05-02 11:30:54,298 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>>
>>>> taskTracker/jobcache/job_200904301103_0944/attempt_200904301103_0944_r_000015_0/output/file.out
>>>> in any of the configured local directories
>>>>
>>>>
>>>> Lance
>>>>
>>>>
>>>> On Apr 30, 2009, at 12:04 PM, Todd Lipcon wrote:
>>>>
>>>> Hey Lance,
>>>>
>>>>>
>>>>> Thanks for the logs. They definitely confirmed my suspicion. There are
>>>>> two
>>>>> problems here:
>>>>>
>>>>> 1) If the JobTracker throws an exception during processing of a
>>>>> heartbeat,
>>>>> the tasktracker retries with no delay, since lastHeartbeat isn't
>>>>> updated
>>>>> in
>>>>> TaskTracker.offerService. This is related to HADOOP-3987
>>>>>
>>>>> 2) If the TaskTracker sends a task in COMMIT_PENDING state with an
>>>>> invalid
>>>>> task id, the jobtracker will trigger a NullPointerException in
>>>>> JobTracker.getTasksToSave. Instead it should probably create a
>>>>> KillTaskAction. I just filed a JIRA to track this issue:
>>>>>
>>>>> https://issues.apache.org/jira/browse/HADOOP-5761
>>>>>
>>>>> 3) The TaskTracker somehow had a task attempt in COMMIT_PENDING state
>>>>> that
>>>>> the JobTracker didn't know about. How it got there is a separate
>>>>> problem
>>>>> that's a bit harder to track down.
>>>>>
>>>>> Thanks
>>>>> -Todd
>>>>>
>>>>> On Thu, Apr 30, 2009 at 11:17 AM, Lance Riedel <la...@dotspots.com>
>>>>> wrote:
>>>>>
>>>>> Here are the job tracker logs from the same time (and yes.. there is
>>>>>
>>>>>> something there!!):
>>>>>>
>>>>>>
>>>>>> 2009-04-30 02:34:28,484 INFO org.apache.hadoop.mapred.JobTracker:
>>>>>> Serious
>>>>>> problem.  While updating status, cannot find taskid
>>>>>> attempt_200904291917_0252_r_000003_0
>>>>>>
>>>>>>
>>>>>> 2009-04-30 02:34:40,215 INFO org.apache.hadoop.ipc.Server: IPC Server
>>>>>> handler 2 on 54311, call
>>>>>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1a93388, false,
>>>>>> true,
>>>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>>>>> java.lang.NullPointerException
>>>>>> java.io.IOException: java.lang.NullPointerException
>>>>>>    at
>>>>>>
>>>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>>>>    at
>>>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>>>>    at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>>>>    at
>>>>>>
>>>>>>
>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>    at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>>>>    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>>>>> 2009-04-30 02:34:40,215 INFO org.apache.hadoop.mapred.JobTracker:
>>>>>> Serious
>>>>>> problem.  While updating status, cannot find taskid
>>>>>> attempt_200904291917_0296_r_000014_1
>>>>>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker:
>>>>>> Adding
>>>>>> task 'attempt_200904291917_0352_r_000013_0' to tip
>>>>>> task_200904291917_0352_r_000013, for tracker
>>>>>>
>>>>>> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
>>>>>> 127.0.0.1:42479'
>>>>>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker:
>>>>>> Removed
>>>>>> completed task 'attempt_200904291917_0343_m_000003_0' from
>>>>>>
>>>>>> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
>>>>>> 127.0.0.1:42479'
>>>>>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker:
>>>>>> Removed
>>>>>> completed task 'attempt_200904291917_0343_m_000007_0' from
>>>>>>
>>>>>> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
>>>>>> 127.0.0.1:42479'
>>>>>>
>>>>>>
>>>>>> And then.. a LOT more
>>>>>>
>>>>>>
>>>>>> 2009-04-30 02:34:40,433 INFO org.apache.hadoop.mapred.JobTracker:
>>>>>> Serious
>>>>>> problem.  While updating status, cannot find taskid
>>>>>> attempt_200904291917_0252_r_000003_0
>>>>>> 2009-04-30 02:34:40,433 WARN org.apache.hadoop.mapred.TaskInProgress:
>>>>>> Recieved duplicate status update of 'KILLED' for
>>>>>> 'attempt_200904291917_0352_m_000010_1' of TIP
>>>>>> 'task_200904291917_0352_m_000010'
>>>>>> 2009-04-30 02:34:40,433 INFO org.apache.hadoop.ipc.Server: IPC Server
>>>>>> handler 1 on 54311, call
>>>>>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1b7b4c1, false,
>>>>>> true,
>>>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>>>>> java.lang.NullPointerException
>>>>>> java.io.IOException: java.lang.NullPointerException
>>>>>>    at
>>>>>>
>>>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>>>>    at
>>>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>>>>    at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>>>>    at
>>>>>>
>>>>>>
>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>    at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>>>>    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>>>>> 2009-04-30 02:34:40,441 INFO org.apache.hadoop.mapred.JobTracker:
>>>>>> Serious
>>>>>> problem.  While updating status, cannot find taskid
>>>>>> attempt_200904291917_0252_r_000003_0
>>>>>> 2009-04-30 02:34:40,441 WARN org.apache.hadoop.mapred.TaskInProgress:
>>>>>> Recieved duplicate status update of 'KILLED' for
>>>>>> 'attempt_200904291917_0352_m_000010_1' of TIP
>>>>>> 'task_200904291917_0352_m_000010'
>>>>>> 2009-04-30 02:34:40,442 INFO org.apache.hadoop.ipc.Server: IPC Server
>>>>>> handler 1 on 54311, call
>>>>>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1598c57, false,
>>>>>> true,
>>>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>>>>> java.lang.NullPointerException
>>>>>> java.io.IOException: java.lang.NullPointerException
>>>>>>    at
>>>>>>
>>>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>>>>    at
>>>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>>>>    at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>>>>    at
>>>>>>
>>>>>>
>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>    at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>>>>    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>>>>> 2009-04-30 02:34:40,444 INFO org.apache.hadoop.mapred.JobTracker:
>>>>>> Serious
>>>>>> problem.  While updating status, cannot find taskid
>>>>>> attempt_200904291917_0252_r_000003_0
>>>>>> 2009-04-30 02:34:40,444 WARN org.apache.hadoop.mapred.TaskInProgress:
>>>>>> Recieved duplicate status update of 'KILLED' for
>>>>>> 'attempt_200904291917_0352_m_000010_1' of TIP
>>>>>> 'task_200904291917_0352_m_000010'
>>>>>> 2009-04-30 02:34:40,444 INFO org.apache.hadoop.ipc.Server: IPC Server
>>>>>> handler 7 on 54311, call
>>>>>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@cb3d5, false,
>>>>>> true,
>>>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>>>>> java.lang.NullPointerException
>>>>>> java.io.IOException: java.lang.NullPointerException
>>>>>>    at
>>>>>>
>>>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>>>>    at
>>>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>>>>    at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>>>>    at
>>>>>>
>>>>>>
>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>    at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>>>>    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>>>>> 2009-04-30 02:34:40,447 INFO org.apache.hadoop.mapred.JobTracker:
>>>>>> Serious
>>>>>> problem.  While updating status, cannot find taskid
>>>>>> attempt_200904291917_0252_r_000003_0
>>>>>> 2009-04-30 02:34:40,447 WARN org.apache.hadoop.mapred.TaskInProgress:
>>>>>> Recieved duplicate status update of 'KILLED' for
>>>>>> 'attempt_200904291917_0352_m_000010_1' of TIP
>>>>>> 'task_200904291917_0352_m_000010'
>>>>>> 2009-04-30 02:34:40,447 INFO org.apache.hadoop.ipc.Server: IPC Server
>>>>>> handler 1 on 54311, call
>>>>>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1f27d98, false,
>>>>>> true,
>>>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>>>>> java.lang.NullPointerException
>>>>>> java.io.IOException: java.lang.NullPointerException
>>>>>>    at
>>>>>>
>>>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>>>>    at
>>>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>>>>    at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>>>>    at
>>>>>>
>>>>>>
>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>    at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>>>>    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Apr 30, 2009, at 11:04 AM, Todd Lipcon wrote:
>>>>>>
>>>>>> Hey Lance,
>>>>>>
>>>>>>
>>>>>>> Did you see any error messages in the JobTracker logs around the time
>>>>>>> this
>>>>>>> started? I think I understand how this might happen.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> -Todd
>>>>>>>
>>>>>>> On Thu, Apr 30, 2009 at 10:45 AM, Lance Riedel <la...@dotspots.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> I have not been able to reproduce.  We are using version 19.1 with
>>>>>>> the
>>>>>>>
>>>>>>>  following patches:
>>>>>>>> 4780-2v19.patch (Jira  4780)
>>>>>>>> closeAll3.patch (Jira 3998)
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Lance
>>>>>>>>
>>>>>>>>
>>>>>>>> On Apr 30, 2009, at 10:40 AM, Todd Lipcon wrote:
>>>>>>>>
>>>>>>>> Hi Lance,
>>>>>>>>
>>>>>>>>
>>>>>>>>  Can I ask what version you were running when you saw this? Is it
>>>>>>>>> reproducible? I'm trying to look at the code path that might
>>>>>>>>> produce
>>>>>>>>> such
>>>>>>>>> a
>>>>>>>>> behavior and want to make sure I'm looking at the right version.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> -Todd
>>>>>>>>>
>>>>>>>>> On Thu, Apr 30, 2009 at 9:33 AM, Lance Riedel <la...@dotspots.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Has anyone seen this before? Our task tracker produced a 2.7 gig
>>>>>>>>> log
>>>>>>>>> file
>>>>>>>>>
>>>>>>>>> in a few hours. The entry is all the same (every 2 ms):
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2009-04-30 02:34:40,207 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>> with
>>>>>>>>>> reponseId '5341
>>>>>>>>>> 2009-04-30 02:34:40,398 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>> with
>>>>>>>>>> reponseId '5341
>>>>>>>>>> 2009-04-30 02:34:40,403 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>> with
>>>>>>>>>> reponseId '5341
>>>>>>>>>> 2009-04-30 02:34:40,411 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>> with
>>>>>>>>>> reponseId '5341
>>>>>>>>>> 2009-04-30 02:34:40,414 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>> with
>>>>>>>>>> reponseId '5341
>>>>>>>>>> 2009-04-30 02:34:40,417 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>> with
>>>>>>>>>> reponseId '5341
>>>>>>>>>> 2009-04-30 02:34:40,420 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>> with
>>>>>>>>>> reponseId '5341
>>>>>>>>>> ... (And on and on and on...)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> These are the few lines before it started:
>>>>>>>>>>
>>>>>>>>>> 2009-04-30 02:34:29,780 INFO
>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src:
>>>>>>>>>> xxx.xxx.xxx.xxx:50060, dest: 10.253.178.95:40268, bytes: 3341324,
>>>>>>>>>> op:
>>>>>>>>>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000006_0
>>>>>>>>>> 2009-04-30 02:34:31,522 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>> Sent
>>>>>>>>>> out
>>>>>>>>>> 418891 bytes for reduce: 12 from map:
>>>>>>>>>> attempt_200904291917_0352_m_000007_0
>>>>>>>>>> given 418891/418887 from 4301462 with (22, 171)
>>>>>>>>>> 2009-04-30 02:34:31,522 INFO
>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src:
>>>>>>>>>> xxx.xxx.xxx.xxx:50060, dest: xxx.xxx.xxx.xxx:40268, bytes: 418891,
>>>>>>>>>> op:
>>>>>>>>>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000007_0
>>>>>>>>>> 2009-04-30 02:34:35,382 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10
>>>>>>>>>> of
>>>>>>>>>> 11
>>>>>>>>>> at
>>>>>>>>>> 0.32 MB/s) >
>>>>>>>>>> 2009-04-30 02:34:38,385 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10
>>>>>>>>>> of
>>>>>>>>>> 11
>>>>>>>>>> at
>>>>>>>>>> 0.32 MB/s) >
>>>>>>>>>> 2009-04-30 02:34:40,207 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>> with
>>>>>>>>>> reponseId '5341
>>>>>>>>>> 2009-04-30 02:34:40,398 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>> with
>>>>>>>>>> reponseId '5341
>>>>>>>>>> 2009-04-30 02:34:40,403 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>> with
>>>>>>>>>> reponseId '5341
>>>>>>>>>> 2009-04-30 02:34:40,411 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>> with
>>>>>>>>>> reponseId '5341
>>>>>>>>>> 2009-04-30 02:34:40,414 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>> with
>>>>>>>>>> reponseId '5341
>>>>>>>>>> 2009-04-30 02:34:40,417 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>> with
>>>>>>>>>> reponseId '5341
>>>>>>>>>> 2009-04-30 02:34:40,420 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>> with
>>>>>>>>>> reponseId '5341
>>>>>>>>>>
>>>>>>>>>> --And on for    2+ gigs
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>
>>
>

Re: Infinite Loop Resending status from task tracker

Posted by Lance Riedel <la...@dotspots.com>.

Just had another cluster crash with the same issue. This is still a huge
issue for us- still crashing our cluster every other night (actually almost
every night now).

Should we move to .20? Is there more information i can provide?  Is this
related to my other email "Constantly getting DiskErrorExceptions - but
logged as INFO"? I haven't seen responses on that.

Thanks!
Lance
On Thu, May 14, 2009 at 7:48 AM, Lance Riedel <la...@dotspots.com> wrote:

> Here is the latest here.. Haven't heard any more, but every other night we
> get 10 gigs logs and tons of failed tasks and have to restart the cluster
>
> ---------- Forwarded message ----------
> From: Lance Riedel <la...@dotspots.com>
> Date: Fri, May 8, 2009 at 10:49 AM
> Subject: Re: Infinite Loop Resending status from task tracker
> To: core-user@hadoop.apache.org
> Cc: Brian Long <br...@dotspots.com>
>
>
> Hi Todd,
> Sorry, my response got hung up in my outbox for a couple of days.. arghh
>
>
>
> Confirmed that 1) we are not running out of space and 2) that our
> mapred.local.dir directory is not in /tmp
>
> Not sure if this an ec2 problem with a mounted drive?
>
> We had the same thing happen again, exact same logs and symptoms
> (simultaneous in jobtracker and tasktracker)
>
> Thinking about moving to .20 because of this, any thoughts on that?
>
> Thanks,
> Lance
>
>
> On May 4, 2009, at 4:18 PM, Todd Lipcon wrote:
>
>  Hi Lance,
>>
>> Two thoughts here that might be the culprit:
>>
>> 1) Is it possible that the partition that your mapred.local.dir is on is
>> out
>> of space on that task tracker?
>>
>> 2) Is it possible that you're using a directory under /tmp for
>> mapred.local.dir and some system cron script cleared out /tmp?
>>
>> -Todd
>>
>> On Sat, May 2, 2009 at 9:01 AM, Lance Riedel <la...@dotspots.com> wrote:
>>
>>  Hi Todd,
>>> Not sure if this is related, but our hadoop cluster in general is getting
>>> more and more unstable.  the logs are full of this error message (but
>>> having
>>> trouble tracking down the root problem):
>>>
>>> 2009-05-02 11:30:39,294 INFO org.apache.hadoop.mapred.TaskTracker:
>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>
>>> taskTracker/jobcache/job_200904301103_1111/attempt_200904301103_1111_m_000001_1/output/file.out
>>> in any of the configured local directories
>>> 2009-05-02 11:30:39,294 INFO org.apache.hadoop.mapred.TaskTracker:
>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>
>>> taskTracker/jobcache/job_200904301103_1675/attempt_200904301103_1675_r_000012_1/output/file.out
>>> in any of the configured local directories
>>> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker:
>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>
>>> taskTracker/jobcache/job_200904301103_0944/attempt_200904301103_0944_r_000015_0/output/file.out
>>> in any of the configured local directories
>>> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker:
>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>
>>> taskTracker/jobcache/job_200904301103_1111/attempt_200904301103_1111_m_000001_1/output/file.out
>>> in any of the configured local directories
>>> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker:
>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>
>>> taskTracker/jobcache/job_200904301103_1675/attempt_200904301103_1675_r_000012_1/output/file.out
>>> in any of the configured local directories
>>> 2009-05-02 11:30:49,296 INFO org.apache.hadoop.mapred.TaskTracker:
>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>
>>> taskTracker/jobcache/job_200904301103_0944/attempt_200904301103_0944_r_000015_0/output/file.out
>>> in any of the configured local directories
>>> 2009-05-02 11:30:49,296 INFO org.apache.hadoop.mapred.TaskTracker:
>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>
>>> taskTracker/jobcache/job_200904301103_1111/attempt_200904301103_1111_m_000001_1/output/file.out
>>> in any of the configured local directories
>>> 2009-05-02 11:30:49,297 INFO org.apache.hadoop.mapred.TaskTracker:
>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>
>>> taskTracker/jobcache/job_200904301103_1675/attempt_200904301103_1675_r_000012_1/output/file.out
>>> in any of the configured local directories
>>> 2009-05-02 11:30:54,298 INFO org.apache.hadoop.mapred.TaskTracker:
>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>
>>> taskTracker/jobcache/job_200904301103_0944/attempt_200904301103_0944_r_000015_0/output/file.out
>>> in any of the configured local directories
>>>
>>>
>>> Lance
>>>
>>>
>>> On Apr 30, 2009, at 12:04 PM, Todd Lipcon wrote:
>>>
>>> Hey Lance,
>>>
>>>>
>>>> Thanks for the logs. They definitely confirmed my suspicion. There are
>>>> two
>>>> problems here:
>>>>
>>>> 1) If the JobTracker throws an exception during processing of a
>>>> heartbeat,
>>>> the tasktracker retries with no delay, since lastHeartbeat isn't updated
>>>> in
>>>> TaskTracker.offerService. This is related to HADOOP-3987
>>>>
>>>> 2) If the TaskTracker sends a task in COMMIT_PENDING state with an
>>>> invalid
>>>> task id, the jobtracker will trigger a NullPointerException in
>>>> JobTracker.getTasksToSave. Instead it should probably create a
>>>> KillTaskAction. I just filed a JIRA to track this issue:
>>>>
>>>> https://issues.apache.org/jira/browse/HADOOP-5761
>>>>
>>>> 3) The TaskTracker somehow had a task attempt in COMMIT_PENDING state
>>>> that
>>>> the JobTracker didn't know about. How it got there is a separate problem
>>>> that's a bit harder to track down.
>>>>
>>>> Thanks
>>>> -Todd
>>>>
>>>> On Thu, Apr 30, 2009 at 11:17 AM, Lance Riedel <la...@dotspots.com>
>>>> wrote:
>>>>
>>>> Here are the job tracker logs from the same time (and yes.. there is
>>>>
>>>>> something there!!):
>>>>>
>>>>>
>>>>> 2009-04-30 02:34:28,484 INFO org.apache.hadoop.mapred.JobTracker:
>>>>> Serious
>>>>> problem.  While updating status, cannot find taskid
>>>>> attempt_200904291917_0252_r_000003_0
>>>>>
>>>>>
>>>>> 2009-04-30 02:34:40,215 INFO org.apache.hadoop.ipc.Server: IPC Server
>>>>> handler 2 on 54311, call
>>>>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1a93388, false,
>>>>> true,
>>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>>>> java.lang.NullPointerException
>>>>> java.io.IOException: java.lang.NullPointerException
>>>>>    at
>>>>>
>>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>>>    at
>>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>>>    at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>>>    at
>>>>>
>>>>>
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>    at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>>>    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>>>> 2009-04-30 02:34:40,215 INFO org.apache.hadoop.mapred.JobTracker:
>>>>> Serious
>>>>> problem.  While updating status, cannot find taskid
>>>>> attempt_200904291917_0296_r_000014_1
>>>>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker:
>>>>> Adding
>>>>> task 'attempt_200904291917_0352_r_000013_0' to tip
>>>>> task_200904291917_0352_r_000013, for tracker
>>>>>
>>>>> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
>>>>> 127.0.0.1:42479'
>>>>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker:
>>>>> Removed
>>>>> completed task 'attempt_200904291917_0343_m_000003_0' from
>>>>>
>>>>> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
>>>>> 127.0.0.1:42479'
>>>>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker:
>>>>> Removed
>>>>> completed task 'attempt_200904291917_0343_m_000007_0' from
>>>>>
>>>>> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
>>>>> 127.0.0.1:42479'
>>>>>
>>>>>
>>>>> And then.. a LOT more
>>>>>
>>>>>
>>>>> 2009-04-30 02:34:40,433 INFO org.apache.hadoop.mapred.JobTracker:
>>>>> Serious
>>>>> problem.  While updating status, cannot find taskid
>>>>> attempt_200904291917_0252_r_000003_0
>>>>> 2009-04-30 02:34:40,433 WARN org.apache.hadoop.mapred.TaskInProgress:
>>>>> Recieved duplicate status update of 'KILLED' for
>>>>> 'attempt_200904291917_0352_m_000010_1' of TIP
>>>>> 'task_200904291917_0352_m_000010'
>>>>> 2009-04-30 02:34:40,433 INFO org.apache.hadoop.ipc.Server: IPC Server
>>>>> handler 1 on 54311, call
>>>>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1b7b4c1, false,
>>>>> true,
>>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>>>> java.lang.NullPointerException
>>>>> java.io.IOException: java.lang.NullPointerException
>>>>>    at
>>>>>
>>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>>>    at
>>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>>>    at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>>>    at
>>>>>
>>>>>
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>    at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>>>    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>>>> 2009-04-30 02:34:40,441 INFO org.apache.hadoop.mapred.JobTracker:
>>>>> Serious
>>>>> problem.  While updating status, cannot find taskid
>>>>> attempt_200904291917_0252_r_000003_0
>>>>> 2009-04-30 02:34:40,441 WARN org.apache.hadoop.mapred.TaskInProgress:
>>>>> Recieved duplicate status update of 'KILLED' for
>>>>> 'attempt_200904291917_0352_m_000010_1' of TIP
>>>>> 'task_200904291917_0352_m_000010'
>>>>> 2009-04-30 02:34:40,442 INFO org.apache.hadoop.ipc.Server: IPC Server
>>>>> handler 1 on 54311, call
>>>>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1598c57, false,
>>>>> true,
>>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>>>> java.lang.NullPointerException
>>>>> java.io.IOException: java.lang.NullPointerException
>>>>>    at
>>>>>
>>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>>>    at
>>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>>>    at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>>>    at
>>>>>
>>>>>
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>    at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>>>    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>>>> 2009-04-30 02:34:40,444 INFO org.apache.hadoop.mapred.JobTracker:
>>>>> Serious
>>>>> problem.  While updating status, cannot find taskid
>>>>> attempt_200904291917_0252_r_000003_0
>>>>> 2009-04-30 02:34:40,444 WARN org.apache.hadoop.mapred.TaskInProgress:
>>>>> Recieved duplicate status update of 'KILLED' for
>>>>> 'attempt_200904291917_0352_m_000010_1' of TIP
>>>>> 'task_200904291917_0352_m_000010'
>>>>> 2009-04-30 02:34:40,444 INFO org.apache.hadoop.ipc.Server: IPC Server
>>>>> handler 7 on 54311, call
>>>>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@cb3d5, false,
>>>>> true,
>>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>>>> java.lang.NullPointerException
>>>>> java.io.IOException: java.lang.NullPointerException
>>>>>    at
>>>>>
>>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>>>    at
>>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>>>    at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>>>    at
>>>>>
>>>>>
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>    at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>>>    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>>>> 2009-04-30 02:34:40,447 INFO org.apache.hadoop.mapred.JobTracker:
>>>>> Serious
>>>>> problem.  While updating status, cannot find taskid
>>>>> attempt_200904291917_0252_r_000003_0
>>>>> 2009-04-30 02:34:40,447 WARN org.apache.hadoop.mapred.TaskInProgress:
>>>>> Recieved duplicate status update of 'KILLED' for
>>>>> 'attempt_200904291917_0352_m_000010_1' of TIP
>>>>> 'task_200904291917_0352_m_000010'
>>>>> 2009-04-30 02:34:40,447 INFO org.apache.hadoop.ipc.Server: IPC Server
>>>>> handler 1 on 54311, call
>>>>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1f27d98, false,
>>>>> true,
>>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>>>> java.lang.NullPointerException
>>>>> java.io.IOException: java.lang.NullPointerException
>>>>>    at
>>>>>
>>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>>>    at
>>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>>>    at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>>>    at
>>>>>
>>>>>
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>    at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>>>    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Apr 30, 2009, at 11:04 AM, Todd Lipcon wrote:
>>>>>
>>>>> Hey Lance,
>>>>>
>>>>>
>>>>>> Did you see any error messages in the JobTracker logs around the time
>>>>>> this
>>>>>> started? I think I understand how this might happen.
>>>>>>
>>>>>> Thanks,
>>>>>> -Todd
>>>>>>
>>>>>> On Thu, Apr 30, 2009 at 10:45 AM, Lance Riedel <la...@dotspots.com>
>>>>>> wrote:
>>>>>>
>>>>>> I have not been able to reproduce.  We are using version 19.1 with the
>>>>>>
>>>>>>  following patches:
>>>>>>> 4780-2v19.patch (Jira  4780)
>>>>>>> closeAll3.patch (Jira 3998)
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Lance
>>>>>>>
>>>>>>>
>>>>>>> On Apr 30, 2009, at 10:40 AM, Todd Lipcon wrote:
>>>>>>>
>>>>>>> Hi Lance,
>>>>>>>
>>>>>>>
>>>>>>>  Can I ask what version you were running when you saw this? Is it
>>>>>>>> reproducible? I'm trying to look at the code path that might produce
>>>>>>>> such
>>>>>>>> a
>>>>>>>> behavior and want to make sure I'm looking at the right version.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> -Todd
>>>>>>>>
>>>>>>>> On Thu, Apr 30, 2009 at 9:33 AM, Lance Riedel <la...@dotspots.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Has anyone seen this before? Our task tracker produced a 2.7 gig log
>>>>>>>> file
>>>>>>>>
>>>>>>>> in a few hours. The entry is all the same (every 2 ms):
>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2009-04-30 02:34:40,207 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>> with
>>>>>>>>> reponseId '5341
>>>>>>>>> 2009-04-30 02:34:40,398 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>> with
>>>>>>>>> reponseId '5341
>>>>>>>>> 2009-04-30 02:34:40,403 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>> with
>>>>>>>>> reponseId '5341
>>>>>>>>> 2009-04-30 02:34:40,411 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>> with
>>>>>>>>> reponseId '5341
>>>>>>>>> 2009-04-30 02:34:40,414 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>> with
>>>>>>>>> reponseId '5341
>>>>>>>>> 2009-04-30 02:34:40,417 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>> with
>>>>>>>>> reponseId '5341
>>>>>>>>> 2009-04-30 02:34:40,420 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>> with
>>>>>>>>> reponseId '5341
>>>>>>>>> ... (And on and on and on...)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> These are the few lines before it started:
>>>>>>>>>
>>>>>>>>> 2009-04-30 02:34:29,780 INFO
>>>>>>>>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src:
>>>>>>>>> xxx.xxx.xxx.xxx:50060, dest: 10.253.178.95:40268, bytes: 3341324,
>>>>>>>>> op:
>>>>>>>>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000006_0
>>>>>>>>> 2009-04-30 02:34:31,522 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>> Sent
>>>>>>>>> out
>>>>>>>>> 418891 bytes for reduce: 12 from map:
>>>>>>>>> attempt_200904291917_0352_m_000007_0
>>>>>>>>> given 418891/418887 from 4301462 with (22, 171)
>>>>>>>>> 2009-04-30 02:34:31,522 INFO
>>>>>>>>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src:
>>>>>>>>> xxx.xxx.xxx.xxx:50060, dest: xxx.xxx.xxx.xxx:40268, bytes: 418891,
>>>>>>>>> op:
>>>>>>>>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000007_0
>>>>>>>>> 2009-04-30 02:34:35,382 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10
>>>>>>>>> of
>>>>>>>>> 11
>>>>>>>>> at
>>>>>>>>> 0.32 MB/s) >
>>>>>>>>> 2009-04-30 02:34:38,385 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10
>>>>>>>>> of
>>>>>>>>> 11
>>>>>>>>> at
>>>>>>>>> 0.32 MB/s) >
>>>>>>>>> 2009-04-30 02:34:40,207 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>> with
>>>>>>>>> reponseId '5341
>>>>>>>>> 2009-04-30 02:34:40,398 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>> with
>>>>>>>>> reponseId '5341
>>>>>>>>> 2009-04-30 02:34:40,403 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>> with
>>>>>>>>> reponseId '5341
>>>>>>>>> 2009-04-30 02:34:40,411 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>> with
>>>>>>>>> reponseId '5341
>>>>>>>>> 2009-04-30 02:34:40,414 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>> with
>>>>>>>>> reponseId '5341
>>>>>>>>> 2009-04-30 02:34:40,417 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>> with
>>>>>>>>> reponseId '5341
>>>>>>>>> 2009-04-30 02:34:40,420 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>> with
>>>>>>>>> reponseId '5341
>>>>>>>>>
>>>>>>>>> --And on for    2+ gigs
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>
>

Re: Infinite Loop Resending status from task tracker

Posted by Lance Riedel <la...@dotspots.com>.

Hi Todd,
Sorry, my response got hung up in my outbox for a couple of days.. arghh



Confirmed that 1) we are not running out of space and 2) that our  
mapred.local.dir directory is not in /tmp

Not sure if this an ec2 problem with a mounted drive?

We had the same thing happen again, exact same logs and symptoms  
(simultaneous in jobtracker and tasktracker)

Thinking about moving to .20 because of this, any thoughts on that?

Thanks,
Lance

On May 4, 2009, at 4:18 PM, Todd Lipcon wrote:

> Hi Lance,
>
> Two thoughts here that might be the culprit:
>
> 1) Is it possible that the partition that your mapred.local.dir is  
> on is out
> of space on that task tracker?
>
> 2) Is it possible that you're using a directory under /tmp for
> mapred.local.dir and some system cron script cleared out /tmp?
>
> -Todd
>
> On Sat, May 2, 2009 at 9:01 AM, Lance Riedel <la...@dotspots.com>  
> wrote:
>
>> Hi Todd,
>> Not sure if this is related, but our hadoop cluster in general is  
>> getting
>> more and more unstable.  the logs are full of this error message  
>> (but having
>> trouble tracking down the root problem):
>>
>> 2009-05-02 11:30:39,294 INFO org.apache.hadoop.mapred.TaskTracker:
>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>> taskTracker/jobcache/job_200904301103_1111/ 
>> attempt_200904301103_1111_m_000001_1/output/file.out
>> in any of the configured local directories
>> 2009-05-02 11:30:39,294 INFO org.apache.hadoop.mapred.TaskTracker:
>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>> taskTracker/jobcache/job_200904301103_1675/ 
>> attempt_200904301103_1675_r_000012_1/output/file.out
>> in any of the configured local directories
>> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker:
>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>> taskTracker/jobcache/job_200904301103_0944/ 
>> attempt_200904301103_0944_r_000015_0/output/file.out
>> in any of the configured local directories
>> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker:
>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>> taskTracker/jobcache/job_200904301103_1111/ 
>> attempt_200904301103_1111_m_000001_1/output/file.out
>> in any of the configured local directories
>> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker:
>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>> taskTracker/jobcache/job_200904301103_1675/ 
>> attempt_200904301103_1675_r_000012_1/output/file.out
>> in any of the configured local directories
>> 2009-05-02 11:30:49,296 INFO org.apache.hadoop.mapred.TaskTracker:
>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>> taskTracker/jobcache/job_200904301103_0944/ 
>> attempt_200904301103_0944_r_000015_0/output/file.out
>> in any of the configured local directories
>> 2009-05-02 11:30:49,296 INFO org.apache.hadoop.mapred.TaskTracker:
>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>> taskTracker/jobcache/job_200904301103_1111/ 
>> attempt_200904301103_1111_m_000001_1/output/file.out
>> in any of the configured local directories
>> 2009-05-02 11:30:49,297 INFO org.apache.hadoop.mapred.TaskTracker:
>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>> taskTracker/jobcache/job_200904301103_1675/ 
>> attempt_200904301103_1675_r_000012_1/output/file.out
>> in any of the configured local directories
>> 2009-05-02 11:30:54,298 INFO org.apache.hadoop.mapred.TaskTracker:
>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>> taskTracker/jobcache/job_200904301103_0944/ 
>> attempt_200904301103_0944_r_000015_0/output/file.out
>> in any of the configured local directories
>>
>>
>> Lance
>>
>>
>> On Apr 30, 2009, at 12:04 PM, Todd Lipcon wrote:
>>
>> Hey Lance,
>>>
>>> Thanks for the logs. They definitely confirmed my suspicion. There  
>>> are two
>>> problems here:
>>>
>>> 1) If the JobTracker throws an exception during processing of a  
>>> heartbeat,
>>> the tasktracker retries with no delay, since lastHeartbeat isn't  
>>> updated
>>> in
>>> TaskTracker.offerService. This is related to HADOOP-3987
>>>
>>> 2) If the TaskTracker sends a task in COMMIT_PENDING state with an  
>>> invalid
>>> task id, the jobtracker will trigger a NullPointerException in
>>> JobTracker.getTasksToSave. Instead it should probably create a
>>> KillTaskAction. I just filed a JIRA to track this issue:
>>>
>>> https://issues.apache.org/jira/browse/HADOOP-5761
>>>
>>> 3) The TaskTracker somehow had a task attempt in COMMIT_PENDING  
>>> state that
>>> the JobTracker didn't know about. How it got there is a separate  
>>> problem
>>> that's a bit harder to track down.
>>>
>>> Thanks
>>> -Todd
>>>
>>> On Thu, Apr 30, 2009 at 11:17 AM, Lance Riedel <la...@dotspots.com>
>>> wrote:
>>>
>>> Here are the job tracker logs from the same time (and yes.. there is
>>>> something there!!):
>>>>
>>>>
>>>> 2009-04-30 02:34:28,484 INFO org.apache.hadoop.mapred.JobTracker:  
>>>> Serious
>>>> problem.  While updating status, cannot find taskid
>>>> attempt_200904291917_0252_r_000003_0
>>>>
>>>>
>>>> 2009-04-30 02:34:40,215 INFO org.apache.hadoop.ipc.Server: IPC  
>>>> Server
>>>> handler 2 on 54311, call
>>>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1a93388,  
>>>> false,
>>>> true,
>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>>> java.lang.NullPointerException
>>>> java.io.IOException: java.lang.NullPointerException
>>>>     at
>>>> org 
>>>> .apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java: 
>>>> 2130)
>>>>     at
>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>>     at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>>     at
>>>>
>>>> sun 
>>>> .reflect 
>>>> .DelegatingMethodAccessorImpl 
>>>> .invoke(DelegatingMethodAccessorImpl.java:25)
>>>>     at java.lang.reflect.Method.invoke(Method.java:597)
>>>>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>>> 2009-04-30 02:34:40,215 INFO org.apache.hadoop.mapred.JobTracker:  
>>>> Serious
>>>> problem.  While updating status, cannot find taskid
>>>> attempt_200904291917_0296_r_000014_1
>>>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker:  
>>>> Adding
>>>> task 'attempt_200904291917_0352_r_000013_0' to tip
>>>> task_200904291917_0352_r_000013, for tracker
>>>> 'tracker_domU-12-31-38-00- 
>>>> F0-41.compute-1.internal:localhost.localdomain/
>>>> 127.0.0.1:42479'
>>>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker:  
>>>> Removed
>>>> completed task 'attempt_200904291917_0343_m_000003_0' from
>>>> 'tracker_domU-12-31-38-00- 
>>>> F0-41.compute-1.internal:localhost.localdomain/
>>>> 127.0.0.1:42479'
>>>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker:  
>>>> Removed
>>>> completed task 'attempt_200904291917_0343_m_000007_0' from
>>>> 'tracker_domU-12-31-38-00- 
>>>> F0-41.compute-1.internal:localhost.localdomain/
>>>> 127.0.0.1:42479'
>>>>
>>>>
>>>> And then.. a LOT more
>>>>
>>>>
>>>> 2009-04-30 02:34:40,433 INFO org.apache.hadoop.mapred.JobTracker:  
>>>> Serious
>>>> problem.  While updating status, cannot find taskid
>>>> attempt_200904291917_0252_r_000003_0
>>>> 2009-04-30 02:34:40,433 WARN  
>>>> org.apache.hadoop.mapred.TaskInProgress:
>>>> Recieved duplicate status update of 'KILLED' for
>>>> 'attempt_200904291917_0352_m_000010_1' of TIP
>>>> 'task_200904291917_0352_m_000010'
>>>> 2009-04-30 02:34:40,433 INFO org.apache.hadoop.ipc.Server: IPC  
>>>> Server
>>>> handler 1 on 54311, call
>>>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1b7b4c1,  
>>>> false,
>>>> true,
>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>>> java.lang.NullPointerException
>>>> java.io.IOException: java.lang.NullPointerException
>>>>     at
>>>> org 
>>>> .apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java: 
>>>> 2130)
>>>>     at
>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>>     at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>>     at
>>>>
>>>> sun 
>>>> .reflect 
>>>> .DelegatingMethodAccessorImpl 
>>>> .invoke(DelegatingMethodAccessorImpl.java:25)
>>>>     at java.lang.reflect.Method.invoke(Method.java:597)
>>>>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>>> 2009-04-30 02:34:40,441 INFO org.apache.hadoop.mapred.JobTracker:  
>>>> Serious
>>>> problem.  While updating status, cannot find taskid
>>>> attempt_200904291917_0252_r_000003_0
>>>> 2009-04-30 02:34:40,441 WARN  
>>>> org.apache.hadoop.mapred.TaskInProgress:
>>>> Recieved duplicate status update of 'KILLED' for
>>>> 'attempt_200904291917_0352_m_000010_1' of TIP
>>>> 'task_200904291917_0352_m_000010'
>>>> 2009-04-30 02:34:40,442 INFO org.apache.hadoop.ipc.Server: IPC  
>>>> Server
>>>> handler 1 on 54311, call
>>>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1598c57,  
>>>> false,
>>>> true,
>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>>> java.lang.NullPointerException
>>>> java.io.IOException: java.lang.NullPointerException
>>>>     at
>>>> org 
>>>> .apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java: 
>>>> 2130)
>>>>     at
>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>>     at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>>     at
>>>>
>>>> sun 
>>>> .reflect 
>>>> .DelegatingMethodAccessorImpl 
>>>> .invoke(DelegatingMethodAccessorImpl.java:25)
>>>>     at java.lang.reflect.Method.invoke(Method.java:597)
>>>>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>>> 2009-04-30 02:34:40,444 INFO org.apache.hadoop.mapred.JobTracker:  
>>>> Serious
>>>> problem.  While updating status, cannot find taskid
>>>> attempt_200904291917_0252_r_000003_0
>>>> 2009-04-30 02:34:40,444 WARN  
>>>> org.apache.hadoop.mapred.TaskInProgress:
>>>> Recieved duplicate status update of 'KILLED' for
>>>> 'attempt_200904291917_0352_m_000010_1' of TIP
>>>> 'task_200904291917_0352_m_000010'
>>>> 2009-04-30 02:34:40,444 INFO org.apache.hadoop.ipc.Server: IPC  
>>>> Server
>>>> handler 7 on 54311, call
>>>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@cb3d5,  
>>>> false, true,
>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>>> java.lang.NullPointerException
>>>> java.io.IOException: java.lang.NullPointerException
>>>>     at
>>>> org 
>>>> .apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java: 
>>>> 2130)
>>>>     at
>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>>     at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>>     at
>>>>
>>>> sun 
>>>> .reflect 
>>>> .DelegatingMethodAccessorImpl 
>>>> .invoke(DelegatingMethodAccessorImpl.java:25)
>>>>     at java.lang.reflect.Method.invoke(Method.java:597)
>>>>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>>> 2009-04-30 02:34:40,447 INFO org.apache.hadoop.mapred.JobTracker:  
>>>> Serious
>>>> problem.  While updating status, cannot find taskid
>>>> attempt_200904291917_0252_r_000003_0
>>>> 2009-04-30 02:34:40,447 WARN  
>>>> org.apache.hadoop.mapred.TaskInProgress:
>>>> Recieved duplicate status update of 'KILLED' for
>>>> 'attempt_200904291917_0352_m_000010_1' of TIP
>>>> 'task_200904291917_0352_m_000010'
>>>> 2009-04-30 02:34:40,447 INFO org.apache.hadoop.ipc.Server: IPC  
>>>> Server
>>>> handler 1 on 54311, call
>>>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1f27d98,  
>>>> false,
>>>> true,
>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>>> java.lang.NullPointerException
>>>> java.io.IOException: java.lang.NullPointerException
>>>>     at
>>>> org 
>>>> .apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java: 
>>>> 2130)
>>>>     at
>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>>     at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>>     at
>>>>
>>>> sun 
>>>> .reflect 
>>>> .DelegatingMethodAccessorImpl 
>>>> .invoke(DelegatingMethodAccessorImpl.java:25)
>>>>     at java.lang.reflect.Method.invoke(Method.java:597)
>>>>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Apr 30, 2009, at 11:04 AM, Todd Lipcon wrote:
>>>>
>>>> Hey Lance,
>>>>
>>>>>
>>>>> Did you see any error messages in the JobTracker logs around the  
>>>>> time
>>>>> this
>>>>> started? I think I understand how this might happen.
>>>>>
>>>>> Thanks,
>>>>> -Todd
>>>>>
>>>>> On Thu, Apr 30, 2009 at 10:45 AM, Lance Riedel  
>>>>> <la...@dotspots.com>
>>>>> wrote:
>>>>>
>>>>> I have not been able to reproduce.  We are using version 19.1  
>>>>> with the
>>>>>
>>>>>> following patches:
>>>>>> 4780-2v19.patch (Jira  4780)
>>>>>> closeAll3.patch (Jira 3998)
>>>>>>
>>>>>> Thanks,
>>>>>> Lance
>>>>>>
>>>>>>
>>>>>> On Apr 30, 2009, at 10:40 AM, Todd Lipcon wrote:
>>>>>>
>>>>>> Hi Lance,
>>>>>>
>>>>>>
>>>>>>> Can I ask what version you were running when you saw this? Is it
>>>>>>> reproducible? I'm trying to look at the code path that might  
>>>>>>> produce
>>>>>>> such
>>>>>>> a
>>>>>>> behavior and want to make sure I'm looking at the right version.
>>>>>>>
>>>>>>> Thanks
>>>>>>> -Todd
>>>>>>>
>>>>>>> On Thu, Apr 30, 2009 at 9:33 AM, Lance Riedel <lance@dotspots.com 
>>>>>>> >
>>>>>>> wrote:
>>>>>>>
>>>>>>> Has anyone seen this before? Our task tracker produced a 2.7  
>>>>>>> gig log
>>>>>>> file
>>>>>>>
>>>>>>> in a few hours. The entry is all the same (every 2 ms):
>>>>>>>>
>>>>>>>> 2009-04-30 02:34:40,207 INFO  
>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx- 
>>>>>>>> xx.compute-1.amazonaws.com' with
>>>>>>>> reponseId '5341
>>>>>>>> 2009-04-30 02:34:40,398 INFO  
>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx- 
>>>>>>>> xx.compute-1.amazonaws.com' with
>>>>>>>> reponseId '5341
>>>>>>>> 2009-04-30 02:34:40,403 INFO  
>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx- 
>>>>>>>> xx.compute-1.amazonaws.com' with
>>>>>>>> reponseId '5341
>>>>>>>> 2009-04-30 02:34:40,411 INFO  
>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx- 
>>>>>>>> xx.compute-1.amazonaws.com' with
>>>>>>>> reponseId '5341
>>>>>>>> 2009-04-30 02:34:40,414 INFO  
>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx- 
>>>>>>>> xx.compute-1.amazonaws.com' with
>>>>>>>> reponseId '5341
>>>>>>>> 2009-04-30 02:34:40,417 INFO  
>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx- 
>>>>>>>> xx.compute-1.amazonaws.com' with
>>>>>>>> reponseId '5341
>>>>>>>> 2009-04-30 02:34:40,420 INFO  
>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx- 
>>>>>>>> xx.compute-1.amazonaws.com' with
>>>>>>>> reponseId '5341
>>>>>>>> ... (And on and on and on...)
>>>>>>>>
>>>>>>>>
>>>>>>>> These are the few lines before it started:
>>>>>>>>
>>>>>>>> 2009-04-30 02:34:29,780 INFO
>>>>>>>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src:
>>>>>>>> xxx.xxx.xxx.xxx:50060, dest: 10.253.178.95:40268, bytes:  
>>>>>>>> 3341324,
>>>>>>>> op:
>>>>>>>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000006_0
>>>>>>>> 2009-04-30 02:34:31,522 INFO  
>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>> Sent
>>>>>>>> out
>>>>>>>> 418891 bytes for reduce: 12 from map:
>>>>>>>> attempt_200904291917_0352_m_000007_0
>>>>>>>> given 418891/418887 from 4301462 with (22, 171)
>>>>>>>> 2009-04-30 02:34:31,522 INFO
>>>>>>>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src:
>>>>>>>> xxx.xxx.xxx.xxx:50060, dest: xxx.xxx.xxx.xxx:40268, bytes:  
>>>>>>>> 418891,
>>>>>>>> op:
>>>>>>>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000007_0
>>>>>>>> 2009-04-30 02:34:35,382 INFO  
>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy  
>>>>>>>> (10 of
>>>>>>>> 11
>>>>>>>> at
>>>>>>>> 0.32 MB/s) >
>>>>>>>> 2009-04-30 02:34:38,385 INFO  
>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy  
>>>>>>>> (10 of
>>>>>>>> 11
>>>>>>>> at
>>>>>>>> 0.32 MB/s) >
>>>>>>>> 2009-04-30 02:34:40,207 INFO  
>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx- 
>>>>>>>> xx.compute-1.amazonaws.com' with
>>>>>>>> reponseId '5341
>>>>>>>> 2009-04-30 02:34:40,398 INFO  
>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx- 
>>>>>>>> xx.compute-1.amazonaws.com' with
>>>>>>>> reponseId '5341
>>>>>>>> 2009-04-30 02:34:40,403 INFO  
>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx- 
>>>>>>>> xx.compute-1.amazonaws.com' with
>>>>>>>> reponseId '5341
>>>>>>>> 2009-04-30 02:34:40,411 INFO  
>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx- 
>>>>>>>> xx.compute-1.amazonaws.com' with
>>>>>>>> reponseId '5341
>>>>>>>> 2009-04-30 02:34:40,414 INFO  
>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx- 
>>>>>>>> xx.compute-1.amazonaws.com' with
>>>>>>>> reponseId '5341
>>>>>>>> 2009-04-30 02:34:40,417 INFO  
>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx- 
>>>>>>>> xx.compute-1.amazonaws.com' with
>>>>>>>> reponseId '5341
>>>>>>>> 2009-04-30 02:34:40,420 INFO  
>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx- 
>>>>>>>> xx.compute-1.amazonaws.com' with
>>>>>>>> reponseId '5341
>>>>>>>>
>>>>>>>> --And on for    2+ gigs
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>

Re: Infinite Loop Resending status from task tracker

Posted by Todd Lipcon <to...@cloudera.com>.

Hi Lance,

Two thoughts here that might be the culprit:

1) Is it possible that the partition that your mapred.local.dir is on is out
of space on that task tracker?

2) Is it possible that you're using a directory under /tmp for
mapred.local.dir and some system cron script cleared out /tmp?

-Todd

On Sat, May 2, 2009 at 9:01 AM, Lance Riedel <la...@dotspots.com> wrote:

> Hi Todd,
> Not sure if this is related, but our hadoop cluster in general is getting
> more and more unstable.  the logs are full of this error message (but having
> trouble tracking down the root problem):
>
> 2009-05-02 11:30:39,294 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_200904301103_1111/attempt_200904301103_1111_m_000001_1/output/file.out
> in any of the configured local directories
> 2009-05-02 11:30:39,294 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_200904301103_1675/attempt_200904301103_1675_r_000012_1/output/file.out
> in any of the configured local directories
> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_200904301103_0944/attempt_200904301103_0944_r_000015_0/output/file.out
> in any of the configured local directories
> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_200904301103_1111/attempt_200904301103_1111_m_000001_1/output/file.out
> in any of the configured local directories
> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_200904301103_1675/attempt_200904301103_1675_r_000012_1/output/file.out
> in any of the configured local directories
> 2009-05-02 11:30:49,296 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_200904301103_0944/attempt_200904301103_0944_r_000015_0/output/file.out
> in any of the configured local directories
> 2009-05-02 11:30:49,296 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_200904301103_1111/attempt_200904301103_1111_m_000001_1/output/file.out
> in any of the configured local directories
> 2009-05-02 11:30:49,297 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_200904301103_1675/attempt_200904301103_1675_r_000012_1/output/file.out
> in any of the configured local directories
> 2009-05-02 11:30:54,298 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_200904301103_0944/attempt_200904301103_0944_r_000015_0/output/file.out
> in any of the configured local directories
>
>
> Lance
>
>
> On Apr 30, 2009, at 12:04 PM, Todd Lipcon wrote:
>
>  Hey Lance,
>>
>> Thanks for the logs. They definitely confirmed my suspicion. There are two
>> problems here:
>>
>> 1) If the JobTracker throws an exception during processing of a heartbeat,
>> the tasktracker retries with no delay, since lastHeartbeat isn't updated
>> in
>> TaskTracker.offerService. This is related to HADOOP-3987
>>
>> 2) If the TaskTracker sends a task in COMMIT_PENDING state with an invalid
>> task id, the jobtracker will trigger a NullPointerException in
>> JobTracker.getTasksToSave. Instead it should probably create a
>> KillTaskAction. I just filed a JIRA to track this issue:
>>
>> https://issues.apache.org/jira/browse/HADOOP-5761
>>
>> 3) The TaskTracker somehow had a task attempt in COMMIT_PENDING state that
>> the JobTracker didn't know about. How it got there is a separate problem
>> that's a bit harder to track down.
>>
>> Thanks
>> -Todd
>>
>> On Thu, Apr 30, 2009 at 11:17 AM, Lance Riedel <la...@dotspots.com>
>> wrote:
>>
>>  Here are the job tracker logs from the same time (and yes.. there is
>>> something there!!):
>>>
>>>
>>> 2009-04-30 02:34:28,484 INFO org.apache.hadoop.mapred.JobTracker: Serious
>>> problem.  While updating status, cannot find taskid
>>> attempt_200904291917_0252_r_000003_0
>>>
>>>
>>> 2009-04-30 02:34:40,215 INFO org.apache.hadoop.ipc.Server: IPC Server
>>> handler 2 on 54311, call
>>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1a93388, false,
>>> true,
>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>> java.lang.NullPointerException
>>> java.io.IOException: java.lang.NullPointerException
>>>      at
>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>      at
>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>      at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>      at
>>>
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>      at java.lang.reflect.Method.invoke(Method.java:597)
>>>      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>> 2009-04-30 02:34:40,215 INFO org.apache.hadoop.mapred.JobTracker: Serious
>>> problem.  While updating status, cannot find taskid
>>> attempt_200904291917_0296_r_000014_1
>>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker: Adding
>>> task 'attempt_200904291917_0352_r_000013_0' to tip
>>> task_200904291917_0352_r_000013, for tracker
>>> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
>>> 127.0.0.1:42479'
>>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker: Removed
>>> completed task 'attempt_200904291917_0343_m_000003_0' from
>>> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
>>> 127.0.0.1:42479'
>>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker: Removed
>>> completed task 'attempt_200904291917_0343_m_000007_0' from
>>> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
>>> 127.0.0.1:42479'
>>>
>>>
>>> And then.. a LOT more
>>>
>>>
>>> 2009-04-30 02:34:40,433 INFO org.apache.hadoop.mapred.JobTracker: Serious
>>> problem.  While updating status, cannot find taskid
>>> attempt_200904291917_0252_r_000003_0
>>> 2009-04-30 02:34:40,433 WARN org.apache.hadoop.mapred.TaskInProgress:
>>> Recieved duplicate status update of 'KILLED' for
>>> 'attempt_200904291917_0352_m_000010_1' of TIP
>>> 'task_200904291917_0352_m_000010'
>>> 2009-04-30 02:34:40,433 INFO org.apache.hadoop.ipc.Server: IPC Server
>>> handler 1 on 54311, call
>>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1b7b4c1, false,
>>> true,
>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>> java.lang.NullPointerException
>>> java.io.IOException: java.lang.NullPointerException
>>>      at
>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>      at
>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>      at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>      at
>>>
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>      at java.lang.reflect.Method.invoke(Method.java:597)
>>>      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>> 2009-04-30 02:34:40,441 INFO org.apache.hadoop.mapred.JobTracker: Serious
>>> problem.  While updating status, cannot find taskid
>>> attempt_200904291917_0252_r_000003_0
>>> 2009-04-30 02:34:40,441 WARN org.apache.hadoop.mapred.TaskInProgress:
>>> Recieved duplicate status update of 'KILLED' for
>>> 'attempt_200904291917_0352_m_000010_1' of TIP
>>> 'task_200904291917_0352_m_000010'
>>> 2009-04-30 02:34:40,442 INFO org.apache.hadoop.ipc.Server: IPC Server
>>> handler 1 on 54311, call
>>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1598c57, false,
>>> true,
>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>> java.lang.NullPointerException
>>> java.io.IOException: java.lang.NullPointerException
>>>      at
>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>      at
>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>      at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>      at
>>>
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>      at java.lang.reflect.Method.invoke(Method.java:597)
>>>      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>> 2009-04-30 02:34:40,444 INFO org.apache.hadoop.mapred.JobTracker: Serious
>>> problem.  While updating status, cannot find taskid
>>> attempt_200904291917_0252_r_000003_0
>>> 2009-04-30 02:34:40,444 WARN org.apache.hadoop.mapred.TaskInProgress:
>>> Recieved duplicate status update of 'KILLED' for
>>> 'attempt_200904291917_0352_m_000010_1' of TIP
>>> 'task_200904291917_0352_m_000010'
>>> 2009-04-30 02:34:40,444 INFO org.apache.hadoop.ipc.Server: IPC Server
>>> handler 7 on 54311, call
>>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@cb3d5, false, true,
>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>> java.lang.NullPointerException
>>> java.io.IOException: java.lang.NullPointerException
>>>      at
>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>      at
>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>      at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>      at
>>>
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>      at java.lang.reflect.Method.invoke(Method.java:597)
>>>      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>> 2009-04-30 02:34:40,447 INFO org.apache.hadoop.mapred.JobTracker: Serious
>>> problem.  While updating status, cannot find taskid
>>> attempt_200904291917_0252_r_000003_0
>>> 2009-04-30 02:34:40,447 WARN org.apache.hadoop.mapred.TaskInProgress:
>>> Recieved duplicate status update of 'KILLED' for
>>> 'attempt_200904291917_0352_m_000010_1' of TIP
>>> 'task_200904291917_0352_m_000010'
>>> 2009-04-30 02:34:40,447 INFO org.apache.hadoop.ipc.Server: IPC Server
>>> handler 1 on 54311, call
>>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1f27d98, false,
>>> true,
>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>> java.lang.NullPointerException
>>> java.io.IOException: java.lang.NullPointerException
>>>      at
>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>      at
>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>      at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>      at
>>>
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>      at java.lang.reflect.Method.invoke(Method.java:597)
>>>      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Apr 30, 2009, at 11:04 AM, Todd Lipcon wrote:
>>>
>>> Hey Lance,
>>>
>>>>
>>>> Did you see any error messages in the JobTracker logs around the time
>>>> this
>>>> started? I think I understand how this might happen.
>>>>
>>>> Thanks,
>>>> -Todd
>>>>
>>>> On Thu, Apr 30, 2009 at 10:45 AM, Lance Riedel <la...@dotspots.com>
>>>> wrote:
>>>>
>>>> I have not been able to reproduce.  We are using version 19.1 with the
>>>>
>>>>> following patches:
>>>>> 4780-2v19.patch (Jira  4780)
>>>>> closeAll3.patch (Jira 3998)
>>>>>
>>>>> Thanks,
>>>>> Lance
>>>>>
>>>>>
>>>>> On Apr 30, 2009, at 10:40 AM, Todd Lipcon wrote:
>>>>>
>>>>> Hi Lance,
>>>>>
>>>>>
>>>>>> Can I ask what version you were running when you saw this? Is it
>>>>>> reproducible? I'm trying to look at the code path that might produce
>>>>>> such
>>>>>> a
>>>>>> behavior and want to make sure I'm looking at the right version.
>>>>>>
>>>>>> Thanks
>>>>>> -Todd
>>>>>>
>>>>>> On Thu, Apr 30, 2009 at 9:33 AM, Lance Riedel <la...@dotspots.com>
>>>>>> wrote:
>>>>>>
>>>>>> Has anyone seen this before? Our task tracker produced a 2.7 gig log
>>>>>> file
>>>>>>
>>>>>>  in a few hours. The entry is all the same (every 2 ms):
>>>>>>>
>>>>>>> 2009-04-30 02:34:40,207 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>>>> reponseId '5341
>>>>>>> 2009-04-30 02:34:40,398 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>>>> reponseId '5341
>>>>>>> 2009-04-30 02:34:40,403 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>>>> reponseId '5341
>>>>>>> 2009-04-30 02:34:40,411 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>>>> reponseId '5341
>>>>>>> 2009-04-30 02:34:40,414 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>>>> reponseId '5341
>>>>>>> 2009-04-30 02:34:40,417 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>>>> reponseId '5341
>>>>>>> 2009-04-30 02:34:40,420 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>>>> reponseId '5341
>>>>>>> ... (And on and on and on...)
>>>>>>>
>>>>>>>
>>>>>>> These are the few lines before it started:
>>>>>>>
>>>>>>> 2009-04-30 02:34:29,780 INFO
>>>>>>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src:
>>>>>>> xxx.xxx.xxx.xxx:50060, dest: 10.253.178.95:40268, bytes: 3341324,
>>>>>>> op:
>>>>>>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000006_0
>>>>>>> 2009-04-30 02:34:31,522 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Sent
>>>>>>> out
>>>>>>> 418891 bytes for reduce: 12 from map:
>>>>>>> attempt_200904291917_0352_m_000007_0
>>>>>>> given 418891/418887 from 4301462 with (22, 171)
>>>>>>> 2009-04-30 02:34:31,522 INFO
>>>>>>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src:
>>>>>>> xxx.xxx.xxx.xxx:50060, dest: xxx.xxx.xxx.xxx:40268, bytes: 418891,
>>>>>>> op:
>>>>>>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000007_0
>>>>>>> 2009-04-30 02:34:35,382 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10 of
>>>>>>> 11
>>>>>>> at
>>>>>>> 0.32 MB/s) >
>>>>>>> 2009-04-30 02:34:38,385 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10 of
>>>>>>> 11
>>>>>>> at
>>>>>>> 0.32 MB/s) >
>>>>>>> 2009-04-30 02:34:40,207 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>>>> reponseId '5341
>>>>>>> 2009-04-30 02:34:40,398 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>>>> reponseId '5341
>>>>>>> 2009-04-30 02:34:40,403 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>>>> reponseId '5341
>>>>>>> 2009-04-30 02:34:40,411 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>>>> reponseId '5341
>>>>>>> 2009-04-30 02:34:40,414 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>>>> reponseId '5341
>>>>>>> 2009-04-30 02:34:40,417 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>>>> reponseId '5341
>>>>>>> 2009-04-30 02:34:40,420 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>>>> reponseId '5341
>>>>>>>
>>>>>>> --And on for    2+ gigs
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>
>

Re: Infinite Loop Resending status from task tracker

Posted by Lance Riedel <la...@dotspots.com>.

Hi Todd,
Not sure if this is related, but our hadoop cluster in general is  
getting more and more unstable.  the logs are full of this error  
message (but having trouble tracking down the root problem):

2009-05-02 11:30:39,294 INFO org.apache.hadoop.mapred.TaskTracker:  
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find  
taskTracker/jobcache/job_200904301103_1111/ 
attempt_200904301103_1111_m_000001_1/output/file.out in any of the  
configured local directories
2009-05-02 11:30:39,294 INFO org.apache.hadoop.mapred.TaskTracker:  
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find  
taskTracker/jobcache/job_200904301103_1675/ 
attempt_200904301103_1675_r_000012_1/output/file.out in any of the  
configured local directories
2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker:  
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find  
taskTracker/jobcache/job_200904301103_0944/ 
attempt_200904301103_0944_r_000015_0/output/file.out in any of the  
configured local directories
2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker:  
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find  
taskTracker/jobcache/job_200904301103_1111/ 
attempt_200904301103_1111_m_000001_1/output/file.out in any of the  
configured local directories
2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker:  
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find  
taskTracker/jobcache/job_200904301103_1675/ 
attempt_200904301103_1675_r_000012_1/output/file.out in any of the  
configured local directories
2009-05-02 11:30:49,296 INFO org.apache.hadoop.mapred.TaskTracker:  
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find  
taskTracker/jobcache/job_200904301103_0944/ 
attempt_200904301103_0944_r_000015_0/output/file.out in any of the  
configured local directories
2009-05-02 11:30:49,296 INFO org.apache.hadoop.mapred.TaskTracker:  
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find  
taskTracker/jobcache/job_200904301103_1111/ 
attempt_200904301103_1111_m_000001_1/output/file.out in any of the  
configured local directories
2009-05-02 11:30:49,297 INFO org.apache.hadoop.mapred.TaskTracker:  
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find  
taskTracker/jobcache/job_200904301103_1675/ 
attempt_200904301103_1675_r_000012_1/output/file.out in any of the  
configured local directories
2009-05-02 11:30:54,298 INFO org.apache.hadoop.mapred.TaskTracker:  
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find  
taskTracker/jobcache/job_200904301103_0944/ 
attempt_200904301103_0944_r_000015_0/output/file.out in any of the  
configured local directories


Lance

On Apr 30, 2009, at 12:04 PM, Todd Lipcon wrote:

> Hey Lance,
>
> Thanks for the logs. They definitely confirmed my suspicion. There  
> are two
> problems here:
>
> 1) If the JobTracker throws an exception during processing of a  
> heartbeat,
> the tasktracker retries with no delay, since lastHeartbeat isn't  
> updated in
> TaskTracker.offerService. This is related to HADOOP-3987
>
> 2) If the TaskTracker sends a task in COMMIT_PENDING state with an  
> invalid
> task id, the jobtracker will trigger a NullPointerException in
> JobTracker.getTasksToSave. Instead it should probably create a
> KillTaskAction. I just filed a JIRA to track this issue:
>
> https://issues.apache.org/jira/browse/HADOOP-5761
>
> 3) The TaskTracker somehow had a task attempt in COMMIT_PENDING  
> state that
> the JobTracker didn't know about. How it got there is a separate  
> problem
> that's a bit harder to track down.
>
> Thanks
> -Todd
>
> On Thu, Apr 30, 2009 at 11:17 AM, Lance Riedel <la...@dotspots.com>  
> wrote:
>
>> Here are the job tracker logs from the same time (and yes.. there is
>> something there!!):
>>
>>
>> 2009-04-30 02:34:28,484 INFO org.apache.hadoop.mapred.JobTracker:  
>> Serious
>> problem.  While updating status, cannot find taskid
>> attempt_200904291917_0252_r_000003_0
>>
>>
>> 2009-04-30 02:34:40,215 INFO org.apache.hadoop.ipc.Server: IPC Server
>> handler 2 on 54311, call
>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1a93388,  
>> false, true,
>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>> java.lang.NullPointerException
>> java.io.IOException: java.lang.NullPointerException
>>       at
>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java: 
>> 2130)
>>       at
>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>       at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>       at
>> sun 
>> .reflect 
>> .DelegatingMethodAccessorImpl 
>> .invoke(DelegatingMethodAccessorImpl.java:25)
>>       at java.lang.reflect.Method.invoke(Method.java:597)
>>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>> 2009-04-30 02:34:40,215 INFO org.apache.hadoop.mapred.JobTracker:  
>> Serious
>> problem.  While updating status, cannot find taskid
>> attempt_200904291917_0296_r_000014_1
>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker:  
>> Adding
>> task 'attempt_200904291917_0352_r_000013_0' to tip
>> task_200904291917_0352_r_000013, for tracker
>> 'tracker_domU-12-31-38-00- 
>> F0-41.compute-1.internal:localhost.localdomain/
>> 127.0.0.1:42479'
>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker:  
>> Removed
>> completed task 'attempt_200904291917_0343_m_000003_0' from
>> 'tracker_domU-12-31-38-00- 
>> F0-41.compute-1.internal:localhost.localdomain/
>> 127.0.0.1:42479'
>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker:  
>> Removed
>> completed task 'attempt_200904291917_0343_m_000007_0' from
>> 'tracker_domU-12-31-38-00- 
>> F0-41.compute-1.internal:localhost.localdomain/
>> 127.0.0.1:42479'
>>
>>
>> And then.. a LOT more
>>
>>
>> 2009-04-30 02:34:40,433 INFO org.apache.hadoop.mapred.JobTracker:  
>> Serious
>> problem.  While updating status, cannot find taskid
>> attempt_200904291917_0252_r_000003_0
>> 2009-04-30 02:34:40,433 WARN org.apache.hadoop.mapred.TaskInProgress:
>> Recieved duplicate status update of 'KILLED' for
>> 'attempt_200904291917_0352_m_000010_1' of TIP
>> 'task_200904291917_0352_m_000010'
>> 2009-04-30 02:34:40,433 INFO org.apache.hadoop.ipc.Server: IPC Server
>> handler 1 on 54311, call
>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1b7b4c1,  
>> false, true,
>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>> java.lang.NullPointerException
>> java.io.IOException: java.lang.NullPointerException
>>       at
>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java: 
>> 2130)
>>       at
>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>       at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>       at
>> sun 
>> .reflect 
>> .DelegatingMethodAccessorImpl 
>> .invoke(DelegatingMethodAccessorImpl.java:25)
>>       at java.lang.reflect.Method.invoke(Method.java:597)
>>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>> 2009-04-30 02:34:40,441 INFO org.apache.hadoop.mapred.JobTracker:  
>> Serious
>> problem.  While updating status, cannot find taskid
>> attempt_200904291917_0252_r_000003_0
>> 2009-04-30 02:34:40,441 WARN org.apache.hadoop.mapred.TaskInProgress:
>> Recieved duplicate status update of 'KILLED' for
>> 'attempt_200904291917_0352_m_000010_1' of TIP
>> 'task_200904291917_0352_m_000010'
>> 2009-04-30 02:34:40,442 INFO org.apache.hadoop.ipc.Server: IPC Server
>> handler 1 on 54311, call
>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1598c57,  
>> false, true,
>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>> java.lang.NullPointerException
>> java.io.IOException: java.lang.NullPointerException
>>       at
>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java: 
>> 2130)
>>       at
>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>       at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>       at
>> sun 
>> .reflect 
>> .DelegatingMethodAccessorImpl 
>> .invoke(DelegatingMethodAccessorImpl.java:25)
>>       at java.lang.reflect.Method.invoke(Method.java:597)
>>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>> 2009-04-30 02:34:40,444 INFO org.apache.hadoop.mapred.JobTracker:  
>> Serious
>> problem.  While updating status, cannot find taskid
>> attempt_200904291917_0252_r_000003_0
>> 2009-04-30 02:34:40,444 WARN org.apache.hadoop.mapred.TaskInProgress:
>> Recieved duplicate status update of 'KILLED' for
>> 'attempt_200904291917_0352_m_000010_1' of TIP
>> 'task_200904291917_0352_m_000010'
>> 2009-04-30 02:34:40,444 INFO org.apache.hadoop.ipc.Server: IPC Server
>> handler 7 on 54311, call
>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@cb3d5, false,  
>> true,
>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>> java.lang.NullPointerException
>> java.io.IOException: java.lang.NullPointerException
>>       at
>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java: 
>> 2130)
>>       at
>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>       at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>       at
>> sun 
>> .reflect 
>> .DelegatingMethodAccessorImpl 
>> .invoke(DelegatingMethodAccessorImpl.java:25)
>>       at java.lang.reflect.Method.invoke(Method.java:597)
>>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>> 2009-04-30 02:34:40,447 INFO org.apache.hadoop.mapred.JobTracker:  
>> Serious
>> problem.  While updating status, cannot find taskid
>> attempt_200904291917_0252_r_000003_0
>> 2009-04-30 02:34:40,447 WARN org.apache.hadoop.mapred.TaskInProgress:
>> Recieved duplicate status update of 'KILLED' for
>> 'attempt_200904291917_0352_m_000010_1' of TIP
>> 'task_200904291917_0352_m_000010'
>> 2009-04-30 02:34:40,447 INFO org.apache.hadoop.ipc.Server: IPC Server
>> handler 1 on 54311, call
>> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1f27d98,  
>> false, true,
>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>> java.lang.NullPointerException
>> java.io.IOException: java.lang.NullPointerException
>>       at
>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java: 
>> 2130)
>>       at
>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>       at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>       at
>> sun 
>> .reflect 
>> .DelegatingMethodAccessorImpl 
>> .invoke(DelegatingMethodAccessorImpl.java:25)
>>       at java.lang.reflect.Method.invoke(Method.java:597)
>>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Apr 30, 2009, at 11:04 AM, Todd Lipcon wrote:
>>
>> Hey Lance,
>>>
>>> Did you see any error messages in the JobTracker logs around the  
>>> time this
>>> started? I think I understand how this might happen.
>>>
>>> Thanks,
>>> -Todd
>>>
>>> On Thu, Apr 30, 2009 at 10:45 AM, Lance Riedel <la...@dotspots.com>
>>> wrote:
>>>
>>> I have not been able to reproduce.  We are using version 19.1 with  
>>> the
>>>> following patches:
>>>> 4780-2v19.patch (Jira  4780)
>>>> closeAll3.patch (Jira 3998)
>>>>
>>>> Thanks,
>>>> Lance
>>>>
>>>>
>>>> On Apr 30, 2009, at 10:40 AM, Todd Lipcon wrote:
>>>>
>>>> Hi Lance,
>>>>
>>>>>
>>>>> Can I ask what version you were running when you saw this? Is it
>>>>> reproducible? I'm trying to look at the code path that might  
>>>>> produce
>>>>> such
>>>>> a
>>>>> behavior and want to make sure I'm looking at the right version.
>>>>>
>>>>> Thanks
>>>>> -Todd
>>>>>
>>>>> On Thu, Apr 30, 2009 at 9:33 AM, Lance Riedel <la...@dotspots.com>
>>>>> wrote:
>>>>>
>>>>> Has anyone seen this before? Our task tracker produced a 2.7 gig  
>>>>> log
>>>>> file
>>>>>
>>>>>> in a few hours. The entry is all the same (every 2 ms):
>>>>>>
>>>>>> 2009-04-30 02:34:40,207 INFO  
>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'  
>>>>>> with
>>>>>> reponseId '5341
>>>>>> 2009-04-30 02:34:40,398 INFO  
>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'  
>>>>>> with
>>>>>> reponseId '5341
>>>>>> 2009-04-30 02:34:40,403 INFO  
>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'  
>>>>>> with
>>>>>> reponseId '5341
>>>>>> 2009-04-30 02:34:40,411 INFO  
>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'  
>>>>>> with
>>>>>> reponseId '5341
>>>>>> 2009-04-30 02:34:40,414 INFO  
>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'  
>>>>>> with
>>>>>> reponseId '5341
>>>>>> 2009-04-30 02:34:40,417 INFO  
>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'  
>>>>>> with
>>>>>> reponseId '5341
>>>>>> 2009-04-30 02:34:40,420 INFO  
>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'  
>>>>>> with
>>>>>> reponseId '5341
>>>>>> ... (And on and on and on...)
>>>>>>
>>>>>>
>>>>>> These are the few lines before it started:
>>>>>>
>>>>>> 2009-04-30 02:34:29,780 INFO
>>>>>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src:
>>>>>> xxx.xxx.xxx.xxx:50060, dest: 10.253.178.95:40268, bytes:  
>>>>>> 3341324, op:
>>>>>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000006_0
>>>>>> 2009-04-30 02:34:31,522 INFO  
>>>>>> org.apache.hadoop.mapred.TaskTracker: Sent
>>>>>> out
>>>>>> 418891 bytes for reduce: 12 from map:
>>>>>> attempt_200904291917_0352_m_000007_0
>>>>>> given 418891/418887 from 4301462 with (22, 171)
>>>>>> 2009-04-30 02:34:31,522 INFO
>>>>>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src:
>>>>>> xxx.xxx.xxx.xxx:50060, dest: xxx.xxx.xxx.xxx:40268, bytes:  
>>>>>> 418891, op:
>>>>>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000007_0
>>>>>> 2009-04-30 02:34:35,382 INFO  
>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy  
>>>>>> (10 of 11
>>>>>> at
>>>>>> 0.32 MB/s) >
>>>>>> 2009-04-30 02:34:38,385 INFO  
>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy  
>>>>>> (10 of 11
>>>>>> at
>>>>>> 0.32 MB/s) >
>>>>>> 2009-04-30 02:34:40,207 INFO  
>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'  
>>>>>> with
>>>>>> reponseId '5341
>>>>>> 2009-04-30 02:34:40,398 INFO  
>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'  
>>>>>> with
>>>>>> reponseId '5341
>>>>>> 2009-04-30 02:34:40,403 INFO  
>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'  
>>>>>> with
>>>>>> reponseId '5341
>>>>>> 2009-04-30 02:34:40,411 INFO  
>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'  
>>>>>> with
>>>>>> reponseId '5341
>>>>>> 2009-04-30 02:34:40,414 INFO  
>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'  
>>>>>> with
>>>>>> reponseId '5341
>>>>>> 2009-04-30 02:34:40,417 INFO  
>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'  
>>>>>> with
>>>>>> reponseId '5341
>>>>>> 2009-04-30 02:34:40,420 INFO  
>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'  
>>>>>> with
>>>>>> reponseId '5341
>>>>>>
>>>>>> --And on for    2+ gigs
>>>>>>
>>>>>>
>>>>>>
>>>>
>>

Re: Infinite Loop Resending status from task tracker

Posted by Todd Lipcon <to...@cloudera.com>.

Hey Lance,

Thanks for the logs. They definitely confirmed my suspicion. There are two
problems here:

1) If the JobTracker throws an exception during processing of a heartbeat,
the tasktracker retries with no delay, since lastHeartbeat isn't updated in
TaskTracker.offerService. This is related to HADOOP-3987

2) If the TaskTracker sends a task in COMMIT_PENDING state with an invalid
task id, the jobtracker will trigger a NullPointerException in
JobTracker.getTasksToSave. Instead it should probably create a
KillTaskAction. I just filed a JIRA to track this issue:

https://issues.apache.org/jira/browse/HADOOP-5761

3) The TaskTracker somehow had a task attempt in COMMIT_PENDING state that
the JobTracker didn't know about. How it got there is a separate problem
that's a bit harder to track down.

Thanks
-Todd

On Thu, Apr 30, 2009 at 11:17 AM, Lance Riedel <la...@dotspots.com> wrote:

> Here are the job tracker logs from the same time (and yes.. there is
> something there!!):
>
>
> 2009-04-30 02:34:28,484 INFO org.apache.hadoop.mapred.JobTracker: Serious
> problem.  While updating status, cannot find taskid
> attempt_200904291917_0252_r_000003_0
>
>
> 2009-04-30 02:34:40,215 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 2 on 54311, call
> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1a93388, false, true,
> 5341) from 10.253.134.191:42688: error: java.io.IOException:
> java.lang.NullPointerException
> java.io.IOException: java.lang.NullPointerException
>        at
> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>        at
> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>        at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
> 2009-04-30 02:34:40,215 INFO org.apache.hadoop.mapred.JobTracker: Serious
> problem.  While updating status, cannot find taskid
> attempt_200904291917_0296_r_000014_1
> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task 'attempt_200904291917_0352_r_000013_0' to tip
> task_200904291917_0352_r_000013, for tracker
> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
> 127.0.0.1:42479'
> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker: Removed
> completed task 'attempt_200904291917_0343_m_000003_0' from
> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
> 127.0.0.1:42479'
> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker: Removed
> completed task 'attempt_200904291917_0343_m_000007_0' from
> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
> 127.0.0.1:42479'
>
>
> And then.. a LOT more
>
>
> 2009-04-30 02:34:40,433 INFO org.apache.hadoop.mapred.JobTracker: Serious
> problem.  While updating status, cannot find taskid
> attempt_200904291917_0252_r_000003_0
> 2009-04-30 02:34:40,433 WARN org.apache.hadoop.mapred.TaskInProgress:
> Recieved duplicate status update of 'KILLED' for
> 'attempt_200904291917_0352_m_000010_1' of TIP
> 'task_200904291917_0352_m_000010'
> 2009-04-30 02:34:40,433 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 1 on 54311, call
> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1b7b4c1, false, true,
> 5341) from 10.253.134.191:42688: error: java.io.IOException:
> java.lang.NullPointerException
> java.io.IOException: java.lang.NullPointerException
>        at
> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>        at
> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>        at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
> 2009-04-30 02:34:40,441 INFO org.apache.hadoop.mapred.JobTracker: Serious
> problem.  While updating status, cannot find taskid
> attempt_200904291917_0252_r_000003_0
> 2009-04-30 02:34:40,441 WARN org.apache.hadoop.mapred.TaskInProgress:
> Recieved duplicate status update of 'KILLED' for
> 'attempt_200904291917_0352_m_000010_1' of TIP
> 'task_200904291917_0352_m_000010'
> 2009-04-30 02:34:40,442 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 1 on 54311, call
> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1598c57, false, true,
> 5341) from 10.253.134.191:42688: error: java.io.IOException:
> java.lang.NullPointerException
> java.io.IOException: java.lang.NullPointerException
>        at
> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>        at
> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>        at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
> 2009-04-30 02:34:40,444 INFO org.apache.hadoop.mapred.JobTracker: Serious
> problem.  While updating status, cannot find taskid
> attempt_200904291917_0252_r_000003_0
> 2009-04-30 02:34:40,444 WARN org.apache.hadoop.mapred.TaskInProgress:
> Recieved duplicate status update of 'KILLED' for
> 'attempt_200904291917_0352_m_000010_1' of TIP
> 'task_200904291917_0352_m_000010'
> 2009-04-30 02:34:40,444 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 7 on 54311, call
> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@cb3d5, false, true,
> 5341) from 10.253.134.191:42688: error: java.io.IOException:
> java.lang.NullPointerException
> java.io.IOException: java.lang.NullPointerException
>        at
> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>        at
> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>        at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
> 2009-04-30 02:34:40,447 INFO org.apache.hadoop.mapred.JobTracker: Serious
> problem.  While updating status, cannot find taskid
> attempt_200904291917_0252_r_000003_0
> 2009-04-30 02:34:40,447 WARN org.apache.hadoop.mapred.TaskInProgress:
> Recieved duplicate status update of 'KILLED' for
> 'attempt_200904291917_0352_m_000010_1' of TIP
> 'task_200904291917_0352_m_000010'
> 2009-04-30 02:34:40,447 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 1 on 54311, call
> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1f27d98, false, true,
> 5341) from 10.253.134.191:42688: error: java.io.IOException:
> java.lang.NullPointerException
> java.io.IOException: java.lang.NullPointerException
>        at
> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>        at
> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>        at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>
>
>
>
>
>
>
>
>
>
> On Apr 30, 2009, at 11:04 AM, Todd Lipcon wrote:
>
>  Hey Lance,
>>
>> Did you see any error messages in the JobTracker logs around the time this
>> started? I think I understand how this might happen.
>>
>> Thanks,
>> -Todd
>>
>> On Thu, Apr 30, 2009 at 10:45 AM, Lance Riedel <la...@dotspots.com>
>> wrote:
>>
>>  I have not been able to reproduce.  We are using version 19.1 with the
>>> following patches:
>>> 4780-2v19.patch (Jira  4780)
>>> closeAll3.patch (Jira 3998)
>>>
>>> Thanks,
>>> Lance
>>>
>>>
>>> On Apr 30, 2009, at 10:40 AM, Todd Lipcon wrote:
>>>
>>> Hi Lance,
>>>
>>>>
>>>> Can I ask what version you were running when you saw this? Is it
>>>> reproducible? I'm trying to look at the code path that might produce
>>>> such
>>>> a
>>>> behavior and want to make sure I'm looking at the right version.
>>>>
>>>> Thanks
>>>> -Todd
>>>>
>>>> On Thu, Apr 30, 2009 at 9:33 AM, Lance Riedel <la...@dotspots.com>
>>>> wrote:
>>>>
>>>> Has anyone seen this before? Our task tracker produced a 2.7 gig log
>>>> file
>>>>
>>>>> in a few hours. The entry is all the same (every 2 ms):
>>>>>
>>>>> 2009-04-30 02:34:40,207 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>> reponseId '5341
>>>>> 2009-04-30 02:34:40,398 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>> reponseId '5341
>>>>> 2009-04-30 02:34:40,403 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>> reponseId '5341
>>>>> 2009-04-30 02:34:40,411 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>> reponseId '5341
>>>>> 2009-04-30 02:34:40,414 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>> reponseId '5341
>>>>> 2009-04-30 02:34:40,417 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>> reponseId '5341
>>>>> 2009-04-30 02:34:40,420 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>> reponseId '5341
>>>>> ... (And on and on and on...)
>>>>>
>>>>>
>>>>> These are the few lines before it started:
>>>>>
>>>>> 2009-04-30 02:34:29,780 INFO
>>>>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src:
>>>>> xxx.xxx.xxx.xxx:50060, dest: 10.253.178.95:40268, bytes: 3341324, op:
>>>>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000006_0
>>>>> 2009-04-30 02:34:31,522 INFO org.apache.hadoop.mapred.TaskTracker: Sent
>>>>> out
>>>>> 418891 bytes for reduce: 12 from map:
>>>>> attempt_200904291917_0352_m_000007_0
>>>>> given 418891/418887 from 4301462 with (22, 171)
>>>>> 2009-04-30 02:34:31,522 INFO
>>>>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src:
>>>>> xxx.xxx.xxx.xxx:50060, dest: xxx.xxx.xxx.xxx:40268, bytes: 418891, op:
>>>>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000007_0
>>>>> 2009-04-30 02:34:35,382 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10 of 11
>>>>> at
>>>>> 0.32 MB/s) >
>>>>> 2009-04-30 02:34:38,385 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10 of 11
>>>>> at
>>>>> 0.32 MB/s) >
>>>>> 2009-04-30 02:34:40,207 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>> reponseId '5341
>>>>> 2009-04-30 02:34:40,398 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>> reponseId '5341
>>>>> 2009-04-30 02:34:40,403 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>> reponseId '5341
>>>>> 2009-04-30 02:34:40,411 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>> reponseId '5341
>>>>> 2009-04-30 02:34:40,414 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>> reponseId '5341
>>>>> 2009-04-30 02:34:40,417 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>> reponseId '5341
>>>>> 2009-04-30 02:34:40,420 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>> reponseId '5341
>>>>>
>>>>> --And on for    2+ gigs
>>>>>
>>>>>
>>>>>
>>>
>

Re: Infinite Loop Resending status from task tracker

Posted by Lance Riedel <la...@dotspots.com>.

Here are the job tracker logs from the same time (and yes.. there is  
something there!!):


2009-04-30 02:34:28,484 INFO org.apache.hadoop.mapred.JobTracker:  
Serious problem.  While updating status, cannot find taskid  
attempt_200904291917_0252_r_000003_0


2009-04-30 02:34:40,215 INFO org.apache.hadoop.ipc.Server: IPC Server  
handler 2 on 54311, call  
heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1a93388, false,  
true, 5341) from 10.253.134.191:42688: error: java.io.IOException:  
java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
         at  
org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
         at  
org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
         at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
         at  
sun 
.reflect 
.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java: 
25)
         at java.lang.reflect.Method.invoke(Method.java:597)
         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
2009-04-30 02:34:40,215 INFO org.apache.hadoop.mapred.JobTracker:  
Serious problem.  While updating status, cannot find taskid  
attempt_200904291917_0296_r_000014_1
2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker:  
Adding task 'attempt_200904291917_0352_r_000013_0' to tip  
task_200904291917_0352_r_000013, for tracker 'tracker_domU-12-31-38-00- 
F0-41.compute-1.internal:localhost.localdomain/127.0.0.1:42479'
2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker:  
Removed completed task 'attempt_200904291917_0343_m_000003_0' from  
'tracker_domU-12-31-38-00- 
F0-41.compute-1.internal:localhost.localdomain/127.0.0.1:42479'
2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker:  
Removed completed task 'attempt_200904291917_0343_m_000007_0' from  
'tracker_domU-12-31-38-00- 
F0-41.compute-1.internal:localhost.localdomain/127.0.0.1:42479'


And then.. a LOT more


2009-04-30 02:34:40,433 INFO org.apache.hadoop.mapred.JobTracker:  
Serious problem.  While updating status, cannot find taskid  
attempt_200904291917_0252_r_000003_0
2009-04-30 02:34:40,433 WARN org.apache.hadoop.mapred.TaskInProgress:  
Recieved duplicate status update of 'KILLED' for  
'attempt_200904291917_0352_m_000010_1' of TIP  
'task_200904291917_0352_m_000010'
2009-04-30 02:34:40,433 INFO org.apache.hadoop.ipc.Server: IPC Server  
handler 1 on 54311, call  
heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1b7b4c1, false,  
true, 5341) from 10.253.134.191:42688: error: java.io.IOException:  
java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
         at  
org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
         at  
org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
         at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
         at  
sun 
.reflect 
.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java: 
25)
         at java.lang.reflect.Method.invoke(Method.java:597)
         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
2009-04-30 02:34:40,441 INFO org.apache.hadoop.mapred.JobTracker:  
Serious problem.  While updating status, cannot find taskid  
attempt_200904291917_0252_r_000003_0
2009-04-30 02:34:40,441 WARN org.apache.hadoop.mapred.TaskInProgress:  
Recieved duplicate status update of 'KILLED' for  
'attempt_200904291917_0352_m_000010_1' of TIP  
'task_200904291917_0352_m_000010'
2009-04-30 02:34:40,442 INFO org.apache.hadoop.ipc.Server: IPC Server  
handler 1 on 54311, call  
heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1598c57, false,  
true, 5341) from 10.253.134.191:42688: error: java.io.IOException:  
java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
         at  
org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
         at  
org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
         at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
         at  
sun 
.reflect 
.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java: 
25)
         at java.lang.reflect.Method.invoke(Method.java:597)
         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
2009-04-30 02:34:40,444 INFO org.apache.hadoop.mapred.JobTracker:  
Serious problem.  While updating status, cannot find taskid  
attempt_200904291917_0252_r_000003_0
2009-04-30 02:34:40,444 WARN org.apache.hadoop.mapred.TaskInProgress:  
Recieved duplicate status update of 'KILLED' for  
'attempt_200904291917_0352_m_000010_1' of TIP  
'task_200904291917_0352_m_000010'
2009-04-30 02:34:40,444 INFO org.apache.hadoop.ipc.Server: IPC Server  
handler 7 on 54311, call  
heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@cb3d5, false,  
true, 5341) from 10.253.134.191:42688: error: java.io.IOException:  
java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
         at  
org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
         at  
org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
         at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
         at  
sun 
.reflect 
.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java: 
25)
         at java.lang.reflect.Method.invoke(Method.java:597)
         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
2009-04-30 02:34:40,447 INFO org.apache.hadoop.mapred.JobTracker:  
Serious problem.  While updating status, cannot find taskid  
attempt_200904291917_0252_r_000003_0
2009-04-30 02:34:40,447 WARN org.apache.hadoop.mapred.TaskInProgress:  
Recieved duplicate status update of 'KILLED' for  
'attempt_200904291917_0352_m_000010_1' of TIP  
'task_200904291917_0352_m_000010'
2009-04-30 02:34:40,447 INFO org.apache.hadoop.ipc.Server: IPC Server  
handler 1 on 54311, call  
heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1f27d98, false,  
true, 5341) from 10.253.134.191:42688: error: java.io.IOException:  
java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
         at  
org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
         at  
org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
         at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
         at  
sun 
.reflect 
.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java: 
25)
         at java.lang.reflect.Method.invoke(Method.java:597)
         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)









On Apr 30, 2009, at 11:04 AM, Todd Lipcon wrote:

> Hey Lance,
>
> Did you see any error messages in the JobTracker logs around the  
> time this
> started? I think I understand how this might happen.
>
> Thanks,
> -Todd
>
> On Thu, Apr 30, 2009 at 10:45 AM, Lance Riedel <la...@dotspots.com>  
> wrote:
>
>> I have not been able to reproduce.  We are using version 19.1 with  
>> the
>> following patches:
>> 4780-2v19.patch (Jira  4780)
>> closeAll3.patch (Jira 3998)
>>
>> Thanks,
>> Lance
>>
>>
>> On Apr 30, 2009, at 10:40 AM, Todd Lipcon wrote:
>>
>> Hi Lance,
>>>
>>> Can I ask what version you were running when you saw this? Is it
>>> reproducible? I'm trying to look at the code path that might  
>>> produce such
>>> a
>>> behavior and want to make sure I'm looking at the right version.
>>>
>>> Thanks
>>> -Todd
>>>
>>> On Thu, Apr 30, 2009 at 9:33 AM, Lance Riedel <la...@dotspots.com>  
>>> wrote:
>>>
>>> Has anyone seen this before? Our task tracker produced a 2.7 gig  
>>> log file
>>>> in a few hours. The entry is all the same (every 2 ms):
>>>>
>>>> 2009-04-30 02:34:40,207 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'  
>>>> with
>>>> reponseId '5341
>>>> 2009-04-30 02:34:40,398 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'  
>>>> with
>>>> reponseId '5341
>>>> 2009-04-30 02:34:40,403 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'  
>>>> with
>>>> reponseId '5341
>>>> 2009-04-30 02:34:40,411 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'  
>>>> with
>>>> reponseId '5341
>>>> 2009-04-30 02:34:40,414 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'  
>>>> with
>>>> reponseId '5341
>>>> 2009-04-30 02:34:40,417 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'  
>>>> with
>>>> reponseId '5341
>>>> 2009-04-30 02:34:40,420 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'  
>>>> with
>>>> reponseId '5341
>>>> ... (And on and on and on...)
>>>>
>>>>
>>>> These are the few lines before it started:
>>>>
>>>> 2009-04-30 02:34:29,780 INFO
>>>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src:
>>>> xxx.xxx.xxx.xxx:50060, dest: 10.253.178.95:40268, bytes: 3341324,  
>>>> op:
>>>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000006_0
>>>> 2009-04-30 02:34:31,522 INFO  
>>>> org.apache.hadoop.mapred.TaskTracker: Sent
>>>> out
>>>> 418891 bytes for reduce: 12 from map:
>>>> attempt_200904291917_0352_m_000007_0
>>>> given 418891/418887 from 4301462 with (22, 171)
>>>> 2009-04-30 02:34:31,522 INFO
>>>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src:
>>>> xxx.xxx.xxx.xxx:50060, dest: xxx.xxx.xxx.xxx:40268, bytes:  
>>>> 418891, op:
>>>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000007_0
>>>> 2009-04-30 02:34:35,382 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10  
>>>> of 11
>>>> at
>>>> 0.32 MB/s) >
>>>> 2009-04-30 02:34:38,385 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10  
>>>> of 11
>>>> at
>>>> 0.32 MB/s) >
>>>> 2009-04-30 02:34:40,207 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'  
>>>> with
>>>> reponseId '5341
>>>> 2009-04-30 02:34:40,398 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'  
>>>> with
>>>> reponseId '5341
>>>> 2009-04-30 02:34:40,403 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'  
>>>> with
>>>> reponseId '5341
>>>> 2009-04-30 02:34:40,411 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'  
>>>> with
>>>> reponseId '5341
>>>> 2009-04-30 02:34:40,414 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'  
>>>> with
>>>> reponseId '5341
>>>> 2009-04-30 02:34:40,417 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'  
>>>> with
>>>> reponseId '5341
>>>> 2009-04-30 02:34:40,420 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'  
>>>> with
>>>> reponseId '5341
>>>>
>>>> --And on for    2+ gigs
>>>>
>>>>
>>

Re: Infinite Loop Resending status from task tracker

Posted by Todd Lipcon <to...@cloudera.com>.

Hey Lance,

Did you see any error messages in the JobTracker logs around the time this
started? I think I understand how this might happen.

Thanks,
-Todd

On Thu, Apr 30, 2009 at 10:45 AM, Lance Riedel <la...@dotspots.com> wrote:

> I have not been able to reproduce.  We are using version 19.1 with the
> following patches:
> 4780-2v19.patch (Jira  4780)
> closeAll3.patch (Jira 3998)
>
> Thanks,
> Lance
>
>
> On Apr 30, 2009, at 10:40 AM, Todd Lipcon wrote:
>
>  Hi Lance,
>>
>> Can I ask what version you were running when you saw this? Is it
>> reproducible? I'm trying to look at the code path that might produce such
>> a
>> behavior and want to make sure I'm looking at the right version.
>>
>> Thanks
>> -Todd
>>
>> On Thu, Apr 30, 2009 at 9:33 AM, Lance Riedel <la...@dotspots.com> wrote:
>>
>>  Has anyone seen this before? Our task tracker produced a 2.7 gig log file
>>> in a few hours. The entry is all the same (every 2 ms):
>>>
>>> 2009-04-30 02:34:40,207 INFO org.apache.hadoop.mapred.TaskTracker:
>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>> reponseId '5341
>>> 2009-04-30 02:34:40,398 INFO org.apache.hadoop.mapred.TaskTracker:
>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>> reponseId '5341
>>> 2009-04-30 02:34:40,403 INFO org.apache.hadoop.mapred.TaskTracker:
>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>> reponseId '5341
>>> 2009-04-30 02:34:40,411 INFO org.apache.hadoop.mapred.TaskTracker:
>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>> reponseId '5341
>>> 2009-04-30 02:34:40,414 INFO org.apache.hadoop.mapred.TaskTracker:
>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>> reponseId '5341
>>> 2009-04-30 02:34:40,417 INFO org.apache.hadoop.mapred.TaskTracker:
>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>> reponseId '5341
>>> 2009-04-30 02:34:40,420 INFO org.apache.hadoop.mapred.TaskTracker:
>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>> reponseId '5341
>>> ... (And on and on and on...)
>>>
>>>
>>> These are the few lines before it started:
>>>
>>> 2009-04-30 02:34:29,780 INFO
>>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src:
>>> xxx.xxx.xxx.xxx:50060, dest: 10.253.178.95:40268, bytes: 3341324, op:
>>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000006_0
>>> 2009-04-30 02:34:31,522 INFO org.apache.hadoop.mapred.TaskTracker: Sent
>>> out
>>> 418891 bytes for reduce: 12 from map:
>>> attempt_200904291917_0352_m_000007_0
>>> given 418891/418887 from 4301462 with (22, 171)
>>> 2009-04-30 02:34:31,522 INFO
>>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src:
>>> xxx.xxx.xxx.xxx:50060, dest: xxx.xxx.xxx.xxx:40268, bytes: 418891, op:
>>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000007_0
>>> 2009-04-30 02:34:35,382 INFO org.apache.hadoop.mapred.TaskTracker:
>>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10 of 11
>>> at
>>> 0.32 MB/s) >
>>> 2009-04-30 02:34:38,385 INFO org.apache.hadoop.mapred.TaskTracker:
>>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10 of 11
>>> at
>>> 0.32 MB/s) >
>>> 2009-04-30 02:34:40,207 INFO org.apache.hadoop.mapred.TaskTracker:
>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>> reponseId '5341
>>> 2009-04-30 02:34:40,398 INFO org.apache.hadoop.mapred.TaskTracker:
>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>> reponseId '5341
>>> 2009-04-30 02:34:40,403 INFO org.apache.hadoop.mapred.TaskTracker:
>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>> reponseId '5341
>>> 2009-04-30 02:34:40,411 INFO org.apache.hadoop.mapred.TaskTracker:
>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>> reponseId '5341
>>> 2009-04-30 02:34:40,414 INFO org.apache.hadoop.mapred.TaskTracker:
>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>> reponseId '5341
>>> 2009-04-30 02:34:40,417 INFO org.apache.hadoop.mapred.TaskTracker:
>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>> reponseId '5341
>>> 2009-04-30 02:34:40,420 INFO org.apache.hadoop.mapred.TaskTracker:
>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>> reponseId '5341
>>>
>>> --And on for    2+ gigs
>>>
>>>
>

Re: Infinite Loop Resending status from task tracker

Posted by Lance Riedel <la...@dotspots.com>.

I have not been able to reproduce.  We are using version 19.1 with the  
following patches:
4780-2v19.patch (Jira  4780)
closeAll3.patch (Jira 3998)

Thanks,
Lance

On Apr 30, 2009, at 10:40 AM, Todd Lipcon wrote:

> Hi Lance,
>
> Can I ask what version you were running when you saw this? Is it
> reproducible? I'm trying to look at the code path that might produce  
> such a
> behavior and want to make sure I'm looking at the right version.
>
> Thanks
> -Todd
>
> On Thu, Apr 30, 2009 at 9:33 AM, Lance Riedel <la...@dotspots.com>  
> wrote:
>
>> Has anyone seen this before? Our task tracker produced a 2.7 gig  
>> log file
>> in a few hours. The entry is all the same (every 2 ms):
>>
>> 2009-04-30 02:34:40,207 INFO org.apache.hadoop.mapred.TaskTracker:
>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>> reponseId '5341
>> 2009-04-30 02:34:40,398 INFO org.apache.hadoop.mapred.TaskTracker:
>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>> reponseId '5341
>> 2009-04-30 02:34:40,403 INFO org.apache.hadoop.mapred.TaskTracker:
>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>> reponseId '5341
>> 2009-04-30 02:34:40,411 INFO org.apache.hadoop.mapred.TaskTracker:
>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>> reponseId '5341
>> 2009-04-30 02:34:40,414 INFO org.apache.hadoop.mapred.TaskTracker:
>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>> reponseId '5341
>> 2009-04-30 02:34:40,417 INFO org.apache.hadoop.mapred.TaskTracker:
>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>> reponseId '5341
>> 2009-04-30 02:34:40,420 INFO org.apache.hadoop.mapred.TaskTracker:
>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>> reponseId '5341
>> ... (And on and on and on...)
>>
>>
>> These are the few lines before it started:
>>
>> 2009-04-30 02:34:29,780 INFO
>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src:
>> xxx.xxx.xxx.xxx:50060, dest: 10.253.178.95:40268, bytes: 3341324, op:
>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000006_0
>> 2009-04-30 02:34:31,522 INFO org.apache.hadoop.mapred.TaskTracker:  
>> Sent out
>> 418891 bytes for reduce: 12 from map:  
>> attempt_200904291917_0352_m_000007_0
>> given 418891/418887 from 4301462 with (22, 171)
>> 2009-04-30 02:34:31,522 INFO
>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src:
>> xxx.xxx.xxx.xxx:50060, dest: xxx.xxx.xxx.xxx:40268, bytes: 418891,  
>> op:
>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000007_0
>> 2009-04-30 02:34:35,382 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10  
>> of 11 at
>> 0.32 MB/s) >
>> 2009-04-30 02:34:38,385 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10  
>> of 11 at
>> 0.32 MB/s) >
>> 2009-04-30 02:34:40,207 INFO org.apache.hadoop.mapred.TaskTracker:
>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>> reponseId '5341
>> 2009-04-30 02:34:40,398 INFO org.apache.hadoop.mapred.TaskTracker:
>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>> reponseId '5341
>> 2009-04-30 02:34:40,403 INFO org.apache.hadoop.mapred.TaskTracker:
>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>> reponseId '5341
>> 2009-04-30 02:34:40,411 INFO org.apache.hadoop.mapred.TaskTracker:
>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>> reponseId '5341
>> 2009-04-30 02:34:40,414 INFO org.apache.hadoop.mapred.TaskTracker:
>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>> reponseId '5341
>> 2009-04-30 02:34:40,417 INFO org.apache.hadoop.mapred.TaskTracker:
>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>> reponseId '5341
>> 2009-04-30 02:34:40,420 INFO org.apache.hadoop.mapred.TaskTracker:
>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>> reponseId '5341
>>
>> --And on for    2+ gigs
>>

Re: Infinite Loop Resending status from task tracker

Posted by Todd Lipcon <to...@cloudera.com>.

Hi Lance,

Can I ask what version you were running when you saw this? Is it
reproducible? I'm trying to look at the code path that might produce such a
behavior and want to make sure I'm looking at the right version.

Thanks
-Todd

On Thu, Apr 30, 2009 at 9:33 AM, Lance Riedel <la...@dotspots.com> wrote:

> Has anyone seen this before? Our task tracker produced a 2.7 gig log file
> in a few hours. The entry is all the same (every 2 ms):
>
> 2009-04-30 02:34:40,207 INFO org.apache.hadoop.mapred.TaskTracker:
> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
> reponseId '5341
> 2009-04-30 02:34:40,398 INFO org.apache.hadoop.mapred.TaskTracker:
> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
> reponseId '5341
> 2009-04-30 02:34:40,403 INFO org.apache.hadoop.mapred.TaskTracker:
> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
> reponseId '5341
> 2009-04-30 02:34:40,411 INFO org.apache.hadoop.mapred.TaskTracker:
> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
> reponseId '5341
> 2009-04-30 02:34:40,414 INFO org.apache.hadoop.mapred.TaskTracker:
> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
> reponseId '5341
> 2009-04-30 02:34:40,417 INFO org.apache.hadoop.mapred.TaskTracker:
> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
> reponseId '5341
> 2009-04-30 02:34:40,420 INFO org.apache.hadoop.mapred.TaskTracker:
> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
> reponseId '5341
> ... (And on and on and on...)
>
>
> These are the few lines before it started:
>
> 2009-04-30 02:34:29,780 INFO
> org.apache.hadoop.mapred.TaskTracker.clienttrace: src:
> xxx.xxx.xxx.xxx:50060, dest: 10.253.178.95:40268, bytes: 3341324, op:
> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000006_0
> 2009-04-30 02:34:31,522 INFO org.apache.hadoop.mapred.TaskTracker: Sent out
> 418891 bytes for reduce: 12 from map: attempt_200904291917_0352_m_000007_0
> given 418891/418887 from 4301462 with (22, 171)
> 2009-04-30 02:34:31,522 INFO
> org.apache.hadoop.mapred.TaskTracker.clienttrace: src:
> xxx.xxx.xxx.xxx:50060, dest: xxx.xxx.xxx.xxx:40268, bytes: 418891, op:
> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000007_0
> 2009-04-30 02:34:35,382 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10 of 11 at
> 0.32 MB/s) >
> 2009-04-30 02:34:38,385 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10 of 11 at
> 0.32 MB/s) >
> 2009-04-30 02:34:40,207 INFO org.apache.hadoop.mapred.TaskTracker:
> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
> reponseId '5341
> 2009-04-30 02:34:40,398 INFO org.apache.hadoop.mapred.TaskTracker:
> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
> reponseId '5341
> 2009-04-30 02:34:40,403 INFO org.apache.hadoop.mapred.TaskTracker:
> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
> reponseId '5341
> 2009-04-30 02:34:40,411 INFO org.apache.hadoop.mapred.TaskTracker:
> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
> reponseId '5341
> 2009-04-30 02:34:40,414 INFO org.apache.hadoop.mapred.TaskTracker:
> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
> reponseId '5341
> 2009-04-30 02:34:40,417 INFO org.apache.hadoop.mapred.TaskTracker:
> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
> reponseId '5341
> 2009-04-30 02:34:40,420 INFO org.apache.hadoop.mapred.TaskTracker:
> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
> reponseId '5341
>
> --And on for    2+ gigs
>