You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Enis Soztutar (JIRA)" <ji...@apache.org> on 2007/09/03 13:03:19 UTC
[jira] Commented: (HADOOP-1763) Too many lost task trackers - Job failures

    [ https://issues.apache.org/jira/browse/HADOOP-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12524497 ] 

Enis Soztutar commented on HADOOP-1763:
---------------------------------------

Devaraj, have you checked ipc call queue sizes? 
{{Server.Listener#doAccept()}} method prints this info as debug log. 

It seems configuring the right number of IPC/RPC server handler will always be a fragile design issue. But alternatively I think we could make the number of handlers dynamic. We can monitor the queue usage and opt for best number of handlers? any thoughts ? 

> Too many lost task trackers - Job failures
> ------------------------------------------
>
>                 Key: HADOOP-1763
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1763
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>         Environment: Version: 0.15.0-dev, r565628
> Compiled: Tue Aug 14 20:55:37 UTC 2007 by hadoopqa
> 1400 Node cluster
>            Reporter: Srikanth Kakani
>            Assignee: Devaraj Das
>            Priority: Blocker
>         Attachments: 1763.patch, 1763.try.patch
>
>
> Steps to reproduce:
> 1 .Run a map reduce application running more than 3000 mappers, each running longer than
> 2. Observe the lost task trackers.
> Observations:
> 1. Most of the lost taskTracker messages correspond to maps that have already completed
> 2. Based on the logs below the taskTracker is unable to connect to the job tracker and so the jobTracker deletes the job after 20 minutes
> One example:
> task_200708210155_0003_m_000000_0	node1	KILLED	0.00%		21-Aug-2007 09:39:09 	Lost task tracker   <-- Please note the time
> Counters:
> Map-Reduce Framework
> 	Map input records 	28,861
> 	Map output records 	1,349,114
> 	Map input bytes 	200,018,562
> 	Map output bytes 	714,878,712
> Node 1 task tracker logs:
> 2007-08-21 09:08:51,109 INFO org.apache.hadoop.mapred.TaskTracker: Task task_200708210155_0003_m_000000_0 is done. <-- Please note the time
> .
> .
> .
> 2007-08-21 09:08:52,212 INFO org.mortbay.http.SocketListener: LOW ON THREADS ((40-40+0)<1) on SocketListener0@0.0.0.0:50060
> 2007-08-21 09:08:52,217 WARN org.mortbay.http.SocketListener: OUT OF THREADS: SocketListener0@0.0.0.0:50060
> .
> .
> .
> 2007-08-21 09:18:53,877 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.net.SocketTimeoutException: timed out waiting for rpc response
>         at org.apache.hadoop.ipc.Client.call(Client.java:472)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:165)
>         at org.apache.hadoop.mapred.$Proxy0.heartbeat(Unknown Source)
>         at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:941)
>         at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:840)
>         at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1227)
>         at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:1911)
> .
> .
> .
> 2007-08-21 09:47:45,207 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to 'wm501219' with reponseId '5247
> 2007-08-21 09:47:46,023 INFO org.apache.hadoop.mapred.TaskTracker: Recieved RenitTrackerAction from JobTracker
> 2007-08-21 09:47:46,041 INFO org.apache.hadoop.mapred.TaskRunner: task_200708210155_0003_m_000000_0 done; removing files.
> 2007-08-21 09:47:46,240 INFO org.apache.hadoop.mapred.TaskRunner: task_200708210155_0003_m_002237_0 done; removing files.
> Tasktracker is pretty active otherwise:
> tracker_wm511293.inktomisearch.com:50050	wm511293.inktomisearch.com	1	6	3
> JobTracker logs:
> 2007-08-21 09:01:11,951 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'task_200708210155_0003_m_000000_0' to tip tip_200708210155_0003_m_000000, for tracker 'tracker_wm511293.inktomisearch.com:50050'
> .
> 2007-08-21 09:06:27,745 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'task_200708210155_0003_m_000000_1' to tip tip_200708210155_0003_m_000000, for tracker 'tracker_wm511783.inktomisearch.com:50050'
> .
> 2007-08-21 09:08:51,212 INFO org.apache.hadoop.mapred.JobInProgress: Task 'task_200708210155_0003_m_000000_0' has completed tip_200708210155_0003_m_000000 successfully.
> 2007-08-21 09:08:51,213 INFO org.apache.hadoop.mapred.TaskInProgress: Task 'task_200708210155_0003_m_000000_0' has completed succesfully
> .
> 2007-08-21 09:11:27,227 INFO org.apache.hadoop.mapred.TaskInProgress: Already complete TIP tip_200708210155_0003_m_000000 has completed task task_200708210155_0003_m_000000_1
> .
> 2007-08-21 09:39:09,014 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_200708210155_0003_m_000000_0: Lost task tracker
> 2007-08-21 09:39:09,014 INFO org.apache.hadoop.mapred.TaskInProgress: Task 'task_200708210155_0003_m_000000_0' has been lost.
> .
> 2007-08-21 09:39:09,348 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'task_200708210155_0003_m_000000_0' from 'tracker_wm511293.inktomisearch.com:50050'
> .
> 2007-08-21 09:47:20,855 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_200708210155_0003_m_000000_1: Lost task tracker
> 2007-08-21 09:47:20,855 WARN org.apache.hadoop.mapred.TaskInProgress: Recieved duplicate status update of 'KILLED' for 'task_200708210155_0003_m_000000_1' of TIP 'tip_200708210155_0003_m_000000'
> Notes:
> 1. I do  not see the taskTracker dying during that period
> 2. Is retry logic not accurate/agressive enough? (did something change recently, this behavior is more evident in 0.15)
> 3. Inconsistencies with jobTracker logs? Lost task tracker detection bad?
> 4. TaskTracker:
>           CPU usage: 9:10-9:20 50%
>                                   9:20-9:40 0%
>           Network Usage: 6M incl dfs operations
> 5. JobTracker
>           CPU udage: Avg: 9%
>           Network Usage:  Negligible

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.