You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Runping Qi (JIRA)" <ji...@apache.org> on 2007/10/29 21:44:50 UTC

[jira] Created: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

JobTracker becomes non-responsive if the task trackers finish task too fast
---------------------------------------------------------------------------

                 Key: HADOOP-2119
                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
             Project: Hadoop
          Issue Type: Bug
          Components: mapred
            Reporter: Runping Qi
             Fix For: 0.15.0



I ran a job with 0 reducer on a cluster with 390 nodes.
The mappers ran very fast.
The jobtracker lacks behind on committing completed mapper tasks.
The number of running mappers displayed on web UI getting bigger and bigger.
The jos tracker eventually stopped responding to web UI.

No progress is reported afterwards.

Job tracker is running on a separate node.
The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12538760 ] 

Devaraj Das commented on HADOOP-2119:
-------------------------------------

bq. Yes, the #running-mappers kept on going up and went beyond the actual number of RUNNING mappers on the task trackers 
Ok, maybe the task commit thread was slow in processing the task completion events, and one reason for that could be that the namenode was slow... 

bq. I've killed the jobtracker, thus cannot do "kill -3" anymore.
Since it is easy to reproduce the problem, could you please have one run and post the traces of the jobtracker's thread stacks (kill -3, when the JT is at 100% CPU).


> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Runping Qi
>             Fix For: 0.16.0
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Posted by "Srikanth Kakani (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546535 ] 

srikantk edited comment on HADOOP-2119 at 11/28/07 6:28 PM:
-------------------------------------------------------------------

Batching up commits in the commit task. Greatly improves commit rate. This patch does not completely fix the problem. In order to fix the problem we need to enhance  1. fix findnewmaptasks to be O(1) 2. locking in jobtracker

Tested the patch, it works for a job with 100,000 mappers. With almost 0 delay between the last complete map tasks and completing the job.


      was (Author: srikantk):
    Batching up commits in the commit task. Greatly improves commit rate.
  
> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>             Fix For: 0.16.0
>
>         Attachments: hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Posted by "Srikanth Kakani (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12549252 ] 

Srikanth Kakani commented on HADOOP-2119:
-----------------------------------------

For some reason this patch causes reducers to hang when fetching data from the mappers.

> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Priority: Blocker
>             Fix For: 0.15.2
>
>         Attachments: hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Posted by "Srikanth Kakani (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Srikanth Kakani updated HADOOP-2119:
------------------------------------

    Attachment: hadoop-2119.patch

Batching up commits in the commit task. Greatly improves commit rate.

> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>             Fix For: 0.16.0
>
>         Attachments: hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12538665 ] 

Devaraj Das commented on HADOOP-2119:
-------------------------------------

Runping, do you mean to say that the #running-mappers kept on going up and went beyond the actual number of mappers? Is the problem happening consistently? Are the maps creating anything on the dfs? Could you please do "kill -3 <jobtracker-pid>" to get a state of the threads when the job tracker goes into the state where it consumes 100% CPU, and post the output on jira. 

> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Runping Qi
>             Fix For: 0.16.0
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12545886 ] 

Christian Kunz commented on HADOOP-2119:
----------------------------------------

2 quick hacks brought down the execution time from 100 hrs down to 6 hrs (actual execution time for all 100,000 mappers was a little bit less than 1 hr):
1)  Changed HEARTBEAT_INTERVAL to 60 secs
2) Changed number of rpc handlers from 80 to 20
Both reduce the time by about a factor of 4.

> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Runping Qi
>             Fix For: 0.16.0
>
>         Attachments: hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christian Kunz updated HADOOP-2119:
-----------------------------------

    Fix Version/s:     (was: 0.16.0)
                   0.15.2

This issue has become important to us - upgrading to blocker for 0.15.2

> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>             Fix For: 0.15.2
>
>         Attachments: hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-2119:
---------------------------------

    Fix Version/s:     (was: 0.15.0)
                   0.16.0
      Description: 
I ran a job with 0 reducer on a cluster with 390 nodes.
The mappers ran very fast.
The jobtracker lacks behind on committing completed mapper tasks.
The number of running mappers displayed on web UI getting bigger and bigger.
The jos tracker eventually stopped responding to web UI.

No progress is reported afterwards.

Job tracker is running on a separate node.
The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).



  was:

I ran a job with 0 reducer on a cluster with 390 nodes.
The mappers ran very fast.
The jobtracker lacks behind on committing completed mapper tasks.
The number of running mappers displayed on web UI getting bigger and bigger.
The jos tracker eventually stopped responding to web UI.

No progress is reported afterwards.

Job tracker is running on a separate node.
The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).




I don't think this is a blocker for 0.15.0.

> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Runping Qi
>             Fix For: 0.16.0
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christian Kunz updated HADOOP-2119:
-----------------------------------

    Attachment: hadoop-jobtracker-thread-dump.txt

> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Runping Qi
>             Fix For: 0.16.0
>
>         Attachments: hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-2119:
--------------------------------

    Status: Open  (was: Patch Available)

The approach for batching updates looks good. The code needs to be fixed for indentation and the outermost try-catch block should not be removed. I am cancelling the patch until we have those fixes and also the fix for findNewTask...

> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>             Fix For: 0.16.0
>
>         Attachments: hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christian Kunz updated HADOOP-2119:
-----------------------------------

    Priority: Blocker  (was: Major)

> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Priority: Blocker
>             Fix For: 0.15.2
>
>         Attachments: hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12552579 ] 

Christian Kunz commented on HADOOP-2119:
----------------------------------------

Because of the time required to produce a comprehensive patch, I changed the blocker to 0.16.

> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Priority: Blocker
>             Fix For: 0.16.0
>
>         Attachments: hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christian Kunz updated HADOOP-2119:
-----------------------------------

    Fix Version/s:     (was: 0.15.2)
                   0.16.0

> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Priority: Blocker
>             Fix For: 0.16.0
>
>         Attachments: hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12538678 ] 

Runping Qi commented on HADOOP-2119:
------------------------------------

Yes,  the #running-mappers kept on going up and went beyond the actual number of RUNNING mappers on the task trackers (which is roughly 3 * NUMBER OF TASK TRACKERS).

The problem happened consistently, and easy (for me) to reproduce. 
The job has 0 reducer, just scans through a large input data set, and the mappers do not write anything out.


I've killed the jobtracker, thus cannot do "kill -3" anymore.


> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Runping Qi
>             Fix For: 0.16.0
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12549264 ] 

Christian Kunz commented on HADOOP-2119:
----------------------------------------

To be precise:

To find out the reason why reduces get stuck by not getting map output for a certain number of mappers,

besides rolling back Srikanth's patch we also rolled back the trivially looking patch suggested by Devaraj:

Index: src/java/org/apache/hadoop/mapred/TaskInProgress.java
===================================================================
--- src/java/org/apache/hadoop/mapred/TaskInProgress.java	(revision 598581)
+++ src/java/org/apache/hadoop/mapred/TaskInProgress.java	(working copy)
@@ -663,7 +663,7 @@
    * Return whether this TIP still needs to run
    */
   boolean isRunnable() {
-    return !failed && (completes == 0);
+    return !isOnlyCommitPending() && !failed && (completes == 0);
   }
     
   /**


> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Priority: Blocker
>             Fix For: 0.15.2
>
>         Attachments: hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12545284 ] 

Christian Kunz commented on HADOOP-2119:
----------------------------------------

I ran into a similar problem:
100,000 mappers with a single reducer on a 1400 node cluster with 2 tasks per node. All the waves of mappers finished in 2 hours, but are stuck in COMMIT_PENDING. Jobtracker is using 100% cpu (but distributed across 4 cpu's) and seems to take 3-4 secs yo process a completed task, such that the job will finish in about 100 hours.

Thread dump attached

> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Runping Qi
>             Fix For: 0.16.0
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546568 ] 

Hadoop QA commented on HADOOP-2119:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
http://issues.apache.org/jira/secure/attachment/12370493/hadoop-2119.patch
against trunk revision r599223.

    @author +1.  The patch does not contain any @author tags.

    javadoc +1.  The javadoc tool did not generate any warning messages.

    javac +1.  The applied patch does not generate any new compiler warnings.

    findbugs +1.  The patch does not introduce any new Findbugs warnings.

    core tests +1.  The patch passed core unit tests.

    contrib tests -1.  The patch failed contrib unit tests.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1196/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1196/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1196/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1196/console

This message is automatically generated.

> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>             Fix For: 0.16.0
>
>         Attachments: hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Posted by "Srikanth Kakani (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546533 ] 

Srikanth Kakani commented on HADOOP-2119:
-----------------------------------------

After a lot of analysis, we think that most threads on jobtracker get locked on the jobtracker object so much so that the lock acquisition time becomes 1.8 s or more.

Now looking at the commit thread it has two synchronized(this) blocks and one promotion call per each promotion.

So really the commit thread instead of doing much useful work waits mostly for lock acquisitions.

The fix is that the commit thread greedily commits as many threads as possible when it has the lock. This works well both for slower maps and for faster-completing maps.

I will be submitting a patch.

> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Runping Qi
>             Fix For: 0.16.0
>
>         Attachments: hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12561091#action_12561091 ] 

Amar Kamat commented on HADOOP-2119:
------------------------------------

Some doubts,
{quote}
 +            synchronized (JobTracker.this) {
 +                     for(count=0;count<jlist.size();count++) {
{quote}

this indicates that the array is scanned with the lock acquired. Is it scalable? Since the array size could be huge. Locking the JobTracker for this duration could be a performance hit or am I missing something? Currently only one task is considered at a time thus freeing the locks in between and thus the JobTracker progresses in between.

> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Priority: Blocker
>             Fix For: 0.16.0
>
>         Attachments: hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Posted by "Srikanth Kakani (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Srikanth Kakani updated HADOOP-2119:
------------------------------------

    Comment: was deleted

> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>             Fix For: 0.16.0
>
>         Attachments: hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Posted by "Nigel Daley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nigel Daley updated HADOOP-2119:
--------------------------------

    Fix Version/s:     (was: 0.16.0)
                   0.17.0
         Priority: Critical  (was: Blocker)

> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Priority: Critical
>             Fix For: 0.17.0
>
>         Attachments: hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Posted by "Srikanth Kakani (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Srikanth Kakani updated HADOOP-2119:
------------------------------------

    Affects Version/s: 0.16.0
               Status: Patch Available  (was: Open)

Patch that batches up promotions.

> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>             Fix For: 0.16.0
>
>         Attachments: hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Posted by "Sameer Paranjpye (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12552553 ] 

Sameer Paranjpye commented on HADOOP-2119:
------------------------------------------

Do we need this in 15.2? The right way to fix this appears to involve a good amount of Jobtracker refactoring -- implementing more efficient data structures for task assignment. The fix proposed here is just a band-aid. Srikanth also tells me that they have an application level workaround.

I suggest pushing this to 0.16 or beyond.

> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Priority: Blocker
>             Fix For: 0.15.2
>
>         Attachments: hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2119) JobTracker becomes non-responsive if the task trackers finish task too fast

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12555747#action_12555747 ] 

Christian Kunz commented on HADOOP-2119:
----------------------------------------

We ran jobs successfully with Srikanth's patch, without stuck reducers, i.e. the conclusion would be that the simple addition of '!sOnlyCommitPending()' to the condition in 'isRunnable()' in TaskInProgress.java might have undesired consequences.

> JobTracker becomes non-responsive if the task trackers finish task too fast
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-2119
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2119
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Priority: Blocker
>             Fix For: 0.16.0
>
>         Attachments: hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt
>
>
> I ran a job with 0 reducer on a cluster with 390 nodes.
> The mappers ran very fast.
> The jobtracker lacks behind on committing completed mapper tasks.
> The number of running mappers displayed on web UI getting bigger and bigger.
> The jos tracker eventually stopped responding to web UI.
> No progress is reported afterwards.
> Job tracker is running on a separate node.
> The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.