You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org> on 2008/03/08 04:03:46 UTC

[jira] Created: (HADOOP-2980) slow reduce copies - map output locations not being fetched even when map complete

slow reduce copies - map output locations not being fetched even when map complete
----------------------------------------------------------------------------------

                 Key: HADOOP-2980
                 URL: https://issues.apache.org/jira/browse/HADOOP-2980
             Project: Hadoop Core
          Issue Type: Bug
    Affects Versions: 0.15.3
            Reporter: Joydeep Sen Sarma


maps are long finished. reduces are stuck looking for map locations. they make progress - but slowly. it almost seems like they get new map locations every minute or so:

2008-03-07 18:50:52,737 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0 done copying task_200803041231_3586_m_004620_0 output from hadoop082.sf2p.facebook.com..
2008-03-07 18:50:53,733 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0: Got 0 new map-outputs & 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures
2008-03-07 18:50:53,733 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0 Got 0 known map output location(s); scheduling...
...
2008-03-07 18:51:49,767 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0 Got 50 known map output location(s); scheduling...
2008-03-07 18:51:49,767 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0 Scheduled 41 of 50 known outputs (0 slow hosts and 9 dup hosts)

they get about 50 locations at a time and this 1 minute delay pattern is surprisingly common ..

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2980) slow reduce copies - map output locations not being fetched even when map complete

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576752#action_12576752 ] 

Devaraj Das commented on HADOOP-2980:
-------------------------------------

There is a fixed delay between two consecutive polls to the JobTracker. But as of 0.16, the way it works is if tasks run out of map output locations, the tasktracker polls the JT (minimum delay between two consecutive polls is set as 5 secs).

> slow reduce copies - map output locations not being fetched even when map complete
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-2980
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2980
>             Project: Hadoop Core
>          Issue Type: Bug
>    Affects Versions: 0.15.3
>            Reporter: Joydeep Sen Sarma
>
> maps are long finished. reduces are stuck looking for map locations. they make progress - but slowly. it almost seems like they get new map locations every minute or so:
> 2008-03-07 18:50:52,737 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0 done copying task_200803041231_3586_m_004620_0 output from hadoop082.sf2p.facebook.com..
> 2008-03-07 18:50:53,733 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0: Got 0 new map-outputs & 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-03-07 18:50:53,733 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0 Got 0 known map output location(s); scheduling...
> ...
> 2008-03-07 18:51:49,767 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0 Got 50 known map output location(s); scheduling...
> 2008-03-07 18:51:49,767 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0 Scheduled 41 of 50 known outputs (0 slow hosts and 9 dup hosts)
> they get about 50 locations at a time and this 1 minute delay pattern is surprisingly common ..

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2980) slow reduce copies - map output locations not being fetched even when map complete

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577740#action_12577740 ] 

Joydeep Sen Sarma commented on HADOOP-2980:
-------------------------------------------

any guesses on what might be going wrong or what u might want us to log/look at? the tasktracker logs look clean to me (at least i dont see any exceptions and such). but this is happening all the time and i want to get to the bottom of it ..

> slow reduce copies - map output locations not being fetched even when map complete
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-2980
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2980
>             Project: Hadoop Core
>          Issue Type: Bug
>    Affects Versions: 0.15.3
>            Reporter: Joydeep Sen Sarma
>
> maps are long finished. reduces are stuck looking for map locations. they make progress - but slowly. it almost seems like they get new map locations every minute or so:
> 2008-03-07 18:50:52,737 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0 done copying task_200803041231_3586_m_004620_0 output from hadoop082.sf2p.facebook.com..
> 2008-03-07 18:50:53,733 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0: Got 0 new map-outputs & 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-03-07 18:50:53,733 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0 Got 0 known map output location(s); scheduling...
> ...
> 2008-03-07 18:51:49,767 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0 Got 50 known map output location(s); scheduling...
> 2008-03-07 18:51:49,767 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0 Scheduled 41 of 50 known outputs (0 slow hosts and 9 dup hosts)
> they get about 50 locations at a time and this 1 minute delay pattern is surprisingly common ..

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2980) slow reduce copies - map output locations not being fetched even when map complete

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577767#action_12577767 ] 

Devaraj Das commented on HADOOP-2980:
-------------------------------------

The '50' can be explained since the default probe size (for mapcompletionevents) is 50. The probe size increases with the number of copier threads via the calculation: _max(numCopiers*5, 50)_ So one thing that you should do is to increase the number of copier threads (mapred.reduce.parallel.copies to something like 20).
But I am not able to clearly explain the 1 minute interval between the queries. This is one scenario when it can happen - discussed in HADOOP-1719 (https://issues.apache.org/jira/browse/HADOOP-1719?focusedCommentId=12522954#action_12522954). The other scenario where this can happen is when there is a failure to fetch output from a host and the task backs off between 1 & 5 minutes from that host. But you say that you don't see any exceptions... 
BTW the above issues are fixed in 0.16.

> slow reduce copies - map output locations not being fetched even when map complete
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-2980
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2980
>             Project: Hadoop Core
>          Issue Type: Bug
>    Affects Versions: 0.15.3
>            Reporter: Joydeep Sen Sarma
>
> maps are long finished. reduces are stuck looking for map locations. they make progress - but slowly. it almost seems like they get new map locations every minute or so:
> 2008-03-07 18:50:52,737 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0 done copying task_200803041231_3586_m_004620_0 output from hadoop082.sf2p.facebook.com..
> 2008-03-07 18:50:53,733 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0: Got 0 new map-outputs & 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-03-07 18:50:53,733 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0 Got 0 known map output location(s); scheduling...
> ...
> 2008-03-07 18:51:49,767 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0 Got 50 known map output location(s); scheduling...
> 2008-03-07 18:51:49,767 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0 Scheduled 41 of 50 known outputs (0 slow hosts and 9 dup hosts)
> they get about 50 locations at a time and this 1 minute delay pattern is surprisingly common ..

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2980) slow reduce copies - map output locations not being fetched even when map complete

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12579191#action_12579191 ] 

Devaraj Das commented on HADOOP-2980:
-------------------------------------

Before you spend time instrumenting the code, could you please apply https://issues.apache.org/jira/secure/attachment/12372447/1719.patch and see whether it helps. I checked that the patch applies cleanly to 0.15.3. Thanks!

> slow reduce copies - map output locations not being fetched even when map complete
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-2980
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2980
>             Project: Hadoop Core
>          Issue Type: Bug
>    Affects Versions: 0.15.3
>            Reporter: Joydeep Sen Sarma
>
> maps are long finished. reduces are stuck looking for map locations. they make progress - but slowly. it almost seems like they get new map locations every minute or so:
> 2008-03-07 18:50:52,737 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0 done copying task_200803041231_3586_m_004620_0 output from hadoop082.sf2p.facebook.com..
> 2008-03-07 18:50:53,733 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0: Got 0 new map-outputs & 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-03-07 18:50:53,733 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0 Got 0 known map output location(s); scheduling...
> ...
> 2008-03-07 18:51:49,767 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0 Got 50 known map output location(s); scheduling...
> 2008-03-07 18:51:49,767 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0 Scheduled 41 of 50 known outputs (0 slow hosts and 9 dup hosts)
> they get about 50 locations at a time and this 1 minute delay pattern is surprisingly common ..

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2980) slow reduce copies - map output locations not being fetched even when map complete

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12578530#action_12578530 ] 

Joydeep Sen Sarma commented on HADOOP-2980:
-------------------------------------------

yeah - it's not the 1719 jira. the copies are all done and no new locations come back before 1 additional minute. there are no slow copies and no backoffs (which is the 5 minute thing)  in the logs.

doesn't seem like there are any changes in 0.16 that would seem to resolve the situation then. i am sure i can find this with enough additional instrumentation. just haven't had the time to ..  

is there an ipc timeout of 60 seconds somewhere lower down the stack?



> slow reduce copies - map output locations not being fetched even when map complete
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-2980
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2980
>             Project: Hadoop Core
>          Issue Type: Bug
>    Affects Versions: 0.15.3
>            Reporter: Joydeep Sen Sarma
>
> maps are long finished. reduces are stuck looking for map locations. they make progress - but slowly. it almost seems like they get new map locations every minute or so:
> 2008-03-07 18:50:52,737 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0 done copying task_200803041231_3586_m_004620_0 output from hadoop082.sf2p.facebook.com..
> 2008-03-07 18:50:53,733 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0: Got 0 new map-outputs & 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-03-07 18:50:53,733 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0 Got 0 known map output location(s); scheduling...
> ...
> 2008-03-07 18:51:49,767 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0 Got 50 known map output location(s); scheduling...
> 2008-03-07 18:51:49,767 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0 Scheduled 41 of 50 known outputs (0 slow hosts and 9 dup hosts)
> they get about 50 locations at a time and this 1 minute delay pattern is surprisingly common ..

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2980) slow reduce copies - map output locations not being fetched even when map complete

Posted by "Sameer Al-Sakran (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624523#action_12624523 ] 

Sameer Al-Sakran commented on HADOOP-2980:
------------------------------------------

I'm seeing this behaviour in 0.17.0, only waits are anywhere from 1-60 minutes. Happens sporadically, but relatively frequently.

> slow reduce copies - map output locations not being fetched even when map complete
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-2980
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2980
>             Project: Hadoop Core
>          Issue Type: Bug
>    Affects Versions: 0.15.3
>            Reporter: Joydeep Sen Sarma
>
> maps are long finished. reduces are stuck looking for map locations. they make progress - but slowly. it almost seems like they get new map locations every minute or so:
> 2008-03-07 18:50:52,737 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0 done copying task_200803041231_3586_m_004620_0 output from hadoop082.sf2p.facebook.com..
> 2008-03-07 18:50:53,733 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0: Got 0 new map-outputs & 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-03-07 18:50:53,733 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0 Got 0 known map output location(s); scheduling...
> ...
> 2008-03-07 18:51:49,767 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0 Got 50 known map output location(s); scheduling...
> 2008-03-07 18:51:49,767 INFO org.apache.hadoop.mapred.ReduceTask: task_200803041231_3586_r_000021_0 Scheduled 41 of 50 known outputs (0 slow hosts and 9 dup hosts)
> they get about 50 locations at a time and this 1 minute delay pattern is surprisingly common ..

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.