You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "dhruba borthakur (JIRA)" <ji...@apache.org> on 2009/06/12 10:02:07 UTC

[jira] Created: (HADOOP-6026) Improve the performance efficiency of task initialization at the JobTracker

Improve the performance efficiency of task initialization at the JobTracker
---------------------------------------------------------------------------

                 Key: HADOOP-6026
                 URL: https://issues.apache.org/jira/browse/HADOOP-6026
             Project: Hadoop Core
          Issue Type: Improvement
          Components: mapred
            Reporter: dhruba borthakur
            Assignee: Zheng Shao


The JobTracker reads the splits for a job at Job Initialization time. Then, for each location in the split, it invokes DNSToSwitchMapping.resolve(). This, in turn, typically invokes an external script that resolves the hostname to a network rack location. The time spent in invoking this external script can be reduced if the hostname and their rack locations are inserted into a cache. JobTracker.resolveAndAddToTopology() can look up this cache first and avoid invoking the external "resolve" script is most cases. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6026) Improve the performance efficiency of task initialization at the JobTracker

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719396#action_12719396 ] 

Devaraj Das commented on HADOOP-6026:
-------------------------------------

If you are using ScriptBasedMapping as the implementation for resolution, the problem outlined in this jira doesn't exist. The implementation of CachedDNSToSwitchMapping that the ScriptBasedMapping extends does the necessary caching.
In fact, I don't think we should do this caching in the core framework (and then start worrying about the cache timeout, etc.). This should be left to the implementations of DNSToSwitchMapping.
Thoughts?

> Improve the performance efficiency of task initialization at the JobTracker
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-6026
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6026
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: dhruba borthakur
>            Assignee: Zheng Shao
>         Attachments: HADOOP-6026.1.patch
>
>
> The JobTracker reads the splits for a job at Job Initialization time. Then, for each location in the split, it invokes DNSToSwitchMapping.resolve(). This, in turn, typically invokes an external script that resolves the hostname to a network rack location. The time spent in invoking this external script can be reduced if the hostname and their rack locations are inserted into a cache. JobTracker.resolveAndAddToTopology() can look up this cache first and avoid invoking the external "resolve" script is most cases. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6026) Improve the performance efficiency of task initialization at the JobTracker

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HADOOP-6026:
-------------------------------

    Attachment: HADOOP-6026.1.patch

I agree with Dhruba's comment but I think currently there is probably no such requirement from any real deployed environment. And if there is, simple uniform timeout may not be the best way to deprecate an item in the cache.

I will vote for simplicity of the code for now. I've put a comment there. In the future people can add caching policy if such a requirement comes up.


> Improve the performance efficiency of task initialization at the JobTracker
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-6026
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6026
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: dhruba borthakur
>            Assignee: Zheng Shao
>         Attachments: HADOOP-6026.1.patch
>
>
> The JobTracker reads the splits for a job at Job Initialization time. Then, for each location in the split, it invokes DNSToSwitchMapping.resolve(). This, in turn, typically invokes an external script that resolves the hostname to a network rack location. The time spent in invoking this external script can be reduced if the hostname and their rack locations are inserted into a cache. JobTracker.resolveAndAddToTopology() can look up this cache first and avoid invoking the external "resolve" script is most cases. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (HADOOP-6026) Improve the performance efficiency of task initialization at the JobTracker

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao resolved HADOOP-6026.
--------------------------------

    Resolution: Invalid

already fixed in 0.19.1

> Improve the performance efficiency of task initialization at the JobTracker
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-6026
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6026
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: dhruba borthakur
>            Assignee: Zheng Shao
>         Attachments: HADOOP-6026.1.patch
>
>
> The JobTracker reads the splits for a job at Job Initialization time. Then, for each location in the split, it invokes DNSToSwitchMapping.resolve(). This, in turn, typically invokes an external script that resolves the hostname to a network rack location. The time spent in invoking this external script can be reduced if the hostname and their rack locations are inserted into a cache. JobTracker.resolveAndAddToTopology() can look up this cache first and avoid invoking the external "resolve" script is most cases. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6026) Improve the performance efficiency of task initialization at the JobTracker

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719849#action_12719849 ] 

Zheng Shao commented on HADOOP-6026:
------------------------------------

I see. We were on 0.17 where the class hierarchy is:
"public final class ScriptBasedMapping implements Configurable, DNSToSwitchMapping"

It seems that CachedDNSToSwitchMapping is added in 0.19.

I also agree the caching should be done in the implementation, because different impl may have different caching policies etc.

I will close this jira.


> Improve the performance efficiency of task initialization at the JobTracker
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-6026
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6026
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: dhruba borthakur
>            Assignee: Zheng Shao
>         Attachments: HADOOP-6026.1.patch
>
>
> The JobTracker reads the splits for a job at Job Initialization time. Then, for each location in the split, it invokes DNSToSwitchMapping.resolve(). This, in turn, typically invokes an external script that resolves the hostname to a network rack location. The time spent in invoking this external script can be reduced if the hostname and their rack locations are inserted into a cache. JobTracker.resolveAndAddToTopology() can look up this cache first and avoid invoking the external "resolve" script is most cases. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6026) Improve the performance efficiency of task initialization at the JobTracker

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718753#action_12718753 ] 

dhruba borthakur commented on HADOOP-6026:
------------------------------------------

One drawback to the above situation is that the mapping of a hostname to its racklocation would be permanent for the lifetime of a JobTracker. To accomodate a more rapidly changing network topology, we can expire items from the cache after every hour or so.

> Improve the performance efficiency of task initialization at the JobTracker
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-6026
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6026
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: dhruba borthakur
>            Assignee: Zheng Shao
>
> The JobTracker reads the splits for a job at Job Initialization time. Then, for each location in the split, it invokes DNSToSwitchMapping.resolve(). This, in turn, typically invokes an external script that resolves the hostname to a network rack location. The time spent in invoking this external script can be reduced if the hostname and their rack locations are inserted into a cache. JobTracker.resolveAndAddToTopology() can look up this cache first and avoid invoking the external "resolve" script is most cases. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.