You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Xi Fang (JIRA)" <ji...@apache.org> on 2013/05/28 08:16:23 UTC

[jira] [Updated] (MAPREDUCE-5278) Perf: Distributed cache is broken when JT staging dir is not on the default FS

     [ https://issues.apache.org/jira/browse/MAPREDUCE-5278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xi Fang updated MAPREDUCE-5278:
-------------------------------

    Assignee: Xi Fang
    
> Perf: Distributed cache is broken when JT staging dir is not on the default FS
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5278
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5278
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1-win
>         Environment: Windows
>            Reporter: Xi Fang
>            Assignee: Xi Fang
>
> Today, we set the JobTracker staging dir ("mapreduce.jobtracker.staging.root.dir) to point to HDFS even though ASV is the default file system. There are a few reason why this config was chosen:
> To prevent leak of the storage account creds to the user's storage account (IOW, keep job.xml in the cluster). This is needed until HADOOP-444 is fixed.
> It uses HDFS for the transient job files what is good for two reasons – a) it does not flood the user's storage account with irrelevant data/files b) it leverages HDFS locality for small files
> However, this approach conflicts with how distributed cache caching works, completely negating the feature's functionality.
> When files are added to the distributed cache (thru files/achieves/libjars hadoop generic options), they are copied to the job tracker staging dir only if they reside on a file system different that the jobtracker's. Later on, this path is used as a "key" to cache the files locally on the tasktracker's machine, and avoid localization (download/unzip) of the distributed cache files if they are already localized.
> In our configuration the caching is completely disabled and we always end up copying dist cache files to the JT staging dir first and localizing them on the tasktracker machine second.
> This is especially not good for Oozie scenarios as Oozie uses dist cache to populate Hive/Pig jars throughout the cluster.
> Easy workaround is to config mapreduce.jobtracker.staging.root.dir in mapred-site.xml to be on the default FS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira