You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oozie.apache.org by "Denes Bodo (JIRA)" <ji...@apache.org> on 2018/04/23 15:08:00 UTC

[jira] [Created] (OOZIE-3227) Eliminate duplicated dependencies from distributed cache

Denes Bodo created OOZIE-3227:
---------------------------------

             Summary: Eliminate duplicated dependencies from distributed cache
                 Key: OOZIE-3227
                 URL: https://issues.apache.org/jira/browse/OOZIE-3227
             Project: Oozie
          Issue Type: Sub-task
          Components: core
    Affects Versions: 5.0.0
            Reporter: Denes Bodo
            Assignee: Denes Bodo


Using Hadoop 3 it is not allowed to have multiple dependencies with same file names on the list of *mapreduce.job.cache.files*.

The issue occurs when I have the same file name on multiple sharelib folders and/or my application's lib folder. This can be avoided but not easy all the time.

I suggest to remove the duplicates from this list.
A quick workaround for the source code in JavaActionExecutor is like:
{code}
            removeDuplicatedDependencies(launcherJobConf, "mapreduce.job.cache.files");
            removeDuplicatedDependencies(launcherJobConf, "mapreduce.job.cache.archives");
......
private void removeDuplicatedDependencies(JobConf conf, String key) {
        final Map<String, String> nameToPath = new HashMap<>();
        StringBuilder uniqList = new StringBuilder();
        for(String dependency: conf.get(key).split(",")) {
            final String[] arr = dependency.split("/");
            final String dependencyName = arr[arr.length - 1];
            if(nameToPath.containsKey(dependencyName)) {
                LOG.warn(dependencyName + " [" + dependency + "] is already defined in " + key + ". Skipping...");
            } else {
                nameToPath.put(dependencyName, dependency);
                uniqList.append(dependency).append(",");
            }
        }
        uniqList.setLength(uniqList.length() - 1);
        conf.set(key, uniqList.toString());
    }
{code}

Other way is to eliminate the deprecated *org.apache.hadoop.filecache.DistributedCache*.

I am going to have a deeper understanding how we should use distributed cache and all the comments are welcome.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)