You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oozie.apache.org by "Andras Piros (JIRA)" <ji...@apache.org> on 2018/04/24 05:58:00 UTC
[jira] [Comment Edited] (OOZIE-3227) Eliminate duplicated
dependencies from distributed cache
[ https://issues.apache.org/jira/browse/OOZIE-3227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16449335#comment-16449335 ]
Andras Piros edited comment on OOZIE-3227 at 4/24/18 5:57 AM:
--------------------------------------------------------------
[~dionusos] to me it seems quite dangerous to remove any JARs that come w/ any sharelib or user provided folders, resources. Using proper symlinking instead seems a better idea.
To have a better understanding, can you please show a minimal reproduction case:
* {{workflow.xml}} and {{job.properties}} to see which sharelibs are used
* user provided {{lib}} folder, as well as any JARs and / or folders provided
* which Hadoop version you use to build Oozie with, as well as which Hadoop version Oozie tries to connect at runtime
Thanks!
was (Author: andras.piros):
[~dionusos] to me it seems quite dangerous to remove any JARs that come w/ any sharelib or user provided folders, resources.
To have a better understanding, can you please show a minimal reproduction case:
* {{workflow.xml}} and {{job.properties}} to see which sharelibs are used
* user provided {{lib}} folder, as well as any JARs and / or folders provided
* which Hadoop version you use to build Oozie with, as well as which Hadoop version Oozie tries to connect at runtime
Thanks!
> Eliminate duplicated dependencies from distributed cache
> --------------------------------------------------------
>
> Key: OOZIE-3227
> URL: https://issues.apache.org/jira/browse/OOZIE-3227
> Project: Oozie
> Issue Type: Sub-task
> Components: core
> Affects Versions: 5.0.0
> Reporter: Denes Bodo
> Assignee: Denes Bodo
> Priority: Major
>
> Using Hadoop 3 it is not allowed to have multiple dependencies with same file names on the list of *mapreduce.job.cache.files*.
> The issue occurs when I have the same file name on multiple sharelib folders and/or my application's lib folder. This can be avoided but not easy all the time.
> I suggest to remove the duplicates from this list.
> A quick workaround for the source code in JavaActionExecutor is like:
> {code}
> removeDuplicatedDependencies(launcherJobConf, "mapreduce.job.cache.files");
> removeDuplicatedDependencies(launcherJobConf, "mapreduce.job.cache.archives");
> ......
> private void removeDuplicatedDependencies(JobConf conf, String key) {
> final Map<String, String> nameToPath = new HashMap<>();
> StringBuilder uniqList = new StringBuilder();
> for(String dependency: conf.get(key).split(",")) {
> final String[] arr = dependency.split("/");
> final String dependencyName = arr[arr.length - 1];
> if(nameToPath.containsKey(dependencyName)) {
> LOG.warn(dependencyName + " [" + dependency + "] is already defined in " + key + ". Skipping...");
> } else {
> nameToPath.put(dependencyName, dependency);
> uniqList.append(dependency).append(",");
> }
> }
> uniqList.setLength(uniqList.length() - 1);
> conf.set(key, uniqList.toString());
> }
> {code}
> Other way is to eliminate the deprecated *org.apache.hadoop.filecache.DistributedCache*.
> I am going to have a deeper understanding how we should use distributed cache and all the comments are welcome.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)