You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Allen Wittenauer (JIRA)" <ji...@apache.org> on 2016/03/02 19:42:18 UTC
[jira] [Commented] (HADOOP-12857) Rework hadoop-tools-dist

    [ https://issues.apache.org/jira/browse/HADOOP-12857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176220#comment-15176220 ] 

Allen Wittenauer commented on HADOOP-12857:
-------------------------------------------

I have some sample code working.  It was very enlightening and I know what to do now.  If we really do want to keep one directory, here's my current plan of attack:

* Truly optional components (s3, azure, openstack, kafka, etc), will have a shellprofile built that users can enable by doing the necessary incantations.  I'm currently thinking I might be able to add content to hadoop-env.sh at build time to actually turn these things on via a single env-var setting or one per feature. No promises.  (Yes, I'm currently looking for my "Black Hat of Bash Wizardry" to make this happen.) Worst case, it'll be a "copy and rename to HADOOP_CONF_DIR".

* With some help from [~raviprak] to make me see the forest for the trees, I can now build shell parse-able dependency lists at build time.  I have two ways I can process this:  I can either store these lists in the hadoop-dist target directory or in the target directory of the actually tools+using a well-known-name+find to build the necessary shell magic at build time.  I'm leaning towards the latter since that will allow mvn clean to work in hadoop-dist in an expected way, since there won't be a hidden dependency on hadoop-tools having been run before the mvn package.

* distch, distcp, archive-logs, etc, are extremely problematic. Using shell profiles for these WILL NOT WORK since they a) aren't really optional and b) removing them from the command line tools won't really help anyone.  Currently these commands load all of HADOOP_TOOLS_PATH which is awful. I want to add to libexec/ a tools directory that stores helper functions for tools jars that are required for the various subcommands.  It will use similar but different code from the optional components.  It will key off a different filename for the dependency list and there will need to be a contract between the helper function names and the dependency file name.  (This sounds worse than what it is.) 

I *wish* there was a way to dynamically add subcommands to hadoop, mapred, etc, but the code just isn't quite there yet.  We can do usage now, but not actually execution.

One big question: How should this work proceed?
# Single patch
# Multiple patches with a strict commit dependency order
# Separate branch followed by a branch merge

Given this work will likely be all or nothing I'm not a fan of multiple patches.

> Rework hadoop-tools-dist
> ------------------------
>
>                 Key: HADOOP-12857
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12857
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: build
>    Affects Versions: 3.0.0
>            Reporter: Allen Wittenauer
>            Assignee: Allen Wittenauer
>
> As hadoop-tools grows bigger and bigger, it's becoming evident that having a single directory that gets sucked in is starting to become a big burden as the number of tools grows.  Let's rework this to be smarter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)