You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@bigtop.apache.org by "Konstantin Boudnik (JIRA)" <ji...@apache.org> on 2014/03/26 23:24:23 UTC
[jira] [Updated] (BIGTOP-1177) Puppet Recipes: Can we modularize
them to foster HCFS initiatives?
[ https://issues.apache.org/jira/browse/BIGTOP-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Konstantin Boudnik updated BIGTOP-1177:
---------------------------------------
Component/s: Deployment
> Puppet Recipes: Can we modularize them to foster HCFS initiatives?
> ------------------------------------------------------------------
>
> Key: BIGTOP-1177
> URL: https://issues.apache.org/jira/browse/BIGTOP-1177
> Project: Bigtop
> Issue Type: Improvement
> Components: Deployment
> Affects Versions: 0.7.0
> Reporter: jay vyas
> Fix For: backlog
>
>
> In the spirit of interoperability Can we work to modularizing the bigtop puppet recipes to not define "hadoop_cluster_node" as an HDFS specific class.
> I'm not a puppet expert but, from testing on https://issues.apache.org/jira/browse/BIGTOP-1171, im starting to note that HDFS dependency can make deployment a little complex (i.e. the init-hdfs logic etc..).
> For those of us not necessarily dependant on HDFS, this is a cumbersome service to maintain.
> Here are two reasons why decoupling "hadoop_cluster_node" from HDFS is beneficial:
> - For HDFS USers: In some use cases we might want to use bigtop to provision many nodes, only some of which are "data nodes". For example: Lets say our cluster is crawling the web in mappers, and doing some machine learning and distillling large pages into a small relational database tuple, i.e. that summarizes the "entities" in the page. In this case we don't necessarily benefit much from locality because we might be CPU rather than network/io bound. So we might want to provision a cluster of 50 machines : 40 multicore CPU heavy ones and just 10 datanodes to support the DFS. I know this is an extreme case but its a good example.
> - For NON-HDFS users: One important aspect of emerging hadoop workflows is HCFS : https://wiki.apache.org/hadoop/HCFS/ -- the idea that filesystems like S3, OrangeFS, GlusterFileSystem, etc.. are all just as capable , although not necessarily optimal, of supporting YARN and Hadoop operations as HDFS.
> This JIRA Might have to be done in phases, and might need some refinement since im not a puppet expert. But here is what seems logical:
> 1) hadoop_cluster_node shouldnt necessarily know about *jobtrackers, tasktrackers*, or any other non essential yarn components.
> 2) Since YARN does need a DFS of some sort to run on, hadoop_cluster_node will need *definitions for that DFS*. These configuration properties (fs.defaultFS, fs.default.name, could be put into the puppet configurations and discovered that way).
> - fs.defaultFS
> - fs.default.name
> - fs.AbstractFileSystem
> - impl,org.apache.hadoop.fs.local....
> - fs.defaultFS
> - hbase.rootdir
> - fs......impl
> - fs.default.name
> - fs.defaultFS
> - fs.AbstractFileSystem.....impl
> - mapreduce.jobtracker.staging.root.dir
> - yarn.app.mapreduce.am.staging-dir
>
> 3) while we're at it : should the hadoop_cluster_node class even know about *specific ecosystem components* (zookeeper,etc..). Some tools, such as zookeeper, dont even need hadoop to run, so there is alot of modularization there to be done.
> Maybe this can be done in phases , but again, a puppet expert will have to weigh in on whats feasible , practical, and maybe on how to phase these changes in an agile way. Any feedback is welcome - i realize this is a significant undertaking... But its important to democratize the hadoop stack and bigtop is the perfect place to do it!
--
This message was sent by Atlassian JIRA
(v6.2#6252)