You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Zheng, Kai" <ka...@intel.com> on 2016/06/13 13:02:03 UTC

A top container module like hadoop-cloud for cloud integration modules

Hi,

Noticed it's an obvious trend Hadoop is supporting more and more cloud platforms, I suggest we have a top container module to hold such integration modules, like the ones for aws, openstack, azure and upcoming one aliyun. The rational is simple besides the trend:

1.       Existing modules are mixed in Hadoop-tools that becomes a little big being of 18 modules now. Cloud specific ones can be grouped together and separated out, making more sense;

2.       Future abstraction and common specs & codes sharing could be easier or thereafter allowed;

3.       Common testing approach could be defined together, for example, some mechanisms as discussed by Chris, Steve and Allen in HADOOP-12756;

4.       Documentation for "Hadoop on Cloud"? Not sure it's needed, as we already have a section for "Hadoop compatible File Systems".

If sounds good, the change would be a good fit for Hadoop 3.0, even though the change should not involve big impact, as it can avoid affecting the artifacts. It may cause some inconveniences for the current development efforts, though.

Comments are welcome. Thanks!

Regards,
Kai


RE: A top container module like hadoop-cloud for cloud integration modules

Posted by "Zheng, Kai" <ka...@intel.com>.
Thanks Steve for the feedback and thoughts. 

Looks like people don't want to move around the related modules as it may not add much real value. It's fine. I may provide better thoughts later when learn the aspect deeper.

Regards,
Kai

-----Original Message-----
From: Steve Loughran [mailto:stevel@hortonworks.com] 
Sent: Wednesday, June 15, 2016 6:16 PM
To: Zheng, Kai <ka...@intel.com>
Cc: common-dev@hadoop.apache.org
Subject: Re: A top container module like hadoop-cloud for cloud integration modules


> On 13 Jun 2016, at 14:02, Zheng, Kai <ka...@intel.com> wrote:
> 
> Hi,
> 
> Noticed it's an obvious trend Hadoop is supporting more and more cloud platforms, I suggest we have a top container module to hold such integration modules, like the ones for aws, openstack, azure and upcoming one aliyun. The rational is simple besides the trend:


I'm kind of =0 right now

> 
> 1.       Existing modules are mixed in Hadoop-tools that becomes a little big being of 18 modules now. Cloud specific ones can be grouped together and separated out, making more sense;

the reason for having separate hadoop-aws, hadoop-openstack modules was always to permit the modules to use APIs exclusive to cloud infrastructures, structure the downstream dependencies, *and* allow people like the EMR team to swap in their own closed-source version. I don't think anyone does that though.

It also lets us completely isolate testing: each module's tests only run if you have the credentials.

> 
> 2.       Future abstraction and common specs & codes sharing could be easier or thereafter allowed;

Right now hadoop-common is where cross FS work and tests go. (Hint, reviewers for HADOOP-12807 needed.). I think we could start there with org.apache.hadoop.cloud package and only split it out if compilation ordering merits it —or it adds any dependencies to hadoop-common.

> 
> 3.       Common testing approach could be defined together, for example, some mechanisms as discussed by Chris, Steve and Allen in HADOOP-12756;
> 


In SPARK-7481 I've added downstream tests for S3a and azure in spark; this shows up that S3a in Hadoop 2.6 gets its blocksize wrong (0) in listings, so the splits are all 1 byte wrong; work dies. I think downstream tests in: Spark, Hive, etc would really round out cloud infra testing, but we can't put those into Hadoop as the build DAG prevents it. (Reviews for SPARK-7481 needed too, BTW). System tests of Aliyun and perhaps GFS connectors would need to go in there or in bigtop —which is the other place I've discussed having cloud integration tests.


> 4.       Documentation for "Hadoop on Cloud"? Not sure it's needed, as we already have a section for "Hadoop compatible File Systems".

Again, we can stick this in common

> 
> If sounds good, the change would be a good fit for Hadoop 3.0, even though the change should not involve big impact, as it can avoid affecting the artifacts. It may cause some inconveniences for the current development efforts, though.
> 


I think it would make sense if other features went in. A good committer against object stores would be an example here: it depends on the MR libraries, so can't go into common.Today it'd have to go into hadoop-mapreduce. This isn't too bad, as long as the APIs it uses are all in hadoop-common. It's only as things get more complex that it matters.



---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org

Re: A top container module like hadoop-cloud for cloud integration modules

Posted by Steve Loughran <st...@hortonworks.com>.
> On 13 Jun 2016, at 14:02, Zheng, Kai <ka...@intel.com> wrote:
> 
> Hi,
> 
> Noticed it's an obvious trend Hadoop is supporting more and more cloud platforms, I suggest we have a top container module to hold such integration modules, like the ones for aws, openstack, azure and upcoming one aliyun. The rational is simple besides the trend:


I'm kind of =0 right now

> 
> 1.       Existing modules are mixed in Hadoop-tools that becomes a little big being of 18 modules now. Cloud specific ones can be grouped together and separated out, making more sense;

the reason for having separate hadoop-aws, hadoop-openstack modules was always to permit the modules to use APIs exclusive to cloud infrastructures, structure the downstream dependencies, *and* allow people like the EMR team to swap in their own closed-source version. I don't think anyone does that though.

It also lets us completely isolate testing: each module's tests only run if you have the credentials.

> 
> 2.       Future abstraction and common specs & codes sharing could be easier or thereafter allowed;

Right now hadoop-common is where cross FS work and tests go. (Hint, reviewers for HADOOP-12807 needed.). I think we could start there with org.apache.hadoop.cloud package and only split it out if compilation ordering merits it —or it adds any dependencies to hadoop-common.

> 
> 3.       Common testing approach could be defined together, for example, some mechanisms as discussed by Chris, Steve and Allen in HADOOP-12756;
> 


In SPARK-7481 I've added downstream tests for S3a and azure in spark; this shows up that S3a in Hadoop 2.6 gets its blocksize wrong (0) in listings, so the splits are all 1 byte wrong; work dies. I think downstream tests in: Spark, Hive, etc would really round out cloud infra testing, but we can't put those into Hadoop as the build DAG prevents it. (Reviews for SPARK-7481 needed too, BTW). System tests of Aliyun and perhaps GFS connectors would need to go in there or in bigtop —which is the other place I've discussed having cloud integration tests.


> 4.       Documentation for "Hadoop on Cloud"? Not sure it's needed, as we already have a section for "Hadoop compatible File Systems".

Again, we can stick this in common

> 
> If sounds good, the change would be a good fit for Hadoop 3.0, even though the change should not involve big impact, as it can avoid affecting the artifacts. It may cause some inconveniences for the current development efforts, though.
> 


I think it would make sense if other features went in. A good committer against object stores would be an example here: it depends on the MR libraries, so can't go into common.Today it'd have to go into hadoop-mapreduce. This isn't too bad, as long as the APIs it uses are all in hadoop-common. It's only as things get more complex that it matters.