You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Steve Loughran <st...@hortonworks.com> on 2018/05/15 15:34:54 UTC

[DISCUSS] Branch Proposal: HADOOP 15407: ABFS

Hi

Chris Douglas I and I've have a proposal for a short-lived feature branch for the Azure ABFS connector to go into the hadoop-azure package. This will connect to the new azure storage service, which will ultimately replace the one used by wasb. It's a big patch and, like all storage connectors, will inevitably take time to stabilize (i.e: nobody ever get seek() right, even when we think we have).

Thomas & Esfandiar will do the coding: they've already done the paperwork. Chris, myself & anyone else interested can be involved in the review and testing.

Comments?

-------------

The initial HADOOP-15407 patch contains a new filesystem client for the forthcoming Azure ABFS, which is intended to replace Azure WASB as the Azure storage layer. The patch is large, as it contains the replacement client, tests, and generated code.

We propose a feature branch, so the module can be broken into salient, reviewable chunks. Internal constraints prevented this feature from being developed in Apache, so we want to ensure that all the code is discussed, maintainable, and documented by the community before it merges.

To effect this, we also propose adding two developers as branch committers: Thomas Marquardt tmarq@microsoft.com<ma...@microsoft.com> Esfandiar Manii esmanii@microsoft.com<ma...@microsoft.com>

Beyond normal feature branch activity and merge criteria for FS modules, we want to add another merge criterion for ABFS. Some of the client APIs are not GA. It seems reasonable to require that this client works with public endpoints before it merges to trunk.

To test the Blob FS driver, Blob FS team (including Esfandiar Manii and Thomas Marquardt) in Azure Storage will need the MSDN subscription ID(s) for all reviewers who want to run the tests. The ABFS team will then whitelist the subscription ID(s) for the Blob FS Preview. At that time, future storage accounts created will have the Blob FS endpoint, <accountName>.dfs.core.windows.net<http://dfs.core.windows.net>, which the Blob FS driver relies on.

This is a temporary state during the (current) Private Preview and the early phases of Public Preview. In a few months, the whitelisting will not be required and anyone will be able to create a storage account with access to the Blob FS endpoint.

Thomas and Esfandiar have been active in the Hadoop project working on the WASB connector (see https://issues.apache.org/jira/browse/HADOOP-14552). They understand the processes and requirements of the software. Working on the branch directly will let them bring this significant feature into the hadoop-azure module without disrupting existing users.

Re: [DISCUSS] Branch Proposal: HADOOP 15407: ABFS

Posted by Steve Loughran <st...@hortonworks.com>.

On 15 May 2018, at 17:30, Thomas Marquardt <tm...@microsoft.com>> wrote:

A feature branch seems reasonable to me too.  Note that the WASB connector will continue to exist, and live side-by-side with the new Azure Blob Filesystem (ABFS) connector.  We will encourage users to move to the new ABFS connector, and all of our new feature and performance improvements will target the ABFS connector.  ABFS will perform better at no additional cost, so I expect current users to migrate in time.  The two connectors are compatible for mainline scenarios, but there are some uncommon features in WASB that we chose not to carry over in the initial implementation.

So we hope ABFS will replace the usage of WASB, but the WASB connector itself will continue to exist.  Maybe we can remove WASB in the future some day, if nobody is using it.


migration strategies of connectors are interesting.

When the new S3 connector was first proposed (HADOOP-10400) we opted for a new name, "s3a" to allow things to go side-by-side until we were happy. For Hadoop 2.6-2.7, this worked well, as stuff stabilised. Now things are good we've cut it from branch-3 entirely, with a stub entry telling people to migrate (HADOOP-14738). It's needed so that if anyone explicitly declares a mapping of schema -> FS (as people do in Spark, more for superstition than need), they'll get a better message than Class not found.

We could have tried to silently forward to the S3A FS, but that adds two issues
* configuration options are all different
* it gets confusing when you return URLs from listings.

The strategy taken stops the switch being magic, but does seem to work. It also has a nice little side effect: if ever someone files a bugrep with an s3n:// URL, we know to close it as invalid, or at least say "move to s3a then retry". If the schema had stayed the same, you'd need to know the actual version number of hadoop underneath to know whether this was with current or removed code.

given the filesystems will work with two different service endpoints, they should be isolated. We will need to keep the work on WASB alive though, if not for new features, but for: security, regression tests & bug fixes.


I can confirm that nobody ever gets seek() right. :)

that and rename(), obviously, —though the fact that nobody knows what rename() is meant to do makes that it's easy for all to argue their interpretation is correct. Certainly I do

-Steve

Re: [DISCUSS] Branch Proposal: HADOOP 15407: ABFS

Posted by Thomas Marquardt <tm...@microsoft.com.INVALID>.
A feature branch seems reasonable to me too.  Note that the WASB connector will continue to exist, and live side-by-side with the new Azure Blob Filesystem (ABFS) connector.  We will encourage users to move to the new ABFS connector, and all of our new feature and performance improvements will target the ABFS connector.  ABFS will perform better at no additional cost, so I expect current users to migrate in time.  The two connectors are compatible for mainline scenarios, but there are some uncommon features in WASB that we chose not to carry over in the initial implementation.


So we hope ABFS will replace the usage of WASB, but the WASB connector itself will continue to exist.  Maybe we can remove WASB in the future some day, if nobody is using it.


I can confirm that nobody ever gets seek() right. :)


Thanks,

Thomas

________________________________
From: larry mccay <lm...@apache.org>
Sent: Tuesday, May 15, 2018 8:44 AM
To: Steve Loughran
Cc: Hadoop Common
Subject: Re: [DISCUSS] Branch Proposal: HADOOP 15407: ABFS

This seems like a reasonable and effective use of a feature branch and
branch committers to me.


On Tue, May 15, 2018 at 11:34 AM, Steve Loughran <st...@hortonworks.com>
wrote:

> Hi
>
> Chris Douglas I and I've have a proposal for a short-lived feature branch
> for the Azure ABFS connector to go into the hadoop-azure package. This will
> connect to the new azure storage service, which will ultimately replace the
> one used by wasb. It's a big patch and, like all storage connectors, will
> inevitably take time to stabilize (i.e: nobody ever get seek() right, even
> when we think we have).
>
> Thomas & Esfandiar will do the coding: they've already done the paperwork.
> Chris, myself & anyone else interested can be involved in the review and
> testing.
>
> Comments?
>
> -------------
>
> The initial HADOOP-15407 patch contains a new filesystem client for the
> forthcoming Azure ABFS, which is intended to replace Azure WASB as the
> Azure storage layer. The patch is large, as it contains the replacement
> client, tests, and generated code.
>
> We propose a feature branch, so the module can be broken into salient,
> reviewable chunks. Internal constraints prevented this feature from being
> developed in Apache, so we want to ensure that all the code is discussed,
> maintainable, and documented by the community before it merges.
>
> To effect this, we also propose adding two developers as branch
> committers: Thomas Marquardt tmarq@microsoft.com<mailto:tma
> rq@microsoft.com> Esfandiar Manii esmanii@microsoft.com<mailto:e
> smanii@microsoft.com>
>
> Beyond normal feature branch activity and merge criteria for FS modules,
> we want to add another merge criterion for ABFS. Some of the client APIs
> are not GA. It seems reasonable to require that this client works with
> public endpoints before it merges to trunk.
>
> To test the Blob FS driver, Blob FS team (including Esfandiar Manii and
> Thomas Marquardt) in Azure Storage will need the MSDN subscription ID(s)
> for all reviewers who want to run the tests. The ABFS team will then
> whitelist the subscription ID(s) for the Blob FS Preview. At that time,
> future storage accounts created will have the Blob FS endpoint,
> <accountName>.dfs.core.windows.net<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdfs.core.windows.net&data=02%7C01%7Ctmarq%40microsoft.com%7C8cce958a338644ba48e108d5ba7acf7e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636619958983989716&sdata=HG5Ru6jlBauS09rQY49BcLCI39jZPJH5cFVGgAy7JW8%3D&reserved=0>, which
> the Blob FS driver relies on.
>
> This is a temporary state during the (current) Private Preview and the
> early phases of Public Preview. In a few months, the whitelisting will not
> be required and anyone will be able to create a storage account with access
> to the Blob FS endpoint.
>
> Thomas and Esfandiar have been active in the Hadoop project working on the
> WASB connector (see https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHADOOP-14552&data=02%7C01%7Ctmarq%40microsoft.com%7C8cce958a338644ba48e108d5ba7acf7e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636619958983989716&sdata=QFZt%2BNRDEvpV6HX0rHLjPvKBzTWVAQyxji1o6cbgMr0%3D&reserved=0).
> They understand the processes and requirements of the software. Working on
> the branch directly will let them bring this significant feature into the
> hadoop-azure module without disrupting existing users.
>

Re: [DISCUSS] Branch Proposal: HADOOP 15407: ABFS

Posted by larry mccay <lm...@apache.org>.
This seems like a reasonable and effective use of a feature branch and
branch committers to me.


On Tue, May 15, 2018 at 11:34 AM, Steve Loughran <st...@hortonworks.com>
wrote:

> Hi
>
> Chris Douglas I and I've have a proposal for a short-lived feature branch
> for the Azure ABFS connector to go into the hadoop-azure package. This will
> connect to the new azure storage service, which will ultimately replace the
> one used by wasb. It's a big patch and, like all storage connectors, will
> inevitably take time to stabilize (i.e: nobody ever get seek() right, even
> when we think we have).
>
> Thomas & Esfandiar will do the coding: they've already done the paperwork.
> Chris, myself & anyone else interested can be involved in the review and
> testing.
>
> Comments?
>
> -------------
>
> The initial HADOOP-15407 patch contains a new filesystem client for the
> forthcoming Azure ABFS, which is intended to replace Azure WASB as the
> Azure storage layer. The patch is large, as it contains the replacement
> client, tests, and generated code.
>
> We propose a feature branch, so the module can be broken into salient,
> reviewable chunks. Internal constraints prevented this feature from being
> developed in Apache, so we want to ensure that all the code is discussed,
> maintainable, and documented by the community before it merges.
>
> To effect this, we also propose adding two developers as branch
> committers: Thomas Marquardt tmarq@microsoft.com<mailto:tma
> rq@microsoft.com> Esfandiar Manii esmanii@microsoft.com<mailto:e
> smanii@microsoft.com>
>
> Beyond normal feature branch activity and merge criteria for FS modules,
> we want to add another merge criterion for ABFS. Some of the client APIs
> are not GA. It seems reasonable to require that this client works with
> public endpoints before it merges to trunk.
>
> To test the Blob FS driver, Blob FS team (including Esfandiar Manii and
> Thomas Marquardt) in Azure Storage will need the MSDN subscription ID(s)
> for all reviewers who want to run the tests. The ABFS team will then
> whitelist the subscription ID(s) for the Blob FS Preview. At that time,
> future storage accounts created will have the Blob FS endpoint,
> <accountName>.dfs.core.windows.net<http://dfs.core.windows.net>, which
> the Blob FS driver relies on.
>
> This is a temporary state during the (current) Private Preview and the
> early phases of Public Preview. In a few months, the whitelisting will not
> be required and anyone will be able to create a storage account with access
> to the Blob FS endpoint.
>
> Thomas and Esfandiar have been active in the Hadoop project working on the
> WASB connector (see https://issues.apache.org/jira/browse/HADOOP-14552).
> They understand the processes and requirements of the software. Working on
> the branch directly will let them bring this significant feature into the
> hadoop-azure module without disrupting existing users.
>

Re: [DISCUSS] Branch Proposal: HADOOP 15407: ABFS

Posted by Chris Douglas <cd...@apache.org>.
There's not a lot of (required) ceremony. Any committer can create the
branch, including branch committers after the PMC adds them (see
bylaws [1]). -C

[1]: http://hadoop.apache.org/bylaws.html

On Thu, May 17, 2018 at 9:16 AM, Steve Loughran <st...@hortonworks.com> wrote:
> Now, what's next? I know we have the normal vote process to merge a branch back in...what about the branch creation + giving people branch commit rights?
>
> -steve
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: common-dev-help@hadoop.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org


Re: [DISCUSS] Branch Proposal: HADOOP 15407: ABFS

Posted by Steve Loughran <st...@hortonworks.com>.
Now, what's next? I know we have the normal vote process to merge a branch back in...what about the branch creation + giving people branch commit rights?

-steve


---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org


Re: [DISCUSS] Branch Proposal: HADOOP 15407: ABFS

Posted by Steve Loughran <st...@hortonworks.com>.

On 15 May 2018, at 20:37, Sean Busbey <bu...@cloudera.com>> wrote:

apologies, copying back in common-dev@ with my question about the code.

On Tue, May 15, 2018 at 2:36 PM, Sean Busbey <bu...@cloudera.com>> wrote:
>  Internal constraints prevented this feature from being developed in Apache, so we want to ensure that all the code is discussed, maintainable, and documented by the community before it merges.

Has this code gone through ASF IP Clearance already?


It's been submitted as a large .patch on the JIRA, which is an effective code donation, and Thomas and Esfandiar have done the paperwork to be able to commit straight to a branch. Part of that merge will include the usual verification that redistributed dependency artifacts aren't cat-X licensed.





Re: [DISCUSS] Branch Proposal: HADOOP 15407: ABFS

Posted by Sean Busbey <bu...@cloudera.com>.
apologies, copying back in common-dev@ with my question about the code.

On Tue, May 15, 2018 at 2:36 PM, Sean Busbey <bu...@cloudera.com> wrote:

> >  Internal constraints prevented this feature from being developed in
> Apache, so we want to ensure that all the code is discussed, maintainable,
> and documented by the community before it merges.
>
> Has this code gone through ASF IP Clearance already?
>
> On Tue, May 15, 2018 at 10:34 AM, Steve Loughran <st...@hortonworks.com>
> wrote:
>
>> Hi
>>
>> Chris Douglas I and I've have a proposal for a short-lived feature branch
>> for the Azure ABFS connector to go into the hadoop-azure package. This will
>> connect to the new azure storage service, which will ultimately replace the
>> one used by wasb. It's a big patch and, like all storage connectors, will
>> inevitably take time to stabilize (i.e: nobody ever get seek() right, even
>> when we think we have).
>>
>> Thomas & Esfandiar will do the coding: they've already done the
>> paperwork. Chris, myself & anyone else interested can be involved in the
>> review and testing.
>>
>> Comments?
>>
>> -------------
>>
>> The initial HADOOP-15407 patch contains a new filesystem client for the
>> forthcoming Azure ABFS, which is intended to replace Azure WASB as the
>> Azure storage layer. The patch is large, as it contains the replacement
>> client, tests, and generated code.
>>
>> We propose a feature branch, so the module can be broken into salient,
>> reviewable chunks. Internal constraints prevented this feature from being
>> developed in Apache, so we want to ensure that all the code is discussed,
>> maintainable, and documented by the community before it merges.
>>
>> To effect this, we also propose adding two developers as branch
>> committers: Thomas Marquardt tmarq@microsoft.com<mailto:tma
>> rq@microsoft.com> Esfandiar Manii esmanii@microsoft.com<mailto:e
>> smanii@microsoft.com>
>>
>> Beyond normal feature branch activity and merge criteria for FS modules,
>> we want to add another merge criterion for ABFS. Some of the client APIs
>> are not GA. It seems reasonable to require that this client works with
>> public endpoints before it merges to trunk.
>>
>> To test the Blob FS driver, Blob FS team (including Esfandiar Manii and
>> Thomas Marquardt) in Azure Storage will need the MSDN subscription ID(s)
>> for all reviewers who want to run the tests. The ABFS team will then
>> whitelist the subscription ID(s) for the Blob FS Preview. At that time,
>> future storage accounts created will have the Blob FS endpoint,
>> <accountName>.dfs.core.windows.net<http://dfs.core.windows.net>, which
>> the Blob FS driver relies on.
>>
>> This is a temporary state during the (current) Private Preview and the
>> early phases of Public Preview. In a few months, the whitelisting will not
>> be required and anyone will be able to create a storage account with access
>> to the Blob FS endpoint.
>>
>> Thomas and Esfandiar have been active in the Hadoop project working on
>> the WASB connector (see https://issues.apache.org/jira
>> /browse/HADOOP-14552). They understand the processes and requirements of
>> the software. Working on the branch directly will let them bring this
>> significant feature into the hadoop-azure module without disrupting
>> existing users.
>>
>
>
>
> --
> busbey
>



-- 
busbey