You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by Jeff Kubina <je...@gmail.com> on 2021/07/27 17:15:40 UTC

Accumulo with Native S3 Support

All,

Some of AWS's back end services use a version of Accumulo modified to use
Amazon's S3 as its storage system. Amazon engineers forked Accumulo 2.0 and
merged that S3 support into it <https://github.com/cmilbert/accumulo/>.
Chris Milbert is the lead Amazon engineer who did the integration. Chris
and I would like to jump start the conversation about how best to initiate
the pull request for these changes into Accumulo 2.1.

Mike Wall suggested using this as an opportunity to abstract out the
storage system of Accumulo and make it pluggable. He suggested the
following broad steps:

   1. Identify all the things HDFS provides such as read, write,
   replication and failover.
   2. Abstract out a file system interface with hooks for all those things
   (and does not require loading hadoop jars).
   3. Plugin HDFS as the default implementation of that interface, hiding
   all hadoop jars there.
   4. Make another implementation that plugins in S3 and make it optionally
   configured.
   5. Run tests to make sure we didn't break things with HDFS.
   6. Run tests to see if S3 meets all the requirements.

Ed Coleman also suggested first forking Accumulo 2.1 and merging the S3
changes into it.

Chris and I look forward to the discussion on how best to add S3 support to
Accumulo.

Thanks,
Jeff
-- 
Jeff Kubina

Re: [EXTERNAL] Accumulo with Native S3 Support

Posted by Christopher <ct...@apache.org>.
From what I saw from looking at the changes in Chris Milbert's fork,
the fork contains a couple S3 implementations of Hadoop's FileSystem
interface in a separate module (similar to s3a:// and abfss://
implementations). It seems to add accS3mo:// and accS3nf://
implementations, which, in spite of their names, do not appear to be
Accumulo-specific (that's a good thing... as these could be reused by
other projects as well!).

In addition, these FileSystem implementations seem to be accompanied
by a few changes to Accumulo code itself, but I couldn't tell if these
were necessary to improve compatibility with these new FileSystems or
if they were unrelated additional enhancements to Accumulo. They also
appeared to be based on an older 2.0 branch, rather than the latest
2.1 / main branch, and conflict with some of the changes in 2.1
branch. So those changes will need to be rebased.

So, I suggest isolating the FileSystem implementations from the
changes to Accumulo. The FileSystem implementations don't need to be
merged into Accumulo's code base, or built as part of Accumulo at all.
They are completely independent from Accumulo and can exist in their
own repo, for use by any other user, just like s3a:// or abfss:// .
The Accumulo PMC could decide to accept responsibility for these
FileSystem implementations, but I don't think the Accumulo project at
the ASF is the best home for them, as they are not Accumulo-specific.
It might make more sense as a subproject of Hadoop instead of
Accumulo, since they are Hadoop FileSystem implementations, or remain
as a 3rd party repository on GitHub as part of the larger Hadoop
ecosystem. Finding the best home for these may take some additional
research on the part of its developers.

The changes to Accumulo itself, separate from the S3 FileSystem
implementations, will be easiest to incorporate into the 2.1 / main
branch if they are rebased first, and submitted from a fork on GitHub
(Chris Milbert's repo does not appear to be a "fork", but a
disconnected clone, so creating a PR using GitHub's UI won't be
possible without first recreating the repo using the "fork" feature on
GitHub). If there are multiple, discrete changes, serving independent
purposes, the changes should be teased apart and submitted as separate
PRs against the main branch, so they can be evaluated on their own
merits through the code review process. It is hard to consider their
merits without a pull request for those changes.

I think the discussion of abstracting the storage layer in Accumulo is
a worthy one, but I think it can be set aside for now. Abstracting the
storage layer from Hadoop would involve creating Accumulo-specific
storage APIs, and corralling Hadoop FileSystem API calls behind an
implementation of that Accumulo storage API. However, that's not
necessary for this. We currently use Hadoop's FileSystem APIs
throughout our own code, and Hadoop's FileSystem already provides
sufficient abstraction for the purposes of adding S3 support to
Accumulo, and that's what appears to have been done by Chris Milbert.
So, there's no need to complicate the discussion with additional
potential future work to further abstract Hadoop FileSystem API calls.
That abstraction doesn't appear to be a necessary prerequisite to
considering the work done by Chris in his repo.

To me, the main questions are:

1. Can the new FileSystem implementations be used as easily as other
drop-in implementations, like s3a:// and abfss:// ?
2. Where is the best home for these FileSystem implementations?
3. What benefits do the other changes to Accumulo serve, and can they
be rebased and submitted as separate PRs against Accumulo's main
branch?


On Tue, Jul 27, 2021 at 2:00 PM Arvind Shyamsundar
<ar...@microsoft.com.invalid> wrote:
>
> Hi Jeff, what would be the difference between this path, and what can be accomplished by using a Hadoop FileSystem interface based connector to talk to S3? Is it because of the consistency limitations with s3a:// (https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html)?
>
> As you probably know for Azure, we went with the abfss:// connector provided as part of hadoop-azure (https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html) with minimal effort. Just wondering what the key difference here is for S3.
>
> Thanks!
>
> Arvind.
>
> -----Original Message-----
> From: Jeff Kubina <je...@gmail.com>
> Sent: Tuesday, July 27, 2021 10:16 AM
> To: dev@accumulo.apache.org
> Subject: [EXTERNAL] Accumulo with Native S3 Support
>
> All,
>
> Some of AWS's back end services use a version of Accumulo modified to use Amazon's S3 as its storage system. Amazon engineers forked Accumulo 2.0 and merged that S3 support into it <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcmilbert%2Faccumulo%2F&amp;data=04%7C01%7Carvindsh%40microsoft.com%7C9b8c533f2a85467b90c008d95122491f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637630030450339294%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=WvhjAgkOZMRVM%2B2KzXH8ZvDU2ZsFxaw%2BFUPtupsNNbs%3D&amp;reserved=0>.
> Chris Milbert is the lead Amazon engineer who did the integration. Chris and I would like to jump start the conversation about how best to initiate the pull request for these changes into Accumulo 2.1.
>
> Mike Wall suggested using this as an opportunity to abstract out the storage system of Accumulo and make it pluggable. He suggested the following broad steps:
>
>    1. Identify all the things HDFS provides such as read, write,
>    replication and failover.
>    2. Abstract out a file system interface with hooks for all those things
>    (and does not require loading hadoop jars).
>    3. Plugin HDFS as the default implementation of that interface, hiding
>    all hadoop jars there.
>    4. Make another implementation that plugins in S3 and make it optionally
>    configured.
>    5. Run tests to make sure we didn't break things with HDFS.
>    6. Run tests to see if S3 meets all the requirements.
>
> Ed Coleman also suggested first forking Accumulo 2.1 and merging the S3 changes into it.
>
> Chris and I look forward to the discussion on how best to add S3 support to Accumulo.
>
> Thanks,
> Jeff
> --
> Jeff Kubina

RE: [EXTERNAL] Accumulo with Native S3 Support

Posted by Arvind Shyamsundar <ar...@microsoft.com.INVALID>.
Hi Jeff, what would be the difference between this path, and what can be accomplished by using a Hadoop FileSystem interface based connector to talk to S3? Is it because of the consistency limitations with s3a:// (https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html)?

As you probably know for Azure, we went with the abfss:// connector provided as part of hadoop-azure (https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html) with minimal effort. Just wondering what the key difference here is for S3.

Thanks!

Arvind.

-----Original Message-----
From: Jeff Kubina <je...@gmail.com> 
Sent: Tuesday, July 27, 2021 10:16 AM
To: dev@accumulo.apache.org
Subject: [EXTERNAL] Accumulo with Native S3 Support

All,

Some of AWS's back end services use a version of Accumulo modified to use Amazon's S3 as its storage system. Amazon engineers forked Accumulo 2.0 and merged that S3 support into it <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcmilbert%2Faccumulo%2F&amp;data=04%7C01%7Carvindsh%40microsoft.com%7C9b8c533f2a85467b90c008d95122491f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637630030450339294%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=WvhjAgkOZMRVM%2B2KzXH8ZvDU2ZsFxaw%2BFUPtupsNNbs%3D&amp;reserved=0>.
Chris Milbert is the lead Amazon engineer who did the integration. Chris and I would like to jump start the conversation about how best to initiate the pull request for these changes into Accumulo 2.1.

Mike Wall suggested using this as an opportunity to abstract out the storage system of Accumulo and make it pluggable. He suggested the following broad steps:

   1. Identify all the things HDFS provides such as read, write,
   replication and failover.
   2. Abstract out a file system interface with hooks for all those things
   (and does not require loading hadoop jars).
   3. Plugin HDFS as the default implementation of that interface, hiding
   all hadoop jars there.
   4. Make another implementation that plugins in S3 and make it optionally
   configured.
   5. Run tests to make sure we didn't break things with HDFS.
   6. Run tests to see if S3 meets all the requirements.

Ed Coleman also suggested first forking Accumulo 2.1 and merging the S3 changes into it.

Chris and I look forward to the discussion on how best to add S3 support to Accumulo.

Thanks,
Jeff
--
Jeff Kubina

Re: Accumulo with Native S3 Support

Posted by Jeremy Kepner <ke...@ll.mit.edu>.
If this works it will be great.
Might also be interest in creating a Lustre plugin.
Regards.  -Jeremy

On Tue, Jul 27, 2021 at 01:15:40PM -0400, Jeff Kubina wrote:
> All,
> 
> Some of AWS's back end services use a version of Accumulo modified to use
> Amazon's S3 as its storage system. Amazon engineers forked Accumulo 2.0 and
> merged that S3 support into it <https://github.com/cmilbert/accumulo/>.
> Chris Milbert is the lead Amazon engineer who did the integration. Chris
> and I would like to jump start the conversation about how best to initiate
> the pull request for these changes into Accumulo 2.1.
> 
> Mike Wall suggested using this as an opportunity to abstract out the
> storage system of Accumulo and make it pluggable. He suggested the
> following broad steps:
> 
>    1. Identify all the things HDFS provides such as read, write,
>    replication and failover.
>    2. Abstract out a file system interface with hooks for all those things
>    (and does not require loading hadoop jars).
>    3. Plugin HDFS as the default implementation of that interface, hiding
>    all hadoop jars there.
>    4. Make another implementation that plugins in S3 and make it optionally
>    configured.
>    5. Run tests to make sure we didn't break things with HDFS.
>    6. Run tests to see if S3 meets all the requirements.
> 
> Ed Coleman also suggested first forking Accumulo 2.1 and merging the S3
> changes into it.
> 
> Chris and I look forward to the discussion on how best to add S3 support to
> Accumulo.
> 
> Thanks,
> Jeff
> -- 
> Jeff Kubina