You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Elliot West <te...@gmail.com> on 2016/04/28 14:01:28 UTC

S3 Hadoop FileSystems

Hello,

I'm working on a project that moves data from HDFS file systems into S3 for
analysis with Hive on EMR. Recently I've become quite confused with the
state of play regarding the different FileSystems: s3, s3n, and s3a. For my
use case I require the following:

   - Support for the transfer of very large files.
   - MD5 checks on copy operations to provide data verification.
   - Excellent compatibility within an EMR/Hive environment.

To move data between clusters it would seem that current versions of the
NativeS3FileSystem are my best bet; It appears that only s3n provides MD5
checking
<https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.java#L120>.
It is often cited that s3n does not support files over 5GB but I can find
no indication of such a limitation in the source code, in fact I see that
it switches over to multi-part upload for larger files
<https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.java#L130>.
So, has this limitation been removed in s3n?

Within EMR Amazon appear to recommend s3, support s3n, and advise against
s3a
<http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-file-systems.html>.
So yet again s3n would appear to win out here too? I assume that the s3n
implementation available in EMR is different to that in Apache Hadoop? I
find it hard to imagine that AWS would use JetS3t instead of their own AWS
Java client, but perhaps they do?

Finally, could I use NativeS3FileSystem to perform the actual transfer on
my Apache Hadoop cluster but then rewrite the table locations in my EMR
Hive metastore to use the s3:// protocol prefix? Could that work?

I'd appreciate any light that can be shed on these questions, and any
advice regarding my reasoning behind my proposal to use s3n for this
particular use case.

Thanks,

Elliot.

Re: S3 Hadoop FileSystems

Posted by Chris Nauroth <cn...@hortonworks.com>.
Hello Elliot,

You're welcome, and the time was not wasted at all.  This is exactly the kind of valuable discussion that we like to share on the user@ list.  As an outcome, we now have a more definitive answer about how MD5 verification works in s3a.  Thank you for starting the discussion.

--Chris Nauroth

From: Elliot West <te...@gmail.com>>
Date: Tuesday, May 3, 2016 at 2:50 AM
To: Chris Nauroth <cn...@hortonworks.com>>
Cc: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: S3 Hadoop FileSystems

Thank you,

I had a look at HADOOP-13076 and associated codes snippets in the AWS SDK. I agree that the MD5 check does appear to be taking place after all. I appreciate your efforts in looking into that matter and raising the ticket.

Apologies for any time wasting that I may have caused.

Cheers - Elliot.

On 30 April 2016 at 23:16, Chris Nauroth <cn...@hortonworks.com>> wrote:
I have some more information regarding MD5 verification with s3a.  It turns out that s3a does have the MD5 verification.  It's just not visible from reading the s3a code, because the MD5 verification is performed entirely within the AWS SDK library dependency.  If you're interested in more details on how this works, or if you want to follow any further discussion on this topic, then please take a look at the comments on HADOOP-13076.

--Chris Nauroth

From: Chris Nauroth <cn...@hortonworks.com>>
Date: Friday, April 29, 2016 at 9:03 PM
To: Elliot West <te...@gmail.com>>, "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: S3 Hadoop FileSystems

Hello Elliot,

The current state of support for the various S3 file system implementations within the Apache Hadoop community can be summed up as follows:

s3: Soon to be deprecated, not actively maintained, appears to not work reliably at all in recent versions.
s3n: Not yet on its way to deprecation, but also not actively maintained.
s3a: This is seen as the direction forward for S3 integration, so this is where Hadoop contributors are currently focusing their energy.

Regarding interoperability with EMR, I can't speak from any of my own experience on how to achieve this.  We know that EMR runs custom code different from what you'll see in the Apache repos.  I think that creates a risk for interop.  My only suggestion would be to experiment and make sure to test any of your interop scenarios end-to-end very thoroughly.

As you noticed, s3n no longer has a 5 GB limitation.  Issue HADOOP-9454 introduced support for files larger than 5 GB by using multi-part upload.  This patch was released in Apache Hadoop 2.4.0.

Regarding lack of MD5 verification in s3a, I believe that is just an oversight, not an intentional design choice.  I filed HADOOP-13076 to track adding this feature in s3a.

--Chris Nauroth

From: Elliot West <te...@gmail.com>>
Date: Thursday, April 28, 2016 at 5:01 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: S3 Hadoop FileSystems

Hello,

I'm working on a project that moves data from HDFS file systems into S3 for analysis with Hive on EMR. Recently I've become quite confused with the state of play regarding the different FileSystems: s3, s3n, and s3a. For my use case I require the following:

  *   Support for the transfer of very large files.
  *   MD5 checks on copy operations to provide data verification.
  *   Excellent compatibility within an EMR/Hive environment.

To move data between clusters it would seem that current versions of the NativeS3FileSystem are my best bet; It appears that only s3n provides MD5 checking<https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.java#L120>. It is often cited that s3n does not support files over 5GB but I can find no indication of such a limitation in the source code, in fact I see that it switches over to multi-part upload for larger files<https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.java#L130>. So, has this limitation been removed in s3n?

Within EMR Amazon appear to recommend s3, support s3n, and advise against s3a<http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-file-systems.html>. So yet again s3n would appear to win out here too? I assume that the s3n implementation available in EMR is different to that in Apache Hadoop? I find it hard to imagine that AWS would use JetS3t instead of their own AWS Java client, but perhaps they do?

Finally, could I use NativeS3FileSystem to perform the actual transfer on my Apache Hadoop cluster but then rewrite the table locations in my EMR Hive metastore to use the s3:// protocol prefix? Could that work?

I'd appreciate any light that can be shed on these questions, and any advice regarding my reasoning behind my proposal to use s3n for this particular use case.

Thanks,

Elliot.




Re: S3 Hadoop FileSystems

Posted by Elliot West <te...@gmail.com>.
Thank you,

I had a look at HADOOP-13076 and associated codes snippets in the AWS SDK. I
agree that the MD5 check does appear to be taking place after all. I
appreciate your efforts in looking into that matter and raising the ticket.

Apologies for any time wasting that I may have caused.

Cheers - Elliot.

On 30 April 2016 at 23:16, Chris Nauroth <cn...@hortonworks.com> wrote:

> I have some more information regarding MD5 verification with s3a.  It
> turns out that s3a does have the MD5 verification.  It's just not visible
> from reading the s3a code, because the MD5 verification is performed
> entirely within the AWS SDK library dependency.  If you're interested in
> more details on how this works, or if you want to follow any further
> discussion on this topic, then please take a look at the comments on
> HADOOP-13076.
>
> --Chris Nauroth
>
> From: Chris Nauroth <cn...@hortonworks.com>
> Date: Friday, April 29, 2016 at 9:03 PM
> To: Elliot West <te...@gmail.com>, "user@hadoop.apache.org" <
> user@hadoop.apache.org>
> Subject: Re: S3 Hadoop FileSystems
>
> Hello Elliot,
>
> The current state of support for the various S3 file system
> implementations within the Apache Hadoop community can be summed up as
> follows:
>
> s3: Soon to be deprecated, not actively maintained, appears to not work
> reliably at all in recent versions.
> s3n: Not yet on its way to deprecation, but also not actively maintained.
> s3a: This is seen as the direction forward for S3 integration, so this is
> where Hadoop contributors are currently focusing their energy.
>
> Regarding interoperability with EMR, I can't speak from any of my own
> experience on how to achieve this.  We know that EMR runs custom code
> different from what you'll see in the Apache repos.  I think that creates a
> risk for interop.  My only suggestion would be to experiment and make sure
> to test any of your interop scenarios end-to-end very thoroughly.
>
> As you noticed, s3n no longer has a 5 GB limitation.  Issue HADOOP-9454
> introduced support for files larger than 5 GB by using multi-part upload.
> This patch was released in Apache Hadoop 2.4.0.
>
> Regarding lack of MD5 verification in s3a, I believe that is just an
> oversight, not an intentional design choice.  I filed HADOOP-13076 to track
> adding this feature in s3a.
>
> --Chris Nauroth
>
> From: Elliot West <te...@gmail.com>
> Date: Thursday, April 28, 2016 at 5:01 AM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: S3 Hadoop FileSystems
>
> Hello,
>
> I'm working on a project that moves data from HDFS file systems into S3
> for analysis with Hive on EMR. Recently I've become quite confused with the
> state of play regarding the different FileSystems: s3, s3n, and s3a. For my
> use case I require the following:
>
>    - Support for the transfer of very large files.
>    - MD5 checks on copy operations to provide data verification.
>    - Excellent compatibility within an EMR/Hive environment.
>
> To move data between clusters it would seem that current versions of the
> NativeS3FileSystem are my best bet; It appears that only s3n provides MD5
> checking
> <https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.java#L120>.
> It is often cited that s3n does not support files over 5GB but I can find
> no indication of such a limitation in the source code, in fact I see that
> it switches over to multi-part upload for larger files
> <https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.java#L130>.
> So, has this limitation been removed in s3n?
>
> Within EMR Amazon appear to recommend s3, support s3n, and advise against
> s3a
> <http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-file-systems.html>.
> So yet again s3n would appear to win out here too? I assume that the s3n
> implementation available in EMR is different to that in Apache Hadoop? I
> find it hard to imagine that AWS would use JetS3t instead of their own AWS
> Java client, but perhaps they do?
>
> Finally, could I use NativeS3FileSystem to perform the actual transfer on
> my Apache Hadoop cluster but then rewrite the table locations in my EMR
> Hive metastore to use the s3:// protocol prefix? Could that work?
>
> I'd appreciate any light that can be shed on these questions, and any
> advice regarding my reasoning behind my proposal to use s3n for this
> particular use case.
>
> Thanks,
>
> Elliot.
>
>
>

Re: S3 Hadoop FileSystems

Posted by Chris Nauroth <cn...@hortonworks.com>.
I have some more information regarding MD5 verification with s3a.  It turns out that s3a does have the MD5 verification.  It's just not visible from reading the s3a code, because the MD5 verification is performed entirely within the AWS SDK library dependency.  If you're interested in more details on how this works, or if you want to follow any further discussion on this topic, then please take a look at the comments on HADOOP-13076.

--Chris Nauroth

From: Chris Nauroth <cn...@hortonworks.com>>
Date: Friday, April 29, 2016 at 9:03 PM
To: Elliot West <te...@gmail.com>>, "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: S3 Hadoop FileSystems

Hello Elliot,

The current state of support for the various S3 file system implementations within the Apache Hadoop community can be summed up as follows:

s3: Soon to be deprecated, not actively maintained, appears to not work reliably at all in recent versions.
s3n: Not yet on its way to deprecation, but also not actively maintained.
s3a: This is seen as the direction forward for S3 integration, so this is where Hadoop contributors are currently focusing their energy.

Regarding interoperability with EMR, I can't speak from any of my own experience on how to achieve this.  We know that EMR runs custom code different from what you'll see in the Apache repos.  I think that creates a risk for interop.  My only suggestion would be to experiment and make sure to test any of your interop scenarios end-to-end very thoroughly.

As you noticed, s3n no longer has a 5 GB limitation.  Issue HADOOP-9454 introduced support for files larger than 5 GB by using multi-part upload.  This patch was released in Apache Hadoop 2.4.0.

Regarding lack of MD5 verification in s3a, I believe that is just an oversight, not an intentional design choice.  I filed HADOOP-13076 to track adding this feature in s3a.

--Chris Nauroth

From: Elliot West <te...@gmail.com>>
Date: Thursday, April 28, 2016 at 5:01 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: S3 Hadoop FileSystems

Hello,

I'm working on a project that moves data from HDFS file systems into S3 for analysis with Hive on EMR. Recently I've become quite confused with the state of play regarding the different FileSystems: s3, s3n, and s3a. For my use case I require the following:

  *   Support for the transfer of very large files.
  *   MD5 checks on copy operations to provide data verification.
  *   Excellent compatibility within an EMR/Hive environment.

To move data between clusters it would seem that current versions of the NativeS3FileSystem are my best bet; It appears that only s3n provides MD5 checking<https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.java#L120>. It is often cited that s3n does not support files over 5GB but I can find no indication of such a limitation in the source code, in fact I see that it switches over to multi-part upload for larger files<https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.java#L130>. So, has this limitation been removed in s3n?

Within EMR Amazon appear to recommend s3, support s3n, and advise against s3a<http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-file-systems.html>. So yet again s3n would appear to win out here too? I assume that the s3n implementation available in EMR is different to that in Apache Hadoop? I find it hard to imagine that AWS would use JetS3t instead of their own AWS Java client, but perhaps they do?

Finally, could I use NativeS3FileSystem to perform the actual transfer on my Apache Hadoop cluster but then rewrite the table locations in my EMR Hive metastore to use the s3:// protocol prefix? Could that work?

I'd appreciate any light that can be shed on these questions, and any advice regarding my reasoning behind my proposal to use s3n for this particular use case.

Thanks,

Elliot.



Re: S3 Hadoop FileSystems

Posted by Chris Nauroth <cn...@hortonworks.com>.
I have some more information regarding MD5 verification with s3a.  It turns out that s3a does have the MD5 verification.  It's just not visible from reading the s3a code, because the MD5 verification is performed entirely within the AWS SDK library dependency.  If you're interested in more details on how this works, or if you want to follow any further discussion on this topic, then please take a look at the comments on HADOOP-13076.

--Chris Nauroth

From: Chris Nauroth <cn...@hortonworks.com>>
Date: Friday, April 29, 2016 at 9:03 PM
To: Elliot West <te...@gmail.com>>, "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: S3 Hadoop FileSystems

Hello Elliot,

The current state of support for the various S3 file system implementations within the Apache Hadoop community can be summed up as follows:

s3: Soon to be deprecated, not actively maintained, appears to not work reliably at all in recent versions.
s3n: Not yet on its way to deprecation, but also not actively maintained.
s3a: This is seen as the direction forward for S3 integration, so this is where Hadoop contributors are currently focusing their energy.

Regarding interoperability with EMR, I can't speak from any of my own experience on how to achieve this.  We know that EMR runs custom code different from what you'll see in the Apache repos.  I think that creates a risk for interop.  My only suggestion would be to experiment and make sure to test any of your interop scenarios end-to-end very thoroughly.

As you noticed, s3n no longer has a 5 GB limitation.  Issue HADOOP-9454 introduced support for files larger than 5 GB by using multi-part upload.  This patch was released in Apache Hadoop 2.4.0.

Regarding lack of MD5 verification in s3a, I believe that is just an oversight, not an intentional design choice.  I filed HADOOP-13076 to track adding this feature in s3a.

--Chris Nauroth

From: Elliot West <te...@gmail.com>>
Date: Thursday, April 28, 2016 at 5:01 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: S3 Hadoop FileSystems

Hello,

I'm working on a project that moves data from HDFS file systems into S3 for analysis with Hive on EMR. Recently I've become quite confused with the state of play regarding the different FileSystems: s3, s3n, and s3a. For my use case I require the following:

  *   Support for the transfer of very large files.
  *   MD5 checks on copy operations to provide data verification.
  *   Excellent compatibility within an EMR/Hive environment.

To move data between clusters it would seem that current versions of the NativeS3FileSystem are my best bet; It appears that only s3n provides MD5 checking<https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.java#L120>. It is often cited that s3n does not support files over 5GB but I can find no indication of such a limitation in the source code, in fact I see that it switches over to multi-part upload for larger files<https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.java#L130>. So, has this limitation been removed in s3n?

Within EMR Amazon appear to recommend s3, support s3n, and advise against s3a<http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-file-systems.html>. So yet again s3n would appear to win out here too? I assume that the s3n implementation available in EMR is different to that in Apache Hadoop? I find it hard to imagine that AWS would use JetS3t instead of their own AWS Java client, but perhaps they do?

Finally, could I use NativeS3FileSystem to perform the actual transfer on my Apache Hadoop cluster but then rewrite the table locations in my EMR Hive metastore to use the s3:// protocol prefix? Could that work?

I'd appreciate any light that can be shed on these questions, and any advice regarding my reasoning behind my proposal to use s3n for this particular use case.

Thanks,

Elliot.



Re: S3 Hadoop FileSystems

Posted by Chris Nauroth <cn...@hortonworks.com>.
I have some more information regarding MD5 verification with s3a.  It turns out that s3a does have the MD5 verification.  It's just not visible from reading the s3a code, because the MD5 verification is performed entirely within the AWS SDK library dependency.  If you're interested in more details on how this works, or if you want to follow any further discussion on this topic, then please take a look at the comments on HADOOP-13076.

--Chris Nauroth

From: Chris Nauroth <cn...@hortonworks.com>>
Date: Friday, April 29, 2016 at 9:03 PM
To: Elliot West <te...@gmail.com>>, "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: S3 Hadoop FileSystems

Hello Elliot,

The current state of support for the various S3 file system implementations within the Apache Hadoop community can be summed up as follows:

s3: Soon to be deprecated, not actively maintained, appears to not work reliably at all in recent versions.
s3n: Not yet on its way to deprecation, but also not actively maintained.
s3a: This is seen as the direction forward for S3 integration, so this is where Hadoop contributors are currently focusing their energy.

Regarding interoperability with EMR, I can't speak from any of my own experience on how to achieve this.  We know that EMR runs custom code different from what you'll see in the Apache repos.  I think that creates a risk for interop.  My only suggestion would be to experiment and make sure to test any of your interop scenarios end-to-end very thoroughly.

As you noticed, s3n no longer has a 5 GB limitation.  Issue HADOOP-9454 introduced support for files larger than 5 GB by using multi-part upload.  This patch was released in Apache Hadoop 2.4.0.

Regarding lack of MD5 verification in s3a, I believe that is just an oversight, not an intentional design choice.  I filed HADOOP-13076 to track adding this feature in s3a.

--Chris Nauroth

From: Elliot West <te...@gmail.com>>
Date: Thursday, April 28, 2016 at 5:01 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: S3 Hadoop FileSystems

Hello,

I'm working on a project that moves data from HDFS file systems into S3 for analysis with Hive on EMR. Recently I've become quite confused with the state of play regarding the different FileSystems: s3, s3n, and s3a. For my use case I require the following:

  *   Support for the transfer of very large files.
  *   MD5 checks on copy operations to provide data verification.
  *   Excellent compatibility within an EMR/Hive environment.

To move data between clusters it would seem that current versions of the NativeS3FileSystem are my best bet; It appears that only s3n provides MD5 checking<https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.java#L120>. It is often cited that s3n does not support files over 5GB but I can find no indication of such a limitation in the source code, in fact I see that it switches over to multi-part upload for larger files<https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.java#L130>. So, has this limitation been removed in s3n?

Within EMR Amazon appear to recommend s3, support s3n, and advise against s3a<http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-file-systems.html>. So yet again s3n would appear to win out here too? I assume that the s3n implementation available in EMR is different to that in Apache Hadoop? I find it hard to imagine that AWS would use JetS3t instead of their own AWS Java client, but perhaps they do?

Finally, could I use NativeS3FileSystem to perform the actual transfer on my Apache Hadoop cluster but then rewrite the table locations in my EMR Hive metastore to use the s3:// protocol prefix? Could that work?

I'd appreciate any light that can be shed on these questions, and any advice regarding my reasoning behind my proposal to use s3n for this particular use case.

Thanks,

Elliot.



Re: S3 Hadoop FileSystems

Posted by Chris Nauroth <cn...@hortonworks.com>.
I have some more information regarding MD5 verification with s3a.  It turns out that s3a does have the MD5 verification.  It's just not visible from reading the s3a code, because the MD5 verification is performed entirely within the AWS SDK library dependency.  If you're interested in more details on how this works, or if you want to follow any further discussion on this topic, then please take a look at the comments on HADOOP-13076.

--Chris Nauroth

From: Chris Nauroth <cn...@hortonworks.com>>
Date: Friday, April 29, 2016 at 9:03 PM
To: Elliot West <te...@gmail.com>>, "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: S3 Hadoop FileSystems

Hello Elliot,

The current state of support for the various S3 file system implementations within the Apache Hadoop community can be summed up as follows:

s3: Soon to be deprecated, not actively maintained, appears to not work reliably at all in recent versions.
s3n: Not yet on its way to deprecation, but also not actively maintained.
s3a: This is seen as the direction forward for S3 integration, so this is where Hadoop contributors are currently focusing their energy.

Regarding interoperability with EMR, I can't speak from any of my own experience on how to achieve this.  We know that EMR runs custom code different from what you'll see in the Apache repos.  I think that creates a risk for interop.  My only suggestion would be to experiment and make sure to test any of your interop scenarios end-to-end very thoroughly.

As you noticed, s3n no longer has a 5 GB limitation.  Issue HADOOP-9454 introduced support for files larger than 5 GB by using multi-part upload.  This patch was released in Apache Hadoop 2.4.0.

Regarding lack of MD5 verification in s3a, I believe that is just an oversight, not an intentional design choice.  I filed HADOOP-13076 to track adding this feature in s3a.

--Chris Nauroth

From: Elliot West <te...@gmail.com>>
Date: Thursday, April 28, 2016 at 5:01 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: S3 Hadoop FileSystems

Hello,

I'm working on a project that moves data from HDFS file systems into S3 for analysis with Hive on EMR. Recently I've become quite confused with the state of play regarding the different FileSystems: s3, s3n, and s3a. For my use case I require the following:

  *   Support for the transfer of very large files.
  *   MD5 checks on copy operations to provide data verification.
  *   Excellent compatibility within an EMR/Hive environment.

To move data between clusters it would seem that current versions of the NativeS3FileSystem are my best bet; It appears that only s3n provides MD5 checking<https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.java#L120>. It is often cited that s3n does not support files over 5GB but I can find no indication of such a limitation in the source code, in fact I see that it switches over to multi-part upload for larger files<https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.java#L130>. So, has this limitation been removed in s3n?

Within EMR Amazon appear to recommend s3, support s3n, and advise against s3a<http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-file-systems.html>. So yet again s3n would appear to win out here too? I assume that the s3n implementation available in EMR is different to that in Apache Hadoop? I find it hard to imagine that AWS would use JetS3t instead of their own AWS Java client, but perhaps they do?

Finally, could I use NativeS3FileSystem to perform the actual transfer on my Apache Hadoop cluster but then rewrite the table locations in my EMR Hive metastore to use the s3:// protocol prefix? Could that work?

I'd appreciate any light that can be shed on these questions, and any advice regarding my reasoning behind my proposal to use s3n for this particular use case.

Thanks,

Elliot.



Re: S3 Hadoop FileSystems

Posted by Chris Nauroth <cn...@hortonworks.com>.
Hello Elliot,

The current state of support for the various S3 file system implementations within the Apache Hadoop community can be summed up as follows:

s3: Soon to be deprecated, not actively maintained, appears to not work reliably at all in recent versions.
s3n: Not yet on its way to deprecation, but also not actively maintained.
s3a: This is seen as the direction forward for S3 integration, so this is where Hadoop contributors are currently focusing their energy.

Regarding interoperability with EMR, I can't speak from any of my own experience on how to achieve this.  We know that EMR runs custom code different from what you'll see in the Apache repos.  I think that creates a risk for interop.  My only suggestion would be to experiment and make sure to test any of your interop scenarios end-to-end very thoroughly.

As you noticed, s3n no longer has a 5 GB limitation.  Issue HADOOP-9454 introduced support for files larger than 5 GB by using multi-part upload.  This patch was released in Apache Hadoop 2.4.0.

Regarding lack of MD5 verification in s3a, I believe that is just an oversight, not an intentional design choice.  I filed HADOOP-13076 to track adding this feature in s3a.

--Chris Nauroth

From: Elliot West <te...@gmail.com>>
Date: Thursday, April 28, 2016 at 5:01 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: S3 Hadoop FileSystems

Hello,

I'm working on a project that moves data from HDFS file systems into S3 for analysis with Hive on EMR. Recently I've become quite confused with the state of play regarding the different FileSystems: s3, s3n, and s3a. For my use case I require the following:

  *   Support for the transfer of very large files.
  *   MD5 checks on copy operations to provide data verification.
  *   Excellent compatibility within an EMR/Hive environment.

To move data between clusters it would seem that current versions of the NativeS3FileSystem are my best bet; It appears that only s3n provides MD5 checking<https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.java#L120>. It is often cited that s3n does not support files over 5GB but I can find no indication of such a limitation in the source code, in fact I see that it switches over to multi-part upload for larger files<https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.java#L130>. So, has this limitation been removed in s3n?

Within EMR Amazon appear to recommend s3, support s3n, and advise against s3a<http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-file-systems.html>. So yet again s3n would appear to win out here too? I assume that the s3n implementation available in EMR is different to that in Apache Hadoop? I find it hard to imagine that AWS would use JetS3t instead of their own AWS Java client, but perhaps they do?

Finally, could I use NativeS3FileSystem to perform the actual transfer on my Apache Hadoop cluster but then rewrite the table locations in my EMR Hive metastore to use the s3:// protocol prefix? Could that work?

I'd appreciate any light that can be shed on these questions, and any advice regarding my reasoning behind my proposal to use s3n for this particular use case.

Thanks,

Elliot.



Re: S3 Hadoop FileSystems

Posted by Chris Nauroth <cn...@hortonworks.com>.
Hello Elliot,

The current state of support for the various S3 file system implementations within the Apache Hadoop community can be summed up as follows:

s3: Soon to be deprecated, not actively maintained, appears to not work reliably at all in recent versions.
s3n: Not yet on its way to deprecation, but also not actively maintained.
s3a: This is seen as the direction forward for S3 integration, so this is where Hadoop contributors are currently focusing their energy.

Regarding interoperability with EMR, I can't speak from any of my own experience on how to achieve this.  We know that EMR runs custom code different from what you'll see in the Apache repos.  I think that creates a risk for interop.  My only suggestion would be to experiment and make sure to test any of your interop scenarios end-to-end very thoroughly.

As you noticed, s3n no longer has a 5 GB limitation.  Issue HADOOP-9454 introduced support for files larger than 5 GB by using multi-part upload.  This patch was released in Apache Hadoop 2.4.0.

Regarding lack of MD5 verification in s3a, I believe that is just an oversight, not an intentional design choice.  I filed HADOOP-13076 to track adding this feature in s3a.

--Chris Nauroth

From: Elliot West <te...@gmail.com>>
Date: Thursday, April 28, 2016 at 5:01 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: S3 Hadoop FileSystems

Hello,

I'm working on a project that moves data from HDFS file systems into S3 for analysis with Hive on EMR. Recently I've become quite confused with the state of play regarding the different FileSystems: s3, s3n, and s3a. For my use case I require the following:

  *   Support for the transfer of very large files.
  *   MD5 checks on copy operations to provide data verification.
  *   Excellent compatibility within an EMR/Hive environment.

To move data between clusters it would seem that current versions of the NativeS3FileSystem are my best bet; It appears that only s3n provides MD5 checking<https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.java#L120>. It is often cited that s3n does not support files over 5GB but I can find no indication of such a limitation in the source code, in fact I see that it switches over to multi-part upload for larger files<https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.java#L130>. So, has this limitation been removed in s3n?

Within EMR Amazon appear to recommend s3, support s3n, and advise against s3a<http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-file-systems.html>. So yet again s3n would appear to win out here too? I assume that the s3n implementation available in EMR is different to that in Apache Hadoop? I find it hard to imagine that AWS would use JetS3t instead of their own AWS Java client, but perhaps they do?

Finally, could I use NativeS3FileSystem to perform the actual transfer on my Apache Hadoop cluster but then rewrite the table locations in my EMR Hive metastore to use the s3:// protocol prefix? Could that work?

I'd appreciate any light that can be shed on these questions, and any advice regarding my reasoning behind my proposal to use s3n for this particular use case.

Thanks,

Elliot.



Re: S3 Hadoop FileSystems

Posted by Chris Nauroth <cn...@hortonworks.com>.
Hello Elliot,

The current state of support for the various S3 file system implementations within the Apache Hadoop community can be summed up as follows:

s3: Soon to be deprecated, not actively maintained, appears to not work reliably at all in recent versions.
s3n: Not yet on its way to deprecation, but also not actively maintained.
s3a: This is seen as the direction forward for S3 integration, so this is where Hadoop contributors are currently focusing their energy.

Regarding interoperability with EMR, I can't speak from any of my own experience on how to achieve this.  We know that EMR runs custom code different from what you'll see in the Apache repos.  I think that creates a risk for interop.  My only suggestion would be to experiment and make sure to test any of your interop scenarios end-to-end very thoroughly.

As you noticed, s3n no longer has a 5 GB limitation.  Issue HADOOP-9454 introduced support for files larger than 5 GB by using multi-part upload.  This patch was released in Apache Hadoop 2.4.0.

Regarding lack of MD5 verification in s3a, I believe that is just an oversight, not an intentional design choice.  I filed HADOOP-13076 to track adding this feature in s3a.

--Chris Nauroth

From: Elliot West <te...@gmail.com>>
Date: Thursday, April 28, 2016 at 5:01 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: S3 Hadoop FileSystems

Hello,

I'm working on a project that moves data from HDFS file systems into S3 for analysis with Hive on EMR. Recently I've become quite confused with the state of play regarding the different FileSystems: s3, s3n, and s3a. For my use case I require the following:

  *   Support for the transfer of very large files.
  *   MD5 checks on copy operations to provide data verification.
  *   Excellent compatibility within an EMR/Hive environment.

To move data between clusters it would seem that current versions of the NativeS3FileSystem are my best bet; It appears that only s3n provides MD5 checking<https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.java#L120>. It is often cited that s3n does not support files over 5GB but I can find no indication of such a limitation in the source code, in fact I see that it switches over to multi-part upload for larger files<https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.java#L130>. So, has this limitation been removed in s3n?

Within EMR Amazon appear to recommend s3, support s3n, and advise against s3a<http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-file-systems.html>. So yet again s3n would appear to win out here too? I assume that the s3n implementation available in EMR is different to that in Apache Hadoop? I find it hard to imagine that AWS would use JetS3t instead of their own AWS Java client, but perhaps they do?

Finally, could I use NativeS3FileSystem to perform the actual transfer on my Apache Hadoop cluster but then rewrite the table locations in my EMR Hive metastore to use the s3:// protocol prefix? Could that work?

I'd appreciate any light that can be shed on these questions, and any advice regarding my reasoning behind my proposal to use s3n for this particular use case.

Thanks,

Elliot.



Re: S3 Hadoop FileSystems

Posted by Chris Nauroth <cn...@hortonworks.com>.
Hello Elliot,

The current state of support for the various S3 file system implementations within the Apache Hadoop community can be summed up as follows:

s3: Soon to be deprecated, not actively maintained, appears to not work reliably at all in recent versions.
s3n: Not yet on its way to deprecation, but also not actively maintained.
s3a: This is seen as the direction forward for S3 integration, so this is where Hadoop contributors are currently focusing their energy.

Regarding interoperability with EMR, I can't speak from any of my own experience on how to achieve this.  We know that EMR runs custom code different from what you'll see in the Apache repos.  I think that creates a risk for interop.  My only suggestion would be to experiment and make sure to test any of your interop scenarios end-to-end very thoroughly.

As you noticed, s3n no longer has a 5 GB limitation.  Issue HADOOP-9454 introduced support for files larger than 5 GB by using multi-part upload.  This patch was released in Apache Hadoop 2.4.0.

Regarding lack of MD5 verification in s3a, I believe that is just an oversight, not an intentional design choice.  I filed HADOOP-13076 to track adding this feature in s3a.

--Chris Nauroth

From: Elliot West <te...@gmail.com>>
Date: Thursday, April 28, 2016 at 5:01 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: S3 Hadoop FileSystems

Hello,

I'm working on a project that moves data from HDFS file systems into S3 for analysis with Hive on EMR. Recently I've become quite confused with the state of play regarding the different FileSystems: s3, s3n, and s3a. For my use case I require the following:

  *   Support for the transfer of very large files.
  *   MD5 checks on copy operations to provide data verification.
  *   Excellent compatibility within an EMR/Hive environment.

To move data between clusters it would seem that current versions of the NativeS3FileSystem are my best bet; It appears that only s3n provides MD5 checking<https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.java#L120>. It is often cited that s3n does not support files over 5GB but I can find no indication of such a limitation in the source code, in fact I see that it switches over to multi-part upload for larger files<https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.java#L130>. So, has this limitation been removed in s3n?

Within EMR Amazon appear to recommend s3, support s3n, and advise against s3a<http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-file-systems.html>. So yet again s3n would appear to win out here too? I assume that the s3n implementation available in EMR is different to that in Apache Hadoop? I find it hard to imagine that AWS would use JetS3t instead of their own AWS Java client, but perhaps they do?

Finally, could I use NativeS3FileSystem to perform the actual transfer on my Apache Hadoop cluster but then rewrite the table locations in my EMR Hive metastore to use the s3:// protocol prefix? Could that work?

I'd appreciate any light that can be shed on these questions, and any advice regarding my reasoning behind my proposal to use s3n for this particular use case.

Thanks,

Elliot.