You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Andrew Olson (Jira)" <ji...@apache.org> on 2020/03/03 21:40:00 UTC

[jira] [Comment Edited] (HADOOP-16900) Very large files can be truncated when written through S3AFileSystem

    [ https://issues.apache.org/jira/browse/HADOOP-16900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050575#comment-17050575 ] 

Andrew Olson edited comment on HADOOP-16900 at 3/3/20 9:39 PM:
---------------------------------------------------------------

[~stevel@apache.org] [~gabor.bota] I think we should fail fast here without writing anything. The fail fast part already apparently happens as seen by distcp task failures like this,
{noformat}
19/12/26 21:45:54 INFO mapreduce.Job: Task Id : attempt_1576854935249_175694_m_000003_0, Status : FAILED
Error: java.io.IOException: File copy failed: hdfs://cluster/path/to/file.avro --> s3a://bucket/path/to/file.avro
	at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:312)
	at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:270)
	at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:52)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.io.IOException: Couldn't run retriable-command: Copying hdfs://cluster/path/to/file.avro to s3a://bucket/path/to/file.avro
	at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101)
	at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:307)
	... 10 more
Caused by: java.lang.IllegalArgumentException: partNumber must be between 1 and 10000 inclusive, but is 10001
	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:115)
	at org.apache.hadoop.fs.s3a.S3AFileSystem$WriteOperationHelper.newUploadPartRequest(S3AFileSystem.java:3086)
	at org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.uploadBlockAsync(S3ABlockOutputStream.java:492)
	at org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.access$000(S3ABlockOutputStream.java:469)
	at org.apache.hadoop.fs.s3a.S3ABlockOutputStream.uploadCurrentBlock(S3ABlockOutputStream.java:307)
	at org.apache.hadoop.fs.s3a.S3ABlockOutputStream.write(S3ABlockOutputStream.java:289)
	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
	at java.io.DataOutputStream.write(DataOutputStream.java:107)
	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
	at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyBytes(RetriableFileCopyCommand.java:299)
	at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyToFile(RetriableFileCopyCommand.java:216)
	at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:146)
	at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:116)
	at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
	... 11 more
{noformat}
However... in distcp there's an optimization where it skips files that already exist in the destination (I think that match a checksum of some number of initial bytes?), so when the task attempt was retried it found no work to do for this large file, and the distcp job ultimately succeeded. I probably should open a separate issue to have distcp compare the file lengths before deciding that a path already present in the target location can be safely skipped.

We ran into this because we had tuned the fs.s3a.multipart.size down to 25M. That effectively imposed a 250 GB limit on our S3A file writes due to the 10,000 part limit on the AWS side. Adjusting the multipart chunk size to a larger value (100M) got us past this since all our files are under 1 TB.

Here's evidence of the issue we recorded noting the unexpected 289789841890 vs. 262144000000 bytes file size difference, with names changed to protect the innocent.
{noformat}
$ hdfs dfs -ls /path/to/file.avro
-rwxrwxr-x   3 user group 289789841890 2016-12-20 06:37 /path/to/file.avro

$ hdfs dfs -conf conf.xml -ls s3a://bucket/path/to/file.avro
-rw-rw-rw-   1 user 262144000000 2019-12-26 21:45 s3a://bucket/path/to/file.avro
{noformat}


was (Author: noslowerdna):
[~stevel@apache.org] [~gabor.bota] I think we should fail fast here without writing anything. The fail fast part already apparently happens as seen by distcp task failures like this,
{noformat}
19/12/26 21:45:54 INFO mapreduce.Job: Task Id : attempt_1576854935249_175694_m_000003_0, Status : FAILED
Error: java.io.IOException: File copy failed: hdfs://cluster/path/to/file.avro --> s3a://bucket/path/to/file.avro
	at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:312)
	at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:270)
	at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:52)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.io.IOException: Couldn't run retriable-command: Copying hdfs://cluster/path/to/file.avro to s3a://bucket/path/to/file.avro
	at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101)
	at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:307)
	... 10 more
Caused by: java.lang.IllegalArgumentException: partNumber must be between 1 and 10000 inclusive, but is 10001
	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:115)
	at org.apache.hadoop.fs.s3a.S3AFileSystem$WriteOperationHelper.newUploadPartRequest(S3AFileSystem.java:3086)
	at org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.uploadBlockAsync(S3ABlockOutputStream.java:492)
	at org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.access$000(S3ABlockOutputStream.java:469)
	at org.apache.hadoop.fs.s3a.S3ABlockOutputStream.uploadCurrentBlock(S3ABlockOutputStream.java:307)
	at org.apache.hadoop.fs.s3a.S3ABlockOutputStream.write(S3ABlockOutputStream.java:289)
	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
	at java.io.DataOutputStream.write(DataOutputStream.java:107)
	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
	at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyBytes(RetriableFileCopyCommand.java:299)
	at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyToFile(RetriableFileCopyCommand.java:216)
	at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:146)
	at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:116)
	at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
	... 11 more
{noformat}
However... in distcp there's an optimization where it skips files that already exist in the destination (I think that match a checksum of some number of initial bytes?), so when the task attempt was retried it found no work to do for this large file, and the distcp job ultimately succeeded. I probably should open a separate issue to have distcp compare the file lengths before deciding that a path already present in the target location can be safely skipped.

We ran into this because we had tuned fs.s3a.multipart.size down to 25M. That effectively imposed a 250 GB limit on our S3A file writes due to the 10,000 part limit on the AWS side. Adjusting the multipart chunk size to a larger value (100M) got us past this since all our files are under 1 TB.

Here's evidence of the issue we recorded noting the unexpected 289789841890 vs. 262144000000 bytes file size difference, with names changed to protect the innocent.
{noformat}
$ hdfs dfs -ls /path/to/file.avro
-rwxrwxr-x   3 user group 289789841890 2016-12-20 06:37 /path/to/file.avro

$ hdfs dfs -conf conf.xml -ls s3a://bucket/path/to/file.avro
-rw-rw-rw-   1 ao6517 262144000000 2019-12-26 21:45 s3a://bucket/path/to/file.avro
{noformat}

> Very large files can be truncated when written through S3AFileSystem
> --------------------------------------------------------------------
>
>                 Key: HADOOP-16900
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16900
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 3.2.1
>            Reporter: Andrew Olson
>            Assignee: Steve Loughran
>            Priority: Major
>              Labels: s3
>
> If a written file size exceeds 10,000 * {{fs.s3a.multipart.size}}, a corrupt truncation of the S3 object will occur as the maximum number of parts in a multipart upload is 10,000 as specific by the S3 API and there is an apparent bug where this failure is not fatal, and the multipart upload is allowed to be marked as completed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org