You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Jack Ye <ye...@gmail.com> on 2021/10/06 05:34:01 UTC

Re: Error when writing large number of rows with S3FileIO

Hi Mayur, sorry I did not follow up on this, were you able to fix the issue
with the AWS SDK upgrade?
-Jack Ye

On Thu, Sep 23, 2021 at 1:13 PM Mayur Srivastava <
Mayur.Srivastava@twosigma.com> wrote:

> I’ll try to upgrade the version and retry.
>
>
>
> Thanks,
>
> Mayur
>
>
>
> *From:* Jack Ye <ye...@gmail.com>
> *Sent:* Thursday, September 23, 2021 2:35 PM
> *To:* Iceberg Dev List <de...@iceberg.apache.org>
> *Subject:* Re: Error when writing large number of rows with S3FileIO
>
>
>
> Thanks, while I am looking into this, this seems to be a very old version,
> is there any reason to use that version specifically? Have you tried a
> newer version? I know there have been quite a few updates to the S3 package
> related to uploading since then, maybe upgrading can solve the problem.
>
>
>
> -Jack
>
>
>
> On Thu, Sep 23, 2021 at 11:02 AM Mayur Srivastava <
> Mayur.Srivastava@twosigma.com> wrote:
>
> No problem Jack.
>
>
>
> I’m using
> https://mvnrepository.com/artifact/software.amazon.awssdk/s3/2.10.53
>
>
>
> Thanks,
>
> Mayur
>
>
>
> *From:* Jack Ye <ye...@gmail.com>
> *Sent:* Thursday, September 23, 2021 1:24 PM
> *To:* Iceberg Dev List <de...@iceberg.apache.org>
> *Subject:* Re: Error when writing large number of rows with S3FileIO
>
>
>
> Hi Mayur,
>
>
>
> Thanks for reporting this issue, could you report what version of AWS SDK
> V2 you are using?
>
>
>
> Best,
>
> Jack Ye
>
>
>
> On Thu, Sep 23, 2021 at 8:39 AM Mayur Srivastava <
> Mayur.Srivastava@twosigma.com> wrote:
>
> Hi,
>
>
>
> I've an Iceberg table partitioned by a single "time" (monthly partitioned)
> column that has 400+ columns and >100k rows. I'm using parquet files and
> PartitionedWriter<Record> + S3FileIO to write the data. When I write <~50k
> rows, the writer works. But it fails with the exception below if I write
> more than ~50k rows. The writer, however, works for the full >100k rows if
> I use HadoopFileIO.
>
>
>
> Has anyone seen this error before and know a way to fix this?
>
>
>
> The writer code is as follows:
>
> AppendFiles append = table.newAppend();
>
>
>
> for (GenericRecord record : records) {
>
>     writer.write(record);
>
> }
>
>
>
> Arrays.stream(writer.complete().dataFiles()).forEach(append::appendFile);
>
> append.commit();
>
>
>
> Thanks,
>
> Mayur
>
>
>
> software.amazon.awssdk.services.s3.model.S3Exception: The specified media
> type is unsupported. Content type binary/octet-stream is not legal.
> (Service: S3, Status Code: 415, Request ID: xxxxxx)
>
>               at
> software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleErrorResponse(AwsXmlPredicatedResponseHandler.java:158)
>
>               at
> software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleResponse(AwsXmlPredicatedResponseHandler.java:108)
>
>               at
> software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:86)
>
>               at
> software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:44)
>
>               at
> software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler$Crc32ValidationResponseHandler.handle(AwsSyncClientHandler.java:94)
>
>               at
> software.amazon.awssdk.core.internal.handler.BaseClientHandler.lambda$successTransformationResponseHandler$4(BaseClientHandler.java:215)
>
>               at
> software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:40)
>
>               at
> software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:30)
>
>               at
> software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
>
>               at
> software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:74)
>
>               at
> software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:43)
>
>               at
> software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:78)
>
>               at
> software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:40)
>
>               at
> software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage$RetryExecutor.doExecute(RetryableStage.java:114)
>
>               at
> software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage$RetryExecutor.execute(RetryableStage.java:87)
>
>               at
> software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:63)
>
>               at
> software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:43)
>
>               at
> software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
>
>               at
> software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:57)
>
>               at
> software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:37)
>
>               at
> software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:81)
>
>               at
> software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:61)
>
>               at
> software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:43)
>
>               at
> software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
>
>               at
> software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
>
>               at
> software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37)
>
>               at
> software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
>
>               at
> software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:198)
>
>               at
> software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:122)
>
>               at
> software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:148)
>
>               at
> software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:102)
>
>               at
> software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
>
>               at
> software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:55)
>
>               at
> software.amazon.awssdk.services.s3.DefaultS3Client.createMultipartUpload(DefaultS3Client.java:1410)
>
>               at
> org.apache.iceberg.aws.s3.S3OutputStream.initializeMultiPartUpload(S3OutputStream.java:209)
>
>               at
> org.apache.iceberg.aws.s3.S3OutputStream.write(S3OutputStream.java:168)
>
>               at java.io.OutputStream.write(OutputStream.java:122)
>
>               at
> org.apache.parquet.io.DelegatingPositionOutputStream.write(DelegatingPositionOutputStream.java:56)
>
>               at
> org.apache.parquet.bytes.ConcatenatingByteArrayCollector.writeAllTo(ConcatenatingByteArrayCollector.java:46)
>
>               at
> org.apache.parquet.hadoop.ParquetFileWriter.writeColumnChunk(ParquetFileWriter.java:620)
>
>               at
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writeToFileWriter(ColumnChunkPageWriteStore.java:241)
>
>               at
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore.flushToFileWriter(ColumnChunkPageWriteStore.java:319)
>
>               at
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
>               at
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>
>               at
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
>               at java.lang.reflect.Method.invoke(Method.java:566)
>
>               at
> org.apache.iceberg.common.DynMethods$UnboundMethod.invokeChecked(DynMethods.java:65)
>
>               at
> org.apache.iceberg.common.DynMethods$UnboundMethod.invoke(DynMethods.java:77)
>
>               at
> org.apache.iceberg.common.DynMethods$BoundMethod.invoke(DynMethods.java:180)
>
>               at
> org.apache.iceberg.parquet.ParquetWriter.flushRowGroup(ParquetWriter.java:176)
>
>               at
> org.apache.iceberg.parquet.ParquetWriter.close(ParquetWriter.java:211)
>
>               at org.apache.iceberg.io.DataWriter.close(DataWriter.java:71)
>
>               at
> org.apache.iceberg.io.BaseTaskWriter$BaseRollingWriter.closeCurrent(BaseTaskWriter.java:282)
>
>               at
> org.apache.iceberg.io.BaseTaskWriter$BaseRollingWriter.close(BaseTaskWriter.java:298)
>
>               at
> org.apache.iceberg.io.PartitionedWriter.close(PartitionedWriter.java:82)
>
>               at
> org.apache.iceberg.io.BaseTaskWriter.complete(BaseTaskWriter.java:83)
>
>

RE: Error when writing large number of rows with S3FileIO

Posted by Mayur Srivastava <Ma...@twosigma.com>.
Jack, thanks for the follow-up.

The issue was with the MPUs support in our internal S3 backend. It currently doesn’t support content-type=‘binary/octet-stream’. It worked when we changed it to the following:

CreateMultipartUploadRequest.builder().bucket(bucket).key(key).contentType("application/octet-stream");

We are changing our backend to support ‘binary/octet-stream’ and that would fix the issue without any code change to Iceberg.

Thanks,
Mayur

From: Jack Ye <ye...@gmail.com>
Sent: Wednesday, October 6, 2021 1:34 AM
To: Iceberg Dev List <de...@iceberg.apache.org>
Subject: Re: Error when writing large number of rows with S3FileIO

Hi Mayur, sorry I did not follow up on this, were you able to fix the issue with the AWS SDK upgrade?
-Jack Ye

On Thu, Sep 23, 2021 at 1:13 PM Mayur Srivastava <Ma...@twosigma.com>> wrote:
I’ll try to upgrade the version and retry.

Thanks,
Mayur

From: Jack Ye <ye...@gmail.com>>
Sent: Thursday, September 23, 2021 2:35 PM
To: Iceberg Dev List <de...@iceberg.apache.org>>
Subject: Re: Error when writing large number of rows with S3FileIO

Thanks, while I am looking into this, this seems to be a very old version, is there any reason to use that version specifically? Have you tried a newer version? I know there have been quite a few updates to the S3 package related to uploading since then, maybe upgrading can solve the problem.

-Jack

On Thu, Sep 23, 2021 at 11:02 AM Mayur Srivastava <Ma...@twosigma.com>> wrote:
No problem Jack.

I’m using https://mvnrepository.com/artifact/software.amazon.awssdk/s3/2.10.53

Thanks,
Mayur

From: Jack Ye <ye...@gmail.com>>
Sent: Thursday, September 23, 2021 1:24 PM
To: Iceberg Dev List <de...@iceberg.apache.org>>
Subject: Re: Error when writing large number of rows with S3FileIO

Hi Mayur,

Thanks for reporting this issue, could you report what version of AWS SDK V2 you are using?

Best,
Jack Ye

On Thu, Sep 23, 2021 at 8:39 AM Mayur Srivastava <Ma...@twosigma.com>> wrote:
Hi,

I've an Iceberg table partitioned by a single "time" (monthly partitioned) column that has 400+ columns and >100k rows. I'm using parquet files and PartitionedWriter<Record> + S3FileIO to write the data. When I write <~50k rows, the writer works. But it fails with the exception below if I write more than ~50k rows. The writer, however, works for the full >100k rows if I use HadoopFileIO.

Has anyone seen this error before and know a way to fix this?

The writer code is as follows:
AppendFiles append = table.newAppend();

for (GenericRecord record : records) {
    writer.write(record);
}

Arrays.stream(writer.complete().dataFiles()).forEach(append::appendFile);
append.commit();

Thanks,
Mayur

software.amazon.awssdk.services.s3.model.S3Exception: The specified media type is unsupported. Content type binary/octet-stream is not legal. (Service: S3, Status Code: 415, Request ID: xxxxxx)
              at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleErrorResponse(AwsXmlPredicatedResponseHandler.java:158)
              at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleResponse(AwsXmlPredicatedResponseHandler.java:108)
              at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:86)
              at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:44)
              at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler$Crc32ValidationResponseHandler.handle(AwsSyncClientHandler.java:94)
              at software.amazon.awssdk.core.internal.handler.BaseClientHandler.lambda$successTransformationResponseHandler$4(BaseClientHandler.java:215)
              at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:40)
              at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:30)
              at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
              at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:74)
              at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:43)
              at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:78)
              at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:40)
              at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage$RetryExecutor.doExecute(RetryableStage.java:114)
              at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage$RetryExecutor.execute(RetryableStage.java:87)
              at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:63)
              at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:43)
              at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
              at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:57)
              at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:37)
              at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:81)
              at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:61)
              at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:43)
              at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
              at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
              at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37)
              at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
              at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:198)
              at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:122)
              at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:148)
              at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:102)
              at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
              at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:55)
              at software.amazon.awssdk.services.s3.DefaultS3Client.createMultipartUpload(DefaultS3Client.java:1410)
              at org.apache.iceberg.aws.s3.S3OutputStream.initializeMultiPartUpload(S3OutputStream.java:209)
              at org.apache.iceberg.aws.s3.S3OutputStream.write(S3OutputStream.java:168)
              at java.io.OutputStream.write(OutputStream.java:122)
              at org.apache.parquet.io.DelegatingPositionOutputStream.write(DelegatingPositionOutputStream.java:56)
              at org.apache.parquet.bytes.ConcatenatingByteArrayCollector.writeAllTo(ConcatenatingByteArrayCollector.java:46)
              at org.apache.parquet.hadoop.ParquetFileWriter.writeColumnChunk(ParquetFileWriter.java:620)
              at org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writeToFileWriter(ColumnChunkPageWriteStore.java:241)
              at org.apache.parquet.hadoop.ColumnChunkPageWriteStore.flushToFileWriter(ColumnChunkPageWriteStore.java:319)
              at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
              at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
              at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
              at java.lang.reflect.Method.invoke(Method.java:566)
              at org.apache.iceberg.common.DynMethods$UnboundMethod.invokeChecked(DynMethods.java:65)
              at org.apache.iceberg.common.DynMethods$UnboundMethod.invoke(DynMethods.java:77)
              at org.apache.iceberg.common.DynMethods$BoundMethod.invoke(DynMethods.java:180)
              at org.apache.iceberg.parquet.ParquetWriter.flushRowGroup(ParquetWriter.java:176)
              at org.apache.iceberg.parquet.ParquetWriter.close(ParquetWriter.java:211)
              at org.apache.iceberg.io.DataWriter.close(DataWriter.java:71)
              at org.apache.iceberg.io.BaseTaskWriter$BaseRollingWriter.closeCurrent(BaseTaskWriter.java:282)
              at org.apache.iceberg.io.BaseTaskWriter$BaseRollingWriter.close(BaseTaskWriter.java:298)
              at org.apache.iceberg.io.PartitionedWriter.close(PartitionedWriter.java:82)
              at org.apache.iceberg.io.BaseTaskWriter.complete(BaseTaskWriter.java:83)