You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Harrison Xu <hx...@quora.com> on 2019/11/06 19:52:54 UTC

StreamingFileSink to S3 failure to complete multipart upload

Hello,
I'm seeing the following behavior in StreamingFileSink (1.9.1) uploading to
S3.

2019-11-06 15:50:58,081 INFO
 com.quora.dataInfra.s3connector.flink.filesystem.Buckets      -* Subtask 1
checkpointing for checkpoint with id=5025 (max part counter=3406).*
2019-11-06 15:50:58,448 INFO
 org.apache.flink.streaming.api.operators.AbstractStreamOperator  - Could
not complete snapshot 5025 for operator Source: kafka_source -> (Sink:
s3_metadata_sink, Sink: s3_data_sink) (2/18).
java.io.IOException: Uploading parts failed
at
org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.awaitPendingPartUploadToComplete(RecoverableMultiPartUploadImpl.java:231)
at
org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.awaitPendingPartsUpload(RecoverableMultiPartUploadImpl.java:215)
at
org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.snapshotAndGetRecoverable(RecoverableMultiPartUploadImpl.java:151)
at
org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.snapshotAndGetRecoverable(RecoverableMultiPartUploadImpl.java:56)
...12 more
*Caused by: java.io.FileNotFoundException: upload part on
tmp/kafka/meta/auction_ads/dt=2019-11-06T15/partition_7/part-1-3403:
*org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
The specified upload does not exist. The upload ID may be invalid, or the
upload may have been aborted or completed. (Service: Amazon S3; Status
Code: 404; Error Code: NoSuchUpload; Request ID: 6D4B335FE7687B51; S3
Extended Request ID:
OOqtRkyz1O4hA+Gfn+kRyZS/XSzD5WHlQZZbU/+OIO/9paITpCJmdKFqws1dDy/d/e4EXedrVNc=),
S3 Extended Request ID:
OOqtRkyz1O4hA+Gfn+kRyZS/XSzD5WHlQZZbU/+OIO/9paITpCJmdKFqws1dDy/d/e4EXedrVNc=:NoSuchUpload
... 10 more
...
2019-11-06 15:50:58,476 INFO  org.apache.flink.runtime.taskmanager.Task
                - Attempting to cancel task Source: kafka_source -> (Sink:
s3_metadata_sink, Sink: s3_data_sink) (2/18)
(060d4deed87f3be96f3704474a5dc3e9).

Via the S3 console, the file in question (part-1-3403) does NOT exist, but
its part file does:

*_part-1-3402_tmp_38cbdecf-e5b5-4649-9754-bb7aa008f373*
*_part-1-3403_tmp_73e2a73b-0bac-46e8-8fdf-9455903d9da0*
part-1-3395
part-1-3396
...
part-1-3401

The MPU lifecycling policy is configured to delete incomplete uploads in *3
days*, which should not be affecting this.

Attempting to restore from the most recent checkpoint, *5025*, results in
similar issues for different topics. What I am seeing in S3 is essentially
two incomplete part files, such as:

*_part-4-3441_tmp_da13ceba-a284-4353-bdd6-ef4005d382fc*

*_part-4-3442_tmp_fe0c0e00-c7f7-462f-a99f-464b2851a4cb*
And the checkpoint restore operation fails with:

*upload part on
tmp/kafka/meta/feed_features/dt=2019-11-06T15/partition_0/part-4-3441:
org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
The specified upload does not exist.*
(It does indeed not exist in S3).

Any ideas?
As it stands, this job is basically unrecoverable right now because of this
error.
Thank you

Re: StreamingFileSink to S3 failure to complete multipart upload

Posted by Harrison Xu <hx...@quora.com>.
To add to this, attempting to restore from the most recent manually
triggered *savepoint* results in a similar, yet slightly different error:

java.io.FileNotFoundException: upload part on
*tmp/kafka/meta/ads_action_log_kafka_uncounted/dt=2019-11-06T00/partition_6/part-4-2158*:
org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
The specified upload does not exist. The upload ID may be invalid, or the
upload may have been aborted or completed. (Service: Amazon S3; Status
Code: 404; Error Code: NoSuchUpload

Looking into the S3, I see that *two files with the same part number
exist. *
_part-4-2158_tmp_03c7ebaa-a9e5-455a-b501-731badc36765
part-4-2158

And again, I cannot recover the job from this prior state.
Thanks so much for any input - would love to understand what is going on.
Happy to provide full logs if needed.


On Wed, Nov 6, 2019 at 11:52 AM Harrison Xu <hx...@quora.com> wrote:

> Hello,
> I'm seeing the following behavior in StreamingFileSink (1.9.1) uploading
> to S3.
>
> 2019-11-06 15:50:58,081 INFO
>  com.quora.dataInfra.s3connector.flink.filesystem.Buckets      -* Subtask
> 1 checkpointing for checkpoint with id=5025 (max part counter=3406).*
> 2019-11-06 15:50:58,448 INFO
>  org.apache.flink.streaming.api.operators.AbstractStreamOperator  - Could
> not complete snapshot 5025 for operator Source: kafka_source -> (Sink:
> s3_metadata_sink, Sink: s3_data_sink) (2/18).
> java.io.IOException: Uploading parts failed
> at
> org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.awaitPendingPartUploadToComplete(RecoverableMultiPartUploadImpl.java:231)
> at
> org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.awaitPendingPartsUpload(RecoverableMultiPartUploadImpl.java:215)
> at
> org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.snapshotAndGetRecoverable(RecoverableMultiPartUploadImpl.java:151)
> at
> org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.snapshotAndGetRecoverable(RecoverableMultiPartUploadImpl.java:56)
> ...12 more
> *Caused by: java.io.FileNotFoundException: upload part on
> tmp/kafka/meta/auction_ads/dt=2019-11-06T15/partition_7/part-1-3403: *org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
> The specified upload does not exist. The upload ID may be invalid, or the
> upload may have been aborted or completed. (Service: Amazon S3; Status
> Code: 404; Error Code: NoSuchUpload; Request ID: 6D4B335FE7687B51; S3
> Extended Request ID:
> OOqtRkyz1O4hA+Gfn+kRyZS/XSzD5WHlQZZbU/+OIO/9paITpCJmdKFqws1dDy/d/e4EXedrVNc=),
> S3 Extended Request ID:
> OOqtRkyz1O4hA+Gfn+kRyZS/XSzD5WHlQZZbU/+OIO/9paITpCJmdKFqws1dDy/d/e4EXedrVNc=:NoSuchUpload
> ... 10 more
> ...
> 2019-11-06 15:50:58,476 INFO  org.apache.flink.runtime.taskmanager.Task
>                   - Attempting to cancel task Source: kafka_source ->
> (Sink: s3_metadata_sink, Sink: s3_data_sink) (2/18)
> (060d4deed87f3be96f3704474a5dc3e9).
>
> Via the S3 console, the file in question (part-1-3403) does NOT exist, but
> its part file does:
>
> *_part-1-3402_tmp_38cbdecf-e5b5-4649-9754-bb7aa008f373*
> *_part-1-3403_tmp_73e2a73b-0bac-46e8-8fdf-9455903d9da0*
> part-1-3395
> part-1-3396
> ...
> part-1-3401
>
> The MPU lifecycling policy is configured to delete incomplete uploads in *3
> days*, which should not be affecting this.
>
> Attempting to restore from the most recent checkpoint, *5025*, results in
> similar issues for different topics. What I am seeing in S3 is essentially
> two incomplete part files, such as:
>
> *_part-4-3441_tmp_da13ceba-a284-4353-bdd6-ef4005d382fc*
>
> *_part-4-3442_tmp_fe0c0e00-c7f7-462f-a99f-464b2851a4cb*
> And the checkpoint restore operation fails with:
>
> *upload part on
> tmp/kafka/meta/feed_features/dt=2019-11-06T15/partition_0/part-4-3441:
> org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
> The specified upload does not exist.*
> (It does indeed not exist in S3).
>
> Any ideas?
> As it stands, this job is basically unrecoverable right now because of
> this error.
> Thank you
>
>
>
>
>