You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/01/12 09:48:27 UTC

[GitHub] [iceberg] kbendick commented on issue #2033: Manifest list file and metadata file are not created when versioning is enabled for S3 bucket

kbendick commented on issue #2033:
URL: https://github.com/apache/iceberg/issues/2033#issuecomment-758538209

I have been thinking about this, and I have some questions related to the bucket being used for versioning.

It's not an uncommon situation to have a versioned s3 bucket which does not have a policy which removes expired object deletion markers or a policy to expire non-current versions. By default, versioned buckets do not hav this. In such a situation, it's not uncommon to either have a very large number of object deletion markers or to have a single key with a very high number of versions (sometimes in the millions), which can greatly affect your S3 throughput.

I have personally encountered this issue when using a versioned bucket with Flink (without using iceberg) for storing checkpoint and savepoint data for jobs. For Flink, it's typical for the job manager to delete checkpoints depending on how many are configured to be saved. With regular checkpointing, it's very easy to then get a very large number of object deletion markers that are never expired. Additionally, it's not uncommon to setup a Flink job to checkpoint to a bucket where much of the data has a very similar prefix for the key (and therefore likely winds up in the same physical partition). For example, when using a per job cluster, where the job ids are always 0000000000000000, it's easy to have your checkpoint data and savepoint data wind up with a long, consistent prefix in the key name (Flink provides a configuration to add randomness wherever desired in the checkpoint path).

Additionally, I know that for RocksDB state backend in Flink there is a `/shared` directory when using incremental checkpointing that I have observed grow pretty much indefinitely. We have special logic in place to remove this folder when a valid savepoint is taken (amongst other criteria) at my work.

**TLDR**: For a versioned S3 buckets, particularly for `PUT` and `DELETE` requests, the likelihood of getting a 503-slow down response increases quite a lot, due to the problem of so many object versions / a very large number of retained object deletion markers, and per partition throughput limitations. **When using Apache Flink, without having a policy in place to aggressively remove expired object versions and object deletion markers, it's not uncommon in my experience to run into 503-slow down issues in my personal experience.**

**What you can do to debug this issue**: First and foremost, if you have access to the console (or if you're the one managing the bucket), I'd be sure that when enabling versioning that the required lifecycle policies are in place. That would be expiring noncurrent versions, removing object deletion markers, and removing stale / failed parts of multipart uploads (more on that below). Some things you can do to debug your current bucket, without having to create additional buckets just for testing etc, is to enable logs for your S3 bucket. You can enable basic server access logs, which do not have added cost beyond the writes to the S3 bucket according to the instructions here: https://docs.aws.amazon.com/AmazonS3/latest/user-guide/server-access-logging.html. Additionally, you can enable lifecycle logging and checking for the relevant lifecycle logs in cloudtrail to see what's happening with versions in your bucket: https://docs.aws.amazon.com/AmazonS3/latest/dev/lifecycle-and-other
-bucket-config.html

You can read some about it at the bottom of this page https://docs.aws.amazon.com/AmazonS3/latest/dev/ObjectVersioning.html. I've written the relevant part here. It does not mention it, but not having a bucket policy in place to remove expired object deletion markers will also cause this issue (I believe that the underlying issue is that it affects HEAD requests, which are needed for both PUT and DELETE on versioned buckets).

```
If you notice a significant increase in the number of HTTP 503-slow down responses received for Amazon S3 PUT or DELETE object requests to a bucket that has S3 Versioning enabled, you might have one or more objects in the bucket for which there are millions of versions. For more information, see Troubleshooting Amazon S3.
```

I would also be interested if you have any error logs @elkhand, as Iceberg retries requests but will eventually error out. So error logs would be helpful.

Have you tested this using a fresh bucket, with no preexisting object keys? And did any transactions ever complete once versioning was enabled, or did it only happen after some time? Additionally, have you observed this issue with a bucket that started its life as a versioned bucket (or at the least, did not have any non-versioned keys in it). I've also encountered instances where version policies are placed on buckets after the fact, and a large number of objects remain in the bucket indefinitely because I've forgotten to remove them.

Some things you can do to test this out, without having to create additional buckets just for testing etc, is to enable logs for your S3 bucket. You can enable basic server access logs, which do not have added cost beyond the writes to the S3 bucket according to the instructions here: https://docs.aws.amazon.com/AmazonS3/latest/user-guide/server-access-logging.html. Additionally, you can enable lifecycle logging and checking for the relevant lifecycle logs in cloudtrail to see what's happening with versions in your bucket: https://docs.aws.amazon.com/AmazonS3/latest/dev/lifecycle-and-other-bucket-config.html

Lastly, and perhaps _most importantly_, here is the documentation on lifecycle rules. I personally have experienced issues when using writing Flink savepoints and checkpoints to S3 buckets that were versioned, mostly because of the high frequency with which Flink can create and delete objects and then not having the proper lifecycle rules to handle expiring old versions, removing object deletion markers, as well as removing failed inflight multipart uploads (parts of a multipart upload that has never successfully completed - while there's not exactly a definitive way for the bucket to know fi the upload has failed or not, it's common to simply decide on a large enough time frame to then remove parts of a multipart upload if the upload does not complete - I typically use 24 hours or even 7 days - the most important thing is just having the policy in place, which AWS does not add by default). https://docs.aws.amazon.com/AmazonS3/latest/dev/intro-lifecycle-rules.html

If one does not explicitly add these policies to the bucket, these objects and their metadata will remain forever and severely impact S3 performance on versioned buckets. Additionally, there is cost associated with storing all of this useless (or potentially useless data, such as very old object versions) as AWS still bills you for them. So it's extra important to ensure that these are all in place.

Without error logs, I'm not sure I can be of much more help. But I've been thinking about this issue recently and thought I'd add my personal experience with using versioned S3 buckets with Flink. I have been able to use Flink to read and write checkpoint data as well as data files to versioned S3 buckets, so I don't personally think that alone is the issue. However, I have experienced a lot of headaches when writing to versioned buckets without having aggressive policies in place to remove files with S3 lifecycle policies, as well as having a separate process in place for removing files from the `/shared` directory for rocksdb incremental checkpoints stored on S3. Having a large number of objects (which include different object versions as well as object deletion markers) is relatively easy to do with Flink without a good lifecycle policy in place.

Lastly, it might also be important to note that if you've enabled versioning on a bucket, it can never technically be reverted to a non-versioned bucket. When turning off versioning on a versioned S3 bucket, it technically becomes a version-suspended S3 bucket. This means that your old files, including object deletion markers and non-current object versions, still exist and that only going forward will the changes take place. So if you've enabled versioning for some time and then turned it off, it's important to ensure that any unneeded non-current object versions / deleted object markers are removed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org