You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2022/08/29 11:21:00 UTC

[jira] [Comment Edited] (ARROW-17544) [C++/Python] Add support for S3 Bucket Versioning

    [ https://issues.apache.org/jira/browse/ARROW-17544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597153#comment-17597153 ] 

Antoine Pitrou edited comment on ARROW-17544 at 8/29/22 11:20 AM:
------------------------------------------------------------------

Two notes:

1) I'm lukewarm about adding new method definitions, as it will result in combinatorial explosion (we also need async versions of these APIs as well as FileInfo-taking versions). One possibility would be to add the version as an additional optional argument (e.g. 
`const util::optional<std::string>version& =  {}`).

2) GCS has both generation and metageneration, I'm not sure how to handle that nicely in the API. Should we just ignore the metageneration here? [~coryan]


was (Author: pitrou):
Two notes:

1) I'm lukewarm about adding new method definitions, as it will result in combinatorial explosion (we also need async versions of these APIs as well as FileInfo-taking versions). One possibility would be to add the version as an additional optional argument (e.g. {{const util::optional<std::string>version& = {} }}).

2) GCS has both generation and metageneration, I'm not sure how to handle that nicely in the API. Should we just ignore the metageneration here? [~coryan]

> [C++/Python] Add support for S3 Bucket Versioning
> -------------------------------------------------
>
>                 Key: ARROW-17544
>                 URL: https://issues.apache.org/jira/browse/ARROW-17544
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>    Affects Versions: 9.0.0
>            Reporter: Rusty Conover
>            Assignee: Rusty Conover
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Arrow offers a reasonably capable S3 interface, but it lacks support for S3 Buckets that have versioning enabled.  For information about what S3 bucket versioning is, see:
> [https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html]
> If Arrow is interacting with a bucket where versioning is enabled, there can be S3 keys that have multiple versions of content stored utilizing the same key name.  At the present moment, Arrow does not have the ability to:
>  # Access versions of an S3 key rather than just the latest version of an S3 key.  There is no ability to specify the VersionId parameter of S3's GetObject API.
>  # Report the VersionId created when a new S3 key is uploaded to a bucket.
> Along with S3, GCS also supports versioned buckets.
> [https://cloud.google.com/storage/docs/object-versioning]
> There are a few shortcomings of the Filesystem interface to support remote file systems that support versioning:
> 1. The parameters for open_input_stream() and open_input_file() do not easily lend themselves to adding an additional parameter of "version" because they would be passed to all other implemented filesystems.  Most other file systems that exist don't actually support versioning.
> 2. Upon completion of an S3 multipart upload (i.e., close() on an S3FileSystem output stream), there is not currently a way for the user to determine the VersionId or ETag of the S3 key that was created.  This is important to know because if there are multiple concurrent writers to S3, it should be possible to identify the written S3 key.
> Proposed solutions to enable S3 Bucket versioning:
> 1. To allow library callers to read specific versions of an S3 key, extend only the S3FileSystem interface with two new API calls:
> {{open_input_stream_with_version()}}
> {{open_input_file_with_version()}}
> Both are like their namesakes from the normal FileSystem interface but take an additional parameter of a "version," which is a string representation of the VersionId returned by S3 when the S3 Key is created.  If these functions are called with an empty string for the specified version, the latest version of the S3 key will be returned.
> I'm a bit reluctant to create these specialized functions just on the S3FileSystem interface, but I also don't think it is appropriate to change open_input_stream() and open_input_file()'s parameter list for all filesystems just for functionality that is only implemented by a small number of filesystems.
> 2. Allow callers to call ReadMetadata() on an S3FileSystem output stream to retrieve the metadata about the S3 key that has been written after the stream has been closed.  The metadata will likely include both a VersionId and a value for ETag.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)