You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "elveshoern32 (via GitHub)" <gi...@apache.org> on 2023/03/20 16:29:45 UTC

[GitHub] [arrow] elveshoern32 commented on issue #32083: [C++][Python] S3 tag support on write

elveshoern32 commented on issue #32083:
URL: https://github.com/apache/arrow/issues/32083#issuecomment-1476562639

In the original request the term 'tag' is used, later the term 'metadata' is used.

I know S3 only:

In S3 'metadata', more precisely 'user-defined object metadata' are considered part of the object and are thus immutable, they have to be added at creation time.
'Tags', however, are a different animal and can be added/changed/removed at any time.

Now neither 'tags' nor 'user-defined object metadata' are currently supported by Arrow, only a few 'system-defined object metadata' are.
For my usecase it would be helpful to use at least one of both.
The details of the API are of minor importance to me.

> We should try to do this in a way that's generic enough and can be implemented in other filesystem types.

Agreed. However, I feel that Arrow should not impose any further limits to which metadata are possible.
Different storage technologies show different characteristics; Arrow shouldn't implement just the smallest common subset.

> This is causing us to scan the bucket and re-apply the tags after a pyrrow based process has run.

This is exactly what I'd like to avoid, because

1. S3 calls are actually quite costly (in terms of CPU and wall clock time) and

2. this approach leads to a time window of unknown length with the objects carrying the wrong set of tags, which might lead to the ILM (Information Lifecycle Management) taking wrong decisions.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org