You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "xxgreg (via GitHub)" <gi...@apache.org> on 2023/05/26 05:08:11 UTC

[GitHub] [arrow] xxgreg opened a new issue, #35775: [GO] pqarrow.FileWriter allow adding parquet metadata after writing rowgroups

xxgreg opened a new issue, #35775:
URL: https://github.com/apache/arrow/issues/35775

   ### Describe the enhancement requested
   
   I'd like to add a key/value metadata field to the Parquet metadata. The value of the field is not known until after the row groups have been written.
   
   It looks like it is possible to do this when using `parquet/file.Writer` but it isn't possible when using `pqarrow.FileWriter`.
   
   Would it make sense to add some API to allow for this?
   
   Or perhaps it's already there, and I can't see it.
   
   
   
   ### Component(s)
   
   Go


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on issue #35775: [GO] pqarrow.FileWriter allow adding parquet metadata after writing rowgroups

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #35775:
URL: https://github.com/apache/arrow/issues/35775#issuecomment-1564021351

   `Finish` api LGTM.
   cc @zeroshade 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade commented on issue #35775: [GO] pqarrow.FileWriter allow adding parquet metadata after writing rowgroups

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade commented on issue #35775:
URL: https://github.com/apache/arrow/issues/35775#issuecomment-1564526323

   thanks @xxgreg!! I'll go take a look at the PR. There's definitely interest in this and thanks for fixing the bug with persisting the metadata.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] xxgreg commented on issue #35775: [GO] pqarrow.FileWriter allow adding parquet metadata after writing rowgroups

Posted by "xxgreg (via GitHub)" <gi...@apache.org>.
xxgreg commented on issue #35775:
URL: https://github.com/apache/arrow/issues/35775#issuecomment-1564015486

   I added a quick change to allow this. See commits above.
   
   I can add some tests and send a PR if there's any interest.
   
   It turns out that although you can update the key value metadata on `parquet/file.Writer`, the changes don't get persisted, so I added a commit which changes this also.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tschaub commented on issue #35775: [Go][Parquet] pqarrow.FileWriter allow adding parquet metadata after writing rowgroups

Posted by "tschaub (via GitHub)" <gi...@apache.org>.
tschaub commented on issue #35775:
URL: https://github.com/apache/arrow/issues/35775#issuecomment-1724764855

   I've put together a proposed set of changes in #37786.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] xxgreg commented on issue #35775: [GO] pqarrow.FileWriter allow adding parquet metadata after writing rowgroups

Posted by "xxgreg (via GitHub)" <gi...@apache.org>.
xxgreg commented on issue #35775:
URL: https://github.com/apache/arrow/issues/35775#issuecomment-1571180684

   Apologies for the slow reply. And thanks for the super quick replies!
   
   I've got a couple of other priorities to sort out first. I'm planning to pick this up again in 2 weeks.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade commented on issue #35775: [Go][Parquet] pqarrow.FileWriter allow adding parquet metadata after writing rowgroups

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade commented on issue #35775:
URL: https://github.com/apache/arrow/issues/35775#issuecomment-1699810703

   @tschaub I think it would be fine to create a backwards incompatible change here (as long as we mark the PR appropriately). While a non-breaking change would always be preferred (such as creating a new function instead of changing the existing one) it's not a deal breaker on this.
   
   > I'm unclear on why the FileMetadata and KeyValueMetadata fields are exported on the file.Writer struct (but I'm probably missing other uses for both of these fields). Are these meant to settable before the writer is closed? Or should there be an explicit writer.SetKeyValueMetadata() method that would be called before closing a writer (somewhat similar to https://github.com/segmentio/parquet-go/pull/399).
   
   You are exactly correct, they were meant to be settable before the writer is closed as `Finish` / `Close` is where the metadata gets written so they can be set / updated anytime before the writer is closed when it will actually *write* the metadata. I think it's reasonable to change these to be no longer exported and add a `SetKeyValueMetadata` method if everyone thinks this would be a better API.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade closed issue #35775: [Go][Parquet] pqarrow.FileWriter allow adding parquet metadata after writing rowgroups

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade closed issue #35775: [Go][Parquet] pqarrow.FileWriter allow adding parquet metadata after writing rowgroups
URL: https://github.com/apache/arrow/issues/35775


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tschaub commented on issue #35775: [Go][Parquet] pqarrow.FileWriter allow adding parquet metadata after writing rowgroups

Posted by "tschaub (via GitHub)" <gi...@apache.org>.
tschaub commented on issue #35775:
URL: https://github.com/apache/arrow/issues/35775#issuecomment-1694105387

   It would also be useful to see an example of how the [`writer.FileMetadata`](https://github.com/apache/arrow/blob/go/v13.0.0/go/parquet/file/file_writer.go#L44) and [`writer.KeyValueMetadata`](https://github.com/apache/arrow/blob/go/v13.0.0/go/parquet/file/file_writer.go#L46) fields are intended to be used. 
   
   I only see `writer.FileMetadata` being set and then immediately being used in the [`file_writer.go` file](https://github.com/apache/arrow/blob/go/v13.0.0/go/parquet/file/file_writer.go).  So it isn't clear why this field needs to be exported at all (but perhaps it is used by other packages) - or even why it is is a struct field.
   
   The `writer.KeyValueMetadata` field is only set by the [`WithWriteMetadata()` function](https://github.com/apache/arrow/blob/go/v13.0.0/go/parquet/file/file_writer.go#L57-L61) and then only accessed by the [`NewParquetWriter()` function](https://github.com/apache/arrow/blob/go/v13.0.0/go/parquet/file/file_writer.go#L67-L83).
   
   I'm unclear on why the `FileMetadata` and `KeyValueMetadata` fields are exported on the `file.Writer` struct.  Are these meant to settable before the writer is closed?  Or should there be an explicit `writer.SetKeyValueMetadata()` method that would be called before closing a writer (somewhat similar to https://github.com/segmentio/parquet-go/pull/399). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tschaub commented on issue #35775: [Go][Parquet] pqarrow.FileWriter allow adding parquet metadata after writing rowgroups

Posted by "tschaub (via GitHub)" <gi...@apache.org>.
tschaub commented on issue #35775:
URL: https://github.com/apache/arrow/issues/35775#issuecomment-1694093670

   I'm also interested in being able to set the file metadata after writing row groups.  The [commit proposed](https://github.com/xxgreg/arrow/commit/8ef88204c2c4746b3985702faeeaabd78b4c58b8) by @xxgreg changes the signature of the `metadata.Finish()` method.  From @mapleFU's [comment above](https://github.com/apache/arrow/issues/35775#issuecomment-1564021351), it sounds like this API change would be acceptable.  If this backwards incompatible change is acceptable, I would like to help write tests to get this in.  Or alternatively, if a non-breaking change would be preferred, I could try to make that happen.  Any suggestions from maintainers on the preferred direction (non-breaking change only or breaking change ok) would be appreciated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org