You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/30 08:40:53 UTC

[GitHub] [arrow] westonpace commented on pull request #10628: ARROW-12364: [Python] [Dataset] Add metadata_collector option to ds.write_dataset()

westonpace commented on pull request #10628:
URL: https://github.com/apache/arrow/pull/10628#issuecomment-871209883

This PR now includes #10619 . Some notes for review:

* Ben and I took parallel approaches at this. Ben's approach was to mirror the C++ API and create a FileWriter to wrap CFileWriter. My approach was to create a WrittenFile class which is just the path & metadata (if present) and expose that as `file_visitor`. I'm happy to switch if we feel the other is better. My rationale was "FileWriter is an internal class, best to hide the concept and only expose what is needed."
* The existing metadata_collector is a bit clunky when working with partitioned datasets. The _metadata file does not contain the partition columns. This does appear to be the intent (with common_metadata, if it exists, containing the full schema) but without a working spark/hadoop setup I can't be completely certain.
* The existing tests for metadata_collector were calling write_dataset on the same directory multiple times and expecting multiple files to be created (since the legacy writer uses a guid for naming). This seems somewhat related to ARROW-12358. I just updated the test to call write_dataset once with a partitioned column.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org