You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Richard Zamora <rz...@nvidia.com> on 2019/05/23 03:26:33 UTC

Proper use of the ColumnChunk `file_path` attribute

I’d like to solicit some feedback on the use of the `file_path` attribute for ColumnChunk metadata in Parquet.  How exactly is this attribute used in practice for both single-file and distributed datasets?

More specifically: Is it bad form to set the `file_path` value in footer metadata when the data is stored in the same file?  Should the value only be set in the `_metadat` file, or in cases where the actual column-chunk data is stored in a different location?  My intuition is that the answer to both of these questions is “yes,”  but any feedback/details from people with strong parquet experience is very welcome :)

Note that the context for these questions is an ongoing discussion about the necessary metadata API in `arrow.parquet` (e.g. https://github.com/apache/arrow/pull/4361 and https://issues.apache.org/jira/browse/ARROW-5349?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16845670#comment-16845670)

Thanks for your help!
-Rick

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information.  Any unauthorized review, use, disclosure or distribution
is prohibited.  If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------