You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/09/10 13:09:05 UTC

[GitHub] [arrow] raduteo commented on pull request #8130: Parquet file writer snapshot API and proper ColumnChunk.file_path utilization

raduteo commented on pull request #8130:
URL: https://github.com/apache/arrow/pull/8130#issuecomment-690275695


   RowGroup level file name is certainly supported by fastparquet:
   https://github.com/dask/fastparquet/blob/0402257560e20b961a517ee6d770e0995e944163/fastparquet/api.py#L187 <https://github.com/dask/fastparquet/blob/0402257560e20b961a517ee6d770e0995e944163/fastparquet/api.py#L187>
   
   and the java code does read file_path (again with the one-file-per-rowgroup constraint): 
   https://github.com/apache/parquet-mr/blob/65b95fb72be8f5a8a193a6f7bc4560fdcd742fc7/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1410
   
   but it’s less clear how it is used (certainly ParquetFileReader class seems to stick to a single file)
   
   More than anything the feature is fully in line with the parquet spec (unless I misreading something):
   
   https://github.com/apache/parquet-format/blob/01971a532e20ff8e5eba9d440289bfb753f0cf0b/src/main/thrift/parquet.thrift#L769
   
   Also the code changes are not affecting any of the existing behavior, specifically even if one uses the proposed `Snapshot` method during file writing, the final file is still readable by the java and the fastparquet implementation.
   
   I am happy to open a discussion on the parquet list and push for broader support around this feature, but given that it is spec compliant and backward compatible with the existing code, I hope we can allow this PR to proceeded independently.  
   
   > On Sep 10, 2020, at 1:46 AM, emkornfield <no...@github.com> wrote:
   > 
   > 
   > I don't think we should support this unless we can get consensus on dev@parquet mailing list that we want to support this across java and C++ (if java already supports it a pointer would be useful).
   > 
   > —
   > You are receiving this because you authored the thread.
   > Reply to this email directly, view it on GitHub <https://github.com/apache/arrow/pull/8130#issuecomment-690000459>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANYOEGKVOLNFHD7HUQBFW23SFBR2XANCNFSM4Q7GLF4A>.
   > 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org