You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by GitBox <gi...@apache.org> on 2023/01/15 09:29:07 UTC

[GitHub] [parquet-mr] wgtmac commented on pull request #1014: PARQUET-2227: Refactor several file rewriters to use a new unified ParquetRewriter implementation

wgtmac commented on PR #1014:
URL: https://github.com/apache/parquet-mr/pull/1014#issuecomment-1383101451

   > > I am afraid some implementations may drop characters after `'\n'` when displaying the string content. Let me do some investigation.
   > 
   > I do not have a strong opinion for `'\n'` only that we need a character that probably won't be used by any systems writing parquet files.
   
   As we are discussing a new entry (`original.created.by`) to the key value metadata, I need to raise two related issues once we have supported rewriting (merging) several files into one:
   - We need to merge `original.created.by` from all input files, making it difficult to tell which created_by comes from which input file. Therefore, `original.created.by` should be dropped in this case.
   - Is there any key value metadata that will conflict from different input files and should be dealt with by the rewriter? For now we simply keep all the old key value metadata from the old file.
   
   @gszadovszky @ggershinsky @shangxinli Thoughts?
   
   If this behavior requires further discussion, I'd suggest to keep the current state of `created_by` unchanged in this pull request which is large enough. All rewriters (ColumnPruner, CompressionConverter, ColumnMasker, and ColumnEncrypter) have dropped original `created_by` and store the current writer version to the footer.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org